Bug 75211 - Division error in radeon_compute_pll_avivo (X hang)
Summary: Division error in radeon_compute_pll_avivo (X hang)
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-05-01 01:45 UTC by Darren Salt
Modified: 2016-03-23 18:56 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.15-rc3
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Possible fix. (990 bytes, patch)
2014-05-02 12:29 UTC, Christian König
Details | Diff

Description Darren Salt 2014-05-01 01:45:16 UTC
I'm seeing a divide error during X startup, causing X to hang (requiring reboot to clear). The trigger is a commit in xserver git:

commit 4c3932620c29c91dfbbc8eb09c84efcaa7ec873e
Author: Keith Packard <keithp@keithp.com>
Date:   Fri Apr 25 08:22:15 2014 -0700

    hw/xfree86: Restore API compatibility for cursor loading functions

The effect is this:

divide error: 0000 [#1] PREEMPT SMP 
Modules linked in: [...]
CPU: 0 PID: 6579 Comm: Xorg Not tainted 3.15.0-rc3 #2
Hardware name: Gigabyte Technology Co., Ltd. GA-870A-UD3/GA-870A-UD3, BIOS F5 08/01/2011
task: ffff8800b12d1850 ti: ffff88021b08e000 task.ti: ffff88021b08e000
RIP: 0010:[<ffffffff812513d0>]  [<ffffffff812513d0>] radeon_compute_pll_avivo+0xb6/0x324
RSP: 0018:ffff88021b08fa08  EFLAGS: 00010246
RAX: 00000000000927c0 RBX: ffff880236dc8210 RCX: 0000000000124f80
RDX: 0000000000000000 RSI: 00000000000927c0 RDI: 0000000000000400
RBP: ffff8800bd6d2c00 R08: ffff88021b08fb08 R09: ffff88021b08fb00
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880236dc8000
R13: 0000000000000000 R14: 0000000000004ff6 R15: 00000000000003ff
FS:  00007f39298d1980(0000) GS:ffff88023fc00000(0000) knlGS:00000000e7063b40
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f392a5d3000 CR3: 000000021b00b000 CR4: 00000000000007f0
Stack:
 0000cf3700000004 ffff88021b08fa88 0000002800000000 ffff88021b08fb04
 0000000000000000 0000000000000002 ffff88021b08fafc 000000000000cf0c
 ffff88021b08fb00 ffffffff81227da9 0000001e00000004 ffff8802364e6c00
Call Trace:
 [<ffffffff81227da9>] ? drm_detect_hdmi_monitor+0x54/0x69
 [<ffffffff8123ed95>] ? atombios_crtc_mode_set+0x100/0x593
 [<ffffffff812d6725>] ? class_dev_iter_next+0xc/0x33
 [<ffffffff812d68ed>] ? class_for_each_device+0x8a/0x9f
 [<ffffffff81213ccd>] ? drm_crtc_helper_set_mode+0x2a3/0x424
 [<ffffffff812145a5>] ? drm_crtc_helper_set_config+0x601/0x853
 [<ffffffff81250650>] ? radeon_crtc_set_config+0x3e/0xd9
 [<ffffffff81222183>] ? drm_mode_set_config_internal+0x48/0xc0
 [<ffffffff81224843>] ? drm_mode_setcrtc+0x3d3/0x487
 [<ffffffff81250dac>] ? radeon_crtc_load_lut+0x275/0x630
 [<ffffffff81219f06>] ? drm_ioctl+0x337/0x3a6
 [<ffffffff81224470>] ? drm_mode_setplane+0x325/0x325
 [<ffffffff81468fcc>] ? _raw_spin_unlock_irqrestore+0xf/0x21
 [<ffffffff81234864>] ? radeon_drm_ioctl+0x42/0x6e
 [<ffffffff810b10a7>] ? do_vfs_ioctl+0x356/0x420
 [<ffffffff810b8747>] ? __fget+0x64/0x6c
 [<ffffffff810b11a4>] ? SyS_ioctl+0x33/0x58
 [<ffffffff81469ca6>] ? system_call_fastpath+0x1a/0x1f
Code: 09 8b 6b 08 89 6c 24 20 eb 5f 80 e5 20 74 08 8b 73 1c 8b 4b 20 eb 06 8b 73 14 8b 4b 18 85 ff 74 06 6b f6 0a 6b c9 0a 31 d2 89 f0 <41> f7 f5 89 c2 8d 68 01 41 0f af d5 39 f2 8b 53 30 0f 43 e8 89 
RIP  [<ffffffff812513d0>] radeon_compute_pll_avivo+0xb6/0x324
 RSP <ffff88021b08fa08>

Graphics card is R7 260X; two monitors (both 1080p) are attached, one via HDMI-0 and one via DVI-0 (as reported by xrandr). It may be relevant that as part of my user-specific X startup scripts, the HDMI-attached monitor (AOC I2367F) has its frame rate set to 50Hz.
Comment 1 Michel Dänzer 2014-05-01 09:55:19 UTC
(In reply to Darren Salt from comment #0)
> I'm seeing a divide error during X startup, causing X to hang (requiring
> reboot to clear). The trigger is a commit in xserver git:
> 
> commit 4c3932620c29c91dfbbc8eb09c84efcaa7ec873e
> Author: Keith Packard <keithp@keithp.com>
> Date:   Fri Apr 25 08:22:15 2014 -0700
> 
>     hw/xfree86: Restore API compatibility for cursor loading functions

That commit broke the Xorg video driver ABI. Did you rebuild xf86-video-ati Git[0] when crossing this xserver Git commit?
Comment 2 Christian König 2014-05-01 09:57:57 UTC
Apart from the X problem the kernel shouldn't run into a div zero error but return -EINVAL instead when some of the parameters are invalid.

Going to take a look into that direction as well.
Comment 3 Darren Salt 2014-05-01 14:16:21 UTC
I didn't rebuild the DDX. (May be worth doing just to see if the kernel bug is still triggered.)
Comment 4 Darren Salt 2014-05-01 15:16:30 UTC
It appears that you do actually need mismatched builds of xserver git and xf86-video-ati git, the latter built against (ideally) 1.15.99.902, to trigger this kernel bug... I wonder why it's gone apparently unnoticed until now. ☺
Comment 5 Christian König 2014-05-01 16:30:41 UTC
(In reply to Darren Salt from comment #4)
> It appears that you do actually need mismatched builds of xserver git and
> xf86-video-ati git, the latter built against (ideally) 1.15.99.902, to
> trigger this kernel bug... I wonder why it's gone apparently unnoticed until
> now. ☺

I've reworked that function for 3.15. It's likely that with mismatched DDX and X you pass some parameters as zero into the kernel and so trigger a divide by zero error.

Allready digging into this, just give me a day or two to catch up with the bugs.
Comment 6 Christian König 2014-05-01 17:08:06 UTC
And please provide the output of:

gdb /lib/modules/$(uname -r)/kernel/drivers/gpu/drm/radeon/radeon.ko -ex 'list *(radeon_compute_pll_avivo+0xb6)' -ex q

Thanks in advance,
Christian.
Comment 7 Darren Salt 2014-05-01 20:33:50 UTC
Can't do that (radeon module is built in, and no debug symbols – should switch that on), but ksymoops output (below) is clear enough for me to determine that the marked line is where the bug makes itself known:

                 if (pll->flags & RADEON_PLL_USE_FRAC_FB_DIV) {
                        vco_min *= 10;
                        vco_max *= 10;
                }

/* here */      post_div_min = vco_min / target_clock;
                if ((target_clock * post_div_min) < vco_min)
                        ++post_div_min;
                if (post_div_min < pll->min_post_div)    
                        post_div_min = pll->min_post_div;

Which means that target_clock is zero, which means that freq is 0 or freq/10 is 0.


>>RIP; ffffffff812513d0 <radeon_compute_pll_avivo+b6/324>   <=====

>>RAX; 00000000000927c0 <__per_cpu_end+804c0/fedd00>
>>RBX; ffff880236dc8210 <phys_startup_64+ffff880235dc8210/ffffffff80000000>
>>RCX; 0000000000124f80 <__per_cpu_end+112c80/fedd00>
>>RSI; 00000000000927c0 <__per_cpu_end+804c0/fedd00>
>>RBP; ffff8800bd6d2c00 <phys_startup_64+ffff8800bc6d2c00/ffffffff80000000>
>>R08; ffff88021b08fb08 <phys_startup_64+ffff88021a08fb08/ffffffff80000000>
>>R09; ffff88021b08fb00 <phys_startup_64+ffff88021a08fb00/ffffffff80000000>
>>R12; ffff880236dc8000 <phys_startup_64+ffff880235dc8000/ffffffff80000000>
>>R14; 0000000000004ff6 <exception_stacks+ff6/6000>

Trace; ffffffff81227da9 <drm_detect_hdmi_monitor+54/69>
Trace; ffffffff8123ed95 <atombios_crtc_mode_set+100/593>
Trace; ffffffff812d6725 <class_dev_iter_next+c/33>
Trace; ffffffff812d68ed <class_for_each_device+8a/9f>
Trace; ffffffff81213ccd <drm_crtc_helper_set_mode+2a3/424>
Trace; ffffffff812145a5 <drm_crtc_helper_set_config+601/853>
Trace; ffffffff81250650 <radeon_crtc_set_config+3e/d9>
Trace; ffffffff81222183 <drm_mode_set_config_internal+48/c0>
Trace; ffffffff81224843 <drm_mode_setcrtc+3d3/487>
Trace; ffffffff81250dac <radeon_crtc_load_lut+275/630>
Trace; ffffffff81219f06 <drm_ioctl+337/3a6>
Trace; ffffffff81224470 <drm_mode_setcrtc+0/487>
Trace; ffffffff81468fcc <_raw_spin_unlock_irqrestore+f/21>
Trace; ffffffff81234864 <radeon_drm_ioctl+42/6e>
Trace; ffffffff810b10a7 <do_vfs_ioctl+356/420>
Trace; ffffffff810b8747 <__fget+64/6c>
Trace; ffffffff810b11a4 <sys_ioctl+33/58>
Trace; ffffffff81469ca6 <system_call_fastpath+1a/1f>

Code;  ffffffff812513a5 <radeon_compute_pll_avivo+8b/324>
0000000000000000 <_RIP>:
Code;  ffffffff812513a5 <radeon_compute_pll_avivo+8b/324>
   0:   09 8b 6b 08 89 6c         or     %ecx,0x6c89086b(%rbx)
Code;  ffffffff812513ab <radeon_compute_pll_avivo+91/324>
   6:   24 20                     and    $0x20,%al
Code;  ffffffff812513ad <radeon_compute_pll_avivo+93/324>
   8:   eb 5f                     jmp    69 <_RIP+0x69>
Code;  ffffffff812513af <radeon_compute_pll_avivo+95/324>
   a:   80 e5 20                  and    $0x20,%ch
Code;  ffffffff812513b2 <radeon_compute_pll_avivo+98/324>
   d:   74 08                     je     17 <_RIP+0x17>
Code;  ffffffff812513b4 <radeon_compute_pll_avivo+9a/324>
   f:   8b 73 1c                  mov    0x1c(%rbx),%esi
Code;  ffffffff812513b7 <radeon_compute_pll_avivo+9d/324>
  12:   8b 4b 20                  mov    0x20(%rbx),%ecx
Code;  ffffffff812513ba <radeon_compute_pll_avivo+a0/324>
  15:   eb 06                     jmp    1d <_RIP+0x1d>
Code;  ffffffff812513bc <radeon_compute_pll_avivo+a2/324>
  17:   8b 73 14                  mov    0x14(%rbx),%esi
Code;  ffffffff812513bf <radeon_compute_pll_avivo+a5/324>
  1a:   8b 4b 18                  mov    0x18(%rbx),%ecx
Code;  ffffffff812513c2 <radeon_compute_pll_avivo+a8/324>
  1d:   85 ff                     test   %edi,%edi
Code;  ffffffff812513c4 <radeon_compute_pll_avivo+aa/324>
  1f:   74 06                     je     27 <_RIP+0x27>
Code;  ffffffff812513c6 <radeon_compute_pll_avivo+ac/324>
  21:   6b f6 0a                  imul   $0xa,%esi,%esi
Code;  ffffffff812513c9 <radeon_compute_pll_avivo+af/324>
  24:   6b c9 0a                  imul   $0xa,%ecx,%ecx
Code;  ffffffff812513cc <radeon_compute_pll_avivo+b2/324>
  27:   31 d2                     xor    %edx,%edx
Code;  ffffffff812513ce <radeon_compute_pll_avivo+b4/324>
  29:   89 f0                     mov    %esi,%eax
Code;  ffffffff812513d0 <radeon_compute_pll_avivo+b6/324>   <=====
  2b:   41 f7 f5                  div    %r13d   <=====
Code;  ffffffff812513d3 <radeon_compute_pll_avivo+b9/324>
  2e:   89 c2                     mov    %eax,%edx
Code;  ffffffff812513d5 <radeon_compute_pll_avivo+bb/324>
  30:   8d 68 01                  lea    0x1(%rax),%ebp
Code;  ffffffff812513d8 <radeon_compute_pll_avivo+be/324>
  33:   41 0f af d5               imul   %r13d,%edx
Code;  ffffffff812513dc <radeon_compute_pll_avivo+c2/324>
  37:   39 f2                     cmp    %esi,%edx
Code;  ffffffff812513de <radeon_compute_pll_avivo+c4/324>
  39:   8b 53 30                  mov    0x30(%rbx),%edx
Code;  ffffffff812513e1 <radeon_compute_pll_avivo+c7/324>
  3c:   0f 43 e8                  cmovae %eax,%ebp
Code;  ffffffff812513e4 <radeon_compute_pll_avivo+ca/324>
  3f:   89                        .byte 0x89
Comment 8 Christian König 2014-05-02 12:29:34 UTC
Created attachment 134701 [details]
Possible fix.

Please try to reproduce the issue with the attached patch applied.

It would still not work correctly (specifying no clock can't work correctly), but it shouldn't crash any more.
Comment 9 Darren Salt 2014-05-02 15:49:33 UTC
No crash now.

However, the kernel log is spammed with
  [drm:drm_crtc_helper_set_config] *ERROR* failed to set mode on [CRTC:14]

I did notice what looked like some X spam too, but that's been lost over an X restart; fairly sure that it was ‘invalid argument’ (which would correspond to that -EINVAL). I had to kill it (telling it that it had segfaulted worked nicely) and there's no evidence of that in Xorg.0.log.old.

I'd say that that DRM error needs to be rate-limited and/or X needs to give up, or at least do something different than what it currently does, if it can't set the mode. (I may be able to get more information about what X is doing if it's needed, but it seemed more important to report these findings first; and that information belongs elsewhere anyway.)
Comment 10 Christian König 2014-05-03 15:24:14 UTC
(In reply to Darren Salt from comment #9)
> No crash now.

Good, thanks for testing.

> However, the kernel log is spammed with
>   [drm:drm_crtc_helper_set_config] *ERROR* failed to set mode on [CRTC:14]

Well, X is sending totally nonsense to the kernel. It's actually good that the logs get spammed so somebody notices that something is really going wrong here.

Note You need to log in before you can comment on or make changes to this bug.