I'm seeing a divide error during X startup, causing X to hang (requiring reboot to clear). The trigger is a commit in xserver git: commit 4c3932620c29c91dfbbc8eb09c84efcaa7ec873e Author: Keith Packard <keithp@keithp.com> Date: Fri Apr 25 08:22:15 2014 -0700 hw/xfree86: Restore API compatibility for cursor loading functions The effect is this: divide error: 0000 [#1] PREEMPT SMP Modules linked in: [...] CPU: 0 PID: 6579 Comm: Xorg Not tainted 3.15.0-rc3 #2 Hardware name: Gigabyte Technology Co., Ltd. GA-870A-UD3/GA-870A-UD3, BIOS F5 08/01/2011 task: ffff8800b12d1850 ti: ffff88021b08e000 task.ti: ffff88021b08e000 RIP: 0010:[<ffffffff812513d0>] [<ffffffff812513d0>] radeon_compute_pll_avivo+0xb6/0x324 RSP: 0018:ffff88021b08fa08 EFLAGS: 00010246 RAX: 00000000000927c0 RBX: ffff880236dc8210 RCX: 0000000000124f80 RDX: 0000000000000000 RSI: 00000000000927c0 RDI: 0000000000000400 RBP: ffff8800bd6d2c00 R08: ffff88021b08fb08 R09: ffff88021b08fb00 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880236dc8000 R13: 0000000000000000 R14: 0000000000004ff6 R15: 00000000000003ff FS: 00007f39298d1980(0000) GS:ffff88023fc00000(0000) knlGS:00000000e7063b40 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f392a5d3000 CR3: 000000021b00b000 CR4: 00000000000007f0 Stack: 0000cf3700000004 ffff88021b08fa88 0000002800000000 ffff88021b08fb04 0000000000000000 0000000000000002 ffff88021b08fafc 000000000000cf0c ffff88021b08fb00 ffffffff81227da9 0000001e00000004 ffff8802364e6c00 Call Trace: [<ffffffff81227da9>] ? drm_detect_hdmi_monitor+0x54/0x69 [<ffffffff8123ed95>] ? atombios_crtc_mode_set+0x100/0x593 [<ffffffff812d6725>] ? class_dev_iter_next+0xc/0x33 [<ffffffff812d68ed>] ? class_for_each_device+0x8a/0x9f [<ffffffff81213ccd>] ? drm_crtc_helper_set_mode+0x2a3/0x424 [<ffffffff812145a5>] ? drm_crtc_helper_set_config+0x601/0x853 [<ffffffff81250650>] ? radeon_crtc_set_config+0x3e/0xd9 [<ffffffff81222183>] ? drm_mode_set_config_internal+0x48/0xc0 [<ffffffff81224843>] ? drm_mode_setcrtc+0x3d3/0x487 [<ffffffff81250dac>] ? radeon_crtc_load_lut+0x275/0x630 [<ffffffff81219f06>] ? drm_ioctl+0x337/0x3a6 [<ffffffff81224470>] ? drm_mode_setplane+0x325/0x325 [<ffffffff81468fcc>] ? _raw_spin_unlock_irqrestore+0xf/0x21 [<ffffffff81234864>] ? radeon_drm_ioctl+0x42/0x6e [<ffffffff810b10a7>] ? do_vfs_ioctl+0x356/0x420 [<ffffffff810b8747>] ? __fget+0x64/0x6c [<ffffffff810b11a4>] ? SyS_ioctl+0x33/0x58 [<ffffffff81469ca6>] ? system_call_fastpath+0x1a/0x1f Code: 09 8b 6b 08 89 6c 24 20 eb 5f 80 e5 20 74 08 8b 73 1c 8b 4b 20 eb 06 8b 73 14 8b 4b 18 85 ff 74 06 6b f6 0a 6b c9 0a 31 d2 89 f0 <41> f7 f5 89 c2 8d 68 01 41 0f af d5 39 f2 8b 53 30 0f 43 e8 89 RIP [<ffffffff812513d0>] radeon_compute_pll_avivo+0xb6/0x324 RSP <ffff88021b08fa08> Graphics card is R7 260X; two monitors (both 1080p) are attached, one via HDMI-0 and one via DVI-0 (as reported by xrandr). It may be relevant that as part of my user-specific X startup scripts, the HDMI-attached monitor (AOC I2367F) has its frame rate set to 50Hz.
(In reply to Darren Salt from comment #0) > I'm seeing a divide error during X startup, causing X to hang (requiring > reboot to clear). The trigger is a commit in xserver git: > > commit 4c3932620c29c91dfbbc8eb09c84efcaa7ec873e > Author: Keith Packard <keithp@keithp.com> > Date: Fri Apr 25 08:22:15 2014 -0700 > > hw/xfree86: Restore API compatibility for cursor loading functions That commit broke the Xorg video driver ABI. Did you rebuild xf86-video-ati Git[0] when crossing this xserver Git commit?
Apart from the X problem the kernel shouldn't run into a div zero error but return -EINVAL instead when some of the parameters are invalid. Going to take a look into that direction as well.
I didn't rebuild the DDX. (May be worth doing just to see if the kernel bug is still triggered.)
It appears that you do actually need mismatched builds of xserver git and xf86-video-ati git, the latter built against (ideally) 1.15.99.902, to trigger this kernel bug... I wonder why it's gone apparently unnoticed until now. ☺
(In reply to Darren Salt from comment #4) > It appears that you do actually need mismatched builds of xserver git and > xf86-video-ati git, the latter built against (ideally) 1.15.99.902, to > trigger this kernel bug... I wonder why it's gone apparently unnoticed until > now. ☺ I've reworked that function for 3.15. It's likely that with mismatched DDX and X you pass some parameters as zero into the kernel and so trigger a divide by zero error. Allready digging into this, just give me a day or two to catch up with the bugs.
And please provide the output of: gdb /lib/modules/$(uname -r)/kernel/drivers/gpu/drm/radeon/radeon.ko -ex 'list *(radeon_compute_pll_avivo+0xb6)' -ex q Thanks in advance, Christian.
Can't do that (radeon module is built in, and no debug symbols – should switch that on), but ksymoops output (below) is clear enough for me to determine that the marked line is where the bug makes itself known: if (pll->flags & RADEON_PLL_USE_FRAC_FB_DIV) { vco_min *= 10; vco_max *= 10; } /* here */ post_div_min = vco_min / target_clock; if ((target_clock * post_div_min) < vco_min) ++post_div_min; if (post_div_min < pll->min_post_div) post_div_min = pll->min_post_div; Which means that target_clock is zero, which means that freq is 0 or freq/10 is 0. >>RIP; ffffffff812513d0 <radeon_compute_pll_avivo+b6/324> <===== >>RAX; 00000000000927c0 <__per_cpu_end+804c0/fedd00> >>RBX; ffff880236dc8210 <phys_startup_64+ffff880235dc8210/ffffffff80000000> >>RCX; 0000000000124f80 <__per_cpu_end+112c80/fedd00> >>RSI; 00000000000927c0 <__per_cpu_end+804c0/fedd00> >>RBP; ffff8800bd6d2c00 <phys_startup_64+ffff8800bc6d2c00/ffffffff80000000> >>R08; ffff88021b08fb08 <phys_startup_64+ffff88021a08fb08/ffffffff80000000> >>R09; ffff88021b08fb00 <phys_startup_64+ffff88021a08fb00/ffffffff80000000> >>R12; ffff880236dc8000 <phys_startup_64+ffff880235dc8000/ffffffff80000000> >>R14; 0000000000004ff6 <exception_stacks+ff6/6000> Trace; ffffffff81227da9 <drm_detect_hdmi_monitor+54/69> Trace; ffffffff8123ed95 <atombios_crtc_mode_set+100/593> Trace; ffffffff812d6725 <class_dev_iter_next+c/33> Trace; ffffffff812d68ed <class_for_each_device+8a/9f> Trace; ffffffff81213ccd <drm_crtc_helper_set_mode+2a3/424> Trace; ffffffff812145a5 <drm_crtc_helper_set_config+601/853> Trace; ffffffff81250650 <radeon_crtc_set_config+3e/d9> Trace; ffffffff81222183 <drm_mode_set_config_internal+48/c0> Trace; ffffffff81224843 <drm_mode_setcrtc+3d3/487> Trace; ffffffff81250dac <radeon_crtc_load_lut+275/630> Trace; ffffffff81219f06 <drm_ioctl+337/3a6> Trace; ffffffff81224470 <drm_mode_setcrtc+0/487> Trace; ffffffff81468fcc <_raw_spin_unlock_irqrestore+f/21> Trace; ffffffff81234864 <radeon_drm_ioctl+42/6e> Trace; ffffffff810b10a7 <do_vfs_ioctl+356/420> Trace; ffffffff810b8747 <__fget+64/6c> Trace; ffffffff810b11a4 <sys_ioctl+33/58> Trace; ffffffff81469ca6 <system_call_fastpath+1a/1f> Code; ffffffff812513a5 <radeon_compute_pll_avivo+8b/324> 0000000000000000 <_RIP>: Code; ffffffff812513a5 <radeon_compute_pll_avivo+8b/324> 0: 09 8b 6b 08 89 6c or %ecx,0x6c89086b(%rbx) Code; ffffffff812513ab <radeon_compute_pll_avivo+91/324> 6: 24 20 and $0x20,%al Code; ffffffff812513ad <radeon_compute_pll_avivo+93/324> 8: eb 5f jmp 69 <_RIP+0x69> Code; ffffffff812513af <radeon_compute_pll_avivo+95/324> a: 80 e5 20 and $0x20,%ch Code; ffffffff812513b2 <radeon_compute_pll_avivo+98/324> d: 74 08 je 17 <_RIP+0x17> Code; ffffffff812513b4 <radeon_compute_pll_avivo+9a/324> f: 8b 73 1c mov 0x1c(%rbx),%esi Code; ffffffff812513b7 <radeon_compute_pll_avivo+9d/324> 12: 8b 4b 20 mov 0x20(%rbx),%ecx Code; ffffffff812513ba <radeon_compute_pll_avivo+a0/324> 15: eb 06 jmp 1d <_RIP+0x1d> Code; ffffffff812513bc <radeon_compute_pll_avivo+a2/324> 17: 8b 73 14 mov 0x14(%rbx),%esi Code; ffffffff812513bf <radeon_compute_pll_avivo+a5/324> 1a: 8b 4b 18 mov 0x18(%rbx),%ecx Code; ffffffff812513c2 <radeon_compute_pll_avivo+a8/324> 1d: 85 ff test %edi,%edi Code; ffffffff812513c4 <radeon_compute_pll_avivo+aa/324> 1f: 74 06 je 27 <_RIP+0x27> Code; ffffffff812513c6 <radeon_compute_pll_avivo+ac/324> 21: 6b f6 0a imul $0xa,%esi,%esi Code; ffffffff812513c9 <radeon_compute_pll_avivo+af/324> 24: 6b c9 0a imul $0xa,%ecx,%ecx Code; ffffffff812513cc <radeon_compute_pll_avivo+b2/324> 27: 31 d2 xor %edx,%edx Code; ffffffff812513ce <radeon_compute_pll_avivo+b4/324> 29: 89 f0 mov %esi,%eax Code; ffffffff812513d0 <radeon_compute_pll_avivo+b6/324> <===== 2b: 41 f7 f5 div %r13d <===== Code; ffffffff812513d3 <radeon_compute_pll_avivo+b9/324> 2e: 89 c2 mov %eax,%edx Code; ffffffff812513d5 <radeon_compute_pll_avivo+bb/324> 30: 8d 68 01 lea 0x1(%rax),%ebp Code; ffffffff812513d8 <radeon_compute_pll_avivo+be/324> 33: 41 0f af d5 imul %r13d,%edx Code; ffffffff812513dc <radeon_compute_pll_avivo+c2/324> 37: 39 f2 cmp %esi,%edx Code; ffffffff812513de <radeon_compute_pll_avivo+c4/324> 39: 8b 53 30 mov 0x30(%rbx),%edx Code; ffffffff812513e1 <radeon_compute_pll_avivo+c7/324> 3c: 0f 43 e8 cmovae %eax,%ebp Code; ffffffff812513e4 <radeon_compute_pll_avivo+ca/324> 3f: 89 .byte 0x89
Created attachment 134701 [details] Possible fix. Please try to reproduce the issue with the attached patch applied. It would still not work correctly (specifying no clock can't work correctly), but it shouldn't crash any more.
No crash now. However, the kernel log is spammed with [drm:drm_crtc_helper_set_config] *ERROR* failed to set mode on [CRTC:14] I did notice what looked like some X spam too, but that's been lost over an X restart; fairly sure that it was ‘invalid argument’ (which would correspond to that -EINVAL). I had to kill it (telling it that it had segfaulted worked nicely) and there's no evidence of that in Xorg.0.log.old. I'd say that that DRM error needs to be rate-limited and/or X needs to give up, or at least do something different than what it currently does, if it can't set the mode. (I may be able to get more information about what X is doing if it's needed, but it seemed more important to report these findings first; and that information belongs elsewhere anyway.)
(In reply to Darren Salt from comment #9) > No crash now. Good, thanks for testing. > However, the kernel log is spammed with > [drm:drm_crtc_helper_set_config] *ERROR* failed to set mode on [CRTC:14] Well, X is sending totally nonsense to the kernel. It's actually good that the logs get spammed so somebody notices that something is really going wrong here.