My laptop is running OpenSUSE 13.1 on self compiled kernel. As of last week, my kernel is panicing when suspending. This only happens when I'm logged in to my KDE session. If I log out of KDE and run pm-suspend as root from VT1 with kdm running. The laptop suspends as expected. I've suspended via pm-suspend while my user was logged in to both IceWM and TWM with success. My suspicion is that some part of my KDE system was upgraded, and something is preventing radeon from suspending propperly. Here is dmesg from the crash: [ 267.445563] PM: Syncing filesystems ... done. [ 267.704349] PM: Preparing system for mem sleep [ 267.953474] Freezing user space processes ... (elapsed 0.001 seconds) done. [ 267.955241] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [ 267.956352] PM: Entering mem sleep [ 267.956405] Suspending console(s) (use no_console_suspend to debug) [ 267.956924] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 267.957167] sd 0:0:0:0: [sda] Stopping disk [ 280.034180] radeon 0000:02:00.0: **** DPM device timeout **** [ 280.034185] ffff88022885b748 ffff8802288546d0 0000000000013180 ffff88022885bfd8 [ 280.034188] 0000000000013180 ffff88022e1fa3d0 ffff8802288546d0 ffff88022885b728 [ 280.034191] ffff88022e39c000 ffff88022885b788 00000000ffffec3b ffff88022e39c000 [ 280.034192] Call Trace: [ 280.034201] [<ffffffff81631854>] schedule+0x24/0x70 [ 280.034205] [<ffffffff8163426b>] schedule_timeout+0x14b/0x2d0 [ 280.034211] [<ffffffff810b7ec0>] ? internal_add_timer+0x80/0x80 [ 280.034255] [<ffffffffa01104a9>] radeon_fence_default_wait+0xc9/0x220 [radeon] [ 280.034260] [<ffffffff810930a0>] ? prepare_to_wait_event+0x100/0x100 [ 280.034265] [<ffffffff81433c94>] fence_wait_timeout+0x34/0x120 [ 280.034278] [<ffffffffa00d8ebf>] ttm_bo_wait+0x14f/0x1b0 [ttm] [ 280.034289] [<ffffffffa00dba42>] ttm_bo_move_accel_cleanup+0x52/0x3c0 [ttm] [ 280.034310] [<ffffffffa01122d0>] radeon_move_blit.isra.12+0xc0/0x150 [radeon] [ 280.034321] [<ffffffffa011309b>] radeon_bo_move+0xab/0x210 [radeon] [ 280.034325] [<ffffffffa00d9e25>] ttm_bo_handle_move_mem+0x265/0x5c0 [ttm] [ 280.034330] [<ffffffffa00da851>] ? ttm_bo_mem_space+0x181/0x350 [ttm] [ 280.034334] [<ffffffffa00da2b2>] ttm_bo_evict+0x132/0x300 [ttm] [ 280.034348] [<ffffffffa00663ee>] ? drm_vma_offset_add+0x2e/0xc0 [drm] [ 280.034352] [<ffffffffa00da640>] ttm_mem_evict_first+0x1c0/0x250 [ttm] [ 280.034357] [<ffffffffa00daa82>] ttm_bo_force_list_clean+0x62/0xb0 [ttm] [ 280.034361] [<ffffffffa00dac9e>] ttm_bo_evict_mm+0x2e/0x60 [ttm] [ 280.034372] [<ffffffffa0114485>] radeon_bo_evict_vram+0x15/0x20 [radeon] [ 280.034380] [<ffffffffa00f66f3>] radeon_suspend_kms+0x163/0x2c0 [radeon] [ 280.034388] [<ffffffffa00f419a>] radeon_pmops_suspend+0x1a/0x20 [radeon] [ 280.034390] [<ffffffff8134b535>] pci_pm_suspend+0x75/0x150 [ 280.034391] [<ffffffff8134b4c0>] ? pci_pm_freeze+0xf0/0xf0 [ 280.034393] [<ffffffff8141e449>] dpm_run_callback+0x49/0x160 [ 280.034394] [<ffffffff8141f15d>] __device_suspend+0x12d/0x370 [ 280.034395] [<ffffffff8141e1e0>] ? pm_dev_dbg+0x80/0x80 [ 280.034396] [<ffffffff8141f3ba>] async_suspend+0x1a/0xa0 [ 280.034398] [<ffffffff81076c97>] async_run_entry_fn+0x47/0x160 [ 280.034400] [<ffffffff8106ef40>] process_one_work+0x140/0x420 [ 280.034401] [<ffffffff8106f33b>] worker_thread+0x11b/0x490 [ 280.034402] [<ffffffff8106f220>] ? process_one_work+0x420/0x420 [ 280.034403] [<ffffffff810742fd>] kthread+0xcd/0xf0 [ 280.034404] [<ffffffff81074230>] ? kthread_create_on_node+0x180/0x180 [ 280.034405] [<ffffffff816357ec>] ret_from_fork+0x7c/0xb0 [ 280.034406] [<ffffffff81074230>] ? kthread_create_on_node+0x180/0x180 [ 280.034408] Kernel panic - not syncing: radeon 0000:02:00.0: unrecoverable failure [ 280.034410] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.19.0-rc2-219-g693a30b8+ #1 [ 280.034410] Hardware name: Dell Inc. Studio XPS 1645/0Y517R, BIOS A13 04/01/2011 [ 280.034411] ffff88022885bd30 ffff880237c03d88 ffffffff8162e880 ffffffff81c45c38 [ 280.034412] ffffffff81a7b1e5 ffff880237c03e08 ffffffff81629367 ffff880237bfffc0 [ 280.034413] ffff880200000018 ffff880237c03e18 ffff880237c03db8 ffff880237c03dc0 [ 280.034414] Call Trace: [ 280.034416] <IRQ> [<ffffffff8162e880>] dump_stack+0x4c/0x6e [ 280.034418] [<ffffffff81629367>] panic+0xb6/0x1e4 [ 280.034420] [<ffffffff8141e1e0>] ? pm_dev_dbg+0x80/0x80 [ 280.034421] [<ffffffff8141e22d>] dpm_watchdog_handler+0x4d/0x60 [ 280.034423] [<ffffffff810b7f04>] call_timer_fn+0x34/0x150 [ 280.034424] [<ffffffff8141e1e0>] ? pm_dev_dbg+0x80/0x80 [ 280.034425] [<ffffffff810b8354>] run_timer_softirq+0x254/0x300 [ 280.034427] [<ffffffff8105be24>] __do_softirq+0xe4/0x2a0 [ 280.034429] [<ffffffff8105c1ed>] irq_exit+0x9d/0xb0 [ 280.034431] [<ffffffff81004b85>] do_IRQ+0x55/0xf0 [ 280.034432] [<ffffffff816365ea>] common_interrupt+0x6a/0x6a [ 280.034435] <EOI> [<ffffffff814f7589>] ? cpuidle_enter_state+0x69/0x1b0 [ 280.034436] [<ffffffff814f7578>] ? cpuidle_enter_state+0x58/0x1b0 [ 280.034438] [<ffffffff814f7782>] cpuidle_enter+0x12/0x20 [ 280.034439] [<ffffffff81093674>] cpu_startup_entry+0x354/0x3f0 [ 280.034440] [<ffffffff81620a30>] rest_init+0x80/0x90 [ 280.034443] [<ffffffff81cf1035>] start_kernel+0x470/0x47d [ 280.034444] [<ffffffff81cf09b5>] ? set_init_arg+0x57/0x57 [ 280.034446] [<ffffffff81cf05ad>] x86_64_start_reservations+0x2a/0x2c [ 280.034447] [<ffffffff81cf06a2>] x86_64_start_kernel+0xf3/0xf7 And here is a backtrace from the crash utility: crash> bt PID: 0 TASK: ffffffff81c19480 CPU: 0 COMMAND: "swapper/0" #0 [ffff880237c03c70] machine_kexec at ffffffff8103c0e1 #1 [ffff880237c03cc0] crash_kexec at ffffffff810d89ee #2 [ffff880237c03d90] panic at ffffffff81629377 #3 [ffff880237c03e10] dpm_watchdog_handler at ffffffff8141e22d #4 [ffff880237c03e30] call_timer_fn at ffffffff810b7f04 #5 [ffff880237c03e70] run_timer_softirq at ffffffff810b8354 #6 [ffff880237c03ef0] __do_softirq at ffffffff8105be24 #7 [ffff880237c03f60] irq_exit at ffffffff8105c1ed #8 [ffff880237c03f70] do_IRQ at ffffffff81004b85 --- <IRQ stack> --- #9 [ffffffff81c03dc8] ret_from_intr at ffffffff816365ea [exception RIP: cpuidle_enter_state+105] RIP: ffffffff814f7589 RSP: ffffffff81c03e78 RFLAGS: 00000202 RAX: 0000004133273fbb RBX: 0000000000000046 RCX: 0000000000000018 RDX: 0000000000940ffe RSI: 0000000000000046 RDI: ffffffff81c1da40 RBP: ffffffff81c03eb8 R8: 0000000000000001 R9: 0000000000060e2d R10: 00000000000935d2 R11: 0000000000000001 R12: 0000000000000046 R13: 0000000000000004 R14: 0000000000000005 R15: ffffffff81c03e38 ORIG_RAX: ffffffffffffffbe CS: 0010 SS: 0018 #10 [ffffffff81c03ec0] cpuidle_enter at ffffffff814f7782 #11 [ffffffff81c03ed0] cpu_startup_entry at ffffffff81093674 #12 [ffffffff81c03f50] rest_init at ffffffff81620a30 #13 [ffffffff81c03f70] start_kernel at ffffffff81cf1035 #14 [ffffffff81c03fc0] x86_64_start_reservations at ffffffff81cf05ad #15 [ffffffff81c03fd0] x86_64_start_kernel at ffffffff81cf06a2 crash> Note, I'm running with radeon.dpm=1. I should probably try suspend without that option. I might be able to bisect if needed. The oldest kernel I tried is 3.11.10-25-desktop from OpenSUSE, but it still crashed on suspend. Just ask if you need more info. I have a coredump of the panic, and can probably get more info if needed.
Please attach your xorg log and full dmesg output.
Created attachment 162731 [details] full dmesg from panic
Created attachment 162741 [details] Complete Xorg.log This is the Xorg.log from a reproduced crash because i forgot to save the Xorg log when I ran the crashkernel.
Created attachment 162751 [details] Full pm-suspend log This is also from a reproduced crash.
I reproduced the panic when I tried to suspend when radeon was loaded without radeon.dpm=1. I was _not_ able to reproduce the panic with kernel-3.10.0-rc7 that I had lying around in my /boot directory.
Looks like dpm_watchdog_handler kicks in while the radeon driver is waiting for a fence to signal. Maybe the Radeon GPU IRQ is disabled too early during suspend or something like that? (In reply to Jon Arne Jørgensen from comment #5) > I was _not_ able to reproduce the panic with kernel-3.10.0-rc7 that I had > lying around in my /boot directory. Can you bisect?
I retried some kernels, and it seems like the problem was introduced in the 3.18 merge window. In v3.17 I can't reproduce the panic, while in v3.18-rc1 and later I can reproduce the panic. I'm trying to bisect now, but I'm troubled by a gpu freeze bug in some of the commits that crashes Xorg before I'm able to suspend the computer. It looks like a bug early in the v3.18 merge window, but I'm not sure if I should skip the crashing commits while doing the bisect?
(In reply to Jon Arne Jørgensen from comment #7) > I'm trying to bisect now, but I'm troubled by a gpu freeze bug in some of > the commits that crashes Xorg before I'm able to suspend the computer. > > It looks like a bug early in the v3.18 merge window, but I'm not sure if I > should skip the crashing commits while doing the bisect? I'd skip them for now.
Finaly managed to get a clean bisect, this is the culprit: commit f2c24b83ae90292d315aa7ac029c6ce7929e01aa Author: Maarten Lankhorst <maarten.lankhorst@canonical.com> Date: Wed Apr 2 17:14:48 2014 +0200 drm/ttm: flip the switch, and convert to dma_fence Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Possibly related to bug 90741.
Maarten, any ideas?
I tested attachment 164421 [details] from bug 90741, and can report that the patch seems to fix the crash.
Does attachment 166571 [details] fix the bug for you too? I think that would be the best version to upstream.
I'm sorry to report that attachment 166571 [details] doesn't fix the crash, but I can also report that your suggestion from Bug 90741 comment 60 does fix the crash. That is, attachment 166571 [details] + if (fence->ring == R600_RING_TYPE_DMA_INDEX) udelay(50); in radeon_fence_enable_signaling? after the irq_get. I tried with udelay(50);, and mb(); both seems to work. I was not able to compile the kernel with "RREG32(DMA_CNTL + DMA0_REGISTER_OFFSET);" because of missing defines. What header should I include?
Ok, in that case it's probably a duplicate. Can you attachment 166571 [details] on top of http://cgit.freedesktop.org/~agd5f/linux/log/?h=posting-read?
Yep, that's a fix.
This bug should be closed: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.4&id=b6610101718d4ab90d793c482625e98eb1262cad