Bug 90861 - Panic on suspend from KDE with radeon
Summary: Panic on suspend from KDE with radeon
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-06 20:41 UTC by Jon Arne Jørgensen
Modified: 2017-04-26 13:35 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.19.0-rc2-219-g693a30b8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
full dmesg from panic (80.26 KB, text/plain)
2015-01-06 21:19 UTC, Jon Arne Jørgensen
Details
Complete Xorg.log (51.17 KB, text/plain)
2015-01-06 21:26 UTC, Jon Arne Jørgensen
Details
Full pm-suspend log (7.52 KB, text/plain)
2015-01-06 21:27 UTC, Jon Arne Jørgensen
Details

Description Jon Arne Jørgensen 2015-01-06 20:41:54 UTC
My laptop is running OpenSUSE 13.1 on self compiled kernel.
As of last week, my kernel is panicing when suspending.

This only happens when I'm logged in to my KDE session.
If I log out of KDE and run pm-suspend as root from VT1 with kdm running. The laptop suspends as expected.

I've suspended via pm-suspend while my user was logged in to both IceWM and TWM with success.

My suspicion is that some part of my KDE system was upgraded, and something is preventing radeon from suspending propperly.

Here is dmesg from the crash:
[  267.445563] PM: Syncing filesystems ... done.
[  267.704349] PM: Preparing system for mem sleep
[  267.953474] Freezing user space processes ... (elapsed 0.001 seconds) done.
[  267.955241] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[  267.956352] PM: Entering mem sleep
[  267.956405] Suspending console(s) (use no_console_suspend to debug)
[  267.956924] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[  267.957167] sd 0:0:0:0: [sda] Stopping disk
[  280.034180] radeon 0000:02:00.0: **** DPM device timeout ****
[  280.034185]  ffff88022885b748 ffff8802288546d0 0000000000013180 ffff88022885bfd8
[  280.034188]  0000000000013180 ffff88022e1fa3d0 ffff8802288546d0 ffff88022885b728
[  280.034191]  ffff88022e39c000 ffff88022885b788 00000000ffffec3b ffff88022e39c000
[  280.034192] Call Trace:
[  280.034201]  [<ffffffff81631854>] schedule+0x24/0x70
[  280.034205]  [<ffffffff8163426b>] schedule_timeout+0x14b/0x2d0
[  280.034211]  [<ffffffff810b7ec0>] ? internal_add_timer+0x80/0x80
[  280.034255]  [<ffffffffa01104a9>] radeon_fence_default_wait+0xc9/0x220 [radeon]
[  280.034260]  [<ffffffff810930a0>] ? prepare_to_wait_event+0x100/0x100
[  280.034265]  [<ffffffff81433c94>] fence_wait_timeout+0x34/0x120
[  280.034278]  [<ffffffffa00d8ebf>] ttm_bo_wait+0x14f/0x1b0 [ttm]
[  280.034289]  [<ffffffffa00dba42>] ttm_bo_move_accel_cleanup+0x52/0x3c0 [ttm]
[  280.034310]  [<ffffffffa01122d0>] radeon_move_blit.isra.12+0xc0/0x150 [radeon]
[  280.034321]  [<ffffffffa011309b>] radeon_bo_move+0xab/0x210 [radeon]
[  280.034325]  [<ffffffffa00d9e25>] ttm_bo_handle_move_mem+0x265/0x5c0 [ttm]
[  280.034330]  [<ffffffffa00da851>] ? ttm_bo_mem_space+0x181/0x350 [ttm]
[  280.034334]  [<ffffffffa00da2b2>] ttm_bo_evict+0x132/0x300 [ttm]
[  280.034348]  [<ffffffffa00663ee>] ? drm_vma_offset_add+0x2e/0xc0 [drm]
[  280.034352]  [<ffffffffa00da640>] ttm_mem_evict_first+0x1c0/0x250 [ttm]
[  280.034357]  [<ffffffffa00daa82>] ttm_bo_force_list_clean+0x62/0xb0 [ttm]
[  280.034361]  [<ffffffffa00dac9e>] ttm_bo_evict_mm+0x2e/0x60 [ttm]
[  280.034372]  [<ffffffffa0114485>] radeon_bo_evict_vram+0x15/0x20 [radeon]
[  280.034380]  [<ffffffffa00f66f3>] radeon_suspend_kms+0x163/0x2c0 [radeon]
[  280.034388]  [<ffffffffa00f419a>] radeon_pmops_suspend+0x1a/0x20 [radeon]
[  280.034390]  [<ffffffff8134b535>] pci_pm_suspend+0x75/0x150
[  280.034391]  [<ffffffff8134b4c0>] ? pci_pm_freeze+0xf0/0xf0
[  280.034393]  [<ffffffff8141e449>] dpm_run_callback+0x49/0x160
[  280.034394]  [<ffffffff8141f15d>] __device_suspend+0x12d/0x370
[  280.034395]  [<ffffffff8141e1e0>] ? pm_dev_dbg+0x80/0x80
[  280.034396]  [<ffffffff8141f3ba>] async_suspend+0x1a/0xa0
[  280.034398]  [<ffffffff81076c97>] async_run_entry_fn+0x47/0x160
[  280.034400]  [<ffffffff8106ef40>] process_one_work+0x140/0x420
[  280.034401]  [<ffffffff8106f33b>] worker_thread+0x11b/0x490
[  280.034402]  [<ffffffff8106f220>] ? process_one_work+0x420/0x420
[  280.034403]  [<ffffffff810742fd>] kthread+0xcd/0xf0
[  280.034404]  [<ffffffff81074230>] ? kthread_create_on_node+0x180/0x180
[  280.034405]  [<ffffffff816357ec>] ret_from_fork+0x7c/0xb0
[  280.034406]  [<ffffffff81074230>] ? kthread_create_on_node+0x180/0x180
[  280.034408] Kernel panic - not syncing: radeon 0000:02:00.0: unrecoverable failure
               
[  280.034410] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.19.0-rc2-219-g693a30b8+ #1
[  280.034410] Hardware name: Dell Inc. Studio XPS 1645/0Y517R, BIOS A13 04/01/2011
[  280.034411]  ffff88022885bd30 ffff880237c03d88 ffffffff8162e880 ffffffff81c45c38
[  280.034412]  ffffffff81a7b1e5 ffff880237c03e08 ffffffff81629367 ffff880237bfffc0
[  280.034413]  ffff880200000018 ffff880237c03e18 ffff880237c03db8 ffff880237c03dc0
[  280.034414] Call Trace:
[  280.034416]  <IRQ>  [<ffffffff8162e880>] dump_stack+0x4c/0x6e
[  280.034418]  [<ffffffff81629367>] panic+0xb6/0x1e4
[  280.034420]  [<ffffffff8141e1e0>] ? pm_dev_dbg+0x80/0x80
[  280.034421]  [<ffffffff8141e22d>] dpm_watchdog_handler+0x4d/0x60
[  280.034423]  [<ffffffff810b7f04>] call_timer_fn+0x34/0x150
[  280.034424]  [<ffffffff8141e1e0>] ? pm_dev_dbg+0x80/0x80
[  280.034425]  [<ffffffff810b8354>] run_timer_softirq+0x254/0x300
[  280.034427]  [<ffffffff8105be24>] __do_softirq+0xe4/0x2a0
[  280.034429]  [<ffffffff8105c1ed>] irq_exit+0x9d/0xb0
[  280.034431]  [<ffffffff81004b85>] do_IRQ+0x55/0xf0
[  280.034432]  [<ffffffff816365ea>] common_interrupt+0x6a/0x6a
[  280.034435]  <EOI>  [<ffffffff814f7589>] ? cpuidle_enter_state+0x69/0x1b0
[  280.034436]  [<ffffffff814f7578>] ? cpuidle_enter_state+0x58/0x1b0
[  280.034438]  [<ffffffff814f7782>] cpuidle_enter+0x12/0x20
[  280.034439]  [<ffffffff81093674>] cpu_startup_entry+0x354/0x3f0
[  280.034440]  [<ffffffff81620a30>] rest_init+0x80/0x90
[  280.034443]  [<ffffffff81cf1035>] start_kernel+0x470/0x47d
[  280.034444]  [<ffffffff81cf09b5>] ? set_init_arg+0x57/0x57
[  280.034446]  [<ffffffff81cf05ad>] x86_64_start_reservations+0x2a/0x2c
[  280.034447]  [<ffffffff81cf06a2>] x86_64_start_kernel+0xf3/0xf7

And here is a backtrace from the crash utility:
crash> bt
PID: 0      TASK: ffffffff81c19480  CPU: 0   COMMAND: "swapper/0"
 #0 [ffff880237c03c70] machine_kexec at ffffffff8103c0e1
 #1 [ffff880237c03cc0] crash_kexec at ffffffff810d89ee
 #2 [ffff880237c03d90] panic at ffffffff81629377
 #3 [ffff880237c03e10] dpm_watchdog_handler at ffffffff8141e22d
 #4 [ffff880237c03e30] call_timer_fn at ffffffff810b7f04
 #5 [ffff880237c03e70] run_timer_softirq at ffffffff810b8354
 #6 [ffff880237c03ef0] __do_softirq at ffffffff8105be24
 #7 [ffff880237c03f60] irq_exit at ffffffff8105c1ed
 #8 [ffff880237c03f70] do_IRQ at ffffffff81004b85
--- <IRQ stack> ---
 #9 [ffffffff81c03dc8] ret_from_intr at ffffffff816365ea
    [exception RIP: cpuidle_enter_state+105]
    RIP: ffffffff814f7589  RSP: ffffffff81c03e78  RFLAGS: 00000202
    RAX: 0000004133273fbb  RBX: 0000000000000046  RCX: 0000000000000018
    RDX: 0000000000940ffe  RSI: 0000000000000046  RDI: ffffffff81c1da40
    RBP: ffffffff81c03eb8   R8: 0000000000000001   R9: 0000000000060e2d
    R10: 00000000000935d2  R11: 0000000000000001  R12: 0000000000000046
    R13: 0000000000000004  R14: 0000000000000005  R15: ffffffff81c03e38
    ORIG_RAX: ffffffffffffffbe  CS: 0010  SS: 0018
#10 [ffffffff81c03ec0] cpuidle_enter at ffffffff814f7782
#11 [ffffffff81c03ed0] cpu_startup_entry at ffffffff81093674
#12 [ffffffff81c03f50] rest_init at ffffffff81620a30
#13 [ffffffff81c03f70] start_kernel at ffffffff81cf1035
#14 [ffffffff81c03fc0] x86_64_start_reservations at ffffffff81cf05ad
#15 [ffffffff81c03fd0] x86_64_start_kernel at ffffffff81cf06a2
crash> 

Note, I'm running with radeon.dpm=1.
I should probably try suspend without that option.

I might be able to bisect if needed. The oldest kernel I tried is 3.11.10-25-desktop from OpenSUSE, but it still crashed on suspend.

Just ask if you need more info.
I have a coredump of the panic, and can probably get more info if needed.
Comment 1 Alex Deucher 2015-01-06 20:44:17 UTC
Please attach your xorg log and full dmesg output.
Comment 2 Jon Arne Jørgensen 2015-01-06 21:19:39 UTC
Created attachment 162731 [details]
full dmesg from panic
Comment 3 Jon Arne Jørgensen 2015-01-06 21:26:04 UTC
Created attachment 162741 [details]
Complete Xorg.log

This is the Xorg.log from a reproduced crash because i forgot to save the Xorg log when I ran the crashkernel.
Comment 4 Jon Arne Jørgensen 2015-01-06 21:27:06 UTC
Created attachment 162751 [details]
Full pm-suspend log

This is also from a reproduced crash.
Comment 5 Jon Arne Jørgensen 2015-01-06 21:30:49 UTC
I reproduced the panic when I tried to suspend when radeon was loaded without radeon.dpm=1.

I was _not_ able to reproduce the panic with kernel-3.10.0-rc7 that I had lying around in my /boot directory.
Comment 6 Michel Dänzer 2015-01-07 01:26:48 UTC
Looks like dpm_watchdog_handler kicks in while the radeon driver is waiting for a fence to signal. Maybe the Radeon GPU IRQ is disabled too early during suspend or something like that?

(In reply to Jon Arne Jørgensen from comment #5)
> I was _not_ able to reproduce the panic with kernel-3.10.0-rc7 that I had
> lying around in my /boot directory.

Can you bisect?
Comment 7 Jon Arne Jørgensen 2015-01-13 20:14:27 UTC
I retried some kernels, and it seems like the problem was introduced in the 3.18 merge window.

In v3.17 I can't reproduce the panic, while in v3.18-rc1 and later I can reproduce the panic.

I'm trying to bisect now, but I'm troubled by a gpu freeze bug in some of the commits that crashes Xorg before I'm able to suspend the computer.

It looks like a bug early in the v3.18 merge window, but I'm not sure if I should skip the crashing commits while doing the bisect?
Comment 8 Michel Dänzer 2015-01-14 01:34:13 UTC
(In reply to Jon Arne Jørgensen from comment #7)
> I'm trying to bisect now, but I'm troubled by a gpu freeze bug in some of
> the commits that crashes Xorg before I'm able to suspend the computer.
> 
> It looks like a bug early in the v3.18 merge window, but I'm not sure if I
> should skip the crashing commits while doing the bisect?

I'd skip them for now.
Comment 9 Jon Arne Jørgensen 2015-01-22 20:48:22 UTC
Finaly managed to get a clean bisect, this is the culprit:

commit f2c24b83ae90292d315aa7ac029c6ce7929e01aa
Author: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Date:   Wed Apr 2 17:14:48 2014 +0200

    drm/ttm: flip the switch, and convert to dma_fence
    
    Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Comment 10 Alex Deucher 2015-01-22 20:55:36 UTC
Possibly related to bug 90741.
Comment 11 Michel Dänzer 2015-01-23 03:24:41 UTC
Maarten, any ideas?
Comment 12 Jon Arne Jørgensen 2015-01-25 23:02:42 UTC
I tested attachment 164421 [details] from bug 90741, and can report that the patch seems to fix the crash.
Comment 13 Maarten Lankhorst 2015-02-23 07:26:50 UTC
Does attachment 166571 [details] fix the bug for you too? I think that would be the best version to upstream.
Comment 14 Jon Arne Jørgensen 2015-03-03 10:48:51 UTC
I'm sorry to report that attachment 166571 [details] doesn't fix the crash,
but I can also report that your suggestion from Bug 90741 comment 60 does fix the crash.

That is, attachment 166571 [details] +
if (fence->ring == R600_RING_TYPE_DMA_INDEX) udelay(50); in radeon_fence_enable_signaling? after the irq_get.

I tried with udelay(50);, and mb(); both seems to work.

I was not able to compile the kernel with "RREG32(DMA_CNTL + DMA0_REGISTER_OFFSET);" because of missing defines.
What header should I include?
Comment 15 Maarten Lankhorst 2015-03-03 10:56:26 UTC
Ok, in that case it's probably a duplicate.

Can you attachment 166571 [details] on top of http://cgit.freedesktop.org/~agd5f/linux/log/?h=posting-read?
Comment 16 Jon Arne Jørgensen 2015-03-03 11:34:38 UTC
Yep, that's a fix.

Note You need to log in before you can comment on or make changes to this bug.