Bug 81001 - radeon: fence wait failed (-35) after hybrid suspend, leading to GPU reset and hangs
Summary: radeon: fence wait failed (-35) after hybrid suspend, leading to GPU reset an...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: i386 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-07-23 12:06 UTC by Ivan Kalvachev
Modified: 2016-03-23 18:56 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.15+
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg of suspend/resume session with radeon.ko dpm debug turned on (54.89 KB, application/x-gunzip)
2014-07-23 12:06 UTC, Ivan Kalvachev
Details

Description Ivan Kalvachev 2014-07-23 12:06:35 UTC
Created attachment 144041 [details]
dmesg of suspend/resume session with radeon.ko dpm debug turned on

My hardware is Radeon HD5670 (Redwood).
To reproduce the problem boot vanilla 3.15.x kernel. Run in KMS mode (no Xorg server needed). Then suspend to ram-and-disk with the following command:

`echo suspend  > /sys/power/disk; echo disk > /sys/power/state`

Resume from the power button. In `dmesg` you can find:

[   83.997399] [drm] ring test on 5 succeeded in 1 usecs
[   83.997403] [drm] UVD initialized successfully.
[   83.997450] [drm] ib test on ring 0 succeeded in 0 usecs
[   83.997494] [drm] ib test on ring 3 succeeded in 1 usecs
[   94.137259] radeon 0000:01:00.0: ring 5 stalled for more than 10000msec
[   94.137263] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002 on ring 5)
[   94.137265] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
[   94.137268] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).

At this point if Xorg server is started, any attempt to vdpau hardware decoding would fail.
If the computer is left working, without reboot, at some point timeout would trigger and GPU restart might be attempted, usually hanging the system. (I took the following log from older GPU restart, probably successful).

[    0.000000] Linux version 3.15.2 (root) (gcc version 4.8.3 (GCC) ) #2 SMP
[12398.387691] radeon 0000:01:00.0: ring 5 stalled for more than 242796msec
[12398.387699] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000005 last fence id 0x0000000000000004 on ring 5)
[12398.425151] radeon 0000:01:00.0: Saved 23 dwords of commands on ring 0.
[12398.425167] radeon 0000:01:00.0: GPU softreset: 0x00000009
[12398.425169] radeon 0000:01:00.0:   GRBM_STATUS               = 0xF5703828
[12398.425171] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0xFC000007
[12398.425173] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
[12398.425175] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200800C0
[12398.425177] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[12398.425179] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[12398.425181] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x40000000
[12398.425183] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00008004
[12398.425185] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80228647
[12398.425187] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[12398.441572] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00007F6B
[12398.441626] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[12398.442782] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003828
[12398.442783] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000007
[12398.442785] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
[12398.442787] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200800C0
[12398.442789] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[12398.442791] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[12398.442793] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[12398.442795] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[12398.442796] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[12398.442798] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[12398.442813] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[12398.512161] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0
[12398.516583] [drm] PCIE GART of 1024M enabled (table at 0x000000000025D000).


The bug vanishes if only suspend to ram or suspend to disk is used.

I tried to do a bisect, but a number of rc kernels seem to hang on me, long before radeon module is loaded. At this point bisect is not feasible.

I suspected that the new async suspend/resume code might be at fault, as I was seeing video card been turned off (aka monitor going off) and then turning on, a moment before shutting down completely.

So I found the commits of the async suspend (from an article) and reverted them.
Reverting 200421a80f6e0a9e39d698944cc35cba103eb6ce, 3c31b52f96f7b559d950b16113c0f68c72a1985e seems to avoid the above effect about monitor turning off, on, then off again. But it does not fix this bug. 

Reverting 
7cd0602d7836c0056fe9bdab014d5ac5ec5cb291, 92858c476ec4e99cf0425f05dee109b6a55eb6f8 and
9e5e7910df824ba02aedd2b5d2ca556426ea6d0b, 76569faa62c46382e080c3e190c66e19515aae1c, de377b3972729f00ee236ae4a97393e282ffe391, 28b6fd6e37792b16a56d324841bdb20ab78e4522, a59ffb2062df3a5c346dbed931fa1e587fd0f0f3
doesn't affect the bug either, so I assume that this bug is not related to suspend/resume async changes.

If you cannot reproduce the problem, please advice me what commits to revert.

This bugreport copy of https://bugs.freedesktop.org/show_bug.cgi?id=81620

Note You need to log in before you can comment on or make changes to this bug.