Bug 77001 - Radeon R9 270X GPU lockup and resume failure after all night inactivity [drm,radeonsi]
Summary: Radeon R9 270X GPU lockup and resume failure after all night inactivity [drm,...
Status: RESOLVED UNREPRODUCIBLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-05-28 12:50 UTC by Grzegorz Kowal
Modified: 2014-07-02 20:49 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.14.4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel log with the lockup and following boot messages (19.00 KB, application/gzip)
2014-05-28 12:50 UTC, Grzegorz Kowal
Details
Boot log when GPU cannot be initialized. (121.39 KB, application/octet-stream)
2014-06-03 11:25 UTC, Grzegorz Kowal
Details

Description Grzegorz Kowal 2014-05-28 12:50:29 UTC
Created attachment 137581 [details]
kernel log with the lockup and following boot messages

After leaving the computer on during the night it hung up in the morning, while I tried to use it again, with the following message:

------
May 28 06:13:06 [kernel] [153149.666146] radeon 0000:01:00.0: GPU lockup CP stall for more than 10033msec
May 28 06:13:06 [kernel] [153149.666150] radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000071debe last fence id 0x000000000071debd on ring 0)
May 28 06:13:06 [kernel] [153150.122657] radeon 0000:01:00.0: GPU lockup CP stall for more than 10490msec
May 28 06:13:06 [kernel] [153150.122661] radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000071debe last fence id 0x000000000071debd on ring 0)
May 28 06:13:06 [kernel] [153150.122664] radeon 0000:01:00.0: failed to get a new IB (-35)
May 28 06:13:06 [kernel] [153150.124014] radeon 0000:01:00.0: sa_manager is not empty, clearing anyway
May 28 06:13:07 [kernel] [153150.927575] radeon 0000:01:00.0: Saved 3296 dwords of commands on ring 0.
May 28 06:13:07 [kernel] [153150.927713] radeon 0000:01:00.0: GPU softreset: 0x0000004D
May 28 06:13:07 [kernel] [153150.927721] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA3503028
May 28 06:13:07 [kernel] [153150.927726] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x28000006
May 28 06:13:07 [kernel] [153150.927732] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x2D000006
May 28 06:13:07 [kernel] [153150.927736] radeon 0000:01:00.0:   SRBM_STATUS               = 0x20000EC0
May 28 06:13:07 [kernel] [153150.927850] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
May 28 06:13:07 [kernel] [153150.927854] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
May 28 06:13:07 [kernel] [153150.927859] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00004100
May 28 06:13:07 [kernel] [153150.927863] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00028986
May 28 06:13:07 [kernel] [153150.927867] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x800282E7
May 28 06:13:07 [kernel] [153150.927874] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44483146
May 28 06:13:07 [kernel] [153150.927878] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
May 28 06:13:07 [kernel] [153150.927883] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
May 28 06:13:07 [kernel] [153150.927887] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
May 28 06:13:08 [kernel] [153152.248694] radeon 0000:01:00.0: Wait for MC idle timedout !
May 28 06:13:08 [kernel] [153152.248703] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
May 28 06:13:08 [kernel] [153152.248760] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00100100
May 28 06:13:08 [kernel] [153152.249931] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
May 28 06:13:08 [kernel] [153152.249937] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
May 28 06:13:08 [kernel] [153152.249941] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
May 28 06:13:08 [kernel] [153152.249945] radeon 0000:01:00.0:   SRBM_STATUS               = 0x20000EC0
May 28 06:13:08 [kernel] [153152.250059] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
May 28 06:13:08 [kernel] [153152.250088] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
May 28 06:13:08 [kernel] [153152.250092] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
May 28 06:13:08 [kernel] [153152.250099] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
May 28 06:13:08 [kernel] [153152.250109] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
May 28 06:13:08 [kernel] [153152.250114] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
May 28 06:13:08 [kernel] [153152.250118] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
May 28 06:13:08 [kernel] [153152.250372] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
May 28 06:13:13 [kernel] [153157.261944] [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting
May 28 06:13:13 [kernel] [153157.261952] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing C008 (len 254, WS 0, PS 4) @ 0xC032
May 28 06:13:13 [kernel] [153157.261957] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing B67A (len 94, WS 12, PS 8) @ 0xB6C3
May 28 06:13:13 [kernel] [153157.276909] [drm] probing gen 2 caps for device 1002:5a16 = 33ed02/0
May 28 06:13:13 [kernel] [153157.276918] [drm] PCIE gen 2 link speeds already enabled
May 28 06:13:14 [kernel] [153157.761994] radeon 0000:01:00.0: Wait for MC idle timedout !
May 28 06:13:14 [kernel] [153157.982771] radeon 0000:01:00.0: Wait for MC idle timedout !
May 28 06:13:14 [kernel] [153157.988649] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
May 28 06:13:14 [kernel] [153157.988769] radeon 0000:01:00.0: WB enabled
May 28 06:13:14 [kernel] [153157.988775] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff8800ba368c00
May 28 06:13:14 [kernel] [153157.988780] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff8800ba368c04
May 28 06:13:14 [kernel] [153157.988785] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff8800ba368c08
May 28 06:13:14 [kernel] [153157.988792] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff8800ba368c0c
May 28 06:13:14 [kernel] [153157.988798] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff8800ba368c10
May 28 06:13:14 [kernel] [153157.989912] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90010335a18
May 28 06:13:15 [kernel] [153158.516551] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x850C)=0xCAFEDEAD)
May 28 06:13:15 [kernel] [153158.516557] [drm:si_resume] *ERROR* si startup failed on resume
May 28 06:13:15 [kernel] [153158.516629] [drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed
May 28 06:13:16 [kernel] [153159.392111] radeon 0000:01:00.0: Saved 9824 dwords of commands on ring 0.
May 28 06:13:16 [kernel] [153159.392244] radeon 0000:01:00.0: GPU softreset: 0x00000048
May 28 06:13:16 [kernel] [153159.392246] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
May 28 06:13:16 [kernel] [153159.392249] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
May 28 06:13:16 [kernel] [153159.392251] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
May 28 06:13:16 [kernel] [153159.392253] radeon 0000:01:00.0:   SRBM_STATUS               = 0x20000EC0
May 28 06:13:16 [kernel] [153159.392363] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
May 28 06:13:16 [kernel] [153159.392365] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
May 28 06:13:16 [kernel] [153159.392367] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010100
May 28 06:13:16 [kernel] [153159.392369] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00420182
May 28 06:13:16 [kernel] [153159.392371] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x84038243
May 28 06:13:16 [kernel] [153159.392375] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
May 28 06:13:16 [kernel] [153159.392377] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
May 28 06:13:16 [kernel] [153159.392380] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
May 28 06:13:16 [kernel] [153159.392383] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
May 28 06:13:17 [kernel] [153160.661846] radeon 0000:01:00.0: Wait for MC idle timedout !
May 28 06:13:17 [kernel] [153160.661850] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
May 28 06:13:17 [kernel] [153160.661904] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
May 28 06:13:17 [kernel] [153160.663060] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
May 28 06:13:17 [kernel] [153160.663062] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
May 28 06:13:17 [kernel] [153160.663067] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
May 28 06:13:17 [kernel] [153160.663075] radeon 0000:01:00.0:   SRBM_STATUS               = 0x20000EC0
May 28 06:13:17 [kernel] [153160.663186] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
May 28 06:13:17 [kernel] [153160.663193] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
May 28 06:13:17 [kernel] [153160.663197] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
May 28 06:13:17 [kernel] [153160.663199] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
May 28 06:13:17 [kernel] [153160.663200] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
May 28 06:13:17 [kernel] [153160.663207] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
May 28 06:13:17 [kernel] [153160.663209] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
May 28 06:13:17 [kernel] [153160.663478] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
------

After that it didn't respond anymore, not even through ssh, so hard reset was required.

I've noticed that simply pressing the reset button is not enough, and hard reset with powering off the computer is necessary.
Comment 1 Alex Deucher 2014-05-28 16:17:18 UTC
Is this easily reproducible?  The GPU hang is probably caused by a particular application.  Were you running anything in particular when it hung?  Maybe a GL screensaver?  If so, can you narrow down what app caused this?
Comment 2 Grzegorz Kowal 2014-05-29 13:37:36 UTC
This type of hang happened for the first time here. I was logged in the KDE with KWin compositing based on OpenGL 3.1. I don't use any screensaver.

I switched compositing type to OpenGL 2.0 now, and recompiled the kernel with suspend to RAM and standby turned off.

My software stack is:

Gentoo default/linux/amd64/13.0/desktop/kde
xorg-server-1.15.0

mesa git -> aa78c4586d42856dcc86b143c2e72c05a1422016
libdrm git -> 8fc62ca8ac010659023bb63c4759eb683de4f9af
glamor git -> a4fbc7732a1e56de385f50b778aafbdd186e015c
xf86-video-ati git -> 851b2cf8714618843725f6d067915375485ade9d

I'll try to do some tests to see if the problem can be reproduced with this configuration.
Comment 3 Grzegorz Kowal 2014-05-30 12:39:41 UTC
This morning at 6:27am it happened again. The computer was idle during the whole night, even during the hang. When I touched the keyboard around 8am, the monitor turned on, but I saw dark flickering screen only. To check if kwin was responsible for the hang, I logged out from the KDE session yesterday, so only KDM was running.

This time I could log in through ssh, copy the kernel log and reboot the machine though.

The kernel log contained:

radeon 0000:01:00.0: GPU lockup CP stall for more than 15500msec
radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000026efd3 last fence id 0x000000000026efd2 on ring 0)
radeon 0000:01:00.0: failed to get a new IB (-35)
radeon 0000:01:00.0: failed to get a new IB (-35)
radeon 0000:01:00.0: failed to get a new IB (-35)
radeon 0000:01:00.0: sa_manager is not empty, clearing anyway
radeon 0000:01:00.0: Saved 2365 dwords of commands on ring 0.
radeon 0000:01:00.0: GPU softreset: 0x0000004D
radeon 0000:01:00.0:   GRBM_STATUS               = 0xB3507028
radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x28000006
radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x2F000006
radeon 0000:01:00.0:   SRBM_STATUS               = 0x20000EC0
radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x40000000
radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00408002
radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x84228647
radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44483146
radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
radeon 0000:01:00.0: Wait for MC idle timedout !
radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00100100
radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
radeon 0000:01:00.0:   SRBM_STATUS               = 0x20000EC0
radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting
[drm:atom_execute_table_locked] *ERROR* atombios stuck executing C008 (len 254, WS 0, PS 4) @ 0xC032
[drm:atom_execute_table_locked] *ERROR* atombios stuck executing B67A (len 94, WS 12, PS 8) @ 0xB6C3
[drm] probing gen 2 caps for device 1002:5a16 = 33ed01/0
radeon 0000:01:00.0: Wait for MC idle timedout !
radeon 0000:01:00.0: Wait for MC idle timedout !
[drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
radeon 0000:01:00.0: WB enabled
radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff8800ba363c00
radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff8800ba363c04
radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff8800ba363c08
radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff8800ba363c0c
radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff8800ba363c10
radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90010335a18
[drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x850C)=0xCAFEDEAD)
[drm:si_resume] *ERROR* si startup failed on resume
[drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed
Comment 4 Grzegorz Kowal 2014-06-03 11:24:30 UTC
I've got this again. This time with radeon.dpm=0 and power profile set to "low".

Sometimes when I switch the computer on, it seems that system does not initiate the GPU properly and monitor stays off. The system loads but with the following error:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
IP: [<ffffffff81337bc3>] radeon_driver_open_kms+0x83/0x200

The full log is attached.
Comment 5 Grzegorz Kowal 2014-06-03 11:25:12 UTC
Created attachment 137941 [details]
Boot log when GPU cannot be initialized.
Comment 6 Grzegorz Kowal 2014-07-02 20:49:30 UTC
With the recent kernels 3.14.8 to 3.14.10 and mesa, xf86-video-ati, and libdrm from git I didn't get the resume failures anymore. My usual uptime is of several days (reboot only after the minor kernel version update).

However, when the machine is booted after being shut down completely, sometimes I am getting problems with the video initialization at the BIOS POST stage. But it looks like a different problem, maybe BIOS settings or hardware issues, so I am closing this bug report as resolved.

Note You need to log in before you can comment on or make changes to this bug.