Bug 78221

Summary: 3.16 RC1: AMD R9 270 GPU locks up on some heavy 2D activity - GPU VM fault occurs. (possibly DMA copying issue strikes back?)
Product: Drivers Reporter: t3st3r
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: high CC: abandonedaccountubdprczb8hs, Actualize.in.Material+bugzillakernel, alexandre.f.demers, alexdeucher, darkbasic, jean.michel.sm, kernel, linux.tester, muhomor.d, q, zazdxscf+bugzilla.kernel.org
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.16-rc1 Subsystem:
Regression: No Bisected commit-id:
Attachments: patch 1/2
patch 2/2
patch 1/2
patch 2/2

Description t3st3r 2014-06-18 02:20:57 UTC
Configuration:
 AMD R9 270 with 3.16-rc1 kernel.
 Ubuntu 14.04 + oibaf PPA used as recent opensource graphics stack (MESA 10.3 git).
 XFCE used as DE, compositing is off.

To reproduce:
 Intermittent bug. R9 270 GPU can sometimes lock up on some heavy 2D-based loads. 
 Known way to toggle bug is to install Battle for Wesnoth game from ubuntu repos and allow game to display some map with many units/heavy animations when running in windowed mode (BfW is SDL-based, 2D-only game, it does not makes GL calls, etc). In random intervals, usually in range of 30 minutes, GPU could lose stability and could lock up.

Special considerations:
 1) GPU recovery often succeeds in this case. Yet its bad to have 20 seconds of black screen and sometimes it can fail to recover after multiple GPU crashes.
 2) Kernels prior 3.15 had similar issue. Kernel 3.15 (release version but not -RCs) has been rock solid and never crashed GPU to the best of my knowledge, no matter what I attempted. Now 3.16-rc1 crashes again.
 3) Taking 2) into account and such verioning I suspect it could have something to do with commit b5be1a839a33634393394e4782edaa37a4bc1a1e or somewhere around. Possibly its what reintroduced lockups in 3.16 again. Maybe underlying reasons of deadlocks were not fixed for R9 270?
 4) GPU seems to be stable under 3D loads I've attempted - it does not crashes even after hours of quite demanding 3D loads like games, etc. Only some specific 2D loads can cause such issues.

Crash details are looking like this:
===CUT===
Jun 18 03:10:01 localhost kernel: [26125.102351] radeon 0000:01:00.0: ring 0 stalled for more than 10263msec
Jun 18 03:10:01 localhost kernel: [26125.102362] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000445274 last fence id 0x0000000000445273 on ring 0)
Jun 18 03:10:01 localhost kernel: [26125.102370] radeon 0000:01:00.0: failed to get a new IB (-35)
Jun 18 03:10:01 localhost kernel: [26125.671219] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x00000000803dae00 flags=0x0000]
Jun 18 03:10:01 localhost kernel: [26125.671232] AMD-Vi: Event logged [
Jun 18 03:10:01 localhost kernel: [26125.671232] radeon 0000:01:00.0: Saved 23200 dwords of commands on ring 0.
Jun 18 03:10:01 localhost kernel: [26125.671240] IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x00000000803dae30 flags=0x0020]
Jun 18 03:10:01 localhost kernel: [26125.671243] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x0000000080000100 flags=0x0020]
Jun 18 03:10:01 localhost kernel: [26125.671248] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x00000000803dad00 flags=0x0000]
Jun 18 03:10:01 localhost kernel: [26125.671361] radeon 0000:01:00.0: GPU softreset: 0x0000006C
Jun 18 03:10:01 localhost kernel: [26125.671365] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
Jun 18 03:10:01 localhost kernel: [26125.671367] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
Jun 18 03:10:01 localhost kernel: [26125.671370] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
Jun 18 03:10:01 localhost kernel: [26125.671372] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
Jun 18 03:10:01 localhost kernel: [26125.671483] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
Jun 18 03:10:01 localhost kernel: [26125.671485] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Jun 18 03:10:01 localhost kernel: [26125.671487] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
Jun 18 03:10:01 localhost kernel: [26125.671489] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000002
Jun 18 03:10:01 localhost kernel: [26125.671492] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80010243
Jun 18 03:10:01 localhost kernel: [26125.671494] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44483146
Jun 18 03:10:01 localhost kernel: [26125.671496] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C84246
Jun 18 03:10:01 localhost kernel: [26125.671499] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Jun 18 03:10:01 localhost kernel: [26125.671501] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Jun 18 03:10:02 localhost kernel: [26126.218193] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
Jun 18 03:10:02 localhost kernel: [26126.218247] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00100140
Jun 18 03:10:02 localhost kernel: [26126.219404] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
Jun 18 03:10:02 localhost kernel: [26126.219407] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
Jun 18 03:10:02 localhost kernel: [26126.219409] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
Jun 18 03:10:02 localhost kernel: [26126.219411] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
Jun 18 03:10:02 localhost kernel: [26126.219522] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
Jun 18 03:10:02 localhost kernel: [26126.219524] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Jun 18 03:10:02 localhost kernel: [26126.219526] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
Jun 18 03:10:02 localhost kernel: [26126.219528] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
Jun 18 03:10:02 localhost kernel: [26126.219530] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
Jun 18 03:10:02 localhost kernel: [26126.219533] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Jun 18 03:10:02 localhost kernel: [26126.219535] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
Jun 18 03:10:02 localhost kernel: [26126.219780] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
Jun 18 03:10:02 localhost kernel: [26126.246602] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0
Jun 18 03:10:02 localhost kernel: [26126.246606] [drm] PCIE gen 2 link speeds already enabled
Jun 18 03:10:02 localhost kernel: [26126.250269] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
Jun 18 03:10:02 localhost kernel: [26126.250405] radeon 0000:01:00.0: WB enabled
Jun 18 03:10:02 localhost kernel: [26126.250408] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff88041651fc00
Jun 18 03:10:02 localhost kernel: [26126.250411] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff88041651fc04
Jun 18 03:10:02 localhost kernel: [26126.250413] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff88041651fc08
Jun 18 03:10:02 localhost kernel: [26126.250415] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff88041651fc0c
Jun 18 03:10:02 localhost kernel: [26126.250417] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff88041651fc10
Jun 18 03:10:02 localhost kernel: [26126.251393] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90011db5a18
Jun 18 03:10:02 localhost kernel: [26126.436426] [drm] ring test on 0 succeeded in 3 usecs
Jun 18 03:10:02 localhost kernel: [26126.436432] [drm] ring test on 1 succeeded in 1 usecs
Jun 18 03:10:02 localhost kernel: [26126.436437] [drm] ring test on 2 succeeded in 1 usecs
Jun 18 03:10:02 localhost kernel: [26126.436500] [drm] ring test on 3 succeeded in 2 usecs
Jun 18 03:10:02 localhost kernel: [26126.436510] [drm] ring test on 4 succeeded in 1 usecs
Jun 18 03:10:02 localhost kernel: [26126.613588] [drm] ring test on 5 succeeded in 2 usecs
Jun 18 03:10:02 localhost kernel: [26126.613595] [drm] UVD initialized successfully.
Jun 18 03:10:04 localhost kernel: [26128.294193] SysRq : Emergency Sync
Jun 18 03:10:04 localhost kernel: [26128.309427] Emergency Sync complete
Jun 18 03:10:12 localhost kernel: [26136.610250] radeon 0000:01:00.0: ring 0 stalled for more than 10001msec
Jun 18 03:10:12 localhost kernel: [26136.610260] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000445396 last fence id 0x0000000000445273 on ring 0)
Jun 18 03:10:12 localhost kernel: [26136.610266] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
Jun 18 03:10:12 localhost kernel: [26136.610273] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).
Jun 18 03:10:12 localhost kernel: [26136.610278] radeon 0000:01:00.0: ib ring test failed (-35).
Jun 18 03:10:13 localhost kernel: [26137.083011] SysRq : Emergency Sync
Jun 18 03:10:13 localhost kernel: [26137.098644] Emergency Sync complete
Jun 18 03:10:13 localhost kernel: [26137.164779] radeon 0000:01:00.0: GPU softreset: 0x00000048
Jun 18 03:10:13 localhost kernel: [26137.164782] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
Jun 18 03:10:13 localhost kernel: [26137.164785] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
Jun 18 03:10:13 localhost kernel: [26137.164787] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
Jun 18 03:10:13 localhost kernel: [26137.164789] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
Jun 18 03:10:13 localhost kernel: [26137.164911] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
Jun 18 03:10:13 localhost kernel: [26137.164913] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Jun 18 03:10:13 localhost kernel: [26137.164918] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
Jun 18 03:10:13 localhost kernel: [26137.164925] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000002
Jun 18 03:10:13 localhost kernel: [26137.164930] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80010243
Jun 18 03:10:13 localhost kernel: [26137.164933] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Jun 18 03:10:13 localhost kernel: [26137.164935] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
Jun 18 03:10:13 localhost kernel: [26137.164937] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Jun 18 03:10:13 localhost kernel: [26137.164940] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Jun 18 03:10:13 localhost kernel: [26137.709342] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
Jun 18 03:10:13 localhost kernel: [26137.709397] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
Jun 18 03:10:13 localhost kernel: [26137.710554] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
Jun 18 03:10:13 localhost kernel: [26137.710556] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
Jun 18 03:10:13 localhost kernel: [26137.710558] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
Jun 18 03:10:13 localhost kernel: [26137.710560] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
Jun 18 03:10:13 localhost kernel: [26137.710681] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
Jun 18 03:10:13 localhost kernel: [26137.710683] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Jun 18 03:10:13 localhost kernel: [26137.710685] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
Jun 18 03:10:13 localhost kernel: [26137.710687] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
Jun 18 03:10:13 localhost kernel: [26137.710689] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
Jun 18 03:10:13 localhost kernel: [26137.710695] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Jun 18 03:10:13 localhost kernel: [26137.710706] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
Jun 18 03:10:13 localhost kernel: [26137.710960] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
Jun 18 03:10:13 localhost kernel: [26137.724290] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0
Jun 18 03:10:13 localhost kernel: [26137.724294] [drm] PCIE gen 2 link speeds already enabled
Jun 18 03:10:13 localhost kernel: [26137.728012] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
Jun 18 03:10:13 localhost kernel: [26137.728147] radeon 0000:01:00.0: WB enabled
Jun 18 03:10:13 localhost kernel: [26137.728150] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff88041651fc00
Jun 18 03:10:13 localhost kernel: [26137.728152] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff88041651fc04
Jun 18 03:10:13 localhost kernel: [26137.728154] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff88041651fc08
Jun 18 03:10:13 localhost kernel: [26137.728156] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff88041651fc0c
Jun 18 03:10:13 localhost kernel: [26137.728158] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff88041651fc10
Jun 18 03:10:13 localhost kernel: [26137.729140] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90011db5a18
Jun 18 03:10:14 localhost kernel: [26137.914127] [drm] ring test on 0 succeeded in 3 usecs
Jun 18 03:10:14 localhost kernel: [26137.914133] [drm] ring test on 1 succeeded in 1 usecs
Jun 18 03:10:14 localhost kernel: [26137.914138] [drm] ring test on 2 succeeded in 1 usecs
Jun 18 03:10:14 localhost kernel: [26137.914202] [drm] ring test on 3 succeeded in 2 usecs
Jun 18 03:10:14 localhost kernel: [26137.914211] [drm] ring test on 4 succeeded in 1 usecs
Jun 18 03:10:14 localhost kernel: [26138.091289] [drm] ring test on 5 succeeded in 2 usecs
Jun 18 03:10:14 localhost kernel: [26138.091296] [drm] UVD initialized successfully.
Jun 18 03:10:14 localhost kernel: [26138.091385] [drm] ib test on ring 0 succeeded in 0 usecs
Jun 18 03:10:14 localhost kernel: [26138.091475] [drm] ib test on ring 1 succeeded in 0 usecs
Jun 18 03:10:14 localhost kernel: [26138.091601] [drm] ib test on ring 2 succeeded in 0 usecs
Jun 18 03:10:14 localhost kernel: [26138.091644] [drm] ib test on ring 3 succeeded in 0 usecs
Jun 18 03:10:14 localhost kernel: [26138.091679] [drm] ib test on ring 4 succeeded in 0 usecs
Jun 18 03:10:16 localhost kernel: [26140.121922] SysRq : Keyboard mode set to system default
Jun 18 03:10:18 localhost kernel: [26141.841306] SysRq : Emergency Sync
Jun 18 03:10:18 localhost kernel: [26141.866011] Emergency Sync complete
Jun 18 03:10:18 localhost kernel: [26142.361117] SysRq : Emergency Sync
Jun 18 03:10:18 localhost kernel: [26142.383541] Emergency Sync complete
Jun 18 03:10:24 localhost kernel: [26148.238943] radeon 0000:01:00.0: ring 5 stalled for more than 10000msec
Jun 18 03:10:24 localhost kernel: [26148.238952] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002 on ring 5)
Jun 18 03:10:24 localhost kernel: [26148.238958] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
Jun 18 03:10:24 localhost kernel: [26148.238966] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).
Jun 18 03:10:24 localhost kernel: [26148.238995] [drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed
Jun 18 03:10:24 localhost kernel: [26148.262523] radeon 0000:01:00.0: GPU fault detected: 146 0x05428804
Jun 18 03:10:24 localhost kernel: [26148.262534] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000A1AA
Jun 18 03:10:24 localhost kernel: [26148.262539] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02088004
Jun 18 03:10:24 localhost kernel: [26148.262544] VM fault (0x04, vmid 1) at page 41386, read from TC (136)
Jun 18 03:10:24 localhost kernel: [26148.262552] radeon 0000:01:00.0: GPU fault detected: 146 0x06824804
Jun 18 03:10:24 localhost kernel: [26148.262556] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Jun 18 03:10:24 localhost kernel: [26148.262559] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Jun 18 03:10:24 localhost kernel: [26148.262563] VM fault (0x00, vmid 0) at page 0, read from unknown (0)
Jun 18 03:10:24 localhost kernel: [26148.262570] radeon 0000:01:00.0: GPU fault detected: 146 0x05428404
Jun 18 03:10:24 localhost kernel: [26148.262573] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Jun 18 03:10:24 localhost kernel: [26148.262577] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Jun 18 03:10:24 localhost kernel: [26148.262580] VM fault (0x00, vmid 0) at page 0, read from unknown (0)
Jun 18 03:10:24 localhost kernel: [26148.262586] radeon 0000:01:00.0: GPU fault detected: 146 0x06c24404
Jun 18 03:10:24 localhost kernel: [26148.262590] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Jun 18 03:10:24 localhost kernel: [26148.262593] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Jun 18 03:10:24 localhost kernel: [26148.262597] VM fault (0x00, vmid 0) at page 0, read from unknown (0)
Jun 18 03:10:24 localhost kernel: [26148.262603] radeon 0000:01:00.0: GPU fault detected: 146 0x06228804
Jun 18 03:10:24 localhost kernel: [26148.262606] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Jun 18 03:10:24 localhost kernel: [26148.262610] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Jun 18 03:10:24 localhost kernel: [26148.262613] VM fault (0x00, vmid 0) at page 0, read from unknown (0)
Jun 18 03:10:24 localhost kernel: [26148.262619] radeon 0000:01:00.0: GPU fault detected: 146 0x07024804
Jun 18 03:10:24 localhost kernel: [26148.262623] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Jun 18 03:10:24 localhost kernel: [26148.262626] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Jun 18 03:10:24 localhost kernel: [26148.262629] VM fault (0x00, vmid 0) at page 0, read from unknown (0)
Jun 18 03:10:24 localhost kernel: [26148.262635] radeon 0000:01:00.0: GPU fault detected: 146 0x0802c804
Jun 18 03:10:24 localhost kernel: [26148.262639] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Jun 18 03:10:24 localhost kernel: [26148.262642] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
Jun 18 03:10:24 localhost kernel: [26148.262646] VM fault (0x00, vmid 0) at page 0, read from unknown (0)
Jun 18 03:10:24 localhost kernel: [26148.262652] radeon 0000:01:00.0: GPU fault detected: 146 0x0882c404
Jun 18 03:10:24 localhost kernel: [26148.262655] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Jun 18 03:10:24 localhost kernel: [26148.262659] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
===CUT===
(message about VM fault repeats several thousands times, some messages omitted as they make full log about 3.5Mb).
Comment 1 t3st3r 2014-06-18 02:22:15 UTC
*** Bug 78211 has been marked as a duplicate of this bug. ***
Comment 2 Alex Deucher 2014-06-18 15:12:31 UTC
This is more likely a bug in the mesa 3D driver than a kernel bug.  The 3D driver is used for both 2D and 3D acceleration.
Comment 3 t3st3r 2014-06-19 07:37:58 UTC
Not looks like this. Re-checked again, using very same version of drivers in process. I'm unable to trigged bug with 3.15 mainline kernel, no matter what. But it happens easily with older kernels like 3.14 or early 3.15RCs, and would also happen with 3.16-RC1. This makes me to think it comes to DMA issues. What makes me even more suspicious about it is that early 3.15RC were crashing as well, but in release version its gone. Looks like last-minute changes related to DMA have fixed this instability. But now it re-appeared again :(.
Comment 4 Alex Deucher 2014-06-19 13:46:28 UTC
Can you bisect and see what fixed it in 3.15 or what broke it again in 3.16?
Comment 5 t3st3r 2014-06-21 04:04:18 UTC
Will try that since that bug is nasty enough. Can take some time.

As initial investigation when looking on commit log and matching encounters of bugs, it appears stability issues were fixed at result of commit 0a4ae727d6aa459247b027387edb6ff99f657792 (appears between 3.15-rc8 -> 3.15 release).

So all 3.15 RCs were not stable on R9 270. However, 3.15 release is okay due to these last-minute fixes. Yet 0a4ae727d6aa459247b027387edb6ff99f657792 seems to be composed of few commits, lets chew a bit more on it. Most likely it comes down to 91b0275c0ecd1870c5f8bfb73e2da2d6c29414b3. 

I think I would try little experiment first: return CPDMA as it was in 3.15 last minute fix and see if stability returns to 3.16-rc1 with R9 270.
Comment 6 t3st3r 2014-06-22 07:12:51 UTC
Hmm, wrong guess about CPDMA. Trying harder, due to nature of bug it can take some time.
Comment 7 Alex Deucher 2014-06-23 14:44:44 UTC
Created attachment 140711 [details]
patch 1/2

Does this patch set help?
Comment 8 Alex Deucher 2014-06-23 14:45:06 UTC
Created attachment 140721 [details]
patch 2/2
Comment 9 t3st3r 2014-06-24 11:40:13 UTC
Hmm, this patch does not applies cleanly to 3.16-rc1 or -rc2, mostly having bunch of conflicts in radeon_vm.c, which are a bit over my head to resolve at this point. Which version of kernel I'm supposed to try?
Comment 10 Alex Deucher 2014-06-24 16:23:36 UTC
Created attachment 140871 [details]
patch 1/2

Sorry, updated patches.  These apply against 3.15.
Comment 11 Alex Deucher 2014-06-24 16:23:57 UTC
Created attachment 140881 [details]
patch 2/2
Comment 12 t3st3r 2014-06-25 01:05:17 UTC
And what I'm supposed to test if patch is against 3.15? Because 3.15 release is fine "on its own" and does not exposes this bug. So its impossible to see if bug appears -> apply patch -> check that bug is gone (which looks most logical course of actions to me, unless I got something wrong). Because 3.15 lacks this bug even before patch.

Bug appears to be fixed between 3.15-rc8 and 3.15 as result of mentioned merge. Then bug reappeared at 3.16-rc1 (and up) as result of other merges. So it would be logical if patch is against some 3.15-rc* or 3.16-rc*? Unfortunately, they have so many changes related to VM management that it's not like if I'm cool enough to port patch to these versions myself (most notably, radeon_vm.c changes are quite complicated). So I cant see if GPU lockup is gone after patching some "known-bad" version.

Or you mean something like this: take 3.15 (which is ok) and check that patch does not breaks anything? But it wouldn't be direct check if bug is actually gone, right?
Comment 13 Alex Deucher 2014-06-25 02:11:02 UTC
Sorry, I misread your comments and thought it was broken on 3.15 as well.  You can follow the thread here:
http://lists.freedesktop.org/archives/dri-devel/2014-June/062305.html
Comment 14 t3st3r 2014-06-25 09:45:28 UTC
Nono, v3.15 (release) is okay on my GPU in regard to this bug. That what makes testing patch tricky.

Bug has been here since unknown. I can tell for sure it plagued all 3.15RCs (maybe earlier versions as well). But between 3.15rc8 and 3.15 release, bunch of last-minute DRM fixes landed (0a4ae727d6aa459247b027387edb6ff99f657792). Except everything else, it corrected this GPU deadlock problem. So v3.15 (the one and the only) does not exposes that bug.

When I gave a try to v3.16rc1, I figured out bug re-appeared. Hence, looked like regression in 3.16RCs.
Comment 15 Alex Deucher 2014-06-25 13:17:50 UTC
Any luck narrowing down what fixed it in 3.15 or what broke it again in 3.16?
Comment 16 t3st3r 2014-08-05 08:06:47 UTC
I have to admit this bug really suxx. I've attempted to bisect 3.15 -> 3.16rc1 several times but these attempts failed so far.

It looks like while I generally found quite fast ways to toggle this bug in lucky cases, in some cases bug does not toggles for many hours or even can require a reboot on same kernel version to increase chance bug appears. Bug also seems to be really picky on previous history of GPU usage (e.g. launching some 3D game before BfW can screw anything up and bug would not toggle in literally days,  but can occasionally backstab).

In some cases deciding if kernel is bugged or not turned out to be a really daunting and time consuming task. My last attempt was also wrong. I bet some of "good" kernels were not as good as they should. Bad kernels on other hand supposed to be bad, i.e. GPU crashed.

So last attempt also led me into really strange area, I don't even have hardware in question so this module is never used.

P.S. and as far as I understand, http://lists.freedesktop.org/archives/dri-devel/2014-June/062305.html fix wasn't ported into 3.16 series? So 3.16 keeps failing for me.


And as example, last bisect looked like this:
$ git bisect log
git bisect start
# good: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15
git bisect good 1860e379875dfe7271c649058aeddffe5afd9d0d
# bad: [7171511eaec5bf23fb06078f59784a3a0626b38f] Linux 3.16-rc1
git bisect bad 7171511eaec5bf23fb06078f59784a3a0626b38f
# good: [aaeb2554337217dfa4eac2fcc90da7be540b9a73] Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media into next
git bisect good aaeb2554337217dfa4eac2fcc90da7be540b9a73
# good: [16b9057804c02e2d351e9c8f606e909b43cbd9e7] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
git bisect good 16b9057804c02e2d351e9c8f606e909b43cbd9e7
# bad: [249c8b8d7e2d1bf9505dc46458537e77326c24fd] i40evf: remove unnecessary log messages
git bisect bad 249c8b8d7e2d1bf9505dc46458537e77326c24fd
# good: [758bd61aa987e82765bd432f37bd81bd197c4b1a] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
git bisect good 758bd61aa987e82765bd432f37bd81bd197c4b1a
# bad: [9db7cb6901740453a442e598563b576987dd471b] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
git bisect bad 9db7cb6901740453a442e598563b576987dd471b
# bad: [99abe65ff18b6bbac2e55524827b571c3eccfa86] Merge tag 'nfc-next-3.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
git bisect bad 99abe65ff18b6bbac2e55524827b571c3eccfa86
# bad: [75e58071c0c64f331ccc4c0037990a1e50099f7f] Merge branch 'for-linville' of git://github.com/kvalo/ath
git bisect bad 75e58071c0c64f331ccc4c0037990a1e50099f7f
# bad: [d5738b41e555f97f597b19bc549fa811b516d6b6] Revert "wl1251: enforce changed hw encryption support on monitor state change"
git bisect bad d5738b41e555f97f597b19bc549fa811b516d6b6
# bad: [0aa7142812c19af25ad21405eefc499e83da2fcc] iwlwifi: mvm: fix sparse warning when _DEBUGFS isn't set
git bisect bad 0aa7142812c19af25ad21405eefc499e83da2fcc
# bad: [14b485f041e35f60212317017c2127b8a9b6be31] iwlwifi: mvm: prevent nic to powered up at driver load
git bisect bad 14b485f041e35f60212317017c2127b8a9b6be31
# bad: [1e9551debacdaa044eeb514f4366beac6e18f6d9] iwlwifi: mvm: rs: don't allow TPC when power save is disabled
git bisect bad 1e9551debacdaa044eeb514f4366beac6e18f6d9
# bad: [cebeb0f1885fa93c44be5d4e0b9b640210ff088c] Merge remote-tracking branch 'wireless-next/master' into iwlwifi-next
git bisect bad cebeb0f1885fa93c44be5d4e0b9b640210ff088c
# bad: [939ecf6b14c46e3448411a934418311b492bfee4] Merge remote-tracking branch 'iwlwifi-fixes/master' into iwlwifi-next
git bisect bad 939ecf6b14c46e3448411a934418311b492bfee4
# first bad commit: [939ecf6b14c46e3448411a934418311b492bfee4] Merge remote-tracking branch 'iwlwifi-fixes/master' into iwlwifi-next

Obviously iwlwifi haves nothing to do with this bug. I bet I failed to judge quality of some kernel(s) correctly one more time.
Comment 17 Tomasz Mloduchowski 2014-08-14 11:56:02 UTC
I can confirm that the bug still occurs on 3.16 as well. 

Different hardware:
02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Curacao XT [Radeon R9 270X]

Non-AMD-Vi  (Intel Xeon), IO-MMU disabled. 

Occasionally on large window resizes (4K display running awesome WM, moving a 2D window from a small tile to a large one) this issue triggers. 


[ 6735.965953] radeon 0000:02:00.0: ring 0 stalled for more than 10081msec
[ 6735.965958] radeon 0000:02:00.0: GPU lockup (waiting for 0x0000000000041872 last fence id 0x0000000000041871 on ring 0)
[ 6735.965962] radeon 0000:02:00.0: failed to get a new IB (-35)
[ 6736.546504] radeon 0000:02:00.0: Saved 12093 dwords of commands on ring 0.
[ 6736.546647] radeon 0000:02:00.0: GPU softreset: 0x0000006C
[ 6736.546651] radeon 0000:02:00.0:   GRBM_STATUS               = 0xA0003028
[ 6736.546654] radeon 0000:02:00.0:   GRBM_STATUS_SE0           = 0x00000006
[ 6736.546657] radeon 0000:02:00.0:   GRBM_STATUS_SE1           = 0x00000006
[ 6736.546660] radeon 0000:02:00.0:   SRBM_STATUS               = 0x200000C0
[ 6736.546773] radeon 0000:02:00.0:   SRBM_STATUS2              = 0x00000000
[ 6736.546777] radeon 0000:02:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[ 6736.546780] radeon 0000:02:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
[ 6736.546783] radeon 0000:02:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000002
[ 6736.546786] radeon 0000:02:00.0:   R_008680_CP_STAT          = 0x80010243
[ 6736.546789] radeon 0000:02:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83146
[ 6736.546793] radeon 0000:02:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C84246
[ 6736.546796] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6736.546802] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6737.119141] radeon 0000:02:00.0: GRBM_SOFT_RESET=0x0000DDFF
[ 6737.119197] radeon 0000:02:00.0: SRBM_SOFT_RESET=0x00100140
[ 6737.120382] radeon 0000:02:00.0:   GRBM_STATUS               = 0x00003028
[ 6737.120385] radeon 0000:02:00.0:   GRBM_STATUS_SE0           = 0x00000006
[ 6737.120388] radeon 0000:02:00.0:   GRBM_STATUS_SE1           = 0x00000006
[ 6737.120391] radeon 0000:02:00.0:   SRBM_STATUS               = 0x20000AC0
[ 6737.120503] radeon 0000:02:00.0:   SRBM_STATUS2              = 0x00000000
[ 6737.120507] radeon 0000:02:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[ 6737.120510] radeon 0000:02:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[ 6737.120513] radeon 0000:02:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[ 6737.120516] radeon 0000:02:00.0:   R_008680_CP_STAT          = 0x00000000
[ 6737.120519] radeon 0000:02:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[ 6737.120522] radeon 0000:02:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[ 6737.120770] radeon 0000:02:00.0: GPU reset succeeded, trying to resume
[ 6737.169219] [drm] probing gen 2 caps for device 8086:340a = 3b3d02/0
[ 6737.169230] [drm] PCIE gen 2 link speeds already enabled
[ 6737.172143] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
[ 6737.172320] radeon 0000:02:00.0: WB enabled
[ 6737.172324] radeon 0000:02:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff880197695c00
[ 6737.172327] radeon 0000:02:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff880197695c04
[ 6737.172330] radeon 0000:02:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff880197695c08
[ 6737.172335] radeon 0000:02:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff880197695c0c
[ 6737.172338] radeon 0000:02:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff880197695c10
[ 6737.216900] radeon 0000:02:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90001735a18
[ 6737.402614] [drm] ring test on 0 succeeded in 3 usecs
[ 6737.402627] [drm] ring test on 1 succeeded in 1 usecs
[ 6737.402634] [drm] ring test on 2 succeeded in 1 usecs
[ 6737.402701] [drm] ring test on 3 succeeded in 2 usecs
[ 6737.402713] [drm] ring test on 4 succeeded in 1 usecs
[ 6737.579764] [drm] ring test on 5 succeeded in 2 usecs
[ 6737.579778] [drm] UVD initialized successfully.
[ 6747.574404] radeon 0000:02:00.0: ring 0 stalled for more than 10000msec
[ 6747.574410] radeon 0000:02:00.0: GPU lockup (waiting for 0x0000000000041920 last fence id 0x0000000000041871 on ring 0)
[ 6747.574414] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
[ 6747.574418] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).
[ 6747.574421] radeon 0000:02:00.0: ib ring test failed (-35).
[ 6748.140502] radeon 0000:02:00.0: GPU softreset: 0x00000048
[ 6748.140507] radeon 0000:02:00.0:   GRBM_STATUS               = 0xA0003028
[ 6748.140510] radeon 0000:02:00.0:   GRBM_STATUS_SE0           = 0x00000006
[ 6748.140513] radeon 0000:02:00.0:   GRBM_STATUS_SE1           = 0x00000006
[ 6748.140516] radeon 0000:02:00.0:   SRBM_STATUS               = 0x200000C0
[ 6748.140628] radeon 0000:02:00.0:   SRBM_STATUS2              = 0x00000000
[ 6748.140631] radeon 0000:02:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[ 6748.140635] radeon 0000:02:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
[ 6748.140638] radeon 0000:02:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000002
[ 6748.140641] radeon 0000:02:00.0:   R_008680_CP_STAT          = 0x80010243
[ 6748.140644] radeon 0000:02:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[ 6748.140647] radeon 0000:02:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[ 6748.140651] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6748.140654] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 6748.692401] radeon 0000:02:00.0: GRBM_SOFT_RESET=0x0000DDFF
[ 6748.692457] radeon 0000:02:00.0: SRBM_SOFT_RESET=0x00000100
[ 6748.693617] radeon 0000:02:00.0:   GRBM_STATUS               = 0x00003028
[ 6748.693621] radeon 0000:02:00.0:   GRBM_STATUS_SE0           = 0x00000006
[ 6748.693624] radeon 0000:02:00.0:   GRBM_STATUS_SE1           = 0x00000006
[ 6748.693627] radeon 0000:02:00.0:   SRBM_STATUS               = 0x200000C0
[ 6748.693746] radeon 0000:02:00.0:   SRBM_STATUS2              = 0x00000000
[ 6748.693751] radeon 0000:02:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[ 6748.693754] radeon 0000:02:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[ 6748.693757] radeon 0000:02:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[ 6748.693760] radeon 0000:02:00.0:   R_008680_CP_STAT          = 0x00000000
[ 6748.693763] radeon 0000:02:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[ 6748.693767] radeon 0000:02:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[ 6748.694014] radeon 0000:02:00.0: GPU reset succeeded, trying to resume
[ 6748.709717] [drm] probing gen 2 caps for device 8086:340a = 3b3d02/0
[ 6748.709721] [drm] PCIE gen 2 link speeds already enabled
[ 6748.712059] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
[ 6748.712221] radeon 0000:02:00.0: WB enabled
[ 6748.712224] radeon 0000:02:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff880197695c00
[ 6748.712225] radeon 0000:02:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff880197695c04
[ 6748.712227] radeon 0000:02:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff880197695c08
[ 6748.712229] radeon 0000:02:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff880197695c0c
[ 6748.712231] radeon 0000:02:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff880197695c10
[ 6748.755479] radeon 0000:02:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90001735a18
[ 6748.941259] [drm] ring test on 0 succeeded in 3 usecs
[ 6748.941266] [drm] ring test on 1 succeeded in 1 usecs
[ 6748.941272] [drm] ring test on 2 succeeded in 1 usecs
[ 6748.941338] [drm] ring test on 3 succeeded in 2 usecs
[ 6748.941350] [drm] ring test on 4 succeeded in 1 usecs
[ 6749.118470] [drm] ring test on 5 succeeded in 2 usecs
[ 6749.118480] [drm] UVD initialized successfully.
[ 6749.118615] [drm] ib test on ring 0 succeeded in 0 usecs
[ 6749.118672] [drm] ib test on ring 1 succeeded in 0 usecs
[ 6749.118732] [drm] ib test on ring 2 succeeded in 0 usecs
[ 6749.118768] [drm] ib test on ring 3 succeeded in 0 usecs
[ 6749.118804] [drm] ib test on ring 4 succeeded in 0 usecs
[ 6759.264624] radeon 0000:02:00.0: ring 5 stalled for more than 10000msec
[ 6759.264630] radeon 0000:02:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002 on ring 5)
[ 6759.264634] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
[ 6759.264640] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).
[ 6759.264667] [drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed
[ 6759.279402] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc33d04
[ 6759.279407] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDE
[ 6759.279410] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004
[ 6759.279413] VM fault (0x04, vmid 1) at page 138462, write from DMA1 (61)
[ 6759.280478] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc24804
[ 6759.280482] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDE
[ 6759.280484] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004
[ 6759.280487] VM fault (0x04, vmid 1) at page 138462, read from TC (72)
[ 6759.281017] radeon 0000:02:00.0: GPU fault detected: 146 0x01033d04
[ 6759.281020] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021988
[ 6759.281023] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004
[ 6759.281025] VM fault (0x04, vmid 1) at page 137608, write from DMA1 (61)
[ 6759.281062] radeon 0000:02:00.0: GPU fault detected: 146 0x01033d04
[ 6759.281064] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6759.281066] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004
[ 6759.281069] VM fault (0x04, vmid 1) at page 0, read from TC (72)
[ 6759.283614] radeon 0000:02:00.0: GPU fault detected: 146 0x0143a004
[ 6759.283619] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0001D38A
[ 6759.283621] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x030A0004
[ 6759.283624] VM fault (0x04, vmid 1) at page 119690, write from CB (160)
[ 6759.283841] radeon 0000:02:00.0: GPU fault detected: 146 0x05439004
[ 6759.283844] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6759.283846] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004
[ 6759.283848] VM fault (0x04, vmid 1) at page 0, read from TC (72)
[ 6759.283853] radeon 0000:02:00.0: GPU fault detected: 146 0x05439004
[ 6759.283856] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0001D391
[ 6759.283858] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004
[ 6759.283861] VM fault (0x04, vmid 1) at page 119697, read from TC (72)
[ 6759.283889] radeon 0000:02:00.0: GPU fault detected: 146 0x05c3a004
[ 6759.283891] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6759.283894] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004
[ 6759.283896] VM fault (0x04, vmid 1) at page 0, write from DMA1 (61)
[ 6759.283901] radeon 0000:02:00.0: GPU fault detected: 146 0x05a32004
[ 6759.283904] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021988
[ 6759.283906] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004
[ 6759.283908] VM fault (0x04, vmid 1) at page 137608, write from DMA1 (61)
[ 6759.283914] radeon 0000:02:00.0: GPU fault detected: 146 0x05a31004
[ 6759.283916] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021988
[ 6759.283918] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004
[ 6759.283921] VM fault (0x04, vmid 1) at page 137608, write from DMA1 (61)
[ 6759.283965] radeon 0000:02:00.0: GPU fault detected: 146 0x06e3d004
[ 6759.283967] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6759.283969] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004
[ 6759.283972] VM fault (0x04, vmid 1) at page 0, read from TC (72)
[ 6759.284178] radeon 0000:02:00.0: GPU fault detected: 146 0x01424804
[ 6759.284180] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0001D38C
[ 6759.284183] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060004
[ 6759.284185] VM fault (0x04, vmid 1) at page 119692, write from CB (96)
[ 6759.284190] radeon 0000:02:00.0: GPU fault detected: 146 0x03224804
[ 6759.284193] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0001D3A4
[ 6759.284195] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03050004
[ 6759.284197] VM fault (0x04, vmid 1) at page 119716, write from CB (80)
[ 6759.284422] radeon 0000:02:00.0: GPU fault detected: 146 0x01224804
[ 6759.284424] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6759.284427] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004
[ 6759.284429] VM fault (0x04, vmid 1) at page 0, read from TC (72)
[ 6759.284444] radeon 0000:02:00.0: GPU fault detected: 146 0x01036004
[ 6759.284447] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0001D395
[ 6759.284449] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004
[ 6759.284451] VM fault (0x04, vmid 1) at page 119701, read from TC (72)
[ 6759.284556] radeon 0000:02:00.0: GPU fault detected: 146 0x03035004
[ 6759.284558] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6759.284561] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060004
[ 6759.284563] VM fault (0x04, vmid 1) at page 0, write from CB (96)
[ 6759.284568] radeon 0000:02:00.0: GPU fault detected: 146 0x0343a004
[ 6759.284570] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021B93
[ 6759.284573] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03090004
[ 6759.284575] VM fault (0x04, vmid 1) at page 138131, write from CB (144)
[ 6759.284612] radeon 0000:02:00.0: GPU fault detected: 146 0x03232004
[ 6759.284615] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 6759.284617] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004
[ 6759.284619] VM fault (0x04, vmid 1) at page 0, write from DMA1 (61)
[ 6759.284624] radeon 0000:02:00.0: GPU fault detected: 146 0x03c39004
[ 6759.284627] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDF
[ 6759.284629] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004
[ 6759.284631] VM fault (0x04, vmid 1) at page 138463, write from DMA1 (61)
[ 6759.284637] radeon 0000:02:00.0: GPU fault detected: 146 0x03231004
[ 6759.284639] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDE
[ 6759.284641] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004
[ 6759.284644] VM fault (0x04, vmid 1) at page 138462, write from DMA1 (61)
[ 6759.284649] radeon 0000:02:00.0: GPU fault detected: 146 0x03231004
[ 6759.284651] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDE
[ 6759.284653] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004
[ 6759.284656] VM fault (0x04, vmid 1) at page 138462, write from DMA1 (61)
[ 6759.284716] radeon 0000:02:00.0: GPU fault detected: 146 0x05035004
[ 6759.284718] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021B93
[ 6759.284720] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02044004
[ 6759.284723] VM fault (0x04, vmid 1) at page 138131, read from TC (68)
[ 6759.284728] radeon 0000:02:00.0: GPU fault detected: 146 0x05035004
[ 6759.284730] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDF
[ 6759.284732] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x020C4004
[ 6759.284735] VM fault (0x04, vmid 1) at page 138463, read from TC (196)
[ 6759.516471] radeon 0000:02:00.0: GPU fault detected: 146 0x01036004
[ 6759.516475] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021988
[ 6759.516477] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060004
[ 6759.516479] VM fault (0x04, vmid 1) at page 137608, write from CB (96)
[ 6759.516483] radeon 0000:02:00.0: GPU fault detected: 146 0x01035004
[ 6759.516485] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021988
[ 6759.516486] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060004
[ 6759.516488] VM fault (0x04, vmid 1) at page 137608, write from CB (96)
[ 6759.533652] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc36004
[ 6759.533656] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDE
[ 6759.533658] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060004
[ 6759.533659] VM fault (0x04, vmid 1) at page 138462, write from CB (96)
[ 6759.533858] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc32004
[ 6759.533860] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDE
[ 6759.533861] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03020004
[ 6759.533863] VM fault (0x04, vmid 1) at page 138462, write from CB (32)
[ 6759.547549] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc24804
[ 6759.547552] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDE
[ 6759.547554] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004
[ 6759.547555] VM fault (0x04, vmid 1) at page 138462, read from TC (72)
[ 6761.492893] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc33d04
[ 6761.492896] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDE
[ 6761.492898] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004
[ 6761.492900] VM fault (0x04, vmid 1) at page 138462, write from DMA1 (61)
[ 6761.493081] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc24804
[ 6759.547552] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDE
[ 6759.547554] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004
[ 6759.547555] VM fault (0x04, vmid 1) at page 138462, read from TC (72)
[ 6761.492893] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc33d04
[ 6761.492896] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDE
[ 6761.492898] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004
[ 6761.492900] VM fault (0x04, vmid 1) at page 138462, write from DMA1 (61)
[ 6761.493081] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc24804
[ 6761.493083] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDE
[ 6761.493085] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004
[ 6761.493087] VM fault (0x04, vmid 1) at page 138462, read from TC (72)
[ 6761.493486] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc36004
[ 6761.493489] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDE
[ 6761.493491] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060004
[ 6761.493493] VM fault (0x04, vmid 1) at page 138462, write from CB (96)
[ 6762.236056] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc21004
[ 6762.236060] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDE
[ 6762.236062] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02010004
[ 6762.236064] VM fault (0x04, vmid 1) at page 138462, read from CB (16)
[ 6762.236240] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc22004
[ 6762.236244] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CDE
[ 6762.236246] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02020004
[ 6762.236248] VM fault (0x04, vmid 1) at page 138462, read from CB (32)
[ 6770.359479] radeon 0000:02:00.0: GPU fault detected: 146 0x01036004
[ 6770.359483] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021988
[ 6770.359489] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060004
[ 6770.359492] VM fault (0x04, vmid 1) at page 137608, write from CB (96)
[ 6770.359496] radeon 0000:02:00.0: GPU fault detected: 146 0x01039004
[ 6770.359498] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021988
[ 6770.359500] radeon 0000:02:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02044004
[ 6770.359502] VM fault (0x04, vmid 1) at page 137608, read from TC (68)
Comment 18 t3st3r 2014-08-24 01:05:04 UTC
This is getting even more interesting. After some investigation I got idea why bisect never succeeds. It looks like there was no stable kernels at all: 3.15 is also broken. However it takes "almost forever" to crash it with previously used methods.

Somehow I stepped up on similar but far more optimized use case (another map in BfW game) which locks up GPU in matter of seconds to a minute. That's what I need :). This also proven to knock down "good" 3.15 kernels in matter of 30 seconds or so. So it was not good at all. Obviously my bisect can't succeed.

On other hand now I can try mentioned patches...
Comment 19 Michel Dänzer 2014-08-25 09:58:48 UTC
Does a 3.17 based drm-fixes kernel tree work better? There have been a couple of stability fixes.
Comment 20 t3st3r 2014-09-08 12:19:50 UTC
1) About 3.15 + patch: I gave it a try and it took quite a while to get opinion about it. Overall it is quite stable and survives about several days of run of problematic load. But eventually GPU still could encounter crash. Intereating thing in this occurence I caught is that regardless of scary message about failed DPM resume, GPU seems to be operable after successful recovery. I got couple of similar crashes as well within a week. It looked like this:

===cut===
[815114.959250] SysRq : Emergency Sync
[815115.071974] Emergency Sync complete
[815116.935547] radeon 0000:01:00.0: ring 0 stalled for more than 10082msec
[815116.935556] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000007f39f60 last fence id 0x0000000007f39f5f on ring 0)
[815116.935564] radeon 0000:01:00.0: failed to get a new IB (-35)
[815116.942472] radeon 0000:01:00.0: sa_manager is not empty, clearing anyway
[815117.134467] SysRq : Keyboard mode set to system default
[815117.500079] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x0000000080406640 flags=0x0000]
[815117.500092] radeon 0000:01:00.0: Saved 6061 dwords of commands on ring 0.
[815117.500097] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x0000000080406650 flags=0x0020]
[815117.500104] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x0000000080000100 flags=0x0020]
[815117.500110] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x0000000080404500 flags=0x0000]
[815117.500222] radeon 0000:01:00.0: GPU softreset: 0x0000006C
[815117.500226] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
[815117.500229] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[815117.500231] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[815117.500233] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200002C0
[815117.500349] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[815117.500351] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[815117.500353] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
[815117.500356] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000002
[815117.500358] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80010243
[815117.500360] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44483106
[815117.500362] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C84246
[815117.500365] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[815117.500368] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[815118.057253] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
[815118.057308] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00100140
[815118.058465] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
[815118.058468] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[815118.058470] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[815118.058472] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[815118.058583] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[815118.058585] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[815118.058588] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[815118.058590] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[815118.058592] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[815118.058594] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[815118.058597] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[815118.058843] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[815118.086936] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0
[815118.086939] [drm] PCIE gen 2 link speeds already enabled
[815118.090599] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
[815118.090704] radeon 0000:01:00.0: WB enabled
[815118.090707] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff880414545c00
[815118.090709] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff880414545c04
[815118.090711] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff880414545c08
[815118.090713] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff880414545c0c
[815118.090715] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff880414545c10
[815118.091689] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90012135a18
[815118.278813] [drm] ring test on 0 succeeded in 3 usecs
[815118.278819] [drm] ring test on 1 succeeded in 1 usecs
[815118.278824] [drm] ring test on 2 succeeded in 1 usecs
[815118.278888] [drm] ring test on 3 succeeded in 2 usecs
[815118.278897] [drm] ring test on 4 succeeded in 1 usecs
[815118.455982] [drm] ring test on 5 succeeded in 2 usecs
[815118.455989] [drm] UVD initialized successfully.
[815128.453467] radeon 0000:01:00.0: ring 0 stalled for more than 10001msec
[815128.453477] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000007f39fad last fence id 0x0000000007f39f5f on ring 0)
[815128.453483] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
[815128.453491] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).
[815128.453496] radeon 0000:01:00.0: ib ring test failed (-35).
[815129.011900] radeon 0000:01:00.0: GPU softreset: 0x00000048
[815129.011904] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
[815129.011907] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[815129.011909] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[815129.011911] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[815129.012022] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[815129.012025] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[815129.012027] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
[815129.012029] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000002
[815129.012031] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80010243
[815129.012034] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[815129.012036] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[815129.012039] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[815129.012041] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[815129.561916] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
[815129.561971] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[815129.563128] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
[815129.563131] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[815129.563133] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[815129.563135] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[815129.563246] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[815129.563249] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[815129.563251] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[815129.563253] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[815129.563255] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[815129.563257] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[815129.563260] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[815129.563506] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[815129.576411] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0
[815129.576415] [drm] PCIE gen 2 link speeds already enabled
[815129.580147] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
[815129.580250] radeon 0000:01:00.0: WB enabled
[815129.580253] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff880414545c00
[815129.580255] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff880414545c04
[815129.580257] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff880414545c08
[815129.580259] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff880414545c0c
[815129.580261] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff880414545c10
[815129.581232] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90012135a18
[815129.767993] [drm] ring test on 0 succeeded in 3 usecs
[815129.767999] [drm] ring test on 1 succeeded in 1 usecs
[815129.768004] [drm] ring test on 2 succeeded in 1 usecs
[815129.768068] [drm] ring test on 3 succeeded in 2 usecs
[815129.768077] [drm] ring test on 4 succeeded in 1 usecs
[815129.945157] [drm] ring test on 5 succeeded in 2 usecs
[815129.945164] [drm] UVD initialized successfully.
[815129.946125] [drm] ib test on ring 0 succeeded in 0 usecs
[815129.946210] [drm] ib test on ring 1 succeeded in 0 usecs
[815129.946301] [drm] ib test on ring 2 succeeded in 0 usecs
[815129.946345] [drm] ib test on ring 3 succeeded in 0 usecs
[815129.946380] [drm] ib test on ring 4 succeeded in 0 usecs
[815137.847012] SysRq : Emergency Sync
[815137.965713] Emergency Sync complete
[815139.742325] SysRq : Emergency Sync
[815139.864190] Emergency Sync complete
[815140.093163] radeon 0000:01:00.0: ring 5 stalled for more than 10000msec
[815140.093173] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002 on ring 5)
[815140.093179] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
[815140.093188] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).
[815140.093217] [drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed
===cut===
Comment 21 t3st3r 2014-09-08 12:22:32 UTC
2) About 3.17... I attempted 3.17-rc1 and it crashed in about 30 seconds of run of problematic work. 

I will try newer -RCs as well, as I can see there were some extra changes to radeon-related code.
Comment 22 t3st3r 2014-09-09 03:09:08 UTC
Attempted to test on 3.17-rc4. Result: crashed in about 3 minutes of run (see below).

Are some stability fixes missing 3.17-rc4 mainline? At first glance I do not see radeon-related commits in drm-fixes which haven't made it to -rc4. Am I missing something?

===cut===
 kernel: [  599.949295] radeon 0000:01:00.0: ring 3 stalled for more than 10167msec
 kernel: [  599.949305] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000001eb0 last fence id 0x0000000000001eaf on ring 3)
 kernel: [  599.949312] radeon 0000:01:00.0: scheduling IB failed (-35).
 kernel: [  600.507409] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x000000008040a840 flags=0x0010]
 kernel: [  600.507420] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x000000008040a870 flags=0x0030]
 kernel: [  600.507426] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x0000000080000100 flags=0x0030]
 kernel: [  600.507431] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x000000008040a700 flags=0x0010]
 kernel: [  600.507460] radeon 0000:01:00.0: Saved 19308 dwords of commands on ring 0.
 kernel: [  600.507590] radeon 0000:01:00.0: GPU softreset: 0x0000006C
 kernel: [  600.507593] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
 kernel: [  600.507596] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
 kernel: [  600.507598] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
 kernel: [  600.507600] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
 kernel: [  600.507711] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
 kernel: [  600.507714] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
 kernel: [  600.507716] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
 kernel: [  600.507718] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000002
 kernel: [  600.507720] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80010243
 kernel: [  600.507723] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44483106
 kernel: [  600.507725] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44E84266
 kernel: [  600.507728] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
 kernel: [  600.507730] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
 kernel: [  601.054357] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
 kernel: [  601.054411] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00100140
 kernel: [  601.055568] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
 kernel: [  601.055571] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
 kernel: [  601.055573] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
 kernel: [  601.055575] radeon 0000:01:00.0:   SRBM_STATUS               = 0x20000AC0
 kernel: [  601.055686] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
 kernel: [  601.055689] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
 kernel: [  601.055691] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
 kernel: [  601.055693] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
 kernel: [  601.055695] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
 kernel: [  601.055698] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
 kernel: [  601.055700] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
 kernel: [  601.055951] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
 kernel: [  601.083744] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0
 kernel: [  601.083747] [drm] PCIE gen 2 link speeds already enabled
 kernel: [  601.084938] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
 kernel: [  601.085046] radeon 0000:01:00.0: WB enabled
 kernel: [  601.085049] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff880413fbec00
 kernel: [  601.085052] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff880413fbec04
 kernel: [  601.085054] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff880413fbec08
 kernel: [  601.085056] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff880413fbec0c
 kernel: [  601.085057] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff880413fbec10
 kernel: [  601.086030] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90011db5a18
 kernel: [  601.271000] [drm] ring test on 0 succeeded in 3 usecs
 kernel: [  601.271006] [drm] ring test on 1 succeeded in 1 usecs
 kernel: [  601.271011] [drm] ring test on 2 succeeded in 1 usecs
 kernel: [  601.271075] [drm] ring test on 3 succeeded in 2 usecs
 kernel: [  601.271084] [drm] ring test on 4 succeeded in 1 usecs
 kernel: [  601.448164] [drm] ring test on 5 succeeded in 2 usecs
 kernel: [  601.448172] [drm] UVD initialized successfully.
 kernel: [  611.444226] radeon 0000:01:00.0: ring 0 stalled for more than 10000msec
 kernel: [  611.444237] radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000001a60a last fence id 0x000000000001a4dd on ring 0)
 kernel: [  611.444244] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
 kernel: [  611.444252] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).
 kernel: [  611.444257] radeon 0000:01:00.0: ib ring test failed (-35).
 kernel: [  611.997330] radeon 0000:01:00.0: GPU softreset: 0x00000048
 kernel: [  611.997333] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
 kernel: [  611.997336] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
 kernel: [  611.997338] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
 kernel: [  611.997341] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
 kernel: [  611.997452] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
 kernel: [  611.997454] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
 kernel: [  611.997456] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
 kernel: [  611.997458] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00400002
 kernel: [  611.997461] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x84010243
 kernel: [  611.997463] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
 kernel: [  611.997465] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
 kernel: [  611.997468] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
 kernel: [  611.997470] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
 kernel: [  612.542126] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
 kernel: [  612.542180] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
 kernel: [  612.543338] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
 kernel: [  612.543340] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
 kernel: [  612.543343] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
 kernel: [  612.543345] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
 kernel: [  612.543456] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
 kernel: [  612.543458] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
 kernel: [  612.543460] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
 kernel: [  612.543462] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
 kernel: [  612.543465] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
 kernel: [  612.543467] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
 kernel: [  612.543469] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
 kernel: [  612.543724] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
 kernel: [  612.556911] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0
 kernel: [  612.556915] [drm] PCIE gen 2 link speeds already enabled
 kernel: [  612.558107] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
 kernel: [  612.558216] radeon 0000:01:00.0: WB enabled
 kernel: [  612.558219] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff880413fbec00
 kernel: [  612.558222] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff880413fbec04
 kernel: [  612.558224] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff880413fbec08
 kernel: [  612.558226] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff880413fbec0c
 kernel: [  612.558228] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff880413fbec10
 kernel: [  612.559203] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90011db5a18
 kernel: [  612.744297] [drm] ring test on 0 succeeded in 3 usecs
 kernel: [  612.744302] [drm] ring test on 1 succeeded in 1 usecs
 kernel: [  612.744308] [drm] ring test on 2 succeeded in 1 usecs
 kernel: [  612.744371] [drm] ring test on 3 succeeded in 2 usecs
 kernel: [  612.744380] [drm] ring test on 4 succeeded in 1 usecs
 kernel: [  612.921464] [drm] ring test on 5 succeeded in 2 usecs
 kernel: [  612.921472] [drm] UVD initialized successfully.
 kernel: [  612.921539] [drm] ib test on ring 0 succeeded in 0 usecs
 kernel: [  612.921634] [drm] ib test on ring 1 succeeded in 0 usecs
 kernel: [  612.921722] [drm] ib test on ring 2 succeeded in 0 usecs
 kernel: [  612.921762] [drm] ib test on ring 3 succeeded in 0 usecs
 kernel: [  612.921796] [drm] ib test on ring 4 succeeded in 0 usecs
 kernel: [  623.068910] radeon 0000:01:00.0: ring 5 stalled for more than 10000msec
 kernel: [  623.068921] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002 on ring 5)
 kernel: [  623.068927] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
 kernel: [  623.068935] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).
 kernel: [  623.098333] radeon 0000:01:00.0: GPU fault detected: 146 0x07a23d0c
 kernel: [  623.098342] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000BDBD
 kernel: [  623.098347] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0203D00C
 kernel: [  623.098352] VM fault (0x0c, vmid 1) at page 48573, read from DMA1 (61)
 kernel: [  623.098364] radeon 0000:01:00.0: GPU fault detected: 146 0x07c23d0c
 kernel: [  623.098368] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
 kernel: [  623.098372] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0208400C
 kernel: [  623.098377] VM fault (0x0c, vmid 1) at page 0, read from TC (132)
 kernel: [  623.098383] radeon 0000:01:00.0: GPU fault detected: 146 0x07e23d0c
 kernel: [  623.098387] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000BDBC
 kernel: [  623.098391] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0200800C
 kernel: [  623.098395] VM fault (0x0c, vmid 1) at page 48572, read from TC (8)
 kernel: [  623.128770] radeon 0000:01:00.0: GPU fault detected: 146 0x06033d14
 kernel: [  623.128781] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000BDB0
 kernel: [  623.128787] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D014
 kernel: [  623.128793] VM fault (0x04, vmid 1) at page 48560, write from DMA1 (61)
 kernel: [  623.128820] radeon 0000:01:00.0: GPU fault detected: 146 0x06033d14
 kernel: [  623.128825] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
 kernel: [  623.128830] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0204400C
 kernel: [  623.128835] VM fault (0x0c, vmid 1) at page 0, read from TC (68)
 kernel: [  623.128842] radeon 0000:01:00.0: GPU fault detected: 146 0x06033d14
 kernel: [  623.128847] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000BDB8
 kernel: [  623.128852] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0204400C
 kernel: [  623.128857] VM fault (0x0c, vmid 1) at page 48568, read from TC (68)
 kernel: [  623.129932] radeon 0000:01:00.0: GPU fault detected: 146 0x06033d14
 kernel: [  623.129940] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000BDB0
 kernel: [  623.129944] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D014
 kernel: [  623.129948] VM fault (0x04, vmid 1) at page 48560, write from DMA1 (61)
 kernel: [  623.129965] radeon 0000:01:00.0: GPU fault detected: 146 0x06233d14
===cut===
Note: several megabytes of similar "VM fault" flood skipped.
Comment 23 Jean-Michel Smith 2014-09-30 04:03:23 UTC
I've seen this bug as well, through quite a few versions of 3.15 and 3.16.  Sometimes it just freezes X, other times it hangs the entire system.  Here is the output of the last hang (I was able to log in remotely as this time it didn't completely crash the system)

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Curacao XT [Radeon R9 270X]

(uname -a)
Linux prime 3.16.3-gentoo #1 SMP PREEMPT Thu Sep 18 20:59:58 CDT 2014 x86_64 Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz GenuineIntel GNU/Linux

(lsmod)
cfbfillrect             3634  1 radeon
cfbimgblt               2055  1 radeon
cfbcopyarea             3110  1 radeon
i2c_algo_bit            5055  1 radeon
drm_kms_helper         33715  1 radeon
ttm                    59052  1 radeon
drm                   226864  6 ttm,drm_kms_helper,radeon
firmware_class          8187  1 radeon
radeon               1258462  3 

(relevant dmesg info)

[120499.589293] radeon 0000:01:00.0: ring 0 stalled for more than 10473msec
[120499.589296] radeon 0000:01:00.0: GPU lockup (waiting for 0x00000000000783d0 last fence id 0x00000000000783cf on ring 0)
[120499.589299] radeon 0000:01:00.0: failed to get a new IB (-35)
[120500.099613] radeon 0000:01:00.0: Saved 3600 dwords of commands on ring 0.
[120500.099743] radeon 0000:01:00.0: GPU softreset: 0x0000006C
[120500.099746] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
[120500.099748] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[120500.099750] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[120500.099751] radeon 0000:01:00.0:   SRBM_STATUS               = 0x20000AC0
[120500.099862] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[120500.099864] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[120500.099866] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
[120500.099868] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000002
[120500.099870] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80010243
[120500.099872] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83146
[120500.099874] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44E84266
[120500.099876] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[120500.099879] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[120500.592138] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
[120500.592192] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00100140
[120500.593350] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
[120500.593352] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[120500.593354] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[120500.593356] radeon 0000:01:00.0:   SRBM_STATUS               = 0x20000AC0
[120500.593466] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[120500.593468] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[120500.593470] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[120500.593472] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[120500.593473] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[120500.593475] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[120500.593477] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[120500.593718] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[120500.621478] [drm] probing gen 2 caps for device 8086:3c04 = 7a7103/e
[120500.621482] [drm] PCIE gen 3 link speeds already enabled
[120500.623908] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
[120500.624051] radeon 0000:01:00.0: WB enabled
[120500.624054] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000100000c00 and cpu addr 0xffff8807fb4aac00
[120500.624056] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000100000c04 and cpu addr 0xffff8807fb4aac04
[120500.624058] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000100000c08 and cpu addr 0xffff8807fb4aac08
[120500.624059] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000100000c0c and cpu addr 0xffff8807fb4aac0c
[120500.624061] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000100000c10 and cpu addr 0xffff8807fb4aac10
[120500.624680] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc900142b5a18
[120500.789277] [drm] ring test on 0 succeeded in 3 usecs
[120500.789283] [drm] ring test on 1 succeeded in 1 usecs
[120500.789287] [drm] ring test on 2 succeeded in 1 usecs
[120500.789351] [drm] ring test on 3 succeeded in 2 usecs
[120500.789361] [drm] ring test on 4 succeeded in 1 usecs
[120500.981448] [drm] ring test on 5 succeeded in 2 usecs
[120500.981456] [drm] UVD initialized successfully.
[120510.981602] radeon 0000:01:00.0: ring 0 stalled for more than 10002msec
[120510.981604] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000078407 last fence id 0x00000000000783cf on ring 0)
[120510.981606] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
[120510.981608] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).
[120510.981609] radeon 0000:01:00.0: ib ring test failed (-35).
[120511.461309] radeon 0000:01:00.0: GPU softreset: 0x00000048
[120511.461310] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003028
[120511.461312] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[120511.461313] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[120511.461314] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[120511.461428] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[120511.461429] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[120511.461431] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
[120511.461432] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000002
[120511.461434] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80010243
[120511.461435] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[120511.461437] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[120511.461439] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[120511.461440] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[120511.933287] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
[120511.933340] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[120511.934495] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003028
[120511.934496] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000006
[120511.934498] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000006
[120511.934499] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[120511.934609] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[120511.934610] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[120511.934612] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[120511.934613] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[120511.934614] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[120511.934616] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[120511.934617] radeon 0000:01:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[120511.934857] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[120511.945176] [drm] probing gen 2 caps for device 8086:3c04 = 7a7103/e
[120511.945179] [drm] PCIE gen 3 link speeds already enabled
[120511.947127] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000).
[120511.947253] radeon 0000:01:00.0: WB enabled
[120511.947255] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000100000c00 and cpu addr 0xffff8807fb4aac00
[120511.947256] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000100000c04 and cpu addr 0xffff8807fb4aac04
[120511.947257] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000100000c08 and cpu addr 0xffff8807fb4aac08
[120511.947258] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000100000c0c and cpu addr 0xffff8807fb4aac0c
[120511.947259] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000100000c10 and cpu addr 0xffff8807fb4aac10
[120511.947868] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc900142b5a18
[120512.109348] [drm] ring test on 0 succeeded in 4 usecs
[120512.109352] [drm] ring test on 1 succeeded in 1 usecs
[120512.109355] [drm] ring test on 2 succeeded in 1 usecs
[120512.109417] [drm] ring test on 3 succeeded in 2 usecs
[120512.109426] [drm] ring test on 4 succeeded in 1 usecs
[120512.286478] [drm] ring test on 5 succeeded in 2 usecs
[120512.286483] [drm] UVD initialized successfully.
[120512.286534] [drm] ib test on ring 0 succeeded in 0 usecs
[120512.286580] [drm] ib test on ring 1 succeeded in 0 usecs
[120512.286623] [drm] ib test on ring 2 succeeded in 0 usecs
[120512.286648] [drm] ib test on ring 3 succeeded in 0 usecs
[120512.286672] [drm] ib test on ring 4 succeeded in 0 usecs
[120522.435679] radeon 0000:01:00.0: ring 5 stalled for more than 10000msec
[120522.435685] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002 on ring 5)
[120522.435688] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
[120522.435695] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).
[120522.435730] [drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed
Comment 24 Linux Tester 2015-07-10 23:38:39 UTC
After some updates to MESA and kernel this bug no longer happens at all - 2D now rock solid for me on R9 270 and I can run even most troublesome workloads for weeks without any issues. While I failed to pinpoint what exactly has fixed bug, thanks anyway. I think it is now safe to close this bug as fixed. I bet you've got dozens of other GPU lockups to chew on, so I'm glad to inform you at least one nasty thing has been nailed down.

(it's me, original bug reporter who has forgot password on both original mailbox and bugzilla account, dammit)