Bug 78221
Summary: | 3.16 RC1: AMD R9 270 GPU locks up on some heavy 2D activity - GPU VM fault occurs. (possibly DMA copying issue strikes back?) | ||
---|---|---|---|
Product: | Drivers | Reporter: | t3st3r |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | NEW --- | ||
Severity: | high | CC: | abandonedaccountubdprczb8hs, Actualize.in.Material+bugzillakernel, alexandre.f.demers, alexdeucher, darkbasic, jean.michel.sm, kernel, linux.tester, muhomor.d, q, zazdxscf+bugzilla.kernel.org |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 3.16-rc1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
patch 1/2
patch 2/2 patch 1/2 patch 2/2 |
Description
t3st3r
2014-06-18 02:20:57 UTC
*** Bug 78211 has been marked as a duplicate of this bug. *** This is more likely a bug in the mesa 3D driver than a kernel bug. The 3D driver is used for both 2D and 3D acceleration. Not looks like this. Re-checked again, using very same version of drivers in process. I'm unable to trigged bug with 3.15 mainline kernel, no matter what. But it happens easily with older kernels like 3.14 or early 3.15RCs, and would also happen with 3.16-RC1. This makes me to think it comes to DMA issues. What makes me even more suspicious about it is that early 3.15RC were crashing as well, but in release version its gone. Looks like last-minute changes related to DMA have fixed this instability. But now it re-appeared again :(. Can you bisect and see what fixed it in 3.15 or what broke it again in 3.16? Will try that since that bug is nasty enough. Can take some time. As initial investigation when looking on commit log and matching encounters of bugs, it appears stability issues were fixed at result of commit 0a4ae727d6aa459247b027387edb6ff99f657792 (appears between 3.15-rc8 -> 3.15 release). So all 3.15 RCs were not stable on R9 270. However, 3.15 release is okay due to these last-minute fixes. Yet 0a4ae727d6aa459247b027387edb6ff99f657792 seems to be composed of few commits, lets chew a bit more on it. Most likely it comes down to 91b0275c0ecd1870c5f8bfb73e2da2d6c29414b3. I think I would try little experiment first: return CPDMA as it was in 3.15 last minute fix and see if stability returns to 3.16-rc1 with R9 270. Hmm, wrong guess about CPDMA. Trying harder, due to nature of bug it can take some time. Created attachment 140711 [details]
patch 1/2
Does this patch set help?
Created attachment 140721 [details]
patch 2/2
Hmm, this patch does not applies cleanly to 3.16-rc1 or -rc2, mostly having bunch of conflicts in radeon_vm.c, which are a bit over my head to resolve at this point. Which version of kernel I'm supposed to try? Created attachment 140871 [details]
patch 1/2
Sorry, updated patches. These apply against 3.15.
Created attachment 140881 [details]
patch 2/2
And what I'm supposed to test if patch is against 3.15? Because 3.15 release is fine "on its own" and does not exposes this bug. So its impossible to see if bug appears -> apply patch -> check that bug is gone (which looks most logical course of actions to me, unless I got something wrong). Because 3.15 lacks this bug even before patch. Bug appears to be fixed between 3.15-rc8 and 3.15 as result of mentioned merge. Then bug reappeared at 3.16-rc1 (and up) as result of other merges. So it would be logical if patch is against some 3.15-rc* or 3.16-rc*? Unfortunately, they have so many changes related to VM management that it's not like if I'm cool enough to port patch to these versions myself (most notably, radeon_vm.c changes are quite complicated). So I cant see if GPU lockup is gone after patching some "known-bad" version. Or you mean something like this: take 3.15 (which is ok) and check that patch does not breaks anything? But it wouldn't be direct check if bug is actually gone, right? Sorry, I misread your comments and thought it was broken on 3.15 as well. You can follow the thread here: http://lists.freedesktop.org/archives/dri-devel/2014-June/062305.html Nono, v3.15 (release) is okay on my GPU in regard to this bug. That what makes testing patch tricky. Bug has been here since unknown. I can tell for sure it plagued all 3.15RCs (maybe earlier versions as well). But between 3.15rc8 and 3.15 release, bunch of last-minute DRM fixes landed (0a4ae727d6aa459247b027387edb6ff99f657792). Except everything else, it corrected this GPU deadlock problem. So v3.15 (the one and the only) does not exposes that bug. When I gave a try to v3.16rc1, I figured out bug re-appeared. Hence, looked like regression in 3.16RCs. Any luck narrowing down what fixed it in 3.15 or what broke it again in 3.16? I have to admit this bug really suxx. I've attempted to bisect 3.15 -> 3.16rc1 several times but these attempts failed so far. It looks like while I generally found quite fast ways to toggle this bug in lucky cases, in some cases bug does not toggles for many hours or even can require a reboot on same kernel version to increase chance bug appears. Bug also seems to be really picky on previous history of GPU usage (e.g. launching some 3D game before BfW can screw anything up and bug would not toggle in literally days, but can occasionally backstab). In some cases deciding if kernel is bugged or not turned out to be a really daunting and time consuming task. My last attempt was also wrong. I bet some of "good" kernels were not as good as they should. Bad kernels on other hand supposed to be bad, i.e. GPU crashed. So last attempt also led me into really strange area, I don't even have hardware in question so this module is never used. P.S. and as far as I understand, http://lists.freedesktop.org/archives/dri-devel/2014-June/062305.html fix wasn't ported into 3.16 series? So 3.16 keeps failing for me. And as example, last bisect looked like this: $ git bisect log git bisect start # good: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15 git bisect good 1860e379875dfe7271c649058aeddffe5afd9d0d # bad: [7171511eaec5bf23fb06078f59784a3a0626b38f] Linux 3.16-rc1 git bisect bad 7171511eaec5bf23fb06078f59784a3a0626b38f # good: [aaeb2554337217dfa4eac2fcc90da7be540b9a73] Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media into next git bisect good aaeb2554337217dfa4eac2fcc90da7be540b9a73 # good: [16b9057804c02e2d351e9c8f606e909b43cbd9e7] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs git bisect good 16b9057804c02e2d351e9c8f606e909b43cbd9e7 # bad: [249c8b8d7e2d1bf9505dc46458537e77326c24fd] i40evf: remove unnecessary log messages git bisect bad 249c8b8d7e2d1bf9505dc46458537e77326c24fd # good: [758bd61aa987e82765bd432f37bd81bd197c4b1a] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next git bisect good 758bd61aa987e82765bd432f37bd81bd197c4b1a # bad: [9db7cb6901740453a442e598563b576987dd471b] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem git bisect bad 9db7cb6901740453a442e598563b576987dd471b # bad: [99abe65ff18b6bbac2e55524827b571c3eccfa86] Merge tag 'nfc-next-3.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next git bisect bad 99abe65ff18b6bbac2e55524827b571c3eccfa86 # bad: [75e58071c0c64f331ccc4c0037990a1e50099f7f] Merge branch 'for-linville' of git://github.com/kvalo/ath git bisect bad 75e58071c0c64f331ccc4c0037990a1e50099f7f # bad: [d5738b41e555f97f597b19bc549fa811b516d6b6] Revert "wl1251: enforce changed hw encryption support on monitor state change" git bisect bad d5738b41e555f97f597b19bc549fa811b516d6b6 # bad: [0aa7142812c19af25ad21405eefc499e83da2fcc] iwlwifi: mvm: fix sparse warning when _DEBUGFS isn't set git bisect bad 0aa7142812c19af25ad21405eefc499e83da2fcc # bad: [14b485f041e35f60212317017c2127b8a9b6be31] iwlwifi: mvm: prevent nic to powered up at driver load git bisect bad 14b485f041e35f60212317017c2127b8a9b6be31 # bad: [1e9551debacdaa044eeb514f4366beac6e18f6d9] iwlwifi: mvm: rs: don't allow TPC when power save is disabled git bisect bad 1e9551debacdaa044eeb514f4366beac6e18f6d9 # bad: [cebeb0f1885fa93c44be5d4e0b9b640210ff088c] Merge remote-tracking branch 'wireless-next/master' into iwlwifi-next git bisect bad cebeb0f1885fa93c44be5d4e0b9b640210ff088c # bad: [939ecf6b14c46e3448411a934418311b492bfee4] Merge remote-tracking branch 'iwlwifi-fixes/master' into iwlwifi-next git bisect bad 939ecf6b14c46e3448411a934418311b492bfee4 # first bad commit: [939ecf6b14c46e3448411a934418311b492bfee4] Merge remote-tracking branch 'iwlwifi-fixes/master' into iwlwifi-next Obviously iwlwifi haves nothing to do with this bug. I bet I failed to judge quality of some kernel(s) correctly one more time. I can confirm that the bug still occurs on 3.16 as well. Different hardware: 02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Curacao XT [Radeon R9 270X] Non-AMD-Vi (Intel Xeon), IO-MMU disabled. Occasionally on large window resizes (4K display running awesome WM, moving a 2D window from a small tile to a large one) this issue triggers. [ 6735.965953] radeon 0000:02:00.0: ring 0 stalled for more than 10081msec [ 6735.965958] radeon 0000:02:00.0: GPU lockup (waiting for 0x0000000000041872 last fence id 0x0000000000041871 on ring 0) [ 6735.965962] radeon 0000:02:00.0: failed to get a new IB (-35) [ 6736.546504] radeon 0000:02:00.0: Saved 12093 dwords of commands on ring 0. [ 6736.546647] radeon 0000:02:00.0: GPU softreset: 0x0000006C [ 6736.546651] radeon 0000:02:00.0: GRBM_STATUS = 0xA0003028 [ 6736.546654] radeon 0000:02:00.0: GRBM_STATUS_SE0 = 0x00000006 [ 6736.546657] radeon 0000:02:00.0: GRBM_STATUS_SE1 = 0x00000006 [ 6736.546660] radeon 0000:02:00.0: SRBM_STATUS = 0x200000C0 [ 6736.546773] radeon 0000:02:00.0: SRBM_STATUS2 = 0x00000000 [ 6736.546777] radeon 0000:02:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [ 6736.546780] radeon 0000:02:00.0: R_008678_CP_STALLED_STAT2 = 0x00010000 [ 6736.546783] radeon 0000:02:00.0: R_00867C_CP_BUSY_STAT = 0x00000002 [ 6736.546786] radeon 0000:02:00.0: R_008680_CP_STAT = 0x80010243 [ 6736.546789] radeon 0000:02:00.0: R_00D034_DMA_STATUS_REG = 0x44C83146 [ 6736.546793] radeon 0000:02:00.0: R_00D834_DMA_STATUS_REG = 0x44C84246 [ 6736.546796] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6736.546802] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6737.119141] radeon 0000:02:00.0: GRBM_SOFT_RESET=0x0000DDFF [ 6737.119197] radeon 0000:02:00.0: SRBM_SOFT_RESET=0x00100140 [ 6737.120382] radeon 0000:02:00.0: GRBM_STATUS = 0x00003028 [ 6737.120385] radeon 0000:02:00.0: GRBM_STATUS_SE0 = 0x00000006 [ 6737.120388] radeon 0000:02:00.0: GRBM_STATUS_SE1 = 0x00000006 [ 6737.120391] radeon 0000:02:00.0: SRBM_STATUS = 0x20000AC0 [ 6737.120503] radeon 0000:02:00.0: SRBM_STATUS2 = 0x00000000 [ 6737.120507] radeon 0000:02:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [ 6737.120510] radeon 0000:02:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [ 6737.120513] radeon 0000:02:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [ 6737.120516] radeon 0000:02:00.0: R_008680_CP_STAT = 0x00000000 [ 6737.120519] radeon 0000:02:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [ 6737.120522] radeon 0000:02:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57 [ 6737.120770] radeon 0000:02:00.0: GPU reset succeeded, trying to resume [ 6737.169219] [drm] probing gen 2 caps for device 8086:340a = 3b3d02/0 [ 6737.169230] [drm] PCIE gen 2 link speeds already enabled [ 6737.172143] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000). [ 6737.172320] radeon 0000:02:00.0: WB enabled [ 6737.172324] radeon 0000:02:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff880197695c00 [ 6737.172327] radeon 0000:02:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff880197695c04 [ 6737.172330] radeon 0000:02:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff880197695c08 [ 6737.172335] radeon 0000:02:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff880197695c0c [ 6737.172338] radeon 0000:02:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff880197695c10 [ 6737.216900] radeon 0000:02:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90001735a18 [ 6737.402614] [drm] ring test on 0 succeeded in 3 usecs [ 6737.402627] [drm] ring test on 1 succeeded in 1 usecs [ 6737.402634] [drm] ring test on 2 succeeded in 1 usecs [ 6737.402701] [drm] ring test on 3 succeeded in 2 usecs [ 6737.402713] [drm] ring test on 4 succeeded in 1 usecs [ 6737.579764] [drm] ring test on 5 succeeded in 2 usecs [ 6737.579778] [drm] UVD initialized successfully. [ 6747.574404] radeon 0000:02:00.0: ring 0 stalled for more than 10000msec [ 6747.574410] radeon 0000:02:00.0: GPU lockup (waiting for 0x0000000000041920 last fence id 0x0000000000041871 on ring 0) [ 6747.574414] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35). [ 6747.574418] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35). [ 6747.574421] radeon 0000:02:00.0: ib ring test failed (-35). [ 6748.140502] radeon 0000:02:00.0: GPU softreset: 0x00000048 [ 6748.140507] radeon 0000:02:00.0: GRBM_STATUS = 0xA0003028 [ 6748.140510] radeon 0000:02:00.0: GRBM_STATUS_SE0 = 0x00000006 [ 6748.140513] radeon 0000:02:00.0: GRBM_STATUS_SE1 = 0x00000006 [ 6748.140516] radeon 0000:02:00.0: SRBM_STATUS = 0x200000C0 [ 6748.140628] radeon 0000:02:00.0: SRBM_STATUS2 = 0x00000000 [ 6748.140631] radeon 0000:02:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [ 6748.140635] radeon 0000:02:00.0: R_008678_CP_STALLED_STAT2 = 0x00010000 [ 6748.140638] radeon 0000:02:00.0: R_00867C_CP_BUSY_STAT = 0x00000002 [ 6748.140641] radeon 0000:02:00.0: R_008680_CP_STAT = 0x80010243 [ 6748.140644] radeon 0000:02:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [ 6748.140647] radeon 0000:02:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57 [ 6748.140651] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6748.140654] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 6748.692401] radeon 0000:02:00.0: GRBM_SOFT_RESET=0x0000DDFF [ 6748.692457] radeon 0000:02:00.0: SRBM_SOFT_RESET=0x00000100 [ 6748.693617] radeon 0000:02:00.0: GRBM_STATUS = 0x00003028 [ 6748.693621] radeon 0000:02:00.0: GRBM_STATUS_SE0 = 0x00000006 [ 6748.693624] radeon 0000:02:00.0: GRBM_STATUS_SE1 = 0x00000006 [ 6748.693627] radeon 0000:02:00.0: SRBM_STATUS = 0x200000C0 [ 6748.693746] radeon 0000:02:00.0: SRBM_STATUS2 = 0x00000000 [ 6748.693751] radeon 0000:02:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [ 6748.693754] radeon 0000:02:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [ 6748.693757] radeon 0000:02:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [ 6748.693760] radeon 0000:02:00.0: R_008680_CP_STAT = 0x00000000 [ 6748.693763] radeon 0000:02:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [ 6748.693767] radeon 0000:02:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57 [ 6748.694014] radeon 0000:02:00.0: GPU reset succeeded, trying to resume [ 6748.709717] [drm] probing gen 2 caps for device 8086:340a = 3b3d02/0 [ 6748.709721] [drm] PCIE gen 2 link speeds already enabled [ 6748.712059] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000). [ 6748.712221] radeon 0000:02:00.0: WB enabled [ 6748.712224] radeon 0000:02:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff880197695c00 [ 6748.712225] radeon 0000:02:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff880197695c04 [ 6748.712227] radeon 0000:02:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff880197695c08 [ 6748.712229] radeon 0000:02:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff880197695c0c [ 6748.712231] radeon 0000:02:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff880197695c10 [ 6748.755479] radeon 0000:02:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90001735a18 [ 6748.941259] [drm] ring test on 0 succeeded in 3 usecs [ 6748.941266] [drm] ring test on 1 succeeded in 1 usecs [ 6748.941272] [drm] ring test on 2 succeeded in 1 usecs [ 6748.941338] [drm] ring test on 3 succeeded in 2 usecs [ 6748.941350] [drm] ring test on 4 succeeded in 1 usecs [ 6749.118470] [drm] ring test on 5 succeeded in 2 usecs [ 6749.118480] [drm] UVD initialized successfully. [ 6749.118615] [drm] ib test on ring 0 succeeded in 0 usecs [ 6749.118672] [drm] ib test on ring 1 succeeded in 0 usecs [ 6749.118732] [drm] ib test on ring 2 succeeded in 0 usecs [ 6749.118768] [drm] ib test on ring 3 succeeded in 0 usecs [ 6749.118804] [drm] ib test on ring 4 succeeded in 0 usecs [ 6759.264624] radeon 0000:02:00.0: ring 5 stalled for more than 10000msec [ 6759.264630] radeon 0000:02:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002 on ring 5) [ 6759.264634] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35). [ 6759.264640] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35). [ 6759.264667] [drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed [ 6759.279402] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc33d04 [ 6759.279407] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDE [ 6759.279410] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004 [ 6759.279413] VM fault (0x04, vmid 1) at page 138462, write from DMA1 (61) [ 6759.280478] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc24804 [ 6759.280482] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDE [ 6759.280484] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004 [ 6759.280487] VM fault (0x04, vmid 1) at page 138462, read from TC (72) [ 6759.281017] radeon 0000:02:00.0: GPU fault detected: 146 0x01033d04 [ 6759.281020] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021988 [ 6759.281023] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004 [ 6759.281025] VM fault (0x04, vmid 1) at page 137608, write from DMA1 (61) [ 6759.281062] radeon 0000:02:00.0: GPU fault detected: 146 0x01033d04 [ 6759.281064] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6759.281066] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004 [ 6759.281069] VM fault (0x04, vmid 1) at page 0, read from TC (72) [ 6759.283614] radeon 0000:02:00.0: GPU fault detected: 146 0x0143a004 [ 6759.283619] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0001D38A [ 6759.283621] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x030A0004 [ 6759.283624] VM fault (0x04, vmid 1) at page 119690, write from CB (160) [ 6759.283841] radeon 0000:02:00.0: GPU fault detected: 146 0x05439004 [ 6759.283844] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6759.283846] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004 [ 6759.283848] VM fault (0x04, vmid 1) at page 0, read from TC (72) [ 6759.283853] radeon 0000:02:00.0: GPU fault detected: 146 0x05439004 [ 6759.283856] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0001D391 [ 6759.283858] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004 [ 6759.283861] VM fault (0x04, vmid 1) at page 119697, read from TC (72) [ 6759.283889] radeon 0000:02:00.0: GPU fault detected: 146 0x05c3a004 [ 6759.283891] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6759.283894] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004 [ 6759.283896] VM fault (0x04, vmid 1) at page 0, write from DMA1 (61) [ 6759.283901] radeon 0000:02:00.0: GPU fault detected: 146 0x05a32004 [ 6759.283904] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021988 [ 6759.283906] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004 [ 6759.283908] VM fault (0x04, vmid 1) at page 137608, write from DMA1 (61) [ 6759.283914] radeon 0000:02:00.0: GPU fault detected: 146 0x05a31004 [ 6759.283916] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021988 [ 6759.283918] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004 [ 6759.283921] VM fault (0x04, vmid 1) at page 137608, write from DMA1 (61) [ 6759.283965] radeon 0000:02:00.0: GPU fault detected: 146 0x06e3d004 [ 6759.283967] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6759.283969] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004 [ 6759.283972] VM fault (0x04, vmid 1) at page 0, read from TC (72) [ 6759.284178] radeon 0000:02:00.0: GPU fault detected: 146 0x01424804 [ 6759.284180] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0001D38C [ 6759.284183] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060004 [ 6759.284185] VM fault (0x04, vmid 1) at page 119692, write from CB (96) [ 6759.284190] radeon 0000:02:00.0: GPU fault detected: 146 0x03224804 [ 6759.284193] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0001D3A4 [ 6759.284195] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03050004 [ 6759.284197] VM fault (0x04, vmid 1) at page 119716, write from CB (80) [ 6759.284422] radeon 0000:02:00.0: GPU fault detected: 146 0x01224804 [ 6759.284424] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6759.284427] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004 [ 6759.284429] VM fault (0x04, vmid 1) at page 0, read from TC (72) [ 6759.284444] radeon 0000:02:00.0: GPU fault detected: 146 0x01036004 [ 6759.284447] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0001D395 [ 6759.284449] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004 [ 6759.284451] VM fault (0x04, vmid 1) at page 119701, read from TC (72) [ 6759.284556] radeon 0000:02:00.0: GPU fault detected: 146 0x03035004 [ 6759.284558] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6759.284561] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060004 [ 6759.284563] VM fault (0x04, vmid 1) at page 0, write from CB (96) [ 6759.284568] radeon 0000:02:00.0: GPU fault detected: 146 0x0343a004 [ 6759.284570] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021B93 [ 6759.284573] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03090004 [ 6759.284575] VM fault (0x04, vmid 1) at page 138131, write from CB (144) [ 6759.284612] radeon 0000:02:00.0: GPU fault detected: 146 0x03232004 [ 6759.284615] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 6759.284617] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004 [ 6759.284619] VM fault (0x04, vmid 1) at page 0, write from DMA1 (61) [ 6759.284624] radeon 0000:02:00.0: GPU fault detected: 146 0x03c39004 [ 6759.284627] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDF [ 6759.284629] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004 [ 6759.284631] VM fault (0x04, vmid 1) at page 138463, write from DMA1 (61) [ 6759.284637] radeon 0000:02:00.0: GPU fault detected: 146 0x03231004 [ 6759.284639] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDE [ 6759.284641] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004 [ 6759.284644] VM fault (0x04, vmid 1) at page 138462, write from DMA1 (61) [ 6759.284649] radeon 0000:02:00.0: GPU fault detected: 146 0x03231004 [ 6759.284651] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDE [ 6759.284653] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004 [ 6759.284656] VM fault (0x04, vmid 1) at page 138462, write from DMA1 (61) [ 6759.284716] radeon 0000:02:00.0: GPU fault detected: 146 0x05035004 [ 6759.284718] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021B93 [ 6759.284720] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02044004 [ 6759.284723] VM fault (0x04, vmid 1) at page 138131, read from TC (68) [ 6759.284728] radeon 0000:02:00.0: GPU fault detected: 146 0x05035004 [ 6759.284730] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDF [ 6759.284732] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x020C4004 [ 6759.284735] VM fault (0x04, vmid 1) at page 138463, read from TC (196) [ 6759.516471] radeon 0000:02:00.0: GPU fault detected: 146 0x01036004 [ 6759.516475] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021988 [ 6759.516477] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060004 [ 6759.516479] VM fault (0x04, vmid 1) at page 137608, write from CB (96) [ 6759.516483] radeon 0000:02:00.0: GPU fault detected: 146 0x01035004 [ 6759.516485] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021988 [ 6759.516486] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060004 [ 6759.516488] VM fault (0x04, vmid 1) at page 137608, write from CB (96) [ 6759.533652] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc36004 [ 6759.533656] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDE [ 6759.533658] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060004 [ 6759.533659] VM fault (0x04, vmid 1) at page 138462, write from CB (96) [ 6759.533858] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc32004 [ 6759.533860] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDE [ 6759.533861] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03020004 [ 6759.533863] VM fault (0x04, vmid 1) at page 138462, write from CB (32) [ 6759.547549] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc24804 [ 6759.547552] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDE [ 6759.547554] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004 [ 6759.547555] VM fault (0x04, vmid 1) at page 138462, read from TC (72) [ 6761.492893] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc33d04 [ 6761.492896] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDE [ 6761.492898] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004 [ 6761.492900] VM fault (0x04, vmid 1) at page 138462, write from DMA1 (61) [ 6761.493081] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc24804 [ 6759.547552] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDE [ 6759.547554] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004 [ 6759.547555] VM fault (0x04, vmid 1) at page 138462, read from TC (72) [ 6761.492893] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc33d04 [ 6761.492896] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDE [ 6761.492898] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004 [ 6761.492900] VM fault (0x04, vmid 1) at page 138462, write from DMA1 (61) [ 6761.493081] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc24804 [ 6761.493083] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDE [ 6761.493085] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004 [ 6761.493087] VM fault (0x04, vmid 1) at page 138462, read from TC (72) [ 6761.493486] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc36004 [ 6761.493489] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDE [ 6761.493491] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060004 [ 6761.493493] VM fault (0x04, vmid 1) at page 138462, write from CB (96) [ 6762.236056] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc21004 [ 6762.236060] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDE [ 6762.236062] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02010004 [ 6762.236064] VM fault (0x04, vmid 1) at page 138462, read from CB (16) [ 6762.236240] radeon 0000:02:00.0: GPU fault detected: 146 0x0bc22004 [ 6762.236244] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CDE [ 6762.236246] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02020004 [ 6762.236248] VM fault (0x04, vmid 1) at page 138462, read from CB (32) [ 6770.359479] radeon 0000:02:00.0: GPU fault detected: 146 0x01036004 [ 6770.359483] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021988 [ 6770.359489] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060004 [ 6770.359492] VM fault (0x04, vmid 1) at page 137608, write from CB (96) [ 6770.359496] radeon 0000:02:00.0: GPU fault detected: 146 0x01039004 [ 6770.359498] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021988 [ 6770.359500] radeon 0000:02:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02044004 [ 6770.359502] VM fault (0x04, vmid 1) at page 137608, read from TC (68) This is getting even more interesting. After some investigation I got idea why bisect never succeeds. It looks like there was no stable kernels at all: 3.15 is also broken. However it takes "almost forever" to crash it with previously used methods. Somehow I stepped up on similar but far more optimized use case (another map in BfW game) which locks up GPU in matter of seconds to a minute. That's what I need :). This also proven to knock down "good" 3.15 kernels in matter of 30 seconds or so. So it was not good at all. Obviously my bisect can't succeed. On other hand now I can try mentioned patches... Does a 3.17 based drm-fixes kernel tree work better? There have been a couple of stability fixes. 1) About 3.15 + patch: I gave it a try and it took quite a while to get opinion about it. Overall it is quite stable and survives about several days of run of problematic load. But eventually GPU still could encounter crash. Intereating thing in this occurence I caught is that regardless of scary message about failed DPM resume, GPU seems to be operable after successful recovery. I got couple of similar crashes as well within a week. It looked like this: ===cut=== [815114.959250] SysRq : Emergency Sync [815115.071974] Emergency Sync complete [815116.935547] radeon 0000:01:00.0: ring 0 stalled for more than 10082msec [815116.935556] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000007f39f60 last fence id 0x0000000007f39f5f on ring 0) [815116.935564] radeon 0000:01:00.0: failed to get a new IB (-35) [815116.942472] radeon 0000:01:00.0: sa_manager is not empty, clearing anyway [815117.134467] SysRq : Keyboard mode set to system default [815117.500079] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x0000000080406640 flags=0x0000] [815117.500092] radeon 0000:01:00.0: Saved 6061 dwords of commands on ring 0. [815117.500097] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x0000000080406650 flags=0x0020] [815117.500104] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x0000000080000100 flags=0x0020] [815117.500110] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x0000000080404500 flags=0x0000] [815117.500222] radeon 0000:01:00.0: GPU softreset: 0x0000006C [815117.500226] radeon 0000:01:00.0: GRBM_STATUS = 0xA0003028 [815117.500229] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006 [815117.500231] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006 [815117.500233] radeon 0000:01:00.0: SRBM_STATUS = 0x200002C0 [815117.500349] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 [815117.500351] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [815117.500353] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00010000 [815117.500356] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000002 [815117.500358] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80010243 [815117.500360] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44483106 [815117.500362] radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44C84246 [815117.500365] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [815117.500368] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [815118.057253] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF [815118.057308] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00100140 [815118.058465] radeon 0000:01:00.0: GRBM_STATUS = 0x00003028 [815118.058468] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006 [815118.058470] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006 [815118.058472] radeon 0000:01:00.0: SRBM_STATUS = 0x200000C0 [815118.058583] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 [815118.058585] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [815118.058588] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [815118.058590] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [815118.058592] radeon 0000:01:00.0: R_008680_CP_STAT = 0x00000000 [815118.058594] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [815118.058597] radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57 [815118.058843] radeon 0000:01:00.0: GPU reset succeeded, trying to resume [815118.086936] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0 [815118.086939] [drm] PCIE gen 2 link speeds already enabled [815118.090599] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000). [815118.090704] radeon 0000:01:00.0: WB enabled [815118.090707] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff880414545c00 [815118.090709] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff880414545c04 [815118.090711] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff880414545c08 [815118.090713] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff880414545c0c [815118.090715] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff880414545c10 [815118.091689] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90012135a18 [815118.278813] [drm] ring test on 0 succeeded in 3 usecs [815118.278819] [drm] ring test on 1 succeeded in 1 usecs [815118.278824] [drm] ring test on 2 succeeded in 1 usecs [815118.278888] [drm] ring test on 3 succeeded in 2 usecs [815118.278897] [drm] ring test on 4 succeeded in 1 usecs [815118.455982] [drm] ring test on 5 succeeded in 2 usecs [815118.455989] [drm] UVD initialized successfully. [815128.453467] radeon 0000:01:00.0: ring 0 stalled for more than 10001msec [815128.453477] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000007f39fad last fence id 0x0000000007f39f5f on ring 0) [815128.453483] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35). [815128.453491] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35). [815128.453496] radeon 0000:01:00.0: ib ring test failed (-35). [815129.011900] radeon 0000:01:00.0: GPU softreset: 0x00000048 [815129.011904] radeon 0000:01:00.0: GRBM_STATUS = 0xA0003028 [815129.011907] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006 [815129.011909] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006 [815129.011911] radeon 0000:01:00.0: SRBM_STATUS = 0x200000C0 [815129.012022] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 [815129.012025] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [815129.012027] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00010000 [815129.012029] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000002 [815129.012031] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80010243 [815129.012034] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [815129.012036] radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57 [815129.012039] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [815129.012041] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [815129.561916] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF [815129.561971] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100 [815129.563128] radeon 0000:01:00.0: GRBM_STATUS = 0x00003028 [815129.563131] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006 [815129.563133] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006 [815129.563135] radeon 0000:01:00.0: SRBM_STATUS = 0x200000C0 [815129.563246] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 [815129.563249] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [815129.563251] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [815129.563253] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [815129.563255] radeon 0000:01:00.0: R_008680_CP_STAT = 0x00000000 [815129.563257] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [815129.563260] radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57 [815129.563506] radeon 0000:01:00.0: GPU reset succeeded, trying to resume [815129.576411] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0 [815129.576415] [drm] PCIE gen 2 link speeds already enabled [815129.580147] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000). [815129.580250] radeon 0000:01:00.0: WB enabled [815129.580253] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff880414545c00 [815129.580255] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff880414545c04 [815129.580257] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff880414545c08 [815129.580259] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff880414545c0c [815129.580261] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff880414545c10 [815129.581232] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90012135a18 [815129.767993] [drm] ring test on 0 succeeded in 3 usecs [815129.767999] [drm] ring test on 1 succeeded in 1 usecs [815129.768004] [drm] ring test on 2 succeeded in 1 usecs [815129.768068] [drm] ring test on 3 succeeded in 2 usecs [815129.768077] [drm] ring test on 4 succeeded in 1 usecs [815129.945157] [drm] ring test on 5 succeeded in 2 usecs [815129.945164] [drm] UVD initialized successfully. [815129.946125] [drm] ib test on ring 0 succeeded in 0 usecs [815129.946210] [drm] ib test on ring 1 succeeded in 0 usecs [815129.946301] [drm] ib test on ring 2 succeeded in 0 usecs [815129.946345] [drm] ib test on ring 3 succeeded in 0 usecs [815129.946380] [drm] ib test on ring 4 succeeded in 0 usecs [815137.847012] SysRq : Emergency Sync [815137.965713] Emergency Sync complete [815139.742325] SysRq : Emergency Sync [815139.864190] Emergency Sync complete [815140.093163] radeon 0000:01:00.0: ring 5 stalled for more than 10000msec [815140.093173] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002 on ring 5) [815140.093179] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35). [815140.093188] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35). [815140.093217] [drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed ===cut=== 2) About 3.17... I attempted 3.17-rc1 and it crashed in about 30 seconds of run of problematic work. I will try newer -RCs as well, as I can see there were some extra changes to radeon-related code. Attempted to test on 3.17-rc4. Result: crashed in about 3 minutes of run (see below). Are some stability fixes missing 3.17-rc4 mainline? At first glance I do not see radeon-related commits in drm-fixes which haven't made it to -rc4. Am I missing something? ===cut=== kernel: [ 599.949295] radeon 0000:01:00.0: ring 3 stalled for more than 10167msec kernel: [ 599.949305] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000001eb0 last fence id 0x0000000000001eaf on ring 3) kernel: [ 599.949312] radeon 0000:01:00.0: scheduling IB failed (-35). kernel: [ 600.507409] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x000000008040a840 flags=0x0010] kernel: [ 600.507420] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x000000008040a870 flags=0x0030] kernel: [ 600.507426] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x0000000080000100 flags=0x0030] kernel: [ 600.507431] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0018 address=0x000000008040a700 flags=0x0010] kernel: [ 600.507460] radeon 0000:01:00.0: Saved 19308 dwords of commands on ring 0. kernel: [ 600.507590] radeon 0000:01:00.0: GPU softreset: 0x0000006C kernel: [ 600.507593] radeon 0000:01:00.0: GRBM_STATUS = 0xA0003028 kernel: [ 600.507596] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006 kernel: [ 600.507598] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006 kernel: [ 600.507600] radeon 0000:01:00.0: SRBM_STATUS = 0x200000C0 kernel: [ 600.507711] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 kernel: [ 600.507714] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 kernel: [ 600.507716] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00010000 kernel: [ 600.507718] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000002 kernel: [ 600.507720] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80010243 kernel: [ 600.507723] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44483106 kernel: [ 600.507725] radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44E84266 kernel: [ 600.507728] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 kernel: [ 600.507730] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 kernel: [ 601.054357] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF kernel: [ 601.054411] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00100140 kernel: [ 601.055568] radeon 0000:01:00.0: GRBM_STATUS = 0x00003028 kernel: [ 601.055571] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006 kernel: [ 601.055573] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006 kernel: [ 601.055575] radeon 0000:01:00.0: SRBM_STATUS = 0x20000AC0 kernel: [ 601.055686] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 kernel: [ 601.055689] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 kernel: [ 601.055691] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 kernel: [ 601.055693] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 kernel: [ 601.055695] radeon 0000:01:00.0: R_008680_CP_STAT = 0x00000000 kernel: [ 601.055698] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 kernel: [ 601.055700] radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57 kernel: [ 601.055951] radeon 0000:01:00.0: GPU reset succeeded, trying to resume kernel: [ 601.083744] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0 kernel: [ 601.083747] [drm] PCIE gen 2 link speeds already enabled kernel: [ 601.084938] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000). kernel: [ 601.085046] radeon 0000:01:00.0: WB enabled kernel: [ 601.085049] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff880413fbec00 kernel: [ 601.085052] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff880413fbec04 kernel: [ 601.085054] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff880413fbec08 kernel: [ 601.085056] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff880413fbec0c kernel: [ 601.085057] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff880413fbec10 kernel: [ 601.086030] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90011db5a18 kernel: [ 601.271000] [drm] ring test on 0 succeeded in 3 usecs kernel: [ 601.271006] [drm] ring test on 1 succeeded in 1 usecs kernel: [ 601.271011] [drm] ring test on 2 succeeded in 1 usecs kernel: [ 601.271075] [drm] ring test on 3 succeeded in 2 usecs kernel: [ 601.271084] [drm] ring test on 4 succeeded in 1 usecs kernel: [ 601.448164] [drm] ring test on 5 succeeded in 2 usecs kernel: [ 601.448172] [drm] UVD initialized successfully. kernel: [ 611.444226] radeon 0000:01:00.0: ring 0 stalled for more than 10000msec kernel: [ 611.444237] radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000001a60a last fence id 0x000000000001a4dd on ring 0) kernel: [ 611.444244] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35). kernel: [ 611.444252] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35). kernel: [ 611.444257] radeon 0000:01:00.0: ib ring test failed (-35). kernel: [ 611.997330] radeon 0000:01:00.0: GPU softreset: 0x00000048 kernel: [ 611.997333] radeon 0000:01:00.0: GRBM_STATUS = 0xA0003028 kernel: [ 611.997336] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006 kernel: [ 611.997338] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006 kernel: [ 611.997341] radeon 0000:01:00.0: SRBM_STATUS = 0x200000C0 kernel: [ 611.997452] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 kernel: [ 611.997454] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 kernel: [ 611.997456] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00010000 kernel: [ 611.997458] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00400002 kernel: [ 611.997461] radeon 0000:01:00.0: R_008680_CP_STAT = 0x84010243 kernel: [ 611.997463] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 kernel: [ 611.997465] radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57 kernel: [ 611.997468] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 kernel: [ 611.997470] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 kernel: [ 612.542126] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF kernel: [ 612.542180] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100 kernel: [ 612.543338] radeon 0000:01:00.0: GRBM_STATUS = 0x00003028 kernel: [ 612.543340] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006 kernel: [ 612.543343] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006 kernel: [ 612.543345] radeon 0000:01:00.0: SRBM_STATUS = 0x200000C0 kernel: [ 612.543456] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 kernel: [ 612.543458] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 kernel: [ 612.543460] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 kernel: [ 612.543462] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 kernel: [ 612.543465] radeon 0000:01:00.0: R_008680_CP_STAT = 0x00000000 kernel: [ 612.543467] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 kernel: [ 612.543469] radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57 kernel: [ 612.543724] radeon 0000:01:00.0: GPU reset succeeded, trying to resume kernel: [ 612.556911] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0 kernel: [ 612.556915] [drm] PCIE gen 2 link speeds already enabled kernel: [ 612.558107] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000). kernel: [ 612.558216] radeon 0000:01:00.0: WB enabled kernel: [ 612.558219] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff880413fbec00 kernel: [ 612.558222] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff880413fbec04 kernel: [ 612.558224] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff880413fbec08 kernel: [ 612.558226] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff880413fbec0c kernel: [ 612.558228] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff880413fbec10 kernel: [ 612.559203] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90011db5a18 kernel: [ 612.744297] [drm] ring test on 0 succeeded in 3 usecs kernel: [ 612.744302] [drm] ring test on 1 succeeded in 1 usecs kernel: [ 612.744308] [drm] ring test on 2 succeeded in 1 usecs kernel: [ 612.744371] [drm] ring test on 3 succeeded in 2 usecs kernel: [ 612.744380] [drm] ring test on 4 succeeded in 1 usecs kernel: [ 612.921464] [drm] ring test on 5 succeeded in 2 usecs kernel: [ 612.921472] [drm] UVD initialized successfully. kernel: [ 612.921539] [drm] ib test on ring 0 succeeded in 0 usecs kernel: [ 612.921634] [drm] ib test on ring 1 succeeded in 0 usecs kernel: [ 612.921722] [drm] ib test on ring 2 succeeded in 0 usecs kernel: [ 612.921762] [drm] ib test on ring 3 succeeded in 0 usecs kernel: [ 612.921796] [drm] ib test on ring 4 succeeded in 0 usecs kernel: [ 623.068910] radeon 0000:01:00.0: ring 5 stalled for more than 10000msec kernel: [ 623.068921] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002 on ring 5) kernel: [ 623.068927] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35). kernel: [ 623.068935] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35). kernel: [ 623.098333] radeon 0000:01:00.0: GPU fault detected: 146 0x07a23d0c kernel: [ 623.098342] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0000BDBD kernel: [ 623.098347] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0203D00C kernel: [ 623.098352] VM fault (0x0c, vmid 1) at page 48573, read from DMA1 (61) kernel: [ 623.098364] radeon 0000:01:00.0: GPU fault detected: 146 0x07c23d0c kernel: [ 623.098368] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 kernel: [ 623.098372] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0208400C kernel: [ 623.098377] VM fault (0x0c, vmid 1) at page 0, read from TC (132) kernel: [ 623.098383] radeon 0000:01:00.0: GPU fault detected: 146 0x07e23d0c kernel: [ 623.098387] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0000BDBC kernel: [ 623.098391] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0200800C kernel: [ 623.098395] VM fault (0x0c, vmid 1) at page 48572, read from TC (8) kernel: [ 623.128770] radeon 0000:01:00.0: GPU fault detected: 146 0x06033d14 kernel: [ 623.128781] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0000BDB0 kernel: [ 623.128787] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D014 kernel: [ 623.128793] VM fault (0x04, vmid 1) at page 48560, write from DMA1 (61) kernel: [ 623.128820] radeon 0000:01:00.0: GPU fault detected: 146 0x06033d14 kernel: [ 623.128825] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 kernel: [ 623.128830] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0204400C kernel: [ 623.128835] VM fault (0x0c, vmid 1) at page 0, read from TC (68) kernel: [ 623.128842] radeon 0000:01:00.0: GPU fault detected: 146 0x06033d14 kernel: [ 623.128847] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0000BDB8 kernel: [ 623.128852] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0204400C kernel: [ 623.128857] VM fault (0x0c, vmid 1) at page 48568, read from TC (68) kernel: [ 623.129932] radeon 0000:01:00.0: GPU fault detected: 146 0x06033d14 kernel: [ 623.129940] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0000BDB0 kernel: [ 623.129944] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D014 kernel: [ 623.129948] VM fault (0x04, vmid 1) at page 48560, write from DMA1 (61) kernel: [ 623.129965] radeon 0000:01:00.0: GPU fault detected: 146 0x06233d14 ===cut=== Note: several megabytes of similar "VM fault" flood skipped. I've seen this bug as well, through quite a few versions of 3.15 and 3.16. Sometimes it just freezes X, other times it hangs the entire system. Here is the output of the last hang (I was able to log in remotely as this time it didn't completely crash the system) 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Curacao XT [Radeon R9 270X] (uname -a) Linux prime 3.16.3-gentoo #1 SMP PREEMPT Thu Sep 18 20:59:58 CDT 2014 x86_64 Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz GenuineIntel GNU/Linux (lsmod) cfbfillrect 3634 1 radeon cfbimgblt 2055 1 radeon cfbcopyarea 3110 1 radeon i2c_algo_bit 5055 1 radeon drm_kms_helper 33715 1 radeon ttm 59052 1 radeon drm 226864 6 ttm,drm_kms_helper,radeon firmware_class 8187 1 radeon radeon 1258462 3 (relevant dmesg info) [120499.589293] radeon 0000:01:00.0: ring 0 stalled for more than 10473msec [120499.589296] radeon 0000:01:00.0: GPU lockup (waiting for 0x00000000000783d0 last fence id 0x00000000000783cf on ring 0) [120499.589299] radeon 0000:01:00.0: failed to get a new IB (-35) [120500.099613] radeon 0000:01:00.0: Saved 3600 dwords of commands on ring 0. [120500.099743] radeon 0000:01:00.0: GPU softreset: 0x0000006C [120500.099746] radeon 0000:01:00.0: GRBM_STATUS = 0xA0003028 [120500.099748] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006 [120500.099750] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006 [120500.099751] radeon 0000:01:00.0: SRBM_STATUS = 0x20000AC0 [120500.099862] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 [120500.099864] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [120500.099866] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00010000 [120500.099868] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000002 [120500.099870] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80010243 [120500.099872] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83146 [120500.099874] radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44E84266 [120500.099876] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [120500.099879] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [120500.592138] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF [120500.592192] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00100140 [120500.593350] radeon 0000:01:00.0: GRBM_STATUS = 0x00003028 [120500.593352] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006 [120500.593354] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006 [120500.593356] radeon 0000:01:00.0: SRBM_STATUS = 0x20000AC0 [120500.593466] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 [120500.593468] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [120500.593470] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [120500.593472] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [120500.593473] radeon 0000:01:00.0: R_008680_CP_STAT = 0x00000000 [120500.593475] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [120500.593477] radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57 [120500.593718] radeon 0000:01:00.0: GPU reset succeeded, trying to resume [120500.621478] [drm] probing gen 2 caps for device 8086:3c04 = 7a7103/e [120500.621482] [drm] PCIE gen 3 link speeds already enabled [120500.623908] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000). [120500.624051] radeon 0000:01:00.0: WB enabled [120500.624054] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000100000c00 and cpu addr 0xffff8807fb4aac00 [120500.624056] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000100000c04 and cpu addr 0xffff8807fb4aac04 [120500.624058] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000100000c08 and cpu addr 0xffff8807fb4aac08 [120500.624059] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000100000c0c and cpu addr 0xffff8807fb4aac0c [120500.624061] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000100000c10 and cpu addr 0xffff8807fb4aac10 [120500.624680] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc900142b5a18 [120500.789277] [drm] ring test on 0 succeeded in 3 usecs [120500.789283] [drm] ring test on 1 succeeded in 1 usecs [120500.789287] [drm] ring test on 2 succeeded in 1 usecs [120500.789351] [drm] ring test on 3 succeeded in 2 usecs [120500.789361] [drm] ring test on 4 succeeded in 1 usecs [120500.981448] [drm] ring test on 5 succeeded in 2 usecs [120500.981456] [drm] UVD initialized successfully. [120510.981602] radeon 0000:01:00.0: ring 0 stalled for more than 10002msec [120510.981604] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000078407 last fence id 0x00000000000783cf on ring 0) [120510.981606] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35). [120510.981608] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35). [120510.981609] radeon 0000:01:00.0: ib ring test failed (-35). [120511.461309] radeon 0000:01:00.0: GPU softreset: 0x00000048 [120511.461310] radeon 0000:01:00.0: GRBM_STATUS = 0xA0003028 [120511.461312] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006 [120511.461313] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006 [120511.461314] radeon 0000:01:00.0: SRBM_STATUS = 0x200000C0 [120511.461428] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 [120511.461429] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [120511.461431] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00010000 [120511.461432] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000002 [120511.461434] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80010243 [120511.461435] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [120511.461437] radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57 [120511.461439] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [120511.461440] radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [120511.933287] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF [120511.933340] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100 [120511.934495] radeon 0000:01:00.0: GRBM_STATUS = 0x00003028 [120511.934496] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006 [120511.934498] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006 [120511.934499] radeon 0000:01:00.0: SRBM_STATUS = 0x200000C0 [120511.934609] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 [120511.934610] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [120511.934612] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [120511.934613] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [120511.934614] radeon 0000:01:00.0: R_008680_CP_STAT = 0x00000000 [120511.934616] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [120511.934617] radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57 [120511.934857] radeon 0000:01:00.0: GPU reset succeeded, trying to resume [120511.945176] [drm] probing gen 2 caps for device 8086:3c04 = 7a7103/e [120511.945179] [drm] PCIE gen 3 link speeds already enabled [120511.947127] [drm] PCIE GART of 1024M enabled (table at 0x0000000000276000). [120511.947253] radeon 0000:01:00.0: WB enabled [120511.947255] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000100000c00 and cpu addr 0xffff8807fb4aac00 [120511.947256] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000100000c04 and cpu addr 0xffff8807fb4aac04 [120511.947257] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000100000c08 and cpu addr 0xffff8807fb4aac08 [120511.947258] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000100000c0c and cpu addr 0xffff8807fb4aac0c [120511.947259] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000100000c10 and cpu addr 0xffff8807fb4aac10 [120511.947868] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc900142b5a18 [120512.109348] [drm] ring test on 0 succeeded in 4 usecs [120512.109352] [drm] ring test on 1 succeeded in 1 usecs [120512.109355] [drm] ring test on 2 succeeded in 1 usecs [120512.109417] [drm] ring test on 3 succeeded in 2 usecs [120512.109426] [drm] ring test on 4 succeeded in 1 usecs [120512.286478] [drm] ring test on 5 succeeded in 2 usecs [120512.286483] [drm] UVD initialized successfully. [120512.286534] [drm] ib test on ring 0 succeeded in 0 usecs [120512.286580] [drm] ib test on ring 1 succeeded in 0 usecs [120512.286623] [drm] ib test on ring 2 succeeded in 0 usecs [120512.286648] [drm] ib test on ring 3 succeeded in 0 usecs [120512.286672] [drm] ib test on ring 4 succeeded in 0 usecs [120522.435679] radeon 0000:01:00.0: ring 5 stalled for more than 10000msec [120522.435685] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002 on ring 5) [120522.435688] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35). [120522.435695] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35). [120522.435730] [drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed After some updates to MESA and kernel this bug no longer happens at all - 2D now rock solid for me on R9 270 and I can run even most troublesome workloads for weeks without any issues. While I failed to pinpoint what exactly has fixed bug, thanks anyway. I think it is now safe to close this bug as fixed. I bet you've got dozens of other GPU lockups to chew on, so I'm glad to inform you at least one nasty thing has been nailed down. (it's me, original bug reporter who has forgot password on both original mailbox and bugzilla account, dammit) |