Bug 94081
Summary: | [radeon 3.18 regression] GPU reset recovery fails | ||
---|---|---|---|
Product: | Drivers | Reporter: | Jan Vesely (jano.vesely) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | NEW --- | ||
Severity: | normal | CC: | abandonedaccountubdprczb8hs, Actualize.in.Material+bugzillakernel, szg00000, zazdxscf+bugzilla.kernel.org |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.18.x | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Jan Vesely
2015-03-01 19:02:03 UTC
here's dmesg output for 3.17 kernel: (looks like ib test on ring 5 is missing in 3.18) [ 249.015280] radeon 0000:01:00.0: ring 0 stalled for more than 10000msec [ 249.015287] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000742 last fence id 0x0000000000000741 on ring 0) [ 249.027303] radeon 0000:01:00.0: ring 0 stalled for more than 10012msec [ 249.027309] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000745 last fence id 0x0000000000000741 on ring 0) [ 249.131987] radeon 0000:01:00.0: Saved 183 dwords of commands on ring 0. [ 249.132005] radeon 0000:01:00.0: GPU softreset: 0x00000009 [ 249.132007] radeon 0000:01:00.0: GRBM_STATUS = 0xA0433828 [ 249.132009] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x08000007 [ 249.132010] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000007 [ 249.132012] radeon 0000:01:00.0: SRBM_STATUS = 0x20000AC0 [ 249.132014] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 [ 249.132015] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [ 249.132017] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00018000 [ 249.132019] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00010002 [ 249.132021] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80038647 [ 249.132023] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [ 249.142433] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00007F6B [ 249.142486] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100 [ 249.143643] radeon 0000:01:00.0: GRBM_STATUS = 0x00003828 [ 249.143644] radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000007 [ 249.143646] radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000007 [ 249.143648] radeon 0000:01:00.0: SRBM_STATUS = 0x200000C0 [ 249.143649] radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000 [ 249.143651] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [ 249.143653] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [ 249.143654] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [ 249.143656] radeon 0000:01:00.0: R_008680_CP_STAT = 0x00000000 [ 249.143658] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [ 249.143681] radeon 0000:01:00.0: GPU reset succeeded, trying to resume [ 249.165960] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0 [ 249.167181] [drm] PCIE GART of 1024M enabled (table at 0x0000000000273000). [ 249.167273] radeon 0000:01:00.0: WB enabled [ 249.167274] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff880401eeac00 [ 249.167276] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff880401eeac0c [ 249.168728] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000072118 and cpu addr 0xffffc900062b2118 [ 249.185032] [drm] ring test on 0 succeeded in 3 usecs [ 249.185042] [drm] ring test on 3 succeeded in 6 usecs [ 249.362220] [drm] ring test on 5 succeeded in 2 usecs [ 249.362228] [drm] UVD initialized successfully. [ 249.362280] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35). [ 249.362281] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35). [ 249.362282] radeon 0000:01:00.0: ib ring test failed (-35). [ 249.370592] radeon 0000:01:00.0: GPU reset succeeded, trying to resume [ 249.377656] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0 [ 249.378860] [drm] PCIE GART of 1024M enabled (table at 0x0000000000273000). [ 249.378951] radeon 0000:01:00.0: WB enabled [ 249.378952] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff880401eeac00 [ 249.378954] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff880401eeac0c [ 249.380537] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000072118 and cpu addr 0xffffc900062b2118 [ 249.396825] [drm] ring test on 0 succeeded in 3 usecs [ 249.396835] [drm] ring test on 3 succeeded in 7 usecs [ 249.574006] [drm] ring test on 5 succeeded in 2 usecs [ 249.574014] [drm] UVD initialized successfully. [ 249.574061] [drm] ib test on ring 0 succeeded in 0 usecs [ 249.574105] [drm] ib test on ring 3 succeeded in 0 usecs [ 249.726529] [drm] ib test on ring 5 succeeded Can you bisect? (In reply to Jan Vesely from comment #1) > here's dmesg output for 3.17 kernel: [...] > [ 249.362280] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35). > [ 249.362281] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB > on GFX ring (-35). Actually, this looks like the reset didn't fully work with 3.17 either though... (In reply to Michel Dänzer from comment #2) > Can you bisect? It took a while (first bisect found unrelated i915 dispaly commit). the failure was introduced in: commit dd7cfd641228abb2669d8d047d5ec377b1835900 Author: Maarten Lankhorst <maarten.lankhorst@canonical.com> Date: Tue Jan 21 13:07:31 2014 +0100 drm/ttm: kill fence_lock No users are left, kill it off! :D Conversion to the reservation api is next on the list, after that the functionality can be restored with rcu. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> the commit moves a call to fence get below two "goto cleanup" in error paths, however, fence_put is left in the cleanup: error target. Moving the fence_put call to pflip_cleanup fixes the issue. I've posted a patch. > > (In reply to Jan Vesely from comment #1) > > here's dmesg output for 3.17 kernel: > > [...] > > > [ 249.362280] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35). > > [ 249.362281] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB > > on GFX ring (-35). > > Actually, this looks like the reset didn't fully work with 3.17 either > though... I don't remember seeing this during bisection. This log is from fedora 3.17.8 kernel. I'll check 3.17.8 vanilla and see whether it's fedora specific This does not make sense, the work structure is zeroed so fence put should is OK. it looks like sometimes the lockup needs more than 1 GPU restart to manifest, I'll replay without the good entries (at least it explains inconsistent bisect results) sorry for the noise (In reply to Jan Vesely from comment #3) > (In reply to Michel Dänzer from comment #2) > > Can you bisect? > > It took a while (first bisect found unrelated i915 dispaly commit). > the failure was introduced in: > > commit dd7cfd641228abb2669d8d047d5ec377b1835900 > Author: Maarten Lankhorst <maarten.lankhorst@canonical.com> > Date: Tue Jan 21 13:07:31 2014 +0100 > > drm/ttm: kill fence_lock > > No users are left, kill it off! :D > Conversion to the reservation api is next on the list, after > that the functionality can be restored with rcu. > > Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> > > the commit moves a call to fence get below two "goto cleanup" in error > paths, however, fence_put is left in the cleanup: error target. Moving the > fence_put call to pflip_cleanup fixes the issue. > > I've posted a patch. > > > > (In reply to Jan Vesely from comment #1) > > > here's dmesg output for 3.17 kernel: > > > > [...] > > > > > [ 249.362280] [drm:r600_ib_test] *ERROR* radeon: fence wait failed > (-35). > > > [ 249.362281] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing > IB > > > on GFX ring (-35). > > > > Actually, this looks like the reset didn't fully work with 3.17 either > > though... > > I don't remember seeing this during bisection. This log is from fedora > 3.17.8 kernel. I'll check 3.17.8 vanilla and see whether it's fedora specific |