Bug 94081 - [radeon 3.18 regression] GPU reset recovery fails
Summary: [radeon 3.18 regression] GPU reset recovery fails
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-01 19:02 UTC by Jan Vesely
Modified: 2016-06-05 03:34 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.18.x
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Jan Vesely 2015-03-01 19:02:03 UTC
starting with kernel-3.18 (fedora version) fails to recover from OpenCL induced GPU loockup.
reproducer:
Run noise-hurl.xml OpenCL test in gegl library:
[354672.707822] radeon 0000:01:00.0: ring 0 stalled for more than 10020msec

on 3.17 (fedora again) I observe one or two display flashes, and full recovery.

starting with 3.18 I see the flash, and the dispaly stays frozen. the task itself(gegl) stays in uninteruptible state

Here are the relevant lines from dmesg on 3.18:
[354672.707822] radeon 0000:01:00.0: ring 0 stalled for more than 10020msec
[354672.707828] radeon 0000:01:00.0: GPU lockup (current fence id 0x00000000007778a3 last fence id 0x00000000007778b3 on ring 0)
[354672.828879] radeon 0000:01:00.0: Saved 503 dwords of commands on ring 0.
[354672.828898] radeon 0000:01:00.0: GPU softreset: 0x00000009
[354672.828900] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0433828
[354672.828902] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x08000007
[354672.828903] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
[354672.828905] radeon 0000:01:00.0:   SRBM_STATUS               = 0x20000AC0
[354672.828907] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[354672.828908] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[354672.828910] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00018000
[354672.828912] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00010002
[354672.828913] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80038647
[354672.828915] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[354672.842214] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00007F6B
[354672.842267] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[354672.843423] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003828
[354672.843425] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000007
[354672.843426] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
[354672.843428] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[354672.843429] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[354672.843431] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[354672.843432] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[354672.843434] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[354672.843435] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[354672.843437] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[354672.843456] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[354672.865723] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0
[354672.868296] [drm] PCIE GART of 1024M enabled (table at 0x0000000000274000).
[354672.868388] radeon 0000:01:00.0: WB enabled
[354672.868390] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff880401c54c00
[354672.868391] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff880401c54c0c
[354672.869865] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000072118 and cpu addr 0xffffc900062b2118
[354672.886233] [drm] ring test on 0 succeeded in 3 usecs
[354672.886244] [drm] ring test on 3 succeeded in 7 usecs
[354673.063433] [drm] ring test on 5 succeeded in 2 usecs
[354673.063441] [drm] UVD initialized successfully.
[354673.187403] [drm] ib test on ring 0 succeeded in 0 usecs
[354673.187432] [drm] ib test on ring 3 succeeded in 0 usecs
Comment 1 Jan Vesely 2015-03-01 19:05:45 UTC
here's dmesg output for 3.17 kernel:
(looks like ib test on ring 5 is missing in 3.18)

[  249.015280] radeon 0000:01:00.0: ring 0 stalled for more than 10000msec
[  249.015287] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000742 last fence id 0x0000000000000741 on ring 0)
[  249.027303] radeon 0000:01:00.0: ring 0 stalled for more than 10012msec
[  249.027309] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000745 last fence id 0x0000000000000741 on ring 0)
[  249.131987] radeon 0000:01:00.0: Saved 183 dwords of commands on ring 0.
[  249.132005] radeon 0000:01:00.0: GPU softreset: 0x00000009
[  249.132007] radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0433828
[  249.132009] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x08000007
[  249.132010] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
[  249.132012] radeon 0000:01:00.0:   SRBM_STATUS               = 0x20000AC0
[  249.132014] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[  249.132015] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  249.132017] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00018000
[  249.132019] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00010002
[  249.132021] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80038647
[  249.132023] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  249.142433] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00007F6B
[  249.142486] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[  249.143643] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003828
[  249.143644] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000007
[  249.143646] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
[  249.143648] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
[  249.143649] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[  249.143651] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  249.143653] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[  249.143654] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[  249.143656] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[  249.143658] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  249.143681] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[  249.165960] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0
[  249.167181] [drm] PCIE GART of 1024M enabled (table at 0x0000000000273000).
[  249.167273] radeon 0000:01:00.0: WB enabled
[  249.167274] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff880401eeac00
[  249.167276] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff880401eeac0c
[  249.168728] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000072118 and cpu addr 0xffffc900062b2118
[  249.185032] [drm] ring test on 0 succeeded in 3 usecs
[  249.185042] [drm] ring test on 3 succeeded in 6 usecs
[  249.362220] [drm] ring test on 5 succeeded in 2 usecs
[  249.362228] [drm] UVD initialized successfully.
[  249.362280] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
[  249.362281] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).
[  249.362282] radeon 0000:01:00.0: ib ring test failed (-35).
[  249.370592] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[  249.377656] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0
[  249.378860] [drm] PCIE GART of 1024M enabled (table at 0x0000000000273000).
[  249.378951] radeon 0000:01:00.0: WB enabled
[  249.378952] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff880401eeac00
[  249.378954] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff880401eeac0c
[  249.380537] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000072118 and cpu addr 0xffffc900062b2118
[  249.396825] [drm] ring test on 0 succeeded in 3 usecs
[  249.396835] [drm] ring test on 3 succeeded in 7 usecs
[  249.574006] [drm] ring test on 5 succeeded in 2 usecs
[  249.574014] [drm] UVD initialized successfully.
[  249.574061] [drm] ib test on ring 0 succeeded in 0 usecs
[  249.574105] [drm] ib test on ring 3 succeeded in 0 usecs
[  249.726529] [drm] ib test on ring 5 succeeded
Comment 2 Michel Dänzer 2015-03-02 06:58:59 UTC
Can you bisect?

(In reply to Jan Vesely from comment #1)
> here's dmesg output for 3.17 kernel:

[...]

> [  249.362280] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
> [  249.362281] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB
> on GFX ring (-35).

Actually, this looks like the reset didn't fully work with 3.17 either though...
Comment 3 Jan Vesely 2015-03-04 21:10:26 UTC
(In reply to Michel Dänzer from comment #2)
> Can you bisect?

It took a while (first bisect found unrelated i915 dispaly commit).
the failure was introduced in:

commit dd7cfd641228abb2669d8d047d5ec377b1835900
Author: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Date:   Tue Jan 21 13:07:31 2014 +0100

    drm/ttm: kill fence_lock
    
    No users are left, kill it off! :D
    Conversion to the reservation api is next on the list, after
    that the functionality can be restored with rcu.
    
    Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>

the commit moves a call to fence get below two "goto cleanup" in error paths, however, fence_put is left in the cleanup: error target. Moving the fence_put call to pflip_cleanup fixes the issue.

I've posted a patch.
> 
> (In reply to Jan Vesely from comment #1)
> > here's dmesg output for 3.17 kernel:
> 
> [...]
> 
> > [  249.362280] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
> > [  249.362281] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB
> > on GFX ring (-35).
> 
> Actually, this looks like the reset didn't fully work with 3.17 either
> though...

I don't remember seeing this during bisection. This log is from fedora 3.17.8 kernel. I'll check 3.17.8 vanilla and see whether it's fedora specific
Comment 4 Jan Vesely 2015-03-05 01:12:11 UTC
This does not make sense, the work structure is zeroed so fence put should is OK.
it looks like sometimes the lockup needs more than 1 GPU restart to manifest,
I'll replay without the good entries (at least it explains inconsistent bisect results)

sorry for the noise

(In reply to Jan Vesely from comment #3)
> (In reply to Michel Dänzer from comment #2)
> > Can you bisect?
> 
> It took a while (first bisect found unrelated i915 dispaly commit).
> the failure was introduced in:
> 
> commit dd7cfd641228abb2669d8d047d5ec377b1835900
> Author: Maarten Lankhorst <maarten.lankhorst@canonical.com>
> Date:   Tue Jan 21 13:07:31 2014 +0100
> 
>     drm/ttm: kill fence_lock
>     
>     No users are left, kill it off! :D
>     Conversion to the reservation api is next on the list, after
>     that the functionality can be restored with rcu.
>     
>     Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
> 
> the commit moves a call to fence get below two "goto cleanup" in error
> paths, however, fence_put is left in the cleanup: error target. Moving the
> fence_put call to pflip_cleanup fixes the issue.
> 
> I've posted a patch.
> > 
> > (In reply to Jan Vesely from comment #1)
> > > here's dmesg output for 3.17 kernel:
> > 
> > [...]
> > 
> > > [  249.362280] [drm:r600_ib_test] *ERROR* radeon: fence wait failed
> (-35).
> > > [  249.362281] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing
> IB
> > > on GFX ring (-35).
> > 
> > Actually, this looks like the reset didn't fully work with 3.17 either
> > though...
> 
> I don't remember seeing this during bisection. This log is from fedora
> 3.17.8 kernel. I'll check 3.17.8 vanilla and see whether it's fedora specific

Note You need to log in before you can comment on or make changes to this bug.