Bug 198239

Summary: [drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD)
Product: Drivers Reporter: Andreas Brogle (anbro)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: RESOLVED DUPLICATE    
Severity: normal CC: bjorn, Ivan.iraci
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.11 and newer Subsystem:
Regression: No Bisected commit-id:

Description Andreas Brogle 2017-12-23 16:40:50 UTC
Hello,

This bug formerly has been reported at https://bugzilla.kernel.org/show_bug.cgi?id=196197
Because it seems to be a problem caused by the PCI subsystem, it is reassigned here.


The bug could be bisected exactly:
git rev-list v4.11-rc1
...
60e8d3e11645a1b9c4197d9786df3894332c1685 (BAD)
190c3ee06a0f0660839785b7ad8a830e832d9481 (GOOD)
...


Description:

Every time the system is booting it will hang, when trying to start Xorg every 5 seconds in an endless loop. The keyboard is locked too.
Only possibility to get a console is via ssh from another machine.

The bug is reproducible always and can always be avoided by downgrading to a kernel 4.10, leaving the rest of the system unchanged.
That bug comes with the step from kernel 4.10 to kernel 4.11.

Hardware:
AMD Opteron Board with graphics card in a PCI-E 8x Slot.
06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV630 XT [Radeon HD 2600 XT]

Relevant sections of kern.log:
Jun  4 09:19:41 a1 kernel: [   46.719247] radeon 0000:06:00.0: ring 0 stalled for more than 10093msec
Jun  4 09:19:41 a1 kernel: [   46.719253] radeon 0000:06:00.0: GPU lockup (current fence id 0x000000000000000e last fence id 0x0000000000000
012 on ring 0)
Jun  4 09:19:41 a1 kernel: [   46.729651] radeon 0000:06:00.0: Saved 121 dwords of commands on ring 0.
Jun  4 09:19:41 a1 kernel: [   46.729666] radeon 0000:06:00.0: GPU softreset: 0x00000019
Jun  4 09:19:41 a1 kernel: [   46.729669] radeon 0000:06:00.0:   R_008010_GRBM_STATUS      = 0xE57C24E0
Jun  4 09:19:41 a1 kernel: [   46.729672] radeon 0000:06:00.0:   R_008014_GRBM_STATUS2     = 0x00113303
Jun  4 09:19:41 a1 kernel: [   46.729674] radeon 0000:06:00.0:   R_000E50_SRBM_STATUS      = 0x200030C0
Jun  4 09:19:41 a1 kernel: [   46.729676] radeon 0000:06:00.0:   R_008674_CP_STALLED_STAT1 = 0x01000000
Jun  4 09:19:41 a1 kernel: [   46.729678] radeon 0000:06:00.0:   R_008678_CP_STALLED_STAT2 = 0x00001002
Jun  4 09:19:41 a1 kernel: [   46.729680] radeon 0000:06:00.0:   R_00867C_CP_BUSY_STAT     = 0x00028C86
Jun  4 09:19:41 a1 kernel: [   46.729683] radeon 0000:06:00.0:   R_008680_CP_STAT          = 0x808386C5
Jun  4 09:19:41 a1 kernel: [   46.729685] radeon 0000:06:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Jun  4 09:19:41 a1 kernel: [   46.965016] radeon 0000:06:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEF
Jun  4 09:19:41 a1 kernel: [   46.965070] radeon 0000:06:00.0: SRBM_SOFT_RESET=0x00000100
Jun  4 09:19:41 a1 kernel: [   46.967176] radeon 0000:06:00.0:   R_008010_GRBM_STATUS      = 0xA0003030
Jun  4 09:19:41 a1 kernel: [   46.967178] radeon 0000:06:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
Jun  4 09:19:41 a1 kernel: [   46.967181] radeon 0000:06:00.0:   R_000E50_SRBM_STATUS      = 0x2000B0C0
Jun  4 09:19:41 a1 kernel: [   46.967183] radeon 0000:06:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Jun  4 09:19:41 a1 kernel: [   46.967185] radeon 0000:06:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
Jun  4 09:19:41 a1 kernel: [   46.967187] radeon 0000:06:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
Jun  4 09:19:41 a1 kernel: [   46.967189] radeon 0000:06:00.0:   R_008680_CP_STAT          = 0x80100000
Jun  4 09:19:41 a1 kernel: [   46.967191] radeon 0000:06:00.0: 
R_00D034_DMA_STATUS_REG   = 0x44C83D57
Jun  4 09:19:41 a1 kernel: [   46.967200] radeon 0000:06:00.0: GPU reset succeeded, trying to resume
Jun  4 09:19:41 a1 kernel: [   47.139539] [drm] PCIE GART of 512M enabled (table at 0x0000000000142000).
Jun  4 09:19:41 a1 kernel: [   47.139556] radeon 0000:06:00.0: WB enabled
Jun  4 09:19:41 a1 kernel: [   47.139559] radeon 0000:06:00.0: fence driver on ring 0 use gpu addr 0x0000000010000c00 and cpu addr 0xffff880
33fc2cc00
Jun  4 09:19:41 a1 kernel: [   47.140372] radeon 0000:06:00.0: fence driver on ring 5 use gpu addr 0x00000000000521d0 and cpu addr 0xffffc90
004a121d0
Jun  4 09:19:42 a1 kernel: [   47.343830] [drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD)
Jun  4 09:19:42 a1 kernel: [   47.343855] [drm:r600_resume [radeon]] *ERROR* r600 startup failed on resume
Jun  4 09:19:52 a1 kernel: [   57.358140] radeon 0000:06:00.0: ring 0 stalled for more than 10186msec
Jun  4 09:19:52 a1 kernel: [   57.358148] radeon 0000:06:00.0: GPU lockup (current fence id 0x000000000000000e last fence id 0x0000000000000
012 on ring 0)
Jun  4 09:19:52 a1 kernel: [   57.367850] radeon 0000:06:00.0: Saved 261817 dwords of commands on ring 0.
Jun  4 09:19:52 a1 kernel: [   57.367866] radeon 0000:06:00.0: GPU softreset: 0x00000008
Jun  4 09:19:52 a1 kernel: [   57.367869] radeon 0000:06:00.0:   R_008010_GRBM_STATUS      = 0xA0003030
Jun  4 09:19:52 a1 kernel: [   57.367871] radeon 0000:06:00.0:   R_008014_GRBM_STATUS2     = 0x00000003

Greetings,
Andreas
Comment 1 Ivan Iraci 2018-01-18 08:04:22 UTC
I've been experiencing the same proble with more or less the same setup.
I bought a new mainboard because the frequent reboots exposed some hardware problems that otherwise showed up less frequently.

Anyway, the new mainboard has an [AMD/ATI] RS780L [Radeon 3000] and suffered from this bug as well.

Suddenly I realized that the reboots (or just the kernel errors) happened during the fb splash loading.

I uninstalled the splashutils and disabled fb_con_decor in my kernel config and all is well since then. Uninstalling splashutils solved the issue for me, but I wanted to be 200% sure I will never have it again so I disabled fb_con_decor.

I don't know if the others who are experiencing the same issue are using splashutils or something similar, but I think that what happened to me could be of some use to someone else.
Comment 2 Bjorn Helgaas 2018-01-18 18:44:39 UTC
Marking as duplicate of 196197 (the original report).  We can continue this there.

*** This bug has been marked as a duplicate of bug 196197 ***