Bug 204683

Summary: amdgpu: ring sdma0 timeout
Product: Drivers Reporter: Matthias Heinz (mh)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED DUPLICATE    
Severity: normal CC: alexdeucher, jordan, soeren.grunewald
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.3.0-rc5 Subsystem:
Regression: No Bisected commit-id:
Attachments: Kernel trace
Kernel trace
Kernel trace

Description Matthias Heinz 2019-08-24 09:28:31 UTC
Hi,

when playing some games I randomly (sometimes after 5 minutes, sometimes after 2 hours) get a blank screen, sometimes audio still works, sometimes the whole system locks up. I've seen this with Rise of the Tomb Raider and 7 Days to Die so far.

I finally managed to sync the log files to disk to get an error, before whole thing locked up:

Aug 24 11:13:33 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=368056, emitted seq=368057
Aug 24 11:13:33 egalite kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
Aug 24 11:13:33 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process 7DaysToDie.x86_ pid 8108 thread 7DaysToDie:cs0
Aug 24 11:13:33 egalite kernel: amdgpu 0000:0c:00.0: GPU reset begin!
Aug 24 11:13:33 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

Only a hard reset made me recover from that.


This is with a self-built kernel 5.3.0-rc5. Also happens with 5.2.1.
Mesa: 19.1.4-1
GPU: Vega 56

Best
Matthias
Comment 1 Matthias Heinz 2019-08-25 08:44:22 UTC
So I tried the oldest installed kernel (a 4.19) I could find and didn't have any problems with it. This seems to be a regression.
Comment 2 Alex Deucher 2019-08-26 15:25:25 UTC
Can you bisect?
Comment 3 Matthias Heinz 2019-08-26 16:25:19 UTC
Already on it. It seems to be somewhere between 5.0.2 and 5.1.21.

This will take a while. Can take some hours to trigger it...
Comment 4 Matthias Heinz 2019-08-28 19:40:48 UTC
I still have 11 steps to go (as I mentioned it's a pretty lengthy task), but I got some more debug output, before the system stopped working. Please see the attached file, maybe it has some clues what's going wrong.
Comment 5 Matthias Heinz 2019-08-28 19:41:40 UTC
Created attachment 284667 [details]
Kernel trace
Comment 6 Matthias Heinz 2019-09-05 12:08:36 UTC
I had to switch to drm-next to do further bisecting and I think 634092b1b9f67bea23a87b77880df5e8012a411a is causing the problem.

I might be wrong though.
Comment 7 Soeren Grunewald 2019-09-06 20:22:29 UTC
Created attachment 284869 [details]
Kernel trace

It seems I have the same issue. I run on fedora 30 with testing-updates enabled. The GPU is a Sapphire Pulse RX 56.
Comment 8 Matthias Heinz 2019-09-07 13:44:56 UTC
I was wrong. 2c3cd66f4c66, which is the predecessor of 634092b1b9f6 just crashed on me. Well, back to the drawing board... (eh, bisecting)
Comment 9 Matthias Heinz 2019-09-12 12:51:13 UTC
A small update.

I managed to go down even further. I'm currently at e6d2421343a7 in drm-next and I see the following error:

Sep 12 14:32:44 egalite kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
Sep 12 14:32:44 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1023042, emitted seq=1023043
Sep 12 14:32:44 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process 7DaysToDie.x86_ pid 6696 thread 7DaysToDie:cs0 pid 6698

Now it looks a lot like #201957, but I have no problems with kernels before and 5.0. It started with 5.1. So I'm not sure how similar it is.


I have one last idea what to do. The commit before e6d2421343a7 results in a similar problem, but the display doesn't go blank and to standby. Only the picture freezes and that's it.
I will try to find the commit that results in this bug and then see, if the kernel of the commit before that one still has my main problem in it. If not I'll post the range, probably somewhere inbetween is the error itself hidden. Otherwise testing is not possible, since it freezes pretty fast and the ring timeout bug takes up to 45 minutes to appear.

Since I'm at the 40th kernels so far, any help or even a hint is highly appreciated. (I could use a faster testing solution.)
Comment 10 Matthias Heinz 2019-09-12 22:18:45 UTC
Created attachment 284945 [details]
Kernel trace

Update 2 for today.

With de00d253bc85, which is the predecessor of e6d2421343a7, I get this kernel bug.

I have never seen this one after de00d253bc85, so my guess is that e6d2421343a7 fixes it partially.

I will now start a bisect starting with the last known good kernel and de00d253bc85 and try to figure out when this one was introduced. (Back to kernel backing hell...)
Comment 11 Matthias Heinz 2019-10-05 09:25:43 UTC
The bug is still present in linux 5.3.

Also I'm not done yet bisecting. But the older kernels seem to have nasty fs bugs and I'm not sure if I'm really willing to put my data on the line for this. Is there really no other way to figure out where this originates from?
Comment 12 Matthias Heinz 2019-10-11 15:02:42 UTC
My last update, because I have no way to go forward from here on.

This bug seems to go way back longer than I initially thought. I'm currently at "drm-fixes-2018-08-31" in linux-drm and it's already in there, so it's probably pretty old.

I can't use any older kernel, because I need steam to run the games to test this. But steam wont work with anything older than 4.19.

BUT I found a game that almost instantly triggers this bug on startup: Insurgency. 

Just start it and if that doesn't trigger it immediately, quit the game and start it again. It can take two to three times, joining a match helps, too, but it takes less than 5 minutes for each test.

So, please go ahead and fix this already, it's annoying.
Comment 13 Matthias Heinz 2019-10-14 17:18:59 UTC

*** This bug has been marked as a duplicate of bug 201957 ***