when playing some games I randomly (sometimes after 5 minutes, sometimes after 2 hours) get a blank screen, sometimes audio still works, sometimes the whole system locks up. I've seen this with Rise of the Tomb Raider and 7 Days to Die so far.
I finally managed to sync the log files to disk to get an error, before whole thing locked up:
Aug 24 11:13:33 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=368056, emitted seq=368057
Aug 24 11:13:33 egalite kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
Aug 24 11:13:33 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process 7DaysToDie.x86_ pid 8108 thread 7DaysToDie:cs0
Aug 24 11:13:33 egalite kernel: amdgpu 0000:0c:00.0: GPU reset begin!
Aug 24 11:13:33 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Only a hard reset made me recover from that.
This is with a self-built kernel 5.3.0-rc5. Also happens with 5.2.1.
GPU: Vega 56
So I tried the oldest installed kernel (a 4.19) I could find and didn't have any problems with it. This seems to be a regression.
Can you bisect?
Already on it. It seems to be somewhere between 5.0.2 and 5.1.21.
This will take a while. Can take some hours to trigger it...
I still have 11 steps to go (as I mentioned it's a pretty lengthy task), but I got some more debug output, before the system stopped working. Please see the attached file, maybe it has some clues what's going wrong.
Created attachment 284667 [details]
I had to switch to drm-next to do further bisecting and I think 634092b1b9f67bea23a87b77880df5e8012a411a is causing the problem.
I might be wrong though.
Created attachment 284869 [details]
It seems I have the same issue. I run on fedora 30 with testing-updates enabled. The GPU is a Sapphire Pulse RX 56.
I was wrong. 2c3cd66f4c66, which is the predecessor of 634092b1b9f6 just crashed on me. Well, back to the drawing board... (eh, bisecting)
A small update.
I managed to go down even further. I'm currently at e6d2421343a7 in drm-next and I see the following error:
Sep 12 14:32:44 egalite kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
Sep 12 14:32:44 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1023042, emitted seq=1023043
Sep 12 14:32:44 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process 7DaysToDie.x86_ pid 6696 thread 7DaysToDie:cs0 pid 6698
Now it looks a lot like #201957, but I have no problems with kernels before and 5.0. It started with 5.1. So I'm not sure how similar it is.
I have one last idea what to do. The commit before e6d2421343a7 results in a similar problem, but the display doesn't go blank and to standby. Only the picture freezes and that's it.
I will try to find the commit that results in this bug and then see, if the kernel of the commit before that one still has my main problem in it. If not I'll post the range, probably somewhere inbetween is the error itself hidden. Otherwise testing is not possible, since it freezes pretty fast and the ring timeout bug takes up to 45 minutes to appear.
Since I'm at the 40th kernels so far, any help or even a hint is highly appreciated. (I could use a faster testing solution.)
Created attachment 284945 [details]
Update 2 for today.
With de00d253bc85, which is the predecessor of e6d2421343a7, I get this kernel bug.
I have never seen this one after de00d253bc85, so my guess is that e6d2421343a7 fixes it partially.
I will now start a bisect starting with the last known good kernel and de00d253bc85 and try to figure out when this one was introduced. (Back to kernel backing hell...)
The bug is still present in linux 5.3.
Also I'm not done yet bisecting. But the older kernels seem to have nasty fs bugs and I'm not sure if I'm really willing to put my data on the line for this. Is there really no other way to figure out where this originates from?
My last update, because I have no way to go forward from here on.
This bug seems to go way back longer than I initially thought. I'm currently at "drm-fixes-2018-08-31" in linux-drm and it's already in there, so it's probably pretty old.
I can't use any older kernel, because I need steam to run the games to test this. But steam wont work with anything older than 4.19.
BUT I found a game that almost instantly triggers this bug on startup: Insurgency.
Just start it and if that doesn't trigger it immediately, quit the game and start it again. It can take two to three times, joining a match helps, too, but it takes less than 5 minutes for each test.
So, please go ahead and fix this already, it's annoying.
*** This bug has been marked as a duplicate of bug 201957 ***