Bug 204683
Summary: | amdgpu: ring sdma0 timeout | ||
---|---|---|---|
Product: | Drivers | Reporter: | Matthias Heinz (mh) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | RESOLVED DUPLICATE | ||
Severity: | normal | CC: | alexdeucher, jordan, soeren.grunewald |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.3.0-rc5 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Kernel trace
Kernel trace Kernel trace |
Description
Matthias Heinz
2019-08-24 09:28:31 UTC
So I tried the oldest installed kernel (a 4.19) I could find and didn't have any problems with it. This seems to be a regression. Can you bisect? Already on it. It seems to be somewhere between 5.0.2 and 5.1.21. This will take a while. Can take some hours to trigger it... I still have 11 steps to go (as I mentioned it's a pretty lengthy task), but I got some more debug output, before the system stopped working. Please see the attached file, maybe it has some clues what's going wrong. Created attachment 284667 [details]
Kernel trace
I had to switch to drm-next to do further bisecting and I think 634092b1b9f67bea23a87b77880df5e8012a411a is causing the problem. I might be wrong though. Created attachment 284869 [details]
Kernel trace
It seems I have the same issue. I run on fedora 30 with testing-updates enabled. The GPU is a Sapphire Pulse RX 56.
I was wrong. 2c3cd66f4c66, which is the predecessor of 634092b1b9f6 just crashed on me. Well, back to the drawing board... (eh, bisecting) A small update. I managed to go down even further. I'm currently at e6d2421343a7 in drm-next and I see the following error: Sep 12 14:32:44 egalite kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out Sep 12 14:32:44 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1023042, emitted seq=1023043 Sep 12 14:32:44 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process 7DaysToDie.x86_ pid 6696 thread 7DaysToDie:cs0 pid 6698 Now it looks a lot like #201957, but I have no problems with kernels before and 5.0. It started with 5.1. So I'm not sure how similar it is. I have one last idea what to do. The commit before e6d2421343a7 results in a similar problem, but the display doesn't go blank and to standby. Only the picture freezes and that's it. I will try to find the commit that results in this bug and then see, if the kernel of the commit before that one still has my main problem in it. If not I'll post the range, probably somewhere inbetween is the error itself hidden. Otherwise testing is not possible, since it freezes pretty fast and the ring timeout bug takes up to 45 minutes to appear. Since I'm at the 40th kernels so far, any help or even a hint is highly appreciated. (I could use a faster testing solution.) Created attachment 284945 [details]
Kernel trace
Update 2 for today.
With de00d253bc85, which is the predecessor of e6d2421343a7, I get this kernel bug.
I have never seen this one after de00d253bc85, so my guess is that e6d2421343a7 fixes it partially.
I will now start a bisect starting with the last known good kernel and de00d253bc85 and try to figure out when this one was introduced. (Back to kernel backing hell...)
The bug is still present in linux 5.3. Also I'm not done yet bisecting. But the older kernels seem to have nasty fs bugs and I'm not sure if I'm really willing to put my data on the line for this. Is there really no other way to figure out where this originates from? My last update, because I have no way to go forward from here on. This bug seems to go way back longer than I initially thought. I'm currently at "drm-fixes-2018-08-31" in linux-drm and it's already in there, so it's probably pretty old. I can't use any older kernel, because I need steam to run the games to test this. But steam wont work with anything older than 4.19. BUT I found a game that almost instantly triggers this bug on startup: Insurgency. Just start it and if that doesn't trigger it immediately, quit the game and start it again. It can take two to three times, joining a match helps, too, but it takes less than 5 minutes for each test. So, please go ahead and fix this already, it's annoying. *** This bug has been marked as a duplicate of bug 201957 *** |