Created attachment 303084 [details] dmesg since problems began Ever since I upgraded from Fedora 34 to Fedora 35 I've gotten random GPU lockups. This machine has otherwise been stable for years. I don't really know what triggers the issue. I *think* it happens in some cases when I try to play a video in Firefox, but I'm not completely sure. Reported here, but Fedora generally don't give any attention to GPU driver issues: https://bugzilla.redhat.com/show_bug.cgi?id=2131923 Last working system: kernel-5.13.8-100.fc33.x86_64 libglvnd-1:1.3.3-1.fc34.x86_64 mesa-libGL-21.1.8-3.fc34.x86_64 libdrm-2.4.109-1.fc34.x86_64 xorg-x11-server-Xorg-1.20.14-3.fc34.x86_64 First broken system: kernel-5.19.8-100.fc35.x86_64 libglvnd-1:1.3.4-2.fc35.x86_64 mesa-libGL-21.3.9-1.fc35.x86_64 libdrm-2.4.110-1.fc35.x86_64 xorg-x11-server-Xorg-1.20.14-7.fc35.x86_64 Attached is all kernel logs since the issue started happening. It also includes a fresh boot from the last good kernel, and a good run with the new kernel. I think that first run with the new kernel was just a fluke, though. The only package upgraded after the system upgrade and before the lockups started is annobin.
Any chance you could bisect? There have been very few changes to the radeon kernel driver over the last few years. I could also be a mesa regression. Does upgrading or downgrading mesa help?
A bisect will be difficult, given that I can't reproduce it. :/ Any clues from the dmesg that could tell how to provoke it? Or some settings that could provide more information? I can try a few version and see if I'm able to narrow it down somewhat. It's difficult to know when to assume it's a good version as in some cases it has gone weeks without a lookup...
This is wrong, I checked the wrong lines in dnf's history: > Last working system: > > kernel-5.13.8-100.fc33.x86_64 The last working kernel is actually 5.17.12-100.fc34.x86_64. So if it's the kernel it's likely 5.18 or 5.19 that regressed. I'll give 5.18.1 a spin.
I just got a GPU lockup on 5.18.4. So it's either not the kernel, or a bug that appeared in the 5.18 series. I'll go back to the known good kernel now and see if I can get the bug there. One thought though, even if it is mesa that happens to issue a bad sequence of commands, shouldn't the kernel driver be able to reset the GPU? It certainly indicates that it is trying.
The lockup happens on 5.17.2 as well, so it seems the kernel is not the most likely suspect. I'll see if I can try an older mesa next. Could the issue be with the firmware? Has that changed recently for these devices? The last good firmware should be: linux-firmware-20220509-132.fc34.noarch And the first bad firmware should be: linux-firmware-20220708-136.fc35.noarch
(In reply to Pierre Ossman from comment #5) > > Could the issue be with the firmware? Has that changed recently for these > devices? > > The last good firmware should be: > > linux-firmware-20220509-132.fc34.noarch > > And the first bad firmware should be: > > linux-firmware-20220708-136.fc35.noarch Not likely. The firmware for this chip has not changed in years.
Sorry, I haven't had time to look at downgrading Mesa yet. But FYI, it does still happen with mesa 22.1.7 and kernel 6.0.10. I am now almost 100% certain that it is videos that are triggering this. And possibly not all videos. So I'm thinking, perhaps the video acceleration? Is that also handled by mesa, or some other component?
(In reply to Pierre Ossman from comment #7) > > Is that also handled by mesa, or some other component? Yes, mesa handles video APIs (VAAPI, OpenMAX, VDPAU) as well as 3D (OpenGL, Vulkan).
FYI, it seems to have gotten worse since upgrading from kernel-6.1.8-100.fc36.x86_64 to kernel-6.1.13-100.fc36.x86_64. It now hangs more arbitrarily, not just when trying to play a video. Having done a suspend/resume cycle is still a requirement though. I'm struggling building the old version of mesa that still worked. It isn't very compatible with newer LLVM, and there is something wrong with Fedora's packaging of LLVM 12 (that seems to be the matching LLVM version for that old mesa). I'll need some more effort to get that test up and running.
I finally got that old version of mesa to build. Unfortunately, the hangs still happen even with that. :/ > Mar 09 07:18:30 kernel: radeon 0000:00:01.0: ring 3 stalled for more than > 10028msec > Mar 09 07:18:30 kernel: radeon 0000:00:01.0: GPU lockup (current fence id > 0x000000000000fa91 last fence id 0x000000000000fabc on ring 3) > Mar 09 07:18:31 kernel: radeon 0000:00:01.0: ring 5 stalled for more than > 10077msec > Mar 09 07:18:31 kernel: radeon 0000:00:01.0: GPU lockup (current fence id > 0x00000000000018fb last fence id 0x00000000000018fe on ring 5) > Mar 09 07:18:31 kernel: radeon 0000:00:01.0: ring 0 stalled for more than > 10202msec > ... What can we do next to pinpoint this? It seems to fail rather reliably after a suspend/resume. Is there some test suite I can run to provoke things?
(In reply to Pierre Ossman from comment #9) > > It now hangs more arbitrarily, not just when trying to play a video. Having > done a suspend/resume cycle is still a requirement though. > I tried disabling video acceleration, and the hangs are now gone. So it does seem to be the culprit after all. Could this help you pinpoint things somehow?