Bug 216625 - [regression] GPU lockup on Radeon R7 Kaveri
Summary: [regression] GPU lockup on Radeon R7 Kaveri
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-10-25 15:52 UTC by Pierre Ossman
Modified: 2023-03-24 17:52 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.19.16-100.fc35.x86_64
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg since problems began (1.37 MB, text/plain)
2022-10-25 15:52 UTC, Pierre Ossman
Details

Description Pierre Ossman 2022-10-25 15:52:04 UTC
Created attachment 303084 [details]
dmesg since problems began

Ever since I upgraded from Fedora 34 to Fedora 35 I've gotten random GPU lockups. This machine has otherwise been stable for years.

I don't really know what triggers the issue. I *think* it happens in some cases when I try to play a video in Firefox, but I'm not completely sure.

Reported here, but Fedora generally don't give any attention to GPU driver issues:

https://bugzilla.redhat.com/show_bug.cgi?id=2131923

Last working system:

  kernel-5.13.8-100.fc33.x86_64
  libglvnd-1:1.3.3-1.fc34.x86_64
  mesa-libGL-21.1.8-3.fc34.x86_64
  libdrm-2.4.109-1.fc34.x86_64
  xorg-x11-server-Xorg-1.20.14-3.fc34.x86_64

First broken system:

  kernel-5.19.8-100.fc35.x86_64
  libglvnd-1:1.3.4-2.fc35.x86_64
  mesa-libGL-21.3.9-1.fc35.x86_64
  libdrm-2.4.110-1.fc35.x86_64
  xorg-x11-server-Xorg-1.20.14-7.fc35.x86_64

Attached is all kernel logs since the issue started happening. It also includes a fresh boot from the last good kernel, and a good run with the new kernel.

I think that first run with the new kernel was just a fluke, though. The only package upgraded after the system upgrade and before the lockups started is annobin.
Comment 1 Alex Deucher 2022-10-25 16:24:56 UTC
Any chance you could bisect?  There have been very few changes to the radeon kernel driver over the last few years.  I could also be a mesa regression.  Does upgrading or downgrading mesa help?
Comment 2 Pierre Ossman 2022-10-26 05:40:29 UTC
A bisect will be difficult, given that I can't reproduce it. :/

Any clues from the dmesg that could tell how to provoke it? Or some settings that could provide more information?

I can try a few version and see if I'm able to narrow it down somewhat. It's difficult to know when to assume it's a good version as in some cases it has gone weeks without a lookup...
Comment 3 Pierre Ossman 2022-10-26 05:58:39 UTC
This is wrong, I checked the wrong lines in dnf's history:

> Last working system:
> 
>   kernel-5.13.8-100.fc33.x86_64

The last working kernel is actually 5.17.12-100.fc34.x86_64. So if it's the kernel it's likely 5.18 or 5.19 that regressed. I'll give 5.18.1 a spin.
Comment 4 Pierre Ossman 2022-10-28 05:40:02 UTC
I just got a GPU lockup on 5.18.4. So it's either not the kernel, or a bug that appeared in the 5.18 series. I'll go back to the known good kernel now and see if I can get the bug there.


One thought though, even if it is mesa that happens to issue a bad sequence of commands, shouldn't the kernel driver be able to reset the GPU? It certainly indicates that it is trying.
Comment 5 Pierre Ossman 2022-11-11 06:47:59 UTC
The lockup happens on 5.17.2 as well, so it seems the kernel is not the most likely suspect.

I'll see if I can try an older mesa next.

Could the issue be with the firmware? Has that changed recently for these devices?

The last good firmware should be:

  linux-firmware-20220509-132.fc34.noarch

And the first bad firmware should be:

  linux-firmware-20220708-136.fc35.noarch
Comment 6 Alex Deucher 2022-11-11 14:54:23 UTC
(In reply to Pierre Ossman from comment #5)
> 
> Could the issue be with the firmware? Has that changed recently for these
> devices?
> 
> The last good firmware should be:
> 
>   linux-firmware-20220509-132.fc34.noarch
> 
> And the first bad firmware should be:
> 
>   linux-firmware-20220708-136.fc35.noarch

Not likely. The firmware for this chip has not changed in years.
Comment 7 Pierre Ossman 2022-12-20 06:53:26 UTC
Sorry, I haven't had time to look at downgrading Mesa yet. But FYI, it does still happen with mesa 22.1.7 and kernel 6.0.10.

I am now almost 100% certain that it is videos that are triggering this. And possibly not all videos. So I'm thinking, perhaps the video acceleration?

Is that also handled by mesa, or some other component?
Comment 8 Alex Deucher 2022-12-20 15:13:03 UTC
(In reply to Pierre Ossman from comment #7)
> 
> Is that also handled by mesa, or some other component?

Yes, mesa handles video APIs (VAAPI, OpenMAX, VDPAU) as well as 3D (OpenGL, Vulkan).
Comment 9 Pierre Ossman 2023-03-06 06:13:25 UTC
FYI, it seems to have gotten worse since upgrading from kernel-6.1.8-100.fc36.x86_64 to kernel-6.1.13-100.fc36.x86_64.

It now hangs more arbitrarily, not just when trying to play a video. Having done a suspend/resume cycle is still a requirement though.

I'm struggling building the old version of mesa that still worked. It isn't very compatible with newer LLVM, and there is something wrong with Fedora's packaging of LLVM 12 (that seems to be the matching LLVM version for that old mesa). I'll need some more effort to get that test up and running.
Comment 10 Pierre Ossman 2023-03-09 06:23:05 UTC
I finally got that old version of mesa to build. Unfortunately, the hangs still happen even with that. :/

> Mar 09 07:18:30 kernel: radeon 0000:00:01.0: ring 3 stalled for more than
> 10028msec
> Mar 09 07:18:30 kernel: radeon 0000:00:01.0: GPU lockup (current fence id
> 0x000000000000fa91 last fence id 0x000000000000fabc on ring 3)
> Mar 09 07:18:31 kernel: radeon 0000:00:01.0: ring 5 stalled for more than
> 10077msec
> Mar 09 07:18:31 kernel: radeon 0000:00:01.0: GPU lockup (current fence id
> 0x00000000000018fb last fence id 0x00000000000018fe on ring 5)
> Mar 09 07:18:31 kernel: radeon 0000:00:01.0: ring 0 stalled for more than
> 10202msec
> ...

What can we do next to pinpoint this?

It seems to fail rather reliably after a suspend/resume. Is there some test suite I can run to provoke things?
Comment 11 Pierre Ossman 2023-03-24 17:52:21 UTC
(In reply to Pierre Ossman from comment #9)
> 
> It now hangs more arbitrarily, not just when trying to play a video. Having
> done a suspend/resume cycle is still a requirement though.
> 

I tried disabling video acceleration, and the hangs are now gone. So it does seem to be the culprit after all.

Could this help you pinpoint things somehow?

Note You need to log in before you can comment on or make changes to this bug.