Bug 213391
Summary: | AMDGPU retries page fault with some specific processes amdgpu and sometimes followed [gfxhub0] retry page fault until *ERROR* ring gfx timeout, but soft recovered | ||
---|---|---|---|
Product: | Drivers | Reporter: | Lahfa Samy (samy) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | RESOLVED UNREPRODUCIBLE | ||
Severity: | low | CC: | dimitris.on.linux, dominic.letz, dushistov, himself, linus.kardell, lsrzj, matejm98mthw, mcmarius, nirmoy.aiemd, philipp.list, samy, ville.aakko, xiehuanjun |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
See Also: | https://bugzilla.kernel.org/show_bug.cgi?id=101831 | ||
Kernel Version: | Linux 5.12.9-arch-1-1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg-chromium-amdgpu-retry-page-fault
journalctl-amdgpu-qutebrowser-page-retry Crash log for kernel 5.12.10 amdgpu crash log for kernel 5.4.126 Firmware info amdgpu-xorg-page-faults-screen-blackout-when-memory-heavily-used Firmware information for a T495 with an AMD Vega RX 10 Archlinux-part-of-modinfo-amdgpu Kernel crash log for linux firmware version 20210511.7685cf4 Linux Firmware version info 20210511.7685cf4 |
Description
Lahfa Samy
2021-06-10 11:16:11 UTC
Created attachment 297287 [details]
dmesg-chromium-amdgpu-retry-page-fault
In the dmesg, there is the end of an entry to a sleep state and then out of the sleep state (a USB-C dock was connected to the laptop, and it has screens however errors happened with it plugged and when it was unplugged).
Created attachment 297291 [details]
journalctl-amdgpu-qutebrowser-page-retry
This time there was no gfx timeout and thus the X11 server did not freeze, and I didn't notice the retry page faults until I ran dmesg.
There is a call trace at the beginning (irq 7: nobody cared (try booting with the "irqpoll" option) and then a call trace, this is a known and reported bug that doesn't affect my computer functionality in any way since I acquired it.
How much VRAM do you have, I can't seem to find that from dmesg? We recently fixed a similar issue using https://patchwork.freedesktop.org/patch/437369/. I wonder if you can try this patch out. I have about 1GB of VRAM currently set according to glxinfo: Extended renderer info (GLX_MESA_query_renderer): Vendor: AMD (0x1002) Device: AMD Radeon(TM) Vega 10 Graphics (RAVEN, DRM 3.40.0, 5.12.9-arch1-1, LLVM 12.0.0) (0x15d8) Version: 21.2.0 Accelerated: yes Video memory: 1024MB Unified memory: no Memory info (GL_ATI_meminfo): VBO free memory - total: 42 MB, largest block: 42 MB VBO free aux. memory - total: 2442 MB, largest block: 2442 MB Texture free memory - total: 42 MB, largest block: 42 MB Texture free aux. memory - total: 2442 MB, largest block: 2442 MB Renderbuffer free memory - total: 42 MB, largest block: 42 MB Renderbuffer free aux. memory - total: 2442 MB, largest block: 2442 MB Memory info (GL_NVX_gpu_memory_info): Dedicated video memory: 1024 MB Total available memory: 4096 MB Currently available dedicated video memory: 42 MB OpenGL vendor string: AMD OpenGL renderer string: AMD Radeon(TM) Vega 10 Graphics (RAVEN, DRM 3.40.0, 5.12.9-arch1-1, LLVM 12.0.0) How would I go about testing a patch ? (I probably need to rebuild the Linux kernel with the patch, right and boot with it), I found this link, but it says that the information in there is probably deprecated : https://www.kernel.org/doc/html/v5.12/process/applying-patches.html Please let me know what distro are you using then I can prepare a complete guide. I'm under ArchLinux running with the ZFS module (I can't boot and mount the root/home "partition" without it), thanks for the time you'll be taking to make this guide, I'll be trying my best to test the patch in any ways I can. Actually, I am wrong, I checked out v5.12.9-arch1 from Arch and realized the fix I mentioned before isn't valid. In the meantime, I'll be trying to find a way to reproduce this issue reliably, if you have any plans on writing a patch for this issue, I would be glad to help in any testing in order to help squash this bug. If you can, reverting to an older version of the files under /lib/firmware/amdgpu/ may avoid the hangs. Seeing the same thing on a T495 running Fedora 33 and Wayland, typically involving Firefox: https://bugzilla.redhat.com/show_bug.cgi?id=1966384 Would it be possible for me to try that patch? Hi Dimitris, what is your current kernel version under Fedora, or the output of this command "uname --kernel-release" in a terminal, I cannot try the patch given however I haven't run into the issue again, I haven't had the time to put my RAM under heavy load. Hi, I've seen this under 5.12.10-200.fc33.x86_64, two incidents hours apart. Earlier had a number of incidents under 5.12.9. In all of my cases I was using Firefox "heavily". Creating tabs and using graphics-heavy pages. Hi Dimitris and Lahfa, please try Michel's suggestion. Having the same issue on an E495 with Kernel 5.12.9. Will try to downgrade the /lib/firmware/amdgpu any hint to which git tag you would consider safe? (In reply to Dominic Letz from comment #14) > Having the same issue on an E495 with Kernel 5.12.9. Will try to downgrade > the /lib/firmware/amdgpu any hint to which git tag you would consider safe? 20210315 seems to work fine here (on an E595). (In reply to Michel Dänzer from comment #15) > (In reply to Dominic Letz from comment #14) > > Having the same issue on an E495 with Kernel 5.12.9. Will try to downgrade > > the /lib/firmware/amdgpu any hint to which git tag you would consider safe? > > 20210315 seems to work fine here (on an E595). +1 trying that Created attachment 297413 [details]
Crash log for kernel 5.12.10
I'm having issues with amdgpu since kernel 5.10. I had to downgrade to 5.4 LTS to get rid of any kind of issue.
Created attachment 297467 [details]
amdgpu crash log for kernel 5.4.126
Before 5.4.126 I had no issues at all, downgrading to 5.4.123 to check if the problem will be gone.
I've also just replaced /lib/firmware/amdgpu with the `20210315` version, I'll see how this goes. Currently running Fedora kernel 5.12.11-200.fc33.x86_64 on a T495. Question, don't I also need to update the initrd? `lsinitrd` shows that all the amdgpu modules are included in the initrd image. Or is the firmware reloaded once root is mounted? (In reply to dimitris from comment #19) > Question, don't I also need to update the initrd? Yes you do, if it didn't happen automatically. So I'm running since 16th on 20210315 and it has been stable so far vs. multiple freezes a day before. Updated initrd also to 20210315, ran under 5.12.11-200.fc33 for a day or so without issues, now under 5.12.12-200.fc33, we'll see how it goes. For reference what's the best way to check the active/loaded firmware? I don't see anything obvious on dmesg or lspci -vv. /sys/kernel/debug/dri/0/amdgpu_firmware_info has all the info. Created attachment 297557 [details]
Firmware info
The downgrade to kernel 5.4.123 doesn't had any effect, I had the same bug. Now I'm passing my firmware versions information.
(In reply to Dominic Letz from comment #21) Trying the same version linux firmware 20210315. Let's check how it goes Created attachment 297669 [details]
amdgpu-xorg-page-faults-screen-blackout-when-memory-heavily-used
Here are other logs. I have seen that when triggering the bug yet again on the 5.12.10-arch1-1 linux kernel running on ArchLinux, the computer didn't freeze this time like before, it just stopped displaying anything (Xorg was affected so I guess that's why).
I'm using this version of the linux-firmware package under Arch : linux-firmware-20210511.7685cf4-1
I have not yet downgraded to test with a downgraded linux-firmware package, may try this soon, if I get affected by the issue too frequently.
Created attachment 297671 [details]
Firmware information for a T495 with an AMD Vega RX 10
Here is again my Linux firmware package version (given by pacman coming from ArchLinux core repositories) : 20210511.7685cf4-1
(In reply to Leandro Jacques from comment #25) Until now, no problems. So the problem is with newer firmware versions, working without any issues since 2021-06-21 19:26:28 UTC with version 20210315 How to file a bug to the linux-firmware project for the amdgpu driver? After the downgrade I haven't experienced any issues anymore. (In reply to Dominic Letz from comment #21) I made what you suggested, no issues anymore. It was a linux-firmware package problem, not a kernel driver problem. I just have hit the same error even after downgrading, here is the current version of the package linux-firmware 20210315.3568f96-3. I have hit the error again, the computer froze for a few seconds, looking at the logs shows many retry page faults for the amdgpu driver. Furthermore, I'm on ArchLinux and I will attach the output of `modinfo amdgpu`, I'm thinking that downgrading linux-firmware on my distro wasn't enough it seems to downgrade the AMDGPU driver. Created attachment 297781 [details]
Archlinux-part-of-modinfo-amdgpu
I think that my kernel is using the latest amdgpu driver that is coming with 5.12.13-arch1-2 and not the version coming with the linux-firmware pkg, if anyone can enlighten me or explain to me if I'm mistaken.
Created attachment 297881 [details]
Kernel crash log for linux firmware version 20210511.7685cf4
amdgpu kernel crash log when the problem ocurred, with the exact same message telling about page fault.
Created attachment 297883 [details]
Linux Firmware version info 20210511.7685cf4
Firmware info as of the moment when the system crashed
i have a Lenovo L340 and the same problem here is the complete dmesg log https://gist.github.com/McMarius11/36c8d21a2dcaf5c2289c91a74af4f7fb Operating System: Manjaro Linux KDE Plasma Version: 5.22.4 KDE Frameworks Version: 5.84.0 Qt Version: 5.15.2 Kernel Version: 5.11.22-2-MANJARO (64-bit) Graphics Platform: X11 Processors: 8 × AMD Ryzen 7 3700U with Radeon Vega Mobile Gfx Memory: 5,6 GiB of RAM Graphics Processor: AMD Radeon™ Vega 10 Graphics Did anyone test whether this has been fixed in newer firmware updates, or should we still stay on version 20210315.3568f96-3 ? (In reply to Lahfa Samy from comment #36) > Did anyone test whether this has been fixed in newer firmware updates, or > should we still stay on version 20210315.3568f96-3 ? It's fixed in upstream linux-firmware 20210818. |