Bug 213391

Summary: AMDGPU retries page fault with some specific processes amdgpu and sometimes followed [gfxhub0] retry page fault until *ERROR* ring gfx timeout, but soft recovered
Product: Drivers Reporter: Lahfa Samy (samy)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED UNREPRODUCIBLE    
Severity: low CC: dimitris, dominic.letz, dushistov, himself, linus.kardell, lsrzj, matejm98mthw, mcmarius, nirmoy.aiemd, philipp.list, samy, ville.aakko, xiehuanjun
Priority: P1    
Hardware: x86-64   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=101831
Kernel Version: Linux 5.12.9-arch-1-1 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg-chromium-amdgpu-retry-page-fault
journalctl-amdgpu-qutebrowser-page-retry
Crash log for kernel 5.12.10
amdgpu crash log for kernel 5.4.126
Firmware info
amdgpu-xorg-page-faults-screen-blackout-when-memory-heavily-used
Firmware information for a T495 with an AMD Vega RX 10
Archlinux-part-of-modinfo-amdgpu
Kernel crash log for linux firmware version 20210511.7685cf4
Linux Firmware version info 20210511.7685cf4

Description Lahfa Samy 2021-06-10 11:16:11 UTC
Hi,

I just updated recently from mainstream Kernel 5.11.16 to 5.12.9 and I've ran into this issue, I've also updated the Mesa driver from mesa-git (21.1.0_devel.137307.f8e5f945b8f-1) to mesa-git (21.2.0_devel.140633.c04f20e7e01-1).

Current kernel parameters : /vmlinuz-linux zfs=zroot/ROOT/default rw loglevel=3 quiet radeon.si_support=0 amdgpu.si_support=1 radeon.cik_support=0 amdgpu.cik_support=1

My computer is a Thinkpad T495 laptop (AMD Ryzen 7 3700 Pro with an iGPU RX VEGA 10, 16GB DDR4 3200Mhz) the very important bit of information is that the BIOS reserves up to 2GB of DDR4 RAM for the iGPU VRAM, I currently have setup 1GB (1024MB) of RAM in my BIOS for the iGPU, I'm thinking the page fault retries could be linked to this in someways.

I think this has a higher chance of happening when my RAM memory is under heavy load and the system is swapping quite a lot too. (I have 12.3GB of Swap on a NVMe PCIe 3.0)

At present, I cannot reproduce this issue consistently yet, however it has been happening with web browsers Qutebrowser (more with Qutebrowser) and also happened only once with Chromium (made the X11 server crash and the computer completely froze, kernel was still responsive to SysReq keys hence I could get out of that tricky situation safely).

I'll be uploading both logs of the crashes I have encountered along with an lspci and other logs files that could be useful.

Kind regards,

Lahfa Samy
Comment 1 Lahfa Samy 2021-06-10 11:33:33 UTC
Created attachment 297287 [details]
dmesg-chromium-amdgpu-retry-page-fault

In the dmesg, there is the end of an entry to a sleep state and then out of the sleep state (a USB-C dock was connected to the laptop, and it has screens however errors happened with it plugged and when it was unplugged).
Comment 2 Lahfa Samy 2021-06-10 11:43:11 UTC
Created attachment 297291 [details]
journalctl-amdgpu-qutebrowser-page-retry

This time there was no gfx timeout and thus the X11 server did not freeze, and I didn't notice the retry page faults until I ran dmesg.

There is a call trace at the beginning (irq 7: nobody cared (try booting with the "irqpoll" option) and then a call trace, this is a known and reported bug that doesn't affect my computer functionality in any way since I acquired it.
Comment 3 Nirmoy 2021-06-10 12:36:42 UTC
How much VRAM do you have, I can't seem to find that from dmesg? We recently fixed a similar issue using https://patchwork.freedesktop.org/patch/437369/. I wonder if you can try this patch out.
Comment 4 Lahfa Samy 2021-06-10 12:51:52 UTC
I have about 1GB of VRAM currently set according to glxinfo:

Extended renderer info (GLX_MESA_query_renderer):
    Vendor: AMD (0x1002)
    Device: AMD Radeon(TM) Vega 10 Graphics (RAVEN, DRM 3.40.0, 5.12.9-arch1-1, LLVM 12.0.0) (0x15d8)
    Version: 21.2.0
    Accelerated: yes
    Video memory: 1024MB
    Unified memory: no
Memory info (GL_ATI_meminfo):
    VBO free memory - total: 42 MB, largest block: 42 MB
    VBO free aux. memory - total: 2442 MB, largest block: 2442 MB
    Texture free memory - total: 42 MB, largest block: 42 MB
    Texture free aux. memory - total: 2442 MB, largest block: 2442 MB
    Renderbuffer free memory - total: 42 MB, largest block: 42 MB
    Renderbuffer free aux. memory - total: 2442 MB, largest block: 2442 MB
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 1024 MB
    Total available memory: 4096 MB
    Currently available dedicated video memory: 42 MB
OpenGL vendor string: AMD
OpenGL renderer string: AMD Radeon(TM) Vega 10 Graphics (RAVEN, DRM 3.40.0, 5.12.9-arch1-1, LLVM 12.0.0)

How would I go about testing a patch ? (I probably need to rebuild the Linux kernel with the patch, right and boot with it), I found this link, but it says that the information in there is probably deprecated : https://www.kernel.org/doc/html/v5.12/process/applying-patches.html
Comment 5 Nirmoy 2021-06-10 13:09:16 UTC
Please let me know what distro are you using then I can prepare a complete guide.
Comment 6 Lahfa Samy 2021-06-10 13:19:07 UTC
I'm under ArchLinux running with the ZFS module (I can't boot and mount the root/home "partition" without it), thanks for the time you'll be taking to make this guide, I'll be trying my best to test the patch in any ways I can.
Comment 7 Nirmoy 2021-06-10 17:41:06 UTC
Actually, I am wrong, I checked out v5.12.9-arch1 from Arch and realized the fix I mentioned before isn't valid.
Comment 8 Lahfa Samy 2021-06-10 19:45:49 UTC
In the meantime, I'll be trying to find a way to reproduce this issue reliably, if you have any plans on writing a patch for this issue, I would be glad to help in any testing in order to help squash this bug.
Comment 9 Michel Dänzer 2021-06-11 07:31:28 UTC
If you can, reverting to an older version of the files under /lib/firmware/amdgpu/ may avoid the hangs.
Comment 10 dimitris 2021-06-11 23:32:36 UTC
Seeing the same thing on a T495 running Fedora 33 and Wayland, typically involving Firefox: https://bugzilla.redhat.com/show_bug.cgi?id=1966384

Would it be possible for me to try that patch?
Comment 11 Lahfa Samy 2021-06-12 23:02:02 UTC
Hi Dimitris, what is your current kernel version under Fedora, or the output of this command "uname --kernel-release" in a terminal, I cannot try the patch given however I haven't run into the issue again, I haven't had the time to put my RAM under heavy load.
Comment 12 dimitris 2021-06-13 17:43:58 UTC
Hi, I've seen this under 5.12.10-200.fc33.x86_64, two incidents hours apart.  Earlier had a number of incidents under 5.12.9.

In all of my cases I was using Firefox "heavily".  Creating tabs and using graphics-heavy pages.
Comment 13 Nirmoy 2021-06-14 08:01:50 UTC
Hi Dimitris and Lahfa, please try Michel's suggestion.
Comment 14 Dominic Letz 2021-06-15 22:14:10 UTC
Having the same issue on an E495 with Kernel 5.12.9. Will try to downgrade the /lib/firmware/amdgpu any hint to which git tag you would consider safe?
Comment 15 Michel Dänzer 2021-06-16 08:51:31 UTC
(In reply to Dominic Letz from comment #14)
> Having the same issue on an E495 with Kernel 5.12.9. Will try to downgrade
> the /lib/firmware/amdgpu any hint to which git tag you would consider safe?

20210315 seems to work fine here (on an E595).
Comment 16 Dominic Letz 2021-06-16 10:46:31 UTC
(In reply to Michel Dänzer from comment #15)
> (In reply to Dominic Letz from comment #14)
> > Having the same issue on an E495 with Kernel 5.12.9. Will try to downgrade
> > the /lib/firmware/amdgpu any hint to which git tag you would consider safe?
> 
> 20210315 seems to work fine here (on an E595).

+1 trying that
Comment 17 Leandro Jacques 2021-06-16 20:55:22 UTC
Created attachment 297413 [details]
Crash log for kernel 5.12.10

I'm having issues with amdgpu since kernel 5.10. I had to downgrade to 5.4 LTS to get rid of any kind of issue.
Comment 18 Leandro Jacques 2021-06-18 18:27:38 UTC
Created attachment 297467 [details]
amdgpu crash log for kernel 5.4.126

Before 5.4.126 I had no issues at all, downgrading to 5.4.123 to check if the problem will be gone.
Comment 19 dimitris 2021-06-18 20:30:50 UTC
I've also just replaced /lib/firmware/amdgpu with the `20210315` version, I'll see how this goes.  Currently running Fedora kernel 5.12.11-200.fc33.x86_64 on a T495.

Question, don't I also need to update the initrd?  `lsinitrd` shows that all the amdgpu modules are included in the initrd image.  Or is the firmware reloaded once root is mounted?
Comment 20 Michel Dänzer 2021-06-19 12:15:51 UTC
(In reply to dimitris from comment #19)
> Question, don't I also need to update the initrd?

Yes you do, if it didn't happen automatically.
Comment 21 Dominic Letz 2021-06-20 13:02:08 UTC
So I'm running since 16th on 20210315 and it has been stable so far vs. multiple freezes a day before.
Comment 22 dimitris 2021-06-20 21:07:27 UTC
Updated initrd also to 20210315, ran under 5.12.11-200.fc33 for a day or so without issues, now under 5.12.12-200.fc33, we'll see how it goes.

For reference what's the best way to check the active/loaded firmware?  I don't see anything obvious on dmesg or lspci -vv.
Comment 23 Michel Dänzer 2021-06-21 07:04:54 UTC
/sys/kernel/debug/dri/0/amdgpu_firmware_info has all the info.
Comment 24 Leandro Jacques 2021-06-21 18:55:06 UTC
Created attachment 297557 [details]
Firmware info

The downgrade to kernel 5.4.123 doesn't had any effect, I had the same bug. Now I'm passing my firmware versions information.
Comment 25 Leandro Jacques 2021-06-21 19:26:28 UTC
(In reply to Dominic Letz from comment #21)

Trying the same version linux firmware 20210315. Let's check how it goes
Comment 26 Lahfa Samy 2021-06-29 23:55:11 UTC
Created attachment 297669 [details]
amdgpu-xorg-page-faults-screen-blackout-when-memory-heavily-used

Here are other logs. I have seen that when triggering the bug yet again on the 5.12.10-arch1-1 linux kernel running on ArchLinux, the computer didn't freeze this time like before, it just stopped displaying anything (Xorg was affected so I guess that's why). 
I'm using this version of the linux-firmware package under Arch : linux-firmware-20210511.7685cf4-1 

I have not yet downgraded to test with a downgraded linux-firmware package, may try this soon, if I get affected by the issue too frequently.
Comment 27 Lahfa Samy 2021-06-29 23:58:30 UTC
Created attachment 297671 [details]
Firmware information for a T495 with an AMD Vega RX 10

Here is again my Linux firmware package version (given by pacman coming from ArchLinux core repositories) : 20210511.7685cf4-1
Comment 28 Leandro Jacques 2021-06-30 19:00:07 UTC
(In reply to Leandro Jacques from comment #25)

Until now, no problems. So the problem is with newer firmware versions, working without any issues since 2021-06-21 19:26:28 UTC with version 20210315
Comment 29 Leandro Jacques 2021-07-05 16:55:54 UTC
How to file a bug to the linux-firmware project for the amdgpu driver? After the downgrade I haven't experienced any issues anymore.
Comment 30 Leandro Jacques 2021-07-06 17:35:07 UTC
(In reply to Dominic Letz from comment #21)
I made what you suggested, no issues anymore. It was a linux-firmware package problem, not a kernel driver problem.
Comment 31 Lahfa Samy 2021-07-08 18:24:43 UTC
I just have hit the same error even after downgrading, here is the current version of the package linux-firmware 20210315.3568f96-3.

I have hit the error again, the computer froze for a few seconds, looking at the logs shows many retry page faults for the amdgpu driver.

Furthermore, I'm on ArchLinux and I will attach the output of `modinfo amdgpu`, I'm thinking that downgrading linux-firmware on my distro wasn't enough it seems to downgrade the AMDGPU driver.
Comment 32 Lahfa Samy 2021-07-08 18:28:20 UTC
Created attachment 297781 [details]
Archlinux-part-of-modinfo-amdgpu

I think that my kernel is using the latest amdgpu driver that is coming with 5.12.13-arch1-2 and not the version coming with the linux-firmware pkg, if anyone can enlighten me or explain to me if I'm mistaken.
Comment 33 Leandro Jacques 2021-07-15 13:29:43 UTC
Created attachment 297881 [details]
Kernel crash log for linux firmware version 20210511.7685cf4

amdgpu kernel crash log when the problem ocurred, with the exact same message telling about page fault.
Comment 34 Leandro Jacques 2021-07-15 13:31:03 UTC
Created attachment 297883 [details]
Linux Firmware version info 20210511.7685cf4

Firmware info as of the moment when the system crashed
Comment 35 mcmarius 2021-08-14 21:00:21 UTC
i have a Lenovo L340 and the same problem

here is the complete dmesg log

https://gist.github.com/McMarius11/36c8d21a2dcaf5c2289c91a74af4f7fb

Operating System: Manjaro Linux
KDE Plasma Version: 5.22.4
KDE Frameworks Version: 5.84.0
Qt Version: 5.15.2
Kernel Version: 5.11.22-2-MANJARO (64-bit)
Graphics Platform: X11
Processors: 8 × AMD Ryzen 7 3700U with Radeon Vega Mobile Gfx
Memory: 5,6 GiB of RAM
Graphics Processor: AMD Radeon™ Vega 10 Graphics
Comment 36 Lahfa Samy 2021-09-10 11:46:14 UTC
Did anyone test whether this has been fixed in newer firmware updates, or should we still stay on version 20210315.3568f96-3 ?
Comment 37 Michel Dänzer 2021-09-10 13:29:28 UTC
(In reply to Lahfa Samy from comment #36)
> Did anyone test whether this has been fixed in newer firmware updates, or
> should we still stay on version 20210315.3568f96-3 ?

It's fixed in upstream linux-firmware 20210818.