Bug 204559

Summary: amdgpu: kernel oops with constant gpu resets while using mpv
Product: Drivers Reporter: Maxim Sheviakov (shoegaze)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: normal CC: alexdeucher, kode54, postix, reuben_p, shoegaze, thejoe
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.2.7 Subsystem:
Regression: No Bisected commit-id:
Attachments: oops.txt
journalctl --dmesg output
dmesg -w without runpm parameter
dmesg -w with runpm=0 parameter

Description Maxim Sheviakov 2019-08-12 10:53:06 UTC
Created attachment 284335 [details]
oops.txt

While watching a video using mpv (default config) the system will hang eventually - this is actually a kernel oops that happens after lots of GPU resets every second or so (in the span of ~5 minutes; it seems to be alright in the beginning):
> Aug 12 00:46:49 mashedpotato kernel: [drm] UVD and UVD ENC initialized
> successfully.
> Aug 12 00:46:49 mashedpotato kernel: [drm] VCE initialized successfully.
> Aug 12 00:46:56 mashedpotato kernel: amdgpu 0000:01:00.0: GPU pci config
> reset
> Aug 12 00:46:59 mashedpotato kernel: [drm] PCIE GART of 256M enabled (table
> at 0x000000F400000000).


This block of warnings repeats itself many times and then it is this error:
> Aug 12 00:52:20 mashedpotato kernel: amdgpu 0000:01:00.0:
> [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 test failed (-110)
> Aug 12 00:52:20 mashedpotato kernel: [drm:amdgpu_device_resume [amdgpu]]
> *ERROR* resume of IP block <sdma_v3_0> failed -110
> Aug 12 00:52:20 mashedpotato kernel: [drm:amdgpu_device_resume [amdgpu]]
> *ERROR* amdgpu_device_ip_resume failed (-110).
> Aug 12 00:52:25 mashedpotato kernel: BUG: kernel NULL pointer dereference,
> address: 0000000000000000
> Aug 12 00:52:25 mashedpotato kernel: #PF: supervisor instruction fetch in
> kernel mode
> Aug 12 00:52:25 mashedpotato kernel: #PF: error_code(0x0010) - not-present
> page


In the end it is a kernel oops, log is in the attachment. The system is only recoverable via a hard reset afterwards, though the sound from a video keeps playing just fine.


My system is a ASUS laptop, TUF FX505-DY with the latest BIOS. lspci:
> 00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Root
> Complex
> 00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 IOMMU
> 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models
> 00h-1fh) PCIe Dummy Host Bridge
> 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP
> Bridge [6:0]
> 00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP
> Bridge [6:0]
> 00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP
> Bridge [6:0]
> 00:01.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP
> Bridge [6:0]
> 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models
> 00h-1fh) PCIe Dummy Host Bridge
> 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Internal
> PCIe GPP Bridge 0 to Bus A
> 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev
> 61)
> 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev
> 51)
> 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device
> 24: Function 0
> 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device
> 24: Function 1
> 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device
> 24: Function 2
> 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device
> 24: Function 3
> 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device
> 24: Function 4
> 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device
> 24: Function 5
> 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device
> 24: Function 6
> 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device
> 24: Function 7
> 01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin
> [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] (rev e5)
> 02:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc.
> Device 5008 (rev 01)
> 03:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8821CE
> 802.11ac PCIe Wireless Network Adapter
> 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
> 05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
> Picasso (rev c2)
> 05:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI]
> Raven/Raven2/Fenghuang HDMI/DP Audio Controller
> 05:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h
> (Models 10h-1fh) Platform Security Processor
> 05:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Raven USB 3.1
> 05:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Raven USB 3.1
> 05:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h (Models
> 10h-1fh) HD Audio Controller

I have amdgpu.gpu_reset=1 in my kernel commandline as I want to figure out another issue - sometimes the system hangs after locking and disabling screen, and I guess it is GPU reset-related.
Comment 1 Alex Deucher 2019-08-12 13:04:46 UTC
Please attach your full dmesg output from boot.
Comment 2 Maxim Sheviakov 2019-08-12 13:30:28 UTC
Created attachment 284337 [details]
journalctl --dmesg output

As I can't run dmesg after the system's hung, here's journalctl --dmesg output from that particular boot.
Comment 3 Alex Deucher 2019-08-12 14:19:13 UTC
You can fetch the output before the hang.
Comment 4 Alex Deucher 2019-08-12 14:22:06 UTC
Looks like your system has two GPUs.  Can you try booting with amdgpu.runpm=0?  Does that fix the issue?
Comment 5 Maxim Sheviakov 2019-08-12 15:56:40 UTC
Created attachment 284341 [details]
dmesg -w without runpm parameter

Here's the whole dmesg from a fresh boot up until the hang, no kernel parameters were modified.
Comment 6 Maxim Sheviakov 2019-08-12 16:42:10 UTC
Created attachment 284345 [details]
dmesg -w with runpm=0 parameter

I have left my laptop with a video playing for about half an hour and it seems like no GPU-related warnings have been produced so far, only RTL8821CE spam. Seems like the root cause of the problem is somewhere in the runtime power management and/or GPU switching stuff as far as I can see.
Comment 7 Maxim Sheviakov 2019-08-12 16:56:29 UTC
By the way, how *exactly* does disabling runpm affect the system? Does it leave the discrete GPU always-on or vice verse? Or does it vary on each system?
I have tried running The Crew via Wine + DXVK while having amdgpu.runpm=0 in my kernel params and it seems that discrete GPU was being used as the framerate was more than fine.
Comment 8 Alex Deucher 2019-08-12 17:01:49 UTC
(In reply to Maxim Sheviakov from comment #7)
> By the way, how *exactly* does disabling runpm affect the system? Does it
> leave the discrete GPU always-on or vice verse? Or does it vary on each
> system?

It leaves the dGPU powered up all the time rather than dynmically powering it on/off as needed.

> I have tried running The Crew via Wine + DXVK while having amdgpu.runpm=0 in
> my kernel params and it seems that discrete GPU was being used as the
> framerate was more than fine.

You can use xrandr to pick which GPU you want to use for rendering.
Comment 9 Maxim Sheviakov 2019-08-12 17:26:31 UTC
Thanks for your explanation. By the way, disabling runpm also seems to fix the other issue with disabling the display after activating the lockscreen as a powersaving measure.
Is there anything else I can do to help with this one? The whole thing seems to be an issue somewhere in the dynamic switching mechanism, which works - but is not really stable with all these hangs at certain conditions.
Comment 10 Christopher Snowhill 2019-09-07 06:55:57 UTC
This looks like an issue I'm having intermittently with the GPU failing to resume from system sleep mode. Do I need to report a separate issue for this? Should I also bother to test the runpm=0 workaround?
Comment 11 Christopher Snowhill 2019-09-07 06:58:35 UTC
Oops, I neglected to mention: The system is non-responsive to input devices, as the USB input appears to all be completely powered off after the GPU crashes, but the network interface is still working, as is sound output, and I'm able to log into the machine via SSH. It does, however, lock up if I attempt to soft reboot it.

The full dmesg from the session that eventually crashed is still available in the journal, up to where it was flooding sdma0 timeouts and failures.
Comment 12 thejoe 2019-10-25 16:38:23 UTC
Have seen the same kernel oops on a dell XPS 15 2-in-1 9575 with vega m hybrid graphics.  As far as I could tell it was not triggered by anything specific (eg mpv playbck) though.  Running run runpm=0 now, and haven't seen it again yet, but only have seen it once or twice without runpm=0.
Comment 13 Maxim Sheviakov 2020-01-03 17:40:49 UTC
I'm on kernel 5.4.7 now and seems like this particular issue is fixed - I tried playing some movies with runpm enabled and things seemed to be okay. Though it looks like dGPU performance with runpm is considerably worse than without runpm, but I guess that's another issue :)

Can anyone confirm if everything's fine now?
Comment 14 thejoe 2020-01-07 22:31:03 UTC
i have not seen the oops on a 5.3.x kernel (ubuntu eoan), even without tweaking the runpm setting (again, only saw it a few times on an earlier kernel).