Bug 204611 - amdgpu error scheduling IBs when waking from sleep
Summary: amdgpu error scheduling IBs when waking from sleep
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-18 20:32 UTC by tones111
Modified: 2020-07-26 12:40 UTC (History)
9 users (show)

See Also:
Kernel Version: 5.2.9
Tree: Mainline
Regression: Yes


Attachments
journalctl: amdgpu lockup on resume from sleep. (944.13 KB, text/plain)
2019-08-18 20:32 UTC, tones111
Details
journalctl output on Thinkpad X395 (3.46 MB, text/plain)
2019-10-06 13:21 UTC, Carmen Bianca Bakker
Details
journalctl amdgpu fails on resume (21.36 KB, text/plain)
2020-05-24 11:41 UTC, Bastian Luettig
Details
dmesg output when switching to console (8.13 KB, text/plain)
2020-05-24 17:19 UTC, Bastian Luettig
Details
4700u journal (545.55 KB, text/plain)
2020-06-14 18:27 UTC, Daniel Parks
Details

Description tones111 2019-08-18 20:32:53 UTC
Created attachment 284485 [details]
journalctl: amdgpu lockup on resume from sleep.

My system locks up when trying to wake from sleep (open lid).  The screen remains black and is unresponsive to keyboard/mouse input.  I'm able to ssh from another machine and have attached the output from journalctl -b.  The log shows scrolling errors...

kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
kernel: amdgpu 0000:05:00.0: couldn't schedule ib on ring <gfx>

This is a Lenovo E585 laptop with an AMD R5 2500U APU.
Comment 1 Alex Deucher 2019-08-26 03:16:44 UTC
If this is a regression can you bisect?
Comment 2 tones111 2019-08-26 23:47:21 UTC
The problem is after v5.1, and before v5.2.  It's very reproducible on v5.2 but might be less frequent as the bisect progresses.  Attempts have driven me into the weeds, but I'm still trying.

It looks like another user reported the same issue here:
https://bugzilla.kernel.org/show_bug.cgi?id=204227

During my bisect I was seeing visual artifacts without the lockup so I believe they're separate issues.
Comment 3 tones111 2019-09-16 00:44:28 UTC
I'm still working on trying to bisect the problem, but it's been challenging.  Following the advice at https://01.org/blogs/rzhang/2015/best-practice-debug-linux-suspend/hibernate-issues I turned on the initcall_debug and no_console_suspend boot options.

I then see the following messages in the boot log after bringing the system back up.

> Sep 15 17:36:39 mobile kernel: [drm] reserve 0x400000 from 0xf400c00000 for
> PSP TMR SIZE
> ...
> Sep 15 17:36:39 mobile kernel: [drm] psp command failed and response status
> is (0)
> Sep 15 17:36:39 mobile kernel: [drm:psp_hw_start [amdgpu]] *ERROR* PSP load
> tmr failed!
> Sep 15 17:36:39 mobile kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume
> failed
> Sep 15 17:36:39 mobile kernel: [drm:amdgpu_device_fw_loading [amdgpu]]
> *ERROR* resume of IP block <psp> failed -22
> Sep 15 17:36:39 mobile kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR*
> amdgpu_device_ip_resume failed (-22).
> Sep 15 17:36:39 mobile kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x90
> returns -22
> Sep 15 17:36:39 mobile kernel: amdgpu 0000:05:00.0: pci_pm_resume+0x0/0x90
> returned -22 after 19543535 usecs
> Sep 15 17:36:39 mobile kernel: PM: Device 0000:05:00.0 failed to resume
> async: error -22
Comment 4 tones111 2019-10-02 01:13:05 UTC
I've been able to narrow the problem down a bit.

The first commit where I get the scrolling amdgpu errors is
4f8b49092c37cf0c87c43bb2698d43c71cf0e4e5

Unfortunately that's a merge commit.
One of the parents appears to be good
ceacbc0e145e3b27d8b12eecb881f9d87702765a

The other parent
5dd6c49339126c2c8df2179041373222362d6e49
causes lockups that don't have any journal messages after going to sleep.  I've tried bisecting this back to v5.1-rc1 (good) but the lockups become much less consistent.
Comment 5 Carmen Bianca Bakker 2019-10-06 13:20:43 UTC
I have the same problem on a Thinkpad X395, Ryzen 5 3500U. I have a downstream bug report at https://bugzilla.redhat.com/show_bug.cgi?id=1731915
Comment 6 Carmen Bianca Bakker 2019-10-06 13:21:53 UTC
Created attachment 285365 [details]
journalctl output on Thinkpad X395
Comment 7 Vic Luo 2020-04-08 14:52:06 UTC
Same for Thinkpad E585 with Ryzen 5 2500U.

kernel: [drm:psp_hw_start [amdgpu]] *ERROR* PSP load tmr failed!
kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -22
kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-22).
kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x80 returns -22
kernel: PM: Device 0000:05:00.0 failed to resume async: error -22
kernel: acpi LNXPOWER:01: Turning OFF
kernel: OOM killer enabled.
kernel: Restarting tasks ...
Comment 8 aeon.descriptor 2020-05-22 21:09:42 UTC
Issue also present on Lenovo e585 -> "AMD Ryzen 7 2700U with Radeon Vega Mobile Gfx"

I can provide debugging information upon request, availability permitting.  Omitted for now, as substantially similar to Vic Luo.  I'm not just posting this as a 'me too', I'll try to make availability to help out in whatever ways I can.
Comment 9 Bastian Luettig 2020-05-24 11:41:21 UTC
Created attachment 289265 [details]
journalctl amdgpu fails on resume

confirming the bug on
AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx

Fedora 32

Kernel: 5.6.14-300.fc32.x86_64

Resume fails presumably (fan still active)
iommu=pt and amd_iommu=on do not work, disabling pageflip does not work

latest bios version from HP is installed.
Comment 10 Bastian Luettig 2020-05-24 17:19:56 UTC
Created attachment 289269 [details]
dmesg output when switching to console

update: when switching to console (ctrl alt f4) before suspend, pc wakes up again.
direct switching back to wayland freezes pc

when instead restarting gdm from console, computer can resume again in wayland (took two logins)
attached the dmesg output of suspend and resume in console mode.
Comment 11 Daniel Parks 2020-06-14 18:27:34 UTC
Created attachment 289653 [details]
4700u journal

I am also affected by this issue on a Dell Inspiron 14 2-in-1 7405 with a Ryzen 7 4700u. I also am willing to help debug and test, but unfortunately I cannot help bisect because amdgpu did not support my gpu at all when the regression occurred.

Note You need to log in before you can comment on or make changes to this bug.