Bug 204611

Summary: amdgpu error scheduling IBs when waking from sleep
Product: Drivers Reporter: tones111
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: aeon.descriptor, alexdeucher, bastian, bjo, carmen, danielrparks, dario.tislar, sevenever, thierry.monnier5, vicluo96
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.2.9 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: journalctl: amdgpu lockup on resume from sleep.
journalctl output on Thinkpad X395
journalctl amdgpu fails on resume
dmesg output when switching to console
4700u journal
attachment-29192-0.html
attachment-26652-0.html

Description tones111 2019-08-18 20:32:53 UTC
Created attachment 284485 [details]
journalctl: amdgpu lockup on resume from sleep.

My system locks up when trying to wake from sleep (open lid).  The screen remains black and is unresponsive to keyboard/mouse input.  I'm able to ssh from another machine and have attached the output from journalctl -b.  The log shows scrolling errors...

kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
kernel: amdgpu 0000:05:00.0: couldn't schedule ib on ring <gfx>

This is a Lenovo E585 laptop with an AMD R5 2500U APU.
Comment 1 Alex Deucher 2019-08-26 03:16:44 UTC
If this is a regression can you bisect?
Comment 2 tones111 2019-08-26 23:47:21 UTC
The problem is after v5.1, and before v5.2.  It's very reproducible on v5.2 but might be less frequent as the bisect progresses.  Attempts have driven me into the weeds, but I'm still trying.

It looks like another user reported the same issue here:
https://bugzilla.kernel.org/show_bug.cgi?id=204227

During my bisect I was seeing visual artifacts without the lockup so I believe they're separate issues.
Comment 3 tones111 2019-09-16 00:44:28 UTC
I'm still working on trying to bisect the problem, but it's been challenging.  Following the advice at https://01.org/blogs/rzhang/2015/best-practice-debug-linux-suspend/hibernate-issues I turned on the initcall_debug and no_console_suspend boot options.

I then see the following messages in the boot log after bringing the system back up.

> Sep 15 17:36:39 mobile kernel: [drm] reserve 0x400000 from 0xf400c00000 for
> PSP TMR SIZE
> ...
> Sep 15 17:36:39 mobile kernel: [drm] psp command failed and response status
> is (0)
> Sep 15 17:36:39 mobile kernel: [drm:psp_hw_start [amdgpu]] *ERROR* PSP load
> tmr failed!
> Sep 15 17:36:39 mobile kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume
> failed
> Sep 15 17:36:39 mobile kernel: [drm:amdgpu_device_fw_loading [amdgpu]]
> *ERROR* resume of IP block <psp> failed -22
> Sep 15 17:36:39 mobile kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR*
> amdgpu_device_ip_resume failed (-22).
> Sep 15 17:36:39 mobile kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x90
> returns -22
> Sep 15 17:36:39 mobile kernel: amdgpu 0000:05:00.0: pci_pm_resume+0x0/0x90
> returned -22 after 19543535 usecs
> Sep 15 17:36:39 mobile kernel: PM: Device 0000:05:00.0 failed to resume
> async: error -22
Comment 4 tones111 2019-10-02 01:13:05 UTC
I've been able to narrow the problem down a bit.

The first commit where I get the scrolling amdgpu errors is
4f8b49092c37cf0c87c43bb2698d43c71cf0e4e5

Unfortunately that's a merge commit.
One of the parents appears to be good
ceacbc0e145e3b27d8b12eecb881f9d87702765a

The other parent
5dd6c49339126c2c8df2179041373222362d6e49
causes lockups that don't have any journal messages after going to sleep.  I've tried bisecting this back to v5.1-rc1 (good) but the lockups become much less consistent.
Comment 5 Carmen Bianca Bakker 2019-10-06 13:20:43 UTC
I have the same problem on a Thinkpad X395, Ryzen 5 3500U. I have a downstream bug report at https://bugzilla.redhat.com/show_bug.cgi?id=1731915
Comment 6 Carmen Bianca Bakker 2019-10-06 13:21:53 UTC
Created attachment 285365 [details]
journalctl output on Thinkpad X395
Comment 7 Vic Luo 2020-04-08 14:52:06 UTC
Same for Thinkpad E585 with Ryzen 5 2500U.

kernel: [drm:psp_hw_start [amdgpu]] *ERROR* PSP load tmr failed!
kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -22
kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-22).
kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x80 returns -22
kernel: PM: Device 0000:05:00.0 failed to resume async: error -22
kernel: acpi LNXPOWER:01: Turning OFF
kernel: OOM killer enabled.
kernel: Restarting tasks ...
Comment 8 aeon.descriptor 2020-05-22 21:09:42 UTC
Issue also present on Lenovo e585 -> "AMD Ryzen 7 2700U with Radeon Vega Mobile Gfx"

I can provide debugging information upon request, availability permitting.  Omitted for now, as substantially similar to Vic Luo.  I'm not just posting this as a 'me too', I'll try to make availability to help out in whatever ways I can.
Comment 9 Bastian Luettig 2020-05-24 11:41:21 UTC
Created attachment 289265 [details]
journalctl amdgpu fails on resume

confirming the bug on
AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx

Fedora 32

Kernel: 5.6.14-300.fc32.x86_64

Resume fails presumably (fan still active)
iommu=pt and amd_iommu=on do not work, disabling pageflip does not work

latest bios version from HP is installed.
Comment 10 Bastian Luettig 2020-05-24 17:19:56 UTC
Created attachment 289269 [details]
dmesg output when switching to console

update: when switching to console (ctrl alt f4) before suspend, pc wakes up again.
direct switching back to wayland freezes pc

when instead restarting gdm from console, computer can resume again in wayland (took two logins)
attached the dmesg output of suspend and resume in console mode.
Comment 11 Daniel Parks 2020-06-14 18:27:34 UTC
Created attachment 289653 [details]
4700u journal

I am also affected by this issue on a Dell Inspiron 14 2-in-1 7405 with a Ryzen 7 4700u. I also am willing to help debug and test, but unfortunately I cannot help bisect because amdgpu did not support my gpu at all when the regression occurred.
Comment 12 tones111 2022-01-29 19:55:12 UTC
I haven't seen problems resuming from sleep in some time.  Is anyone still experiencing this problem on newer kernels?  If not then I'd like to propose this issue be marked as resolved.
Comment 13 aeon.descriptor 2022-01-29 19:59:02 UTC
Created attachment 300344 [details]
attachment-29192-0.html

I still have this issue, but I'm using the latest Ubuntu 20.04 patched
kernels, so I don't know how 'latest' that is.

What kernel versions work?  I could try them out.

On Sat, Jan 29, 2022, 2:55 PM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=204611
>
> --- Comment #12 from tones111@hotmail.com ---
> I haven't seen problems resuming from sleep in some time.  Is anyone still
> experiencing this problem on newer kernels?  If not then I'd like to
> propose
> this issue be marked as resolved.
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 14 Thierry 2022-01-30 02:39:03 UTC
Created attachment 300346 [details]
attachment-26652-0.html

No problem since a long time too. I think it's solve.

Le sam. 29 janv. 2022 à 23:59, <bugzilla-daemon@kernel.org> a écrit :

> https://bugzilla.kernel.org/show_bug.cgi?id=204611
>
> --- Comment #13 from aeon.descriptor@gmail.com ---
> I still have this issue, but I'm using the latest Ubuntu 20.04 patched
> kernels, so I don't know how 'latest' that is.
>
> What kernel versions work?  I could try them out.
>
> On Sat, Jan 29, 2022, 2:55 PM <bugzilla-daemon@kernel.org> wrote:
>
> > https://bugzilla.kernel.org/show_bug.cgi?id=204611
> >
> > --- Comment #12 from tones111@hotmail.com ---
> > I haven't seen problems resuming from sleep in some time.  Is anyone
> still
> > experiencing this problem on newer kernels?  If not then I'd like to
> > propose
> > this issue be marked as resolved.
> >
> > --
> > You may reply to this email to add a comment.
> >
> > You are receiving this mail because:
> > You are on the CC list for the bug.
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.