Bug 204241

Summary: amdgpu fails to resume from suspend
Product: Drivers Reporter: Alexander Kitaev (kitaev)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: a.geno, acjones8, Ahzo, alexdeucher, andre, andreas.jackisch, andrew, andrey.grodzovsky, beggarsfair.2014, bjo, blazra, bucket.size, cousinmarc, crab2313, dav.per, dimitris.on.linux, dushistov, fjfcavalcanti, frans.skarman, hvbakel, igor, jman6495, kernel.org, kernel, kitaev, matteo.kloiber, me, nikola, postix, reuben_p, rmuncrief, samy, sethchhim, sevenever, tyrell.rutledge, ulf, vkrevs, waltibaba
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.2.1-arch1-1-ARCH Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
lspci
var_log_messages for amdgpu_ERROR
lspci from ryzen system
amdgpu firmware from ryzen system
resume_failure.log
var_log_meesages_5_2_11
/var/log/messages w/ kernel 5.3.0-gentoo
Patch to prevent frequent resume failures
Patch to fix the resume failures
Patch to prevent kernel NULL pointer dereferences
possible fix uvd6
possible fix uvd7
possible fix for uvd6
possible fix uvd7
possible fix for vcn
suspend crash on Lenovo Thinkpad T495 Kernel 5.3.13-arch1-1
log of x395 when suspend
arch linux 5.6 resolution with kernel params
dmesg log on suspend
dmesg 5.8.9-arch2-1
Resume fail with RX 580 GPU

Description Alexander Kitaev 2019-07-20 09:50:31 UTC
Created attachment 283863 [details]
dmesg

Computer fails to resume from suspend.
From the logs it looks like AMDGPU fails to resume.
Comment 1 Alexander Kitaev 2019-07-20 09:50:55 UTC
Created attachment 283865 [details]
lspci
Comment 2 Andrey Grodzovsky 2019-08-08 20:13:57 UTC
Can you post full dmesg log from boot, what card are you using ?
Comment 3 Andrey Grodzovsky 2019-08-08 20:22:17 UTC
OK, checked lspci and it's Ellsmere... Never mind.
Comment 4 Andrey Grodzovsky 2019-08-08 21:02:03 UTC
I tried to reproduce it with a kernel which is just a few commits different then this one - https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next

I tried with X enabled and in FB console. Was able to suspend and resume with no errors.

I would suggest to build your kernel from the branch above and see if it helps.
Also please post your FW info using this command 
cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
Comment 5 Andreas Jackisch 2019-08-14 19:29:29 UTC
The same issue started to hit me on gentoo when switching from 5.1.5-gentoo to  5.2.5-gentoo. I reverted back to latest 5.1.21-gentoo and the issue did not come up again. The failure on resume happens every after 5...20 attempts. I'll add message logs, lspci and firmware info.
Comment 6 Andreas Jackisch 2019-08-14 19:38:28 UTC
Created attachment 284411 [details]
var_log_messages for amdgpu_ERROR

search fro "amdgpu" to see it fail after resume
Comment 7 Andreas Jackisch 2019-08-14 19:39:11 UTC
Created attachment 284413 [details]
lspci from ryzen system
Comment 8 Andreas Jackisch 2019-08-14 19:39:40 UTC
Created attachment 284415 [details]
amdgpu firmware from ryzen system
Comment 9 Andrey Grodzovsky 2019-08-14 21:04:57 UTC
I was able to reproduce.
Comment 10 Andrey Grodzovsky 2019-08-15 20:55:41 UTC
Created attachment 284445 [details]
resume_failure.log

The kernel OOPS is just a result of previous GFX ring test failure. Attached log from UMR shows gfx ring is hang around (or right after) first PKT3_SET_CONTEXT_REG because latest PFP_HEADER_DUMP shows 0xc0d46900, this points to possibly that some of the payload within SET_CONTEXT_REG (in gfx_v8_0_get_csb_buffer) causes hang and later this results in ring test failure.

Alex Deucher - Any idea how to confirm this ?
Comment 12 Andreas Jackisch 2019-09-03 18:56:07 UTC
Created attachment 284807 [details]
var_log_meesages_5_2_11

I tested w/ kernel 5.2.11 as it contains the referenced patch "drm/amdgpu: pin the csb buffer on hw init for gfx v8". However, the system did not resume properly as before. This was on the 3rd attempt after almost 24 hours in S3. Reverted back to 5.1.21
Comment 13 Andreas Jackisch 2019-09-21 18:31:37 UTC
Created attachment 285079 [details]
/var/log/messages w/ kernel 5.3.0-gentoo

As there was no success w/ 5.2.x at all I tested 5.3.0. However, the system did not resume after the 2nd attempt with a comparable failure message.
 
amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper] *ERROR* ring sdma0 test failed (-110)

This is slightly different from 5.2.x where it was 

amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper] *ERROR* ring gfx test failed (-110)

but the result seems to be the same.

I'm not sure whether anybody is working on this or the bug-opener still sees the issue. As latest kernel series 5.1.x is somehow outdated now I will revert to 4.19.x LTS.
If there is any hint or advise what I can do to help please let me know.
Comment 14 Ahzo 2019-10-05 00:08:04 UTC
Created attachment 285349 [details]
Patch to prevent frequent resume failures

While this issue happens rather randomly, it can be quite reliably reproduced on linux 5.2 and later by performing successive suspend-resume cycles.
Usually the error occurs after less than 10 cycles, but occasionally only after more than 20. Thus one can use the following command to reproduce it almost certainly:
$ for i in $(seq 30); do sudo rtcwake -m mem -s 5; sleep 15; done

A bisection using this method lead to:
commit 533aed278afeaa68bb5d0600856ab02268cfa3b8
Author: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Date:   Wed Mar 6 16:16:28 2019 -0500

    drm/amdgpu: Move IB pool init and fini v2
    
    Problem:
    Using SDMA for TLB invalidation in certain ASICs exposed a problem
    of IB pool not being ready while SDMA already up on Init and already
    shutt down while SDMA still running on Fini. This caused
    IB allocation failure. Temproary fix was commited into a
    bringup branch but this is the generic fix.
    
    Fix:
    Init IB pool rigth after GMC is ready but before SDMA is ready.
    Do th opposite for Fini.
    
    v2: Remove restriction on SDMA early init and move amdgpu_ib_pool_fini
    
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


Reverting this commit makes the problem unreproducible with above command.

Another way to prevent these frequent resume failures, while preserving the intention of this commit, is to simply call amdgpu_ib_pool_init directly after calling amdgpu_ucode_create_bo instead of directly before that. Attached is a patch doing it that way.
Comment 15 Andreas Jackisch 2019-10-05 10:35:25 UTC
(In reply to Ahzo from comment #14)
> Created attachment 285349 [details]
> Patch to prevent frequent resume failures
> ....
> Another way to prevent these frequent resume failures, while preserving the
> intention of this commit, is to simply call amdgpu_ib_pool_init directly
> after calling amdgpu_ucode_create_bo instead of directly before that.
> Attached is a patch doing it that way.
I applied the patch above to 5.3.2-gentoo. All 30 Suspend/Resume cycles using rtcwake and a couple of manual cycles went OK.

I'll continue to use this setup and will report if it fails again or is still OK after one week.

Thx for bisecting this issue and providing this fix as I assume it took some time.
Comment 16 Christian Schwarz 2019-10-07 10:13:07 UTC
Can confirm the patch 'drm/amdgpu: Move IB pool init after ucode bo creation' fixed the issue for me (96h and counting, failure normally within 24h, with ~2 suspend/resume cycles per day).
Comment 17 Alex Deucher 2019-10-07 18:10:11 UTC
(In reply to Ahzo from comment #14)
> Another way to prevent these frequent resume failures, while preserving the
> intention of this commit, is to simply call amdgpu_ib_pool_init directly
> after calling amdgpu_ucode_create_bo instead of directly before that.
> Attached is a patch doing it that way.

I'm not sure I understand why the patch helps.  You are just changing the order of two memory allocations.  The order shouldn't matter.
Comment 18 Michel Dänzer 2019-10-08 07:56:53 UTC
(In reply to Alex Deucher from comment #17)
> I'm not sure I understand why the patch helps.  You are just changing the
> order of two memory allocations.  The order shouldn't matter.

My guess would be that the exact location of the ucode BO matters somehow.
Comment 19 Christian Schwarz 2019-10-08 09:40:24 UTC
Just had the first (but different kind of) crash since applying the patch on top of 5.3.2, but didn't have kdump configured:
The system woke, everything seemed to work for about 30s, then the screen went black and the machine rebooted.
Comment 20 Ahzo 2019-10-11 18:33:10 UTC
Created attachment 285469 [details]
Patch to fix the resume failures

(In reply to Alex Deucher from comment #17)
> I'm not sure I understand why the patch helps.  You are just changing the
> order of two memory allocations.  The order shouldn't matter.

My hypothesis is that the order here is not the root cause of the problem, but rather affects the likelihood of that manifesting itself.
This is based on the fact that I have seen a resume failure typical for this bug on linux 5.0 once, but I'm unable to reproduce it with that version.

As commit 533aed278afe apparently makes the failures much more likely to happen, it provides an opportunity to debug this further by backporting it to older linux versions.
Doing that for versions down to linux 4.15 exposes the resume failures, but not on linux 4.14.

A bisection between these two, while backporting 533aed278afe on every step, lead to commit 2a91f272e34c, which failed to boot and thus had to be skipped, and:
commit e0128efb08b3d628d767ec8578e77cdd7ecc8f81
Author: James Zhu <James.Zhu@amd.com>
Date:   Fri Sep 29 16:42:27 2017 -0400

    drm/amdgpu: add uvd enc ib test
    
    Generate create/destroy messages to test UVD encode indirect buffer function.
    And enable UVD encode IB test during device initialization.
    
    Signed-off-by: James Zhu <James.Zhu@amd.com>
    Reviewed-and-Tested-by: Leo Liu <leo.liu@amd.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

This looks like a likely root cause. Indeed, adding 'return 0;' at the beginning of uvd_v6_0_enc_ring_test_ib makes the problem unreproducible, even on the latest linux 5.4-rc2.

Comparing with amdgpu_uvd_get_{create,destroy}_msg shows that these use 0 as dummy GPU pointer, while uvd_v6_0_enc_get_{create,destroy}_msg use a real GPU memory address.
Changing them to also use 0 as dummy pointer, as is done in the attached patch, actually fixes the resume failures.

Maybe a similar change should also be made for UVD 7.
Comment 21 Ahzo 2019-10-11 18:37:48 UTC
Created attachment 285471 [details]
Patch to prevent kernel NULL pointer dereferences

By the way, some of the kernel NULL pointer dereferences, that can happen after a resume failure, also happen always on shutdown:
RIP: 0010:build_audio_output.isra.0+0x97/0x110 [amdgpu]
RIP: 0010:enable_link_dp+0x186/0x300 [amdgpu]

Attached patch prevents them.

Note that these oopses are difficult to notice on shutdown, because they only leave traces in /sys/fs/pstore, not on the disk, as they happen after unmounting.
Comment 22 Alex Deucher 2019-10-11 20:47:45 UTC
Created attachment 285473 [details]
possible fix uvd6

Nice work.  I think the attached patch should fix it.
Comment 23 Alex Deucher 2019-10-11 20:48:08 UTC
Created attachment 285475 [details]
possible fix uvd7

Same fix for uvd7.
Comment 24 Ahzo 2019-10-12 10:37:51 UTC
(In reply to Alex Deucher from comment #22)
> Created attachment 285473 [details]
> possible fix uvd6
> 
> Nice work.  I think the attached patch should fix it.

Thanks for finding the correct solution. I can confirm that the patch for uvd6 works. The one for uvd7 also looks good, but I don't have the hardware to test it.
Furthermore, I think vcn also needs a similar change. I'm not sure about vce, as that uses 'ib_size_dw = 1024' thus allocating a much larger buffer.
Comment 25 Christian Schwarz 2019-10-12 16:25:32 UTC
If it is of any help: I would be willing to test any of the more recent patches.

Hardware:
- Radeon RX 550
- Ryzen 1700X

The first patch by Ahzo@ already worked for me:
5.3.2 with "drm/amdgpu: Move IB pool init after ucode bo creation"

What other patches should I test with which kernel version?
Please provide Bugzilla attachment numbers.
Comment 26 Ahzo 2019-10-12 18:35:31 UTC
You can test Alex Deucher's uvd6 patch (attachment 285473 [details]), which is the proper fix for your RX 550.
Testing on linux 5.3 is fine, as this patch should fix the problem on any affected version.

The patch you tested previously just makes the problem unlikely to cause resume failures, but it doesn't fix the root cause of overwriting random GPU memory, so it might still cause random issues.
Comment 27 Andreas Jackisch 2019-10-12 20:47:09 UTC
In brief - the patch "0001-drm-amdgpu-uvd6-fix-allocation-size-in-enc-ring-test.patch" didn't work for me. After about 10 suspend/resume cycles the typical issue occurred again and I had to SysRq the system.

Status, all gentoo kernels:
5.1.x  OK
4.19.74 OK
5.2.x FAIL
5.3.0 FAIL
5.3.2 w/ patch from comment#14 OK
5.3.6 FAIL
5.3.6 w/ patch 0001-drm-amdgpu-uvd6-fix-allocation-size-in-enc-ring-test FAIL
5.3.6 w/ patch 0001-drm-amdgpu-uvd6-use-0-as-dummy-pointer-in-enc-ring-t OK

The last setup has seen 30+ suspend/resume cycles. I'll continue to use this.

So, to me it looks like that increasing the allocation did not help but assigning 0 to the dummy pointer did.

My hardware is comparable to the one listed in comment#25
- Radeon RX550
- Ryzen 1700
Comment 28 Ahzo 2019-10-13 10:47:29 UTC
(In reply to Andreas Jackisch from comment #27)
> In brief - the patch
> "0001-drm-amdgpu-uvd6-fix-allocation-size-in-enc-ring-test.patch" didn't
> work for me. After about 10 suspend/resume cycles the typical issue occurred
> again and I had to SysRq the system.

Indeed, the 0001-drm-amdgpu-uvd6-fix-allocation-size-in-enc-ring-test patch (attachement 285473) doesn't work.
Apparently I got (un)lucky enough that it survived 30 suspend/resume cycles, but testing it again, it failed.

On the other hand, the 0001-drm-amdgpu-uvd6-use-0-as-dummy-pointer-in-enc-ring-t patch (attachement 285469) survived 100 cycles.
Comment 29 Alex Deucher 2019-10-15 22:11:53 UTC
Created attachment 285507 [details]
possible fix for uvd6

The session info is 128K according to mesa.
Comment 30 Alex Deucher 2019-10-15 22:12:20 UTC
Created attachment 285509 [details]
possible fix uvd7

Updated patch for uvd7
Comment 31 Alex Deucher 2019-10-15 22:12:58 UTC
Created attachment 285511 [details]
possible fix for vcn

Same fix for vcn.
Comment 32 Christian Schwarz 2019-10-16 14:27:16 UTC
@Alex: Didn't have a crash with the old uvd6 patch (attachment 285473 [details]) so far, but apparently I am just lucky.

Which patch (series?) should I test now?
Comment 33 Alex Deucher 2019-10-16 14:29:04 UTC
(In reply to me from comment #32)
> @Alex: Didn't have a crash with the old uvd6 patch (attachment 285473 [details]
> [details]) so far, but apparently I am just lucky.
> 
> Which patch (series?) should I test now?

Please try attachment 285507 [details].
Comment 34 Ahzo 2019-10-16 17:29:27 UTC
(In reply to Alex Deucher from comment #29)
> Created attachment 285507 [details]
> possible fix for uvd6
> 
> The session info is 128K according to mesa.

This version of the patch didn't fail for 100 suspend/resume cycles, so I think it actually fixes the problem.
Comment 35 Andreas Jackisch 2019-10-16 22:03:18 UTC

(In reply to Ahzo from comment #34)
> (In reply to Alex Deucher from comment #29)
> > Created attachment 285507 [details]
> > possible fix for uvd6
> > 
> > The session info is 128K according to mesa.
> 
> This version of the patch didn't fail for 100 suspend/resume cycles, so I
> think it actually fixes the problem.

I can confirm that the patch seems to work OK. 30+ suspend/resume cycles so far where it normally fails after 10 cycles.
Comment 36 Mario 2019-10-20 20:06:13 UTC
I can also confirm this patch (285507) fixed the problem on Arch Linux 5.3.7. 

The stock kernel failed after ~5 sleep-wake cycles. Patched kernel was able to survive the complete 30 cycles:

```for i in $(seq 30); do sudo rtcwake -m mem -s 5; sleep 15; done```

Thanks for the patch. I also suspect that bug 204965 is a duplicate of this one.
Comment 37 David 2019-10-23 16:46:53 UTC
*** Bug 204965 has been marked as a duplicate of this bug. ***
Comment 38 Andrew Hutchings 2019-10-28 20:16:20 UTC
Also confirmed Alex Deucher's patches work great for me, patched Fedora 31 kernel 5.3.7 on a ThinkPad T495 Ryzen 7 PRO 3700U with a Vega 10 GPU (vcn).

Many thanks!
Comment 39 Christian Schwarz 2019-11-30 18:24:19 UTC
(In reply to Alex Deucher from comment #33)
> (In reply to me from comment #32)
> > @Alex: Didn't have a crash with the old uvd6 patch (attachment 285473 [details]
> [details]
> > [details]) so far, but apparently I am just lucky.
> > 
> > Which patch (series?) should I test now?
> 
> Please try attachment 285507 [details].

Can confirm this patch works, 40 days uptime, _many_ suspend-resume cycles, no problems.
Comment 40 Frans Skarman 2019-12-07 10:28:55 UTC
This patch did not solve the issue for me, or rather, the arch build system says the patch is already applied in 5.4.2-arch.

Suspend consistently doesn't work, and the first issue reported by journalctl is the aformentioned amdgpu (-110) error.

This is with a ryzen 7 3800x and rx 580
Comment 41 Ulf Winkelvos 2019-12-07 23:50:15 UTC
Created attachment 286215 [details]
suspend crash on Lenovo Thinkpad T495 Kernel 5.3.13-arch1-1
Comment 42 Ulf Winkelvos 2019-12-07 23:55:37 UTC
On my System Lenovo ThinkPad T495 (model 20NKS01Y00) the crashes still happen on every 1st to 4th suspend (see above log).

---
amdgpu 0000:06:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx (-110).
---

I found out though that if i disable my fingerprint reader, aswell as the smartcard reader in bios the crashes do not occour anymore:

---
-Bus 003 Device 006: ID 06cb:00bd Synaptics, Inc. 
-Bus 003 Device 005: ID 058f:9540 Alcor Micro Corp. AU9540 Smartcard Reader
---
Comment 43 crab2313 2019-12-11 00:37:24 UTC
Same problem with my Thinkpad x395 (model 20NL000YCD). The system refused to suspend consistently and showed a blurred screen. Also, the LED on power button do not turn off. 

The issue still exist when I disable fingerprint reader and SD card reader in bios.
Comment 44 crab2313 2019-12-11 00:39:40 UTC
Created attachment 286253 [details]
log of x395 when suspend
Comment 45 crab2313 2019-12-12 05:00:06 UTC
Kernel 5.4.2 and kernel 5.3 is affected. I switch to kernel 5.2.19 and do not have this issue.
Comment 46 Alex Deucher 2019-12-12 14:37:40 UTC
Can you bisect?  It sounds like you may be experiencing a different issue.
Comment 47 crab2313 2019-12-13 10:38:51 UTC
@Alex Deucher

Unfortunately, I discovered switch to 5.2.19 just lower the possibility of my issue. I think bisect can not find the root cause.
Comment 48 Robert M. Muncrief 2019-12-16 04:40:30 UTC
I have an R9-390 that just started having the resume from suspend problem as of 5.5-rc1. And I just tested 5.5-rc2 and the problem persists.

The problem looks exactly the same as the one that plagued the R9-390 starting with the 4.20 kernel, but was fixed a few releases later.

My system goes into suspend mode normally, but when resuming my monitor says "Signal not recognized" and I have to SSH into my system and reboot it.

I'm running Manjaro with Mesa 19.2.7 amdgpu, and the last working kernel is 5.4.2. So something in the new 5.5 amdgpu has borked the R9-390 again.
Comment 49 Alex Deucher 2019-12-16 13:40:03 UTC
(In reply to muncrief from comment #48)
> I have an R9-390 that just started having the resume from suspend problem as
> of 5.5-rc1. And I just tested 5.5-rc2 and the problem persists.
> 
> The problem looks exactly the same as the one that plagued the R9-390
> starting with the 4.20 kernel, but was fixed a few releases later.
> 
> My system goes into suspend mode normally, but when resuming my monitor says
> "Signal not recognized" and I have to SSH into my system and reboot it.
> 
> I'm running Manjaro with Mesa 19.2.7 amdgpu, and the last working kernel is
> 5.4.2. So something in the new 5.5 amdgpu has borked the R9-390 again.

This sounds like a different issue, please file a different ticket.
Comment 50 Ulf Winkelvos 2019-12-20 22:17:06 UTC
I tried to bisect this issue in the past days, but it is almost impossible to track it down, as it is so hard to reproduce it reliably. It seems that 5.2 is "better", the close the commits go to 5.3 it gets "worse". Now all of a sudden 5.4.3-arch1-1 is completely stable so far... I am going to create a new bug, whenever this comes back.
Comment 51 Alexander Jones 2020-02-25 01:22:37 UTC
For what it's worth, I believe I might be suffering from this bug. I have a ThinkPad A275, with an AMD A12 9800B CPU and an R7 integrated GPU, and I can reliably produce a crash on suspend every single time. It produces an image like this when it wakes up: https://imgur.com/tKAxlI7

As you can see, a complete garbled mess. X11 becomes completely unresponsive; I can't quit it, switch to a VT, or do anything whatsoever, only a hard reset fixes it. The screen glitchiness does seem to flicker and slightly change while mashing buttons though. Other aspects of the computer work fine though; the CPU fan maintains the same speed, the power LED blinks normally, the dot on the i on the back of the lid pulses like normal, and I can still change the keyboard's backlight with no problems. 

I'm running OpenSUSE Tumbleweed at the moment. With kernel 5.2.X, I never had any crashes whatsoever, but once Tumbleweed updated to 5.3 or 5.4, it will fail every single time to resume. I'm currently running 5.5.4.1 and the issue is still here. I don't have any kernel hacking or debugging experience, but I'm willing to upload any logs that might prove helpful, if you can tell me which ones those might be.
Comment 52 Dimitris 2020-02-25 02:00:38 UTC
This is a shot in the dark/cargo culting it, but in case it helps:

I had a very similar problem on a T495 (Ryzen 3700U), running Fedora 31, which resolved itself when the 5.4 series was available in Fedora.

Before 5.4 was available, I came across reports linking this to a USB controller of all things, like https://www.mail-archive.com/debian-kernel@lists.debian.org/msg116563.html.

In my case the cuprit was:

06:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Raven USB 3.1

so I started removing the device from the PCI tree before suspend using /sys/bus/pci/devices/0000:06:00.4/remove and rescanning the PCI bus on resume.  First manually and later though a systemd hook.  That worked around the problem until 5.4 "fixed" this.
Comment 53 Alexander Jones 2020-02-25 03:06:24 UTC
(In reply to dimitris from comment #52)
> This is a shot in the dark/cargo culting it, but in case it helps:
> 
> I had a very similar problem on a T495 (Ryzen 3700U), running Fedora 31,
> which resolved itself when the 5.4 series was available in Fedora.
> 
> Before 5.4 was available, I came across reports linking this to a USB
> controller of all things, like
> https://www.mail-archive.com/debian-kernel@lists.debian.org/msg116563.html.
> 
> In my case the cuprit was:
> 
> 06:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Raven USB 3.1
> 
> so I started removing the device from the PCI tree before suspend using
> /sys/bus/pci/devices/0000:06:00.4/remove and rescanning the PCI bus on
> resume.  First manually and later though a systemd hook.  That worked around
> the problem until 5.4 "fixed" this.

Thank you for the suggestion! I tried that out on my ThinkPad, disabling all of my USB devices just in case. They are:

00:10.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller (rev 20)
00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 49)
01:00.4 USB controller: Realtek Semiconductor Co., Ltd. Device 816d (rev 0e)

Unfortunately, this didn't fix the suspend issue, I still get the glitchy screen. On a whim, I tried to disable Bluetooth in the BIOS, as well as the Fingerprint Scanner and TPM chip, but that also didn't have any affect. Coincidentally, I DID hear Kmail pop a notification after resuming, so it seems it's not as dead as I thought, the kernel and even the userland seem to still work then. Must be AMDGPU or something else in the graphics stack that dies, since I can't switch to a VT and suspending while in a VT doesn't work either, and results in the same glitched out mess.
Comment 54 Bjoern Franke 2020-02-25 17:26:17 UTC
@Alexander Jones:

Regarding the garbled screen after resume, there's another bugreport: https://bugzilla.kernel.org/show_bug.cgi?id=206393
Comment 55 Alexander Jones 2020-02-25 21:26:51 UTC
Thank you very much for that link, Mr. Franke! That bug report much more closely approximates my situation, down to a T. I tried the older kernel suggestion listed there, I still have a backup copy of 5.4.10 (but not anything earlier), and it works perfectly again! That solves my problem in the short term with Tumbleweed. I'm not sure then if it's related to this AMDGPU bug and just manifesting itself differently, or if they're actually different bugs, but I'll switch to over to that thread then. Thank you once again!
Comment 56 Frans Skarman 2020-02-26 09:54:01 UTC
I experienced this issue (black screen after resuming from suspend) for a while on my ryzen 3800x + rx 580 setup. Same issues happened with every kernel i tried. Eventually, I figured out that a BIOS update fixed the issues (this was an MSI B450 tomahawk max).
Comment 57 Jordan Maris 2020-04-10 21:16:47 UTC
I'm also experiencing this issue on a HP Envy 13 x360 with the Ryzen 3500U APU.
Has anyone found any potential solutions ?
Comment 58 Jaya Balan Aaron 2020-04-15 19:43:11 UTC
Created attachment 288507 [details]
arch linux 5.6 resolution with kernel params
Comment 59 Jaya Balan Aaron 2020-04-15 19:50:08 UTC
Comment on attachment 288507 [details]
arch linux 5.6 resolution with kernel params

Hi,

Using arch linux kernel 5.5zen, 5.6. Not sure if it's a solution but, interesting to note.


With 5.6, with kernel params 'amd_iommu=on iommu=pt', able to suspend/resume correctly 10/10 times. Without the params resume hanged with a blank and backlit screen 2/2 times.


With 5.5zen, even with the same kernel params, resume hanged 2/2 times.


Reason for the kernel params is that I was trying to set up gpu passthrough with kvm.

Suspend resumes immediately sometimes, but I think that's because of mis-configured, keyboard/mouse/usb wake triggers.
Comment 60 Jordan Maris 2020-05-13 20:45:16 UTC
Created attachment 289129 [details]
dmesg log on suspend
Comment 61 igor 2020-05-17 20:40:35 UTC
(In reply to Ulf Winkelvos from comment #50)
> I tried to bisect this issue in the past days, but it is almost impossible
> to track it down, as it is so hard to reproduce it reliably. It seems that
> 5.2 is "better", the close the commits go to 5.3 it gets "worse". Now all of
> a sudden 5.4.3-arch1-1 is completely stable so far... I am going to create a
> new bug, whenever this comes back.

Thank you,
You saved my day. Switched from 5.4.0 to 5.4.3 and now I am able to properly suspend and resume.
With 5.4.0 the system crashed and reset.

keep save and healthy.
By
Igor
Comment 62 poinck 2020-06-24 19:15:50 UTC
I am having the same issue with:

Platform: Linux-5.4.38-gentoo-x86_64-Intel-R-_Core-TM-i5-3570K_CPU@_3.40GHz-with-gentoo-2.6, 64bit
Graphics: 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 550 640SP / RX 560/560X] (rev cf)
DE: Gnome 3.34.4

Steps to repoduce:
- start qutebrowser (uses Qt-5.14.2) under Gnome
- hibernate
- system freezes immediatly (or just blank screen and remotely still available) or eventually blanks after resume.
- restart or hard reset neccessary

Workarround:
- stop qutebrowser before hibernating
- resume works and I can login normally and resume the session
Comment 63 felipejfc 2020-07-27 16:02:17 UTC
I'm having the same issue with an AMD RX5700 and kernel version 5.7.9-1 on manjaro linux.

for me adding kernel params 'amd_iommu=on iommu=pt' didn't solve the problem. graphics won't turn on so monitor just keeps blinking
Comment 64 felipejfc 2020-07-27 16:03:14 UTC
(In reply to felipejfc from comment #63)
> I'm having the same issue with an AMD RX5700 and kernel version 5.7.9-1 on
> manjaro linux.
> 
> for me adding kernel params 'amd_iommu=on iommu=pt' didn't solve the
> problem. graphics won't turn on so monitor just keeps blinking

complementing my last answer, the "fix" that worked for me was to disable IOMMU on BIOS
Comment 65 waltibaba 2020-09-25 07:31:59 UTC
I'm getting the issue again on 5.8.9-arch2-1 - though compounding it is that suspending fails and it instantly tries to resume with a black screen.
Can confirm that it was not present on 5.8.8-arch1-1 before it (that ran with many suspend/resume cycles for a week).

Prime B350M-A
R7 1700
Fury X (Fiji)
Comment 66 waltibaba 2020-09-25 07:34:32 UTC
Created attachment 292637 [details]
dmesg 5.8.9-arch2-1

truncated dmesg logs of resume failures on 5.8.9-arch2-1
3 boot/suspend/resume/fail cycles occurred
Comment 67 Lahfa Samy 2020-09-30 09:22:40 UTC
I'm having currently this issue on a T495 with a Ryzen 3700U with integrated graphics Vega RX 10 on ArchLinux with ZFS.

Before 5.8.12-arch1-1, I can suspend however right when I resume the system freezes. 

I have to hard reset by rebooting using the power button, nothing is present in the journalctl besides systemd saying it did suspend, it's not mentioning something that fails about AMDGPU.

However have seen a call trace in dmesg about the wifi driver (RIP: 0010:iwl_pcie_rx_handle+0x9c7/0xbb0 [iwlwifi]) but this is happening during boot and thus maybe not affecting the suspend process. 

The thing is this issue started when I upgraded the kernel from 5.8.11-arch to 5.8.12 but I have also installed AMDGPU (bad timing) and Mesa-git thus I'm not being too sure if the latter is maybe part of the issue or is the very problem of this bug. 

I have removed the git packages and installed their stable counterparts also removed the kernel parameters amdgpu.cik_support=1 amdgpu.sk_support=1 radeon.sk_support=0 radeon.cik_support=0 and I'll be doing some tests and reporting if I find a way to mitigate the issue.
Comment 68 Robert M. Muncrief 2020-09-30 16:31:49 UTC
Created attachment 292729 [details]
Resume fail with RX 580 GPU

I've been having random resume problems form around kernel 5.5, and it persists even up to 5.9-rc6. When this occurs I can still login to SSH and give a reboot command, but though SSH disconnects my computer doesn't reboot and I have to press the reset button.  

I have an ASUS Gaming TUF X570 motherboard, R7 3700X CPU, RX 580 GPU, and 16GB of RAM.  

The primary error recorded in dmesg is:  

[xxxxx.xxxxxx] amdgpu:  
                last message was failed ret is 65535  

I've included the part of dmesg beginning with suspend event through the resume failure.
Comment 69 Lahfa Samy 2020-09-30 19:23:12 UTC
I've got news of a current workaround for my T495 with a Ryzen 7 3700U and a Vega RX 10 on kernel 5.8.12arch, I have disabled the Network card (which means no more WiFi at all) in the BIOS and this has solved the problem of the resuming freeze. This is most likely due to a bug in the driver iwlwifi used by the Intel Wireless AC-9260 network card, I can also confirm that the same bug affects the package linux-lts for ArchLinux 5.4.68-1-lts.

The logs show a watchdog :soft-lockup on CPU#0 stuck for 22s! [irq/87-iwlwifi::979].

Later in the log there is this line :
RIP : 0010:iwl_trans_pcie_read32+0x10/0x20 [iwlwifi]

A few more information probably that would help someone make a patch maybe.
And finally a call trace.
Comment 70 Lahfa Samy 2020-09-30 20:03:44 UTC
I've opened a new bug report as the issue is clearly related to networking and the iwlwifi driver and not to the AMDGPU driver in my case.
Here is the link to the bug report : https://bugzilla.kernel.org/show_bug.cgi?id=209435
Comment 71 Alex Deucher 2020-10-01 14:59:58 UTC
The original issue reported in this bug was fixed long ago.  If you are having issues, please file a new report.
Comment 72 Robert M. Muncrief 2020-10-01 17:21:38 UTC
(In reply to Alex Deucher from comment #71)
> The original issue reported in this bug was fixed long ago.  If you are
> having issues, please file a new report.

I just filed a new bug for the resume issue at your request. It's 209457.
Comment 73 Lahfa Samy 2020-10-01 17:27:43 UTC
(In reply to Robert M. Muncrief from comment #72)
> (In reply to Alex Deucher from comment #71)
> > The original issue reported in this bug was fixed long ago.  If you are
> > having issues, please file a new report.
> 
> I just filed a new bug for the resume issue at your request. It's 209457.

My issue seems unrelated to your bug report, my suspend/resume freeze issue is related to my Intel Wireless AC9260 not to my AMD Ryzen 7 3700U with integrated graphics Vega RX10. 

Disabling the wireless card in the BIOS fixes the suspend/resume problem for my specific configuration (Thinkpad T495 20NK model).

Although your issue seems to be with the AMDGPU driver and related to your graphics card I suppose.
Comment 74 Robert M. Muncrief 2020-10-01 17:55:40 UTC
(In reply to Lahfa Samy from comment #73)
> (In reply to Robert M. Muncrief from comment #72)
> > (In reply to Alex Deucher from comment #71)
> > > The original issue reported in this bug was fixed long ago.  If you are
> > > having issues, please file a new report.
> > 
> > I just filed a new bug for the resume issue at your request. It's 209457.
> 
> My issue seems unrelated to your bug report, my suspend/resume freeze issue
> is related to my Intel Wireless AC9260 not to my AMD Ryzen 7 3700U with
> integrated graphics Vega RX10. 
> 
> Disabling the wireless card in the BIOS fixes the suspend/resume problem for
> my specific configuration (Thinkpad T495 20NK model).
> 
> Although your issue seems to be with the AMDGPU driver and related to your
> graphics card I suppose.

Yes, I filed a new bug for my issue at https://bugzilla.kernel.org/show_bug.cgi?id=209457.  
  
Hopefully this bug will be closed to avoid further confusion for users, and relieve the hard working developers from our confusion as well :)
Comment 75 Allexj 2021-02-08 22:00:57 UTC
I've this problem too. Still happening. Currently I have the 5.10.7-3 kernel.
Comment 76 Robert M. Muncrief 2021-02-08 22:13:47 UTC
I also continue to have this problem on Arch with kernel 5.10.14.
Comment 77 Alex Deucher 2021-02-08 22:15:16 UTC
Please open a new ticket this issue was fixed.
Comment 78 TheRinger 2023-04-13 20:11:20 UTC
After this happened to me on Debian I started digging to find the source as it came with a payload which ultimately flashed my bios after flashing my wireless card’s firmware. I found two files that were modified from the original installation which may have been injected as the source hash is different. Researching further I’ve found some interesting comments about how this is done by manipulating Systemd after resuming from hibernation, and pulling memory back from the swap that was modified. The rabbit hole goes further as it then returns from sleeping after modifying the library’s that control fonts and their storage. You browse Google and your search’s contain websites with web fonts. In These fonts there is strange emojis and and symbols which at first seem like poorly designed icons and graphic s but actually contain raw code that is downloaded to your cache. At some point there is another part that goes in and assembles these code blocks to copy your .home/user/.ssh files because of weak user land file and directory attributes. Anyway this goes into on as you can imagine how this only continues to work. When this happens or after you restart because the computer doesn’t return from sleep. You end up with modifications to your bios, graphics, hard drive, firmware and anything else that it can possibly find to stay present. Your gparted code will contain code blocks that that swap out code from the end of your hard drive to the start. You will need to start from scratch by clearing cmos then uploading new firmware and zeroing out hard drives. It’s a huge headache. It may only get so far and so you may never end up downloading the cached fonts or some other step it needs and will think it’s just a glitch. Check your known hosts folder in your ssh directory also compare hashes to original source code . I switched to Slackware despite enjoying the simplicity of package management years ago as its appeal to me was it didn’t contain Systemd, recently I decided to try a mainline distro again only to discover this gem. 

The library files among others but notable only because the were in the original initramfs were libfribidi.o and libgraphite2.so