Bug 204241 - amdgpu fails to resume from suspend
Summary: amdgpu fails to resume from suspend
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
: 204965 (view as bug list)
Depends on:
Blocks:
 
Reported: 2019-07-20 09:50 UTC by Alexander Kitaev
Modified: 2019-10-28 20:16 UTC (History)
11 users (show)

See Also:
Kernel Version: 5.2.1-arch1-1-ARCH
Tree: Mainline
Regression: No


Attachments
dmesg (29.88 KB, text/plain)
2019-07-20 09:50 UTC, Alexander Kitaev
Details
lspci (10.52 KB, text/plain)
2019-07-20 09:50 UTC, Alexander Kitaev
Details
var_log_messages for amdgpu_ERROR (34.03 KB, text/plain)
2019-08-14 19:38 UTC, Andreas Jackisch
Details
lspci from ryzen system (3.81 KB, text/plain)
2019-08-14 19:39 UTC, Andreas Jackisch
Details
amdgpu firmware from ryzen system (1.08 KB, text/plain)
2019-08-14 19:39 UTC, Andreas Jackisch
Details
resume_failure.log (207.39 KB, text/plain)
2019-08-15 20:55 UTC, Andrey Grodzovsky
Details
var_log_meesages_5_2_11 (35.24 KB, text/plain)
2019-09-03 18:56 UTC, Andreas Jackisch
Details
/var/log/messages w/ kernel 5.3.0-gentoo (20.40 KB, text/plain)
2019-09-21 18:31 UTC, Andreas Jackisch
Details
Patch to prevent frequent resume failures (1.36 KB, patch)
2019-10-05 00:08 UTC, Ahzo
Details | Diff
Patch to fix the resume failures (1.35 KB, patch)
2019-10-11 18:33 UTC, Ahzo
Details | Diff
Patch to prevent kernel NULL pointer dereferences (1.98 KB, patch)
2019-10-11 18:37 UTC, Ahzo
Details | Diff
possible fix uvd6 (1.66 KB, patch)
2019-10-11 20:47 UTC, Alex Deucher
Details | Diff
possible fix uvd7 (1.67 KB, patch)
2019-10-11 20:48 UTC, Alex Deucher
Details | Diff
possible fix for uvd6 (4.04 KB, patch)
2019-10-15 22:11 UTC, Alex Deucher
Details | Diff
possible fix uvd7 (4.12 KB, patch)
2019-10-15 22:12 UTC, Alex Deucher
Details | Diff
possible fix for vcn (3.95 KB, patch)
2019-10-15 22:12 UTC, Alex Deucher
Details | Diff

Description Alexander Kitaev 2019-07-20 09:50:31 UTC
Created attachment 283863 [details]
dmesg

Computer fails to resume from suspend.
From the logs it looks like AMDGPU fails to resume.
Comment 1 Alexander Kitaev 2019-07-20 09:50:55 UTC
Created attachment 283865 [details]
lspci
Comment 2 Andrey Grodzovsky 2019-08-08 20:13:57 UTC
Can you post full dmesg log from boot, what card are you using ?
Comment 3 Andrey Grodzovsky 2019-08-08 20:22:17 UTC
OK, checked lspci and it's Ellsmere... Never mind.
Comment 4 Andrey Grodzovsky 2019-08-08 21:02:03 UTC
I tried to reproduce it with a kernel which is just a few commits different then this one - https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next

I tried with X enabled and in FB console. Was able to suspend and resume with no errors.

I would suggest to build your kernel from the branch above and see if it helps.
Also please post your FW info using this command 
cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
Comment 5 Andreas Jackisch 2019-08-14 19:29:29 UTC
The same issue started to hit me on gentoo when switching from 5.1.5-gentoo to  5.2.5-gentoo. I reverted back to latest 5.1.21-gentoo and the issue did not come up again. The failure on resume happens every after 5...20 attempts. I'll add message logs, lspci and firmware info.
Comment 6 Andreas Jackisch 2019-08-14 19:38:28 UTC
Created attachment 284411 [details]
var_log_messages for amdgpu_ERROR

search fro "amdgpu" to see it fail after resume
Comment 7 Andreas Jackisch 2019-08-14 19:39:11 UTC
Created attachment 284413 [details]
lspci from ryzen system
Comment 8 Andreas Jackisch 2019-08-14 19:39:40 UTC
Created attachment 284415 [details]
amdgpu firmware from ryzen system
Comment 9 Andrey Grodzovsky 2019-08-14 21:04:57 UTC
I was able to reproduce.
Comment 10 Andrey Grodzovsky 2019-08-15 20:55:41 UTC
Created attachment 284445 [details]
resume_failure.log

The kernel OOPS is just a result of previous GFX ring test failure. Attached log from UMR shows gfx ring is hang around (or right after) first PKT3_SET_CONTEXT_REG because latest PFP_HEADER_DUMP shows 0xc0d46900, this points to possibly that some of the payload within SET_CONTEXT_REG (in gfx_v8_0_get_csb_buffer) causes hang and later this results in ring test failure.

Alex Deucher - Any idea how to confirm this ?
Comment 12 Andreas Jackisch 2019-09-03 18:56:07 UTC
Created attachment 284807 [details]
var_log_meesages_5_2_11

I tested w/ kernel 5.2.11 as it contains the referenced patch "drm/amdgpu: pin the csb buffer on hw init for gfx v8". However, the system did not resume properly as before. This was on the 3rd attempt after almost 24 hours in S3. Reverted back to 5.1.21
Comment 13 Andreas Jackisch 2019-09-21 18:31:37 UTC
Created attachment 285079 [details]
/var/log/messages w/ kernel 5.3.0-gentoo

As there was no success w/ 5.2.x at all I tested 5.3.0. However, the system did not resume after the 2nd attempt with a comparable failure message.
 
amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper] *ERROR* ring sdma0 test failed (-110)

This is slightly different from 5.2.x where it was 

amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper] *ERROR* ring gfx test failed (-110)

but the result seems to be the same.

I'm not sure whether anybody is working on this or the bug-opener still sees the issue. As latest kernel series 5.1.x is somehow outdated now I will revert to 4.19.x LTS.
If there is any hint or advise what I can do to help please let me know.
Comment 14 Ahzo 2019-10-05 00:08:04 UTC
Created attachment 285349 [details]
Patch to prevent frequent resume failures

While this issue happens rather randomly, it can be quite reliably reproduced on linux 5.2 and later by performing successive suspend-resume cycles.
Usually the error occurs after less than 10 cycles, but occasionally only after more than 20. Thus one can use the following command to reproduce it almost certainly:
$ for i in $(seq 30); do sudo rtcwake -m mem -s 5; sleep 15; done

A bisection using this method lead to:
commit 533aed278afeaa68bb5d0600856ab02268cfa3b8
Author: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Date:   Wed Mar 6 16:16:28 2019 -0500

    drm/amdgpu: Move IB pool init and fini v2
    
    Problem:
    Using SDMA for TLB invalidation in certain ASICs exposed a problem
    of IB pool not being ready while SDMA already up on Init and already
    shutt down while SDMA still running on Fini. This caused
    IB allocation failure. Temproary fix was commited into a
    bringup branch but this is the generic fix.
    
    Fix:
    Init IB pool rigth after GMC is ready but before SDMA is ready.
    Do th opposite for Fini.
    
    v2: Remove restriction on SDMA early init and move amdgpu_ib_pool_fini
    
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>


Reverting this commit makes the problem unreproducible with above command.

Another way to prevent these frequent resume failures, while preserving the intention of this commit, is to simply call amdgpu_ib_pool_init directly after calling amdgpu_ucode_create_bo instead of directly before that. Attached is a patch doing it that way.
Comment 15 Andreas Jackisch 2019-10-05 10:35:25 UTC
(In reply to Ahzo from comment #14)
> Created attachment 285349 [details]
> Patch to prevent frequent resume failures
> ....
> Another way to prevent these frequent resume failures, while preserving the
> intention of this commit, is to simply call amdgpu_ib_pool_init directly
> after calling amdgpu_ucode_create_bo instead of directly before that.
> Attached is a patch doing it that way.
I applied the patch above to 5.3.2-gentoo. All 30 Suspend/Resume cycles using rtcwake and a couple of manual cycles went OK.

I'll continue to use this setup and will report if it fails again or is still OK after one week.

Thx for bisecting this issue and providing this fix as I assume it took some time.
Comment 16 me 2019-10-07 10:13:07 UTC
Can confirm the patch 'drm/amdgpu: Move IB pool init after ucode bo creation' fixed the issue for me (96h and counting, failure normally within 24h, with ~2 suspend/resume cycles per day).
Comment 17 Alex Deucher 2019-10-07 18:10:11 UTC
(In reply to Ahzo from comment #14)
> Another way to prevent these frequent resume failures, while preserving the
> intention of this commit, is to simply call amdgpu_ib_pool_init directly
> after calling amdgpu_ucode_create_bo instead of directly before that.
> Attached is a patch doing it that way.

I'm not sure I understand why the patch helps.  You are just changing the order of two memory allocations.  The order shouldn't matter.
Comment 18 Michel Dänzer 2019-10-08 07:56:53 UTC
(In reply to Alex Deucher from comment #17)
> I'm not sure I understand why the patch helps.  You are just changing the
> order of two memory allocations.  The order shouldn't matter.

My guess would be that the exact location of the ucode BO matters somehow.
Comment 19 me 2019-10-08 09:40:24 UTC
Just had the first (but different kind of) crash since applying the patch on top of 5.3.2, but didn't have kdump configured:
The system woke, everything seemed to work for about 30s, then the screen went black and the machine rebooted.
Comment 20 Ahzo 2019-10-11 18:33:10 UTC
Created attachment 285469 [details]
Patch to fix the resume failures

(In reply to Alex Deucher from comment #17)
> I'm not sure I understand why the patch helps.  You are just changing the
> order of two memory allocations.  The order shouldn't matter.

My hypothesis is that the order here is not the root cause of the problem, but rather affects the likelihood of that manifesting itself.
This is based on the fact that I have seen a resume failure typical for this bug on linux 5.0 once, but I'm unable to reproduce it with that version.

As commit 533aed278afe apparently makes the failures much more likely to happen, it provides an opportunity to debug this further by backporting it to older linux versions.
Doing that for versions down to linux 4.15 exposes the resume failures, but not on linux 4.14.

A bisection between these two, while backporting 533aed278afe on every step, lead to commit 2a91f272e34c, which failed to boot and thus had to be skipped, and:
commit e0128efb08b3d628d767ec8578e77cdd7ecc8f81
Author: James Zhu <James.Zhu@amd.com>
Date:   Fri Sep 29 16:42:27 2017 -0400

    drm/amdgpu: add uvd enc ib test
    
    Generate create/destroy messages to test UVD encode indirect buffer function.
    And enable UVD encode IB test during device initialization.
    
    Signed-off-by: James Zhu <James.Zhu@amd.com>
    Reviewed-and-Tested-by: Leo Liu <leo.liu@amd.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

This looks like a likely root cause. Indeed, adding 'return 0;' at the beginning of uvd_v6_0_enc_ring_test_ib makes the problem unreproducible, even on the latest linux 5.4-rc2.

Comparing with amdgpu_uvd_get_{create,destroy}_msg shows that these use 0 as dummy GPU pointer, while uvd_v6_0_enc_get_{create,destroy}_msg use a real GPU memory address.
Changing them to also use 0 as dummy pointer, as is done in the attached patch, actually fixes the resume failures.

Maybe a similar change should also be made for UVD 7.
Comment 21 Ahzo 2019-10-11 18:37:48 UTC
Created attachment 285471 [details]
Patch to prevent kernel NULL pointer dereferences

By the way, some of the kernel NULL pointer dereferences, that can happen after a resume failure, also happen always on shutdown:
RIP: 0010:build_audio_output.isra.0+0x97/0x110 [amdgpu]
RIP: 0010:enable_link_dp+0x186/0x300 [amdgpu]

Attached patch prevents them.

Note that these oopses are difficult to notice on shutdown, because they only leave traces in /sys/fs/pstore, not on the disk, as they happen after unmounting.
Comment 22 Alex Deucher 2019-10-11 20:47:45 UTC
Created attachment 285473 [details]
possible fix uvd6

Nice work.  I think the attached patch should fix it.
Comment 23 Alex Deucher 2019-10-11 20:48:08 UTC
Created attachment 285475 [details]
possible fix uvd7

Same fix for uvd7.
Comment 24 Ahzo 2019-10-12 10:37:51 UTC
(In reply to Alex Deucher from comment #22)
> Created attachment 285473 [details]
> possible fix uvd6
> 
> Nice work.  I think the attached patch should fix it.

Thanks for finding the correct solution. I can confirm that the patch for uvd6 works. The one for uvd7 also looks good, but I don't have the hardware to test it.
Furthermore, I think vcn also needs a similar change. I'm not sure about vce, as that uses 'ib_size_dw = 1024' thus allocating a much larger buffer.
Comment 25 me 2019-10-12 16:25:32 UTC
If it is of any help: I would be willing to test any of the more recent patches.

Hardware:
- Radeon RX 550
- Ryzen 1700X

The first patch by Ahzo@ already worked for me:
5.3.2 with "drm/amdgpu: Move IB pool init after ucode bo creation"

What other patches should I test with which kernel version?
Please provide Bugzilla attachment numbers.
Comment 26 Ahzo 2019-10-12 18:35:31 UTC
You can test Alex Deucher's uvd6 patch (attachment 285473 [details]), which is the proper fix for your RX 550.
Testing on linux 5.3 is fine, as this patch should fix the problem on any affected version.

The patch you tested previously just makes the problem unlikely to cause resume failures, but it doesn't fix the root cause of overwriting random GPU memory, so it might still cause random issues.
Comment 27 Andreas Jackisch 2019-10-12 20:47:09 UTC
In brief - the patch "0001-drm-amdgpu-uvd6-fix-allocation-size-in-enc-ring-test.patch" didn't work for me. After about 10 suspend/resume cycles the typical issue occurred again and I had to SysRq the system.

Status, all gentoo kernels:
5.1.x  OK
4.19.74 OK
5.2.x FAIL
5.3.0 FAIL
5.3.2 w/ patch from comment#14 OK
5.3.6 FAIL
5.3.6 w/ patch 0001-drm-amdgpu-uvd6-fix-allocation-size-in-enc-ring-test FAIL
5.3.6 w/ patch 0001-drm-amdgpu-uvd6-use-0-as-dummy-pointer-in-enc-ring-t OK

The last setup has seen 30+ suspend/resume cycles. I'll continue to use this.

So, to me it looks like that increasing the allocation did not help but assigning 0 to the dummy pointer did.

My hardware is comparable to the one listed in comment#25
- Radeon RX550
- Ryzen 1700
Comment 28 Ahzo 2019-10-13 10:47:29 UTC
(In reply to Andreas Jackisch from comment #27)
> In brief - the patch
> "0001-drm-amdgpu-uvd6-fix-allocation-size-in-enc-ring-test.patch" didn't
> work for me. After about 10 suspend/resume cycles the typical issue occurred
> again and I had to SysRq the system.

Indeed, the 0001-drm-amdgpu-uvd6-fix-allocation-size-in-enc-ring-test patch (attachement 285473) doesn't work.
Apparently I got (un)lucky enough that it survived 30 suspend/resume cycles, but testing it again, it failed.

On the other hand, the 0001-drm-amdgpu-uvd6-use-0-as-dummy-pointer-in-enc-ring-t patch (attachement 285469) survived 100 cycles.
Comment 29 Alex Deucher 2019-10-15 22:11:53 UTC
Created attachment 285507 [details]
possible fix for uvd6

The session info is 128K according to mesa.
Comment 30 Alex Deucher 2019-10-15 22:12:20 UTC
Created attachment 285509 [details]
possible fix uvd7

Updated patch for uvd7
Comment 31 Alex Deucher 2019-10-15 22:12:58 UTC
Created attachment 285511 [details]
possible fix for vcn

Same fix for vcn.
Comment 32 me 2019-10-16 14:27:16 UTC
@Alex: Didn't have a crash with the old uvd6 patch (attachment 285473 [details]) so far, but apparently I am just lucky.

Which patch (series?) should I test now?
Comment 33 Alex Deucher 2019-10-16 14:29:04 UTC
(In reply to me from comment #32)
> @Alex: Didn't have a crash with the old uvd6 patch (attachment 285473 [details]
> [details]) so far, but apparently I am just lucky.
> 
> Which patch (series?) should I test now?

Please try attachment 285507 [details].
Comment 34 Ahzo 2019-10-16 17:29:27 UTC
(In reply to Alex Deucher from comment #29)
> Created attachment 285507 [details]
> possible fix for uvd6
> 
> The session info is 128K according to mesa.

This version of the patch didn't fail for 100 suspend/resume cycles, so I think it actually fixes the problem.
Comment 35 Andreas Jackisch 2019-10-16 22:03:18 UTC

(In reply to Ahzo from comment #34)
> (In reply to Alex Deucher from comment #29)
> > Created attachment 285507 [details]
> > possible fix for uvd6
> > 
> > The session info is 128K according to mesa.
> 
> This version of the patch didn't fail for 100 suspend/resume cycles, so I
> think it actually fixes the problem.

I can confirm that the patch seems to work OK. 30+ suspend/resume cycles so far where it normally fails after 10 cycles.
Comment 36 Mario 2019-10-20 20:06:13 UTC
I can also confirm this patch (285507) fixed the problem on Arch Linux 5.3.7. 

The stock kernel failed after ~5 sleep-wake cycles. Patched kernel was able to survive the complete 30 cycles:

```for i in $(seq 30); do sudo rtcwake -m mem -s 5; sleep 15; done```

Thanks for the patch. I also suspect that bug 204965 is a duplicate of this one.
Comment 37 David 2019-10-23 16:46:53 UTC
*** Bug 204965 has been marked as a duplicate of this bug. ***
Comment 38 Andrew Hutchings 2019-10-28 20:16:20 UTC
Also confirmed Alex Deucher's patches work great for me, patched Fedora 31 kernel 5.3.7 on a ThinkPad T495 Ryzen 7 PRO 3700U with a Vega 10 GPU (vcn).

Many thanks!

Note You need to log in before you can comment on or make changes to this bug.