Bug 216119

Summary: 087451f372bf76d breaks hibernation on amdgpu Radeon R9 390
Product: Drivers Reporter: Harald Judt (h.judt)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: alexdeucher, daniel, h.judt, mario.limonciello
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: v5.17+ Subsystem:
Regression: No Bisected commit-id:
Attachments: Error when resuming fails
dmesg.out
patch 1/4
patch 2/4
patch 3/4
patch 4/4
mailing-list-patch-adapted-for-5.18.7.patch
dmesg.out
patch 1/2
patch 2/2
dmesg.out
dmesg.out
dmesg.out
possible fix
another fix

Description Harald Judt 2022-06-12 21:42:43 UTC
This is a problem with amdgpu only, as I did not experience this on my other machines, and caused by [087451f372bf76d971184caa258807b7c35aac8f] drm/amdgpu: use generic fb helpers instead of setting up AMD own's
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=087451f372bf76d971184caa258807b7c35aac8f

Reverting this patch makes me a happy amdgpu user with working hibernation again on Linux 5.18.3.

Note I did the bisect on the stable linux repo, but that shouldn't matter.

git bisect start
# good: [c19a885e12f114b799b5d0d877219f0695e0d4de] Linux 5.16.20
git bisect good c19a885e12f114b799b5d0d877219f0695e0d4de
# bad: [f443e374ae131c168a065ea1748feac6b2e76613] Linux 5.17
git bisect bad f443e374ae131c168a065ea1748feac6b2e76613
# good: [df0cc57e057f18e44dac8e6c18aba47ab53202f9] Linux 5.16
git bisect good df0cc57e057f18e44dac8e6c18aba47ab53202f9
# bad: [22ef12195e13c5ec58320dbf99ef85059a2c0820] Merge tag 'staging-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect bad 22ef12195e13c5ec58320dbf99ef85059a2c0820
# bad: [9bcbf894b6872216ef61faf17248ec234e3db6bc] Merge tag 'media/v5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect bad 9bcbf894b6872216ef61faf17248ec234e3db6bc
# bad: [cb6846fbb83b574c85c2a80211b402a6347b60b1] Merge tag 'amd-drm-next-5.17-2021-12-30' of ssh://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect bad cb6846fbb83b574c85c2a80211b402a6347b60b1
# bad: [15bb79910fe734ad21c765d1cae762e855969caa] Merge tag 'drm-misc-next-2021-12-09' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
git bisect bad 15bb79910fe734ad21c765d1cae762e855969caa
# good: [03848335b5b1faa4a4641fcf30b7c233579a45aa] drm/bridge: sn65dsi86: defer if there is no dsi host
git bisect good 03848335b5b1faa4a4641fcf30b7c233579a45aa
# good: [6bb0a0e0fd358d4f9f6ce4c2d36c1f80d7496f6a] drm/i915: Clean up FPGA_DBG/CLAIM_ER bits
git bisect good 6bb0a0e0fd358d4f9f6ce4c2d36c1f80d7496f6a
# bad: [13d20aabd6ef501229ac002493c6f237482c47de] drm/amd/display: remove no need NULL check before kfree
git bisect bad 13d20aabd6ef501229ac002493c6f237482c47de
# bad: [a6506cd845824fe92b1760aaf104011cc04dfa78] drm/radeon: correct indentation
git bisect bad a6506cd845824fe92b1760aaf104011cc04dfa78
# bad: [f0d0c39149f817e5ecdff8fa164f44da455b3317] drm/amd/display: Pass panel inst to a PSR command
git bisect bad f0d0c39149f817e5ecdff8fa164f44da455b3317
# good: [574c4183ef75117f763e9f2b35e08c85f5dcad2d] drm/amdkfd: replace kgd_dev in get amdgpu_amdkfd funcs
git bisect good 574c4183ef75117f763e9f2b35e08c85f5dcad2d
# bad: [b5f57384805a34f497edb8b04d694a8a1b3d81d4] drm/amdkfd: Add sysfs bitfields and enums to uAPI
git bisect bad b5f57384805a34f497edb8b04d694a8a1b3d81d4
# good: [56c5977eae8799c9a71ee2112802fd1f1591dc3a] drm/amdkfd: replace/remove remaining kgd_dev references
git bisect good 56c5977eae8799c9a71ee2112802fd1f1591dc3a
# bad: [087451f372bf76d971184caa258807b7c35aac8f] drm/amdgpu: use generic fb helpers instead of setting up AMD own's.
git bisect bad 087451f372bf76d971184caa258807b7c35aac8f
# good: [b5d1d755c1344075d4f16a3e6183ed04b4d022ef] drm/amdkfd: remove kgd_dev declaration and initialization
git bisect good b5d1d755c1344075d4f16a3e6183ed04b4d022ef
# first bad commit: [087451f372bf76d971184caa258807b7c35aac8f] drm/amdgpu: use generic fb helpers instead of setting up AMD own's.

lspci:
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii PRO [Radeon R9 290/390] (rev 80) (prog-if 00 [VGA controller])
        Subsystem: XFX Pine Group Inc. Hawaii PRO [Radeon R9 290/390]
        Flags: bus master, fast devsel, latency 0, IRQ 25
        Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Memory at f0000000 (64-bit, prefetchable) [size=8M]
        I/O ports at e000 [size=256]
        Memory at f7e00000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: amdgpu
Comment 1 Harald Judt 2022-06-12 21:46:17 UTC
Created attachment 301161 [details]
Error when resuming fails

Screenshot with error message when resuming fails. Screen will blank out and come back, looping endlessly.
Comment 2 Artem S. Tashkinov 2022-06-13 10:27:14 UTC
Please refile here: https://gitlab.freedesktop.org/drm/amd/-/issues (when the website gets restored - it's currently down).
Comment 4 Harald Judt 2022-06-16 18:40:01 UTC
I have tried 5.18.4 which includes this patch, and I also think 5.18.3 came with this patch. Unfortunately, this does not fix hibernation for me, though the symptoms are a bit different: No longer can I see any kernel messages like before (see comment #1), but the screen stays black. I can reboot my kernel with keyboard sysrq keys, though.
Comment 5 Alex Deucher 2022-06-16 19:15:51 UTC
Does setting amdgpu.runpm=0 help?
Comment 6 Alex Deucher 2022-06-16 19:16:09 UTC
On the kernel command line in grub.
Comment 7 Harald Judt 2022-06-16 19:24:52 UTC
Unfortunately, no it did not help (tested with 5.18.4).
Comment 8 Alex Deucher 2022-06-16 19:35:18 UTC
Can you attach your full dmesg output?
Comment 9 Harald Judt 2022-06-16 19:47:07 UTC
Created attachment 301189 [details]
dmesg.out

Here is the dmesg out from my linux-5.18.3 right after boot (that version is with the reverted patch). If you need something more specific or the dmesg from another version, please tell me.
Comment 10 Harald Judt 2022-06-16 19:55:19 UTC
Again, the same applies to 5.18.4: Reverting the fb helper patch gets me a working resume from hibernation.
Comment 11 Alex Deucher 2022-06-16 21:57:03 UTC
Does changing the the prefer_shadow hint help?  E.g., something like:

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index c2bc7db85d7e..4b6bd1a5804a 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -3852,7 +3852,7 @@ static int amdgpu_dm_mode_config_init(struct amdgpu_device *adev)
        adev_to_drm(adev)->mode_config.max_height = 16384;
 
        adev_to_drm(adev)->mode_config.preferred_depth = 24;
-       adev_to_drm(adev)->mode_config.prefer_shadow = 1;
+       adev_to_drm(adev)->mode_config.prefer_shadow = 0;
        /* indicates support for immediate flip */
        adev_to_drm(adev)->mode_config.async_page_flip = true;
Comment 12 Harald Judt 2022-06-16 22:36:03 UTC
Yes, changing this single line indeed solves the issue! Thanks.
Comment 13 Alex Deucher 2022-06-21 14:08:15 UTC
Created attachment 301243 [details]
patch 1/4

Can you try this patch set instead?
Comment 14 Alex Deucher 2022-06-21 14:08:36 UTC
Created attachment 301244 [details]
patch 2/4
Comment 15 Alex Deucher 2022-06-21 14:08:55 UTC
Created attachment 301245 [details]
patch 3/4
Comment 16 Alex Deucher 2022-06-21 14:09:17 UTC
Created attachment 301246 [details]
patch 4/4
Comment 17 Harald Judt 2022-06-21 22:33:39 UTC
Yes, this patch set seems to work fine. Reporting two successful hibernate/resume cycles with them applied to 5.18.5.
Comment 18 Alex Deucher 2022-06-22 13:22:14 UTC
Just to verify, you removed the patch from comment 11 before testing the new patches?  Also, can you try the just patch 1/4 and then again with just patches 2/4 and 3/4?  Do either of those combinations also work?
Comment 19 Harald Judt 2022-06-22 21:08:07 UTC
Yes, I definitely removed the patch from comment 11 before testing the new patch set. I will try the other combinations later this week, I am a bit short on time at the moment.
Comment 20 Harald Judt 2022-06-24 19:51:21 UTC
One thing that I have noticed: Since these changes, the kernel seems to switch to text mode when hibernating. Before that I think it remained (frozen) on the X screen.

Here are the results:
- 1+4: screen black, display suspend led, keyboard comes online again but no ssh. can sysreq to reboot.
- 2+4: screen black first, comes back after some time, restores screen with hibernation snapshotting progress visible (not those of resume), continues to resume (resuming progress visible), but then screen goes black again in dpms on, ssh available. was able to compile kernel and reboot via ssh.
- 3+4: does not compile. unknown fb struct or so. because of that, i tried 2+3+4 since 2+4 compiled fine and worked fine mostly.
- 2+3+4 works as good as 1+2+3+4. seems patch 1 is not necessary.
Comment 21 Daniel Vetter 2022-06-27 16:20:43 UTC
Two thoughts:

- It's entirely possible that the working kernel has the same warning. I happens right before we switch to the resumed kernel from the hibernation image (if I read this all right), so a) will show only for extremely short amounts of time and b) wont be captured in the dmesg once we're in the resumed kernel I think.

- Something later on is going wrong, and we don't know. Have we tried the no_console_suspend trick already? Another possibility might be to connect a serial console, that tends to be able to get more dmesg output out of dying kernels.
Comment 23 Harald Judt 2022-06-27 23:48:12 UTC
Created attachment 301291 [details]
mailing-list-patch-adapted-for-5.18.7.patch

I had to adapt the patch for 5.18.7. I hope this is right, there is only a difference in a struct it seems.
Unfortunately, it did not help. Screen suspended, comes back after a while, goes blank again (but powered on), also keyboard back for sysreq, but ssh unresponsive.
Comment 24 Alex Deucher 2022-06-28 15:43:36 UTC
Does setting amdgpu.dc=0 on the kernel command line also exhibit the behavior?  If so, does patch 4/4 alone fix that?
Comment 25 Harald Judt 2022-06-28 21:12:29 UTC
Created attachment 301298 [details]
dmesg.out

I report success, partly.

Setting amdgpu.dc=0 on the kernel command line also exhibits the behaviour => Screen suspended, comes back after a while, goes blank again (but powered on), also keyboard back for sysreq, but ssh unresponsive.

After applying patch 4 and setting amdgpu.dc=0, the system restores again, but not smoothly. Screen suspends, comes back after a while, also keyboard, display switches back on, machine is responsive and I could generate a dmesg which shows that there have been test failures. dmesg.out attached, but the relevant messages are these:

amdgpu 0000:01:00.0: [drm:amdgpu_ib_ring_tests] *ERROR* IB test failed on sdma0 (-110).
[drm:process_one_work] *ERROR* ib ring test failed (-110).
Comment 26 Alex Deucher 2022-06-28 21:33:04 UTC
Created attachment 301299 [details]
patch 1/2

Can you try the attached 2 patches (without any previous patches) both with and without amdgpu.dc=0?
Comment 27 Alex Deucher 2022-06-28 21:33:26 UTC
Created attachment 301300 [details]
patch 2/2
Comment 28 Harald Judt 2022-06-29 06:16:15 UTC
Created attachment 301304 [details]
dmesg.out

The new 2 patches alone work fine _without_ setting amdgpu.dc=0.
Comment 29 Alex Deucher 2022-06-29 13:00:57 UTC
Do they also work when you set amdgpu.dc=0 or has that always had problems for you?
Comment 30 Harald Judt 2022-06-29 13:30:36 UTC
I will have to test this. I have not known this option exists and what it does nor what will happen if I disable it and if that is good or bad, so I have never used it before.
Comment 31 Harald Judt 2022-06-29 13:40:40 UTC
Created attachment 301306 [details]
dmesg.out

amdgpu.dc=0 doesn't seem to make any practical difference.

BTW: With the new patchset, the machine also does no longer vt switch to the console. I do not care about that very much, but the old behaviour has been restored.

This time, the following message did not appear in dmesg.out after resume:
[drm] Fence fallback timer expired on ring sdma0
Comment 32 Harald Judt 2022-07-16 22:39:51 UTC
There still seem to be issues with the shared fb implementation and hibernation. After resume, chvt to another vt causes the following errors in dmesg:

[  975.920944] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered
[  986.160803] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered
[  996.400610] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered
[ 1006.640916] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered
[ 1016.880923] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered
[ 1027.120907] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered

And the machine gets stuck/freezes, except one can ssh in and reboot it. So only the graphical display will not restore properly. Also, if I reboot the machine and the error above does not occur, then the console (not X display) is somehow misaligned, that is the messages are not printed on the most-left column, but a bit more in the middle of the screen.
Comment 33 Harald Judt 2022-07-16 22:41:02 UTC
With "if I reboot the machine" I mean, if I shutdown/reboot it later after resume, then the shutdown messages get printed that strange way.
Comment 34 Mario Limonciello (AMD) 2022-08-09 05:04:35 UTC
I recently became aware that the WX3200 in my workstation wasn't working properly after suspend-to-ram.

61:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981] (rev 10)

I ran a bisect, and unfortunately it came up to the commit that was created for this bug (https://github.com/torvalds/linux/commit/3a4b1cc28fbdc2325b3e3ed7d8024995a75f9216)

$ git bisect log
git bisect start '--term-new=broken' '--term-old=good'
# good: [4b0986a3613c92f4ec1bdc7f60ec66fea135991f] Linux 5.18
git bisect good 4b0986a3613c92f4ec1bdc7f60ec66fea135991f
# broken: [8843bf1f0737ecea456d2bbd19d4263d49f2d110] Linux 5.18.16
git bisect broken 8843bf1f0737ecea456d2bbd19d4263d49f2d110
# good: [ffd4c4d5293e4985092ea45ba21cad9326e2e434] drivers: staging: rtl8192e: Fix deadlock in rtllib_beacons_stop()
git bisect good ffd4c4d5293e4985092ea45ba21cad9326e2e434
# good: [164f0714bae175e2f5737070d037d7475417228d] pinctrl: sunxi: a83t: Fix NAND function name for some pins
git bisect good 164f0714bae175e2f5737070d037d7475417228d
# broken: [86fbd2844858c5aef57a28ebc3d53d298f37cc67] x86/retpoline: Use -mfunction-return
git bisect broken 86fbd2844858c5aef57a28ebc3d53d298f37cc67
# broken: [7fc7c6d053cfca70bb81892f3f00937e5c459d5a] arm64: dts: broadcom: bcm4908: Fix cpu node for smp boot
git bisect broken 7fc7c6d053cfca70bb81892f3f00937e5c459d5a
# good: [cd52154b924f2ea05069d4296045d9fd56a8da23] ALSA: hda - Add fixup for Dell Latitidue E5430
git bisect good cd52154b924f2ea05069d4296045d9fd56a8da23
# good: [b8651049bdd77fa652bcf0f3157911a3a6fc4f2f] net/mlx5e: CT: Use own workqueue instead of mlx5e priv
git bisect good b8651049bdd77fa652bcf0f3157911a3a6fc4f2f
# broken: [594cea2c09f7cd440d1ee1c4547d5bc6a646b0e4] netfilter: conntrack: remove the percpu dying list
git bisect broken 594cea2c09f7cd440d1ee1c4547d5bc6a646b0e4
# broken: [cd486308d773d6d062a0140062458b48f8a0eb6b] ASoC: tas2764: Add post reset delays
git bisect broken cd486308d773d6d062a0140062458b48f8a0eb6b
# broken: [4ffcacab7145080187330accafae69e87a481eec] drm/amdgpu/display: disable prefer_shadow for generic fb helpers
git bisect broken 4ffcacab7145080187330accafae69e87a481eec
# good: [16427298f3dc02ec90bdfa31c8ef9b384ea5534a] net/mlx5e: Ring the TX doorbell on DMA errors
git bisect good 16427298f3dc02ec90bdfa31c8ef9b384ea5534a
# good: [27dccf616a0a82f4d8004b7ee04560e7de419e63] drm/amdgpu: keep fbdev buffers pinned during suspend
git bisect good 27dccf616a0a82f4d8004b7ee04560e7de419e63
# first broken commit: [4ffcacab7145080187330accafae69e87a481eec] drm/amdgpu/display: disable prefer_shadow for generic fb helpers

It seems that a revert of that commit isn't the best solution as it's just trading the S3 failure I see for your S4 failure.  But also it seems that from your comment #32 there is still an underlying problem with using the fbdev helper, albeit improved for you in S3.

Would you mind contrasting if S3 is working for you with/without that commit?
Comment 35 Harald Judt 2022-08-09 18:53:00 UTC
I have not had time yet to try any patches, but here are more detailed dmesg messages when things get awry after resuming from hibernation and vt-switching (see symptoms described above). Maybe they give someone additional hints what's going wrong:

[drm:amdgpu_dm_atomic_commit_tail] *ERROR* Waiting for fences timed out!
[drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered
[drm:amdgpu_dm_atomic_commit_tail] *ERROR* Waiting for fences timed out!
[drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered
[drm:amdgpu_dm_atomic_commit_tail] *ERROR* Waiting for fences timed out!
[drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered
[drm:amdgpu_dm_atomic_commit_tail] *ERROR* Waiting for fences timed out!
[drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered
[drm:amdgpu_dm_atomic_commit_tail] *ERROR* Waiting for fences timed out!
[drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered
[drm:amdgpu_dm_atomic_commit_tail] *ERROR* Waiting for fences timed out!
[drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered
[drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered
amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 146 0x000cc40c
amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C0C400C
amdgpu 0000:01:00.0: amdgpu: VM fault (0x0c, vmid 6, pasid 0) at page 0, read from 'TC7' (0x54433700) (196)
amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 146 0x0004c40c
amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x040C400C
amdgpu 0000:01:00.0: amdgpu: VM fault (0x0c, vmid 2, pasid 0) at page 0, read from 'TC7' (0x54433700) (196)
amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 146 0x000ac40c
amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A0C400C
amdgpu 0000:01:00.0: amdgpu: VM fault (0x0c, vmid 5, pasid 0) at page 0, read from 'TC7' (0x54433700) (196)
amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 146 0x0004c40c
amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x040C400C
amdgpu 0000:01:00.0: amdgpu: VM fault (0x0c, vmid 2, pasid 0) at page 0, read from 'TC7' (0x54433700) (196)
amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 146 0x0004480c
amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0404800C
amdgpu 0000:01:00.0: amdgpu: VM fault (0x0c, vmid 2, pasid 0) at page 0, read from 'TC0' (0x54433000) (72)
[drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered
amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 146 0x0004480c
amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0404800C
amdgpu 0000:01:00.0: amdgpu: VM fault (0x0c, vmid 2, pasid 0) at page 0, read from 'TC0' (0x54433000) (72)
[drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered
amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 146 0x000a480c
amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A04800C
amdgpu 0000:01:00.0: amdgpu: VM fault (0x0c, vmid 5, pasid 0) at page 0, read from 'TC0' (0x54433000) (72)
[drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, but soft recovered

The funny thing was that another X session was still somehow usable (it takes a while to switch to it because of the hangs). But in general, those hangs when vt-switching sucks.

I will try to revert all the fbdev patches again to see if that also happens with the old fb code, though I cannot remember it did.

I will also test whether that happens when using only S3 instead of S4.

It will probably take me a few days until I can get to it though.
Comment 36 Harald Judt 2022-08-09 18:55:59 UTC
Maybe these are also relevant, they occur right after resuming, before all those other messages:
[drm] Fence fallback timer expired on ring sdma0
[drm] Fence fallback timer expired on ring sdma0
Comment 37 Mario Limonciello (AMD) 2022-08-09 19:07:24 UTC
Hmm.  I get the impression that those are /probably/ collateral damage from the underlying issue of something not initializing properly during resume from S4.
Comment 38 Alex Deucher 2022-08-09 19:38:22 UTC
(In reply to Harald Judt from comment #36)
> Maybe these are also relevant, they occur right after resuming, before all
> those other messages:
> [drm] Fence fallback timer expired on ring sdma0
> [drm] Fence fallback timer expired on ring sdma0

Those are usually a sign of interrupt issues on the platform.  E.g., interrupts not being proper enabled on the platform after resume.
Comment 39 Harald Judt 2022-08-25 08:37:12 UTC
Switching back to the original amdgpu_fb implementation (I've reverted this and all relevant patches on stable-5.18.19), all remaining issues vanish. After two hibernate/resume cycles, I can still successfully switch vts to the console and back, and nothing freezes.

Regarding the "[drm] Fence fallback timer expired on ring sdma0 messages", they also appear when using the amdgpu_fb implementation.

As for the display corruption on the console, this is an ascii art of how it looks like - next time I'll try to take a picture.

-----------------------------------------------------------------------
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
|                       Beginning of line starts here not at column 0 |
-----------------------------------------------------------------------

Maybe that can give some clue what else could be wrong.
Comment 40 Harald Judt 2022-10-13 19:30:53 UTC
Any updates on this? Hibernate/resume is still borked after a few cycles (rather sooner than later) and it gets harder and harder to revert these fb patches on newer kernels. The prefer_shadow patch alone does not really fix this.
Comment 41 Alex Deucher 2022-10-13 20:39:14 UTC
Does setting amdgpu.dc=0 make any difference?
Comment 42 Harald Judt 2022-10-15 09:02:07 UTC
Created attachment 303004 [details]
dmesg.out

Here is a new dmesg with linux-stable-5.19.11 and amdgpu.dc=0.

There are a couple more "Fence fallbasck timer expired on ring sdma0" than before, but nothing else directly after resuming.

However, VT-switching to the console and back to X causes the GPU fault in the last lines. I could login and produce this dmesg, but the machine wouldn't reboot with sudo reboot and I had to do sys-rq finally.
Comment 43 Alex Deucher 2022-10-20 14:13:05 UTC
Created attachment 303055 [details]
possible fix

Can you try this patch?  You might also try adding:
https://patchwork.kernel.org/project/dri-devel/patch/20221020103755.24058-2-tzimmermann@suse.de/

If those two don't help, can you try this whole set:
https://patchwork.kernel.org/project/dri-devel/list/?series=687097
Comment 44 Alex Deucher 2022-10-20 19:22:35 UTC
Created attachment 303064 [details]
another fix

You'll probably want this patch too.
Comment 45 Harald Judt 2022-10-20 21:59:11 UTC
Thanks for the new patches.

Applying attachment 303055 [details] and https://patchwork.kernel.org/project/dri-devel/patch/20221020103755.24058-2-tzimmermann@suse.de/ on linux-5.19.16 did seem to have an effect at first, but after a few cycles (and I booted into Windows and then resumed Linux again, switching VTs to another X session and consoles and back, then hibernated/resumed again) the machine crashed at resume, with the screen not frozen but dead.

I will try the whole patchset, however it seems I will have to adapt it to 5.19.16 - which needs more work - or update to the latest kernel dev release and then try to apply it on this.

I have not yet tried the last fix, which reverts the prefer_shadow setting for hawaii.
Comment 46 Alex Deucher 2022-10-21 20:11:08 UTC
Please try with the prefer_shadow patch.  It sounds like we may be on the right track.
Comment 47 Harald Judt 2022-10-24 21:19:06 UTC
I have tried attachment 303055 [details] and https://patchwork.kernel.org/project/dri-devel/patch/20221020103755.24058-2-tzimmermann@suse.de/ together with attachment 303064 [details], but unfortunately it made no difference. screen goes off for a while, keyboard comes back but screen stays dark. I might be able to test the whole patchset in a week or so, but not earlier.
Comment 48 Harald Judt 2022-12-15 20:39:34 UTC
Ok. I am quite sure about this, but I have new observations which could mean that the hibernation issue could be fixed but there is another issue.

First, I have tried 6.1.0. Hibernation/resume seems to work as long as I do not VT switch, causing the ring gfx timeout but soft recovered messages and fences timed out. I have not tested this thoroughly enough though.

Second, the
amdgpu: GPU fault detected: 146 0x0006c40c
amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x060C400C
amdgpu: VM fault (0x0c, vmid 3, pasid 0) at page 0, read from 'TC7' (0x54433700) (196)
and similar messages also happen with 5.18.19 when VT-switching, though *much* more rarely.

It seems these remaining problems are likely more related to VT-switching than having to do with hibernate/resume now. Maybe the new fb code or other changes in the driver just cause these problems to happen more often?
Comment 49 Harald Judt 2022-12-15 20:40:09 UTC
I meant "I am not quite sure about this."
Comment 50 Harald Judt 2022-12-16 08:12:11 UTC
On the other hand, I have not experienced any problems with vt-switching so far _before_ hibernate/resume.
Comment 51 Harald Judt 2023-02-03 11:25:40 UTC
Ok. After some tests with 6.1.8 and later, things seem to be a bit more stable, though I do not know whether the reason is actually the newer kernel or more exact tests on my part.

It now seems that as long as I only switch between X window sessions (VT-switching from one X session to another one), the crash does not occur. As soon as I switch to any non-X VT, the screen stays blank or freezes with the text cursor not visible. I can restart the computer using Ctrl-Alt-Delete, and switching back to another X VT also works, but of course the X session is frozen too.
Comment 52 Harald Judt 2023-05-16 16:21:30 UTC
I believe my last issue regarding hibernation after this rewrite has been fixed somewhere in linux-stable 6.2.13: The desktop no longer freezes when VT-switching after resuming from hibernation. Seems to work fine for some days now.

It could have been this commit as there has not been much else, but I have not investigated nor verified it:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.2.y&id=56a03f64fedf49a4f81c5605167b6e7bb0300a59
Comment 53 Mario Limonciello (AMD) 2023-05-16 17:02:04 UTC
> I believe my last issue regarding hibernation after this rewrite has been
> fixed somewhere in linux-stable 6.2.13: The desktop no longer freezes when
> VT-switching after resuming from hibernation. Seems to work fine for some
> days now.

That's great thanks!

> It could have been this commit as there has not been much else, but I have
> not investigated nor verified it:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.2.y&id=56a03f64fedf49a4f81c5605167b6e7bb0300a59

This seems plausible to me because the GPU does reset on the way down for hibernate and if it's poorly timed it could manifest similarly.
Comment 54 Alex Deucher 2023-05-23 18:21:01 UTC
Does the patch in https://bugzilla.kernel.org/attachment.cgi?id=303064 make a difference?  It would be nice to remove the special treatment for hawaii if it is not required.
Comment 55 Harald Judt 2023-05-26 08:28:48 UTC
Unfortunately with the patch applied to 6.3.4 it causes the old problem. The screen stays dark, USB keyboard offline when trying to resume. So special treatment is still required it seems.
Comment 56 Harald Judt 2023-07-06 08:04:29 UTC
I think that the issue with VT-switching is not fixed by the commit, but by another measure, that is switching to a text console before hibernating (the script I use has an option to do this). This seems to prevent the freezes.
Comment 57 Alex Deucher 2023-07-06 14:51:47 UTC
Does this patch help?
https://gitlab.freedesktop.org/drm/amd/uploads/64dc2a05039b583e89da17309969fa50/0001-client-register-fix-plus-fbdev-debug-noise-2.patch

It's pretty noisy.  The meat of the patch is this hunk:

diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c
index 76e46713b2f0..5d28c54b2512 100644
--- a/drivers/gpu/drm/drm_fb_helper.c
+++ b/drivers/gpu/drm/drm_fb_helper.c
@@ -2634,10 +2678,12 @@ void drm_fbdev_generic_setup(struct drm_device *dev,
 		preferred_bpp = 32;
 	fb_helper->preferred_bpp = preferred_bpp;
 
+	drm_client_register(&fb_helper->client);
+
 	ret = drm_fbdev_client_hotplug(&fb_helper->client);
 	if (ret)
 		drm_dbg_kms(dev, "client hotplug ret=%d\n", ret);
 
-	drm_client_register(&fb_helper->client);
+	drm_warn(dev, "%s:%d\n", __FILE__, __LINE__);
 }
 EXPORT_SYMBOL(drm_fbdev_generic_setup);
Comment 58 Alex Deucher 2023-07-06 15:03:04 UTC
Nevermind, sorry, wrong bug.