Bug 215436

Summary: admgpu: suspend and resuming from suspend don't work
Product: Drivers Reporter: spasswolf
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: normal CC: alexdeucher, juha-matti.rensujeff, mario.limonciello, superm1
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.16-rc{1-8} Subsystem:
Regression: No Bisected commit-id:
Attachments: full kern.log from boot to reboot
dmesg with linux-5.16-rc8
dmesg with linux-5.16-rc8 while ignoring adev->in_s0ix
Possible patch that ignores WA in s0ix
Possible patch that ignores WA in s0ix w/ debugging statements (v2)
debugging patch
corrected debugging patch
new debuggin patch

Description spasswolf 2021-12-29 12:28:51 UTC
Created attachment 300180 [details]
full kern.log from boot to reboot

System:
MSI Alpha 15 B5EEK Lapttop with newest BIOS
lspci -nn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex [1022:1630]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU [1022:1631]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus [1022:1635]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 51)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0 [1022:166a]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1 [1022:166b]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2 [1022:166c]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3 [1022:166d]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4 [1022:166e]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5 [1022:166f]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6 [1022:1670]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7 [1022:1671]
01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 HDMI Audio [Radeon RX 6800/6800 XT / 6900 XT] [1002:ab28]
04:00.0 Network controller [0280]: MEDIATEK Corp. Device [14c3:0608]
05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
06:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology P1 NVMe PCIe SSD [c0a9:2263] (rev 03)
07:00.0 Non-Volatile memory controller [0108]: Kingston Technology Company, Inc. Device [2646:500c] (rev 01)
08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [1002:1638] (rev c5)
08:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller [1002:1637]
08:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df]
08:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
08:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
08:00.5 Multimedia controller [0480]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2/FireFlight/Renoir Audio Processor [1022:15e2] (rev 01)
08:00.6 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) HD Audio Controller [1022:15e3]
08:00.7 Signal processing controller [1180]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2/Renoir Sensor Fusion Hub [1022:15e4]
When going into suspend on the above System suspend does not work properly
Dec 29 13:12:38 lisa kernel: [   52.085072] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
Dec 29 13:12:38 lisa kernel: [   52.407841] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
resuming from suspend freezes the System:
Dec 29 13:12:38 lisa kernel: [   53.663566] [drm] reserve 0xa00000 from 0x81fe000000 for PSP TMR
Dec 29 13:12:38 lisa kernel: [   53.715562] amdgpu 0000:08:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 test failed (-110)
Dec 29 13:12:38 lisa kernel: [   53.715805] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <sdma_v4_0> failed -110
Dec 29 13:12:38 lisa kernel: [   53.716007] amdgpu 0000:08:00.0: amdgpu: amdgpu_device_ip_resume failed (-110).
Dec 29 13:12:38 lisa kernel: [   53.716010] PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -110
Dec 29 13:12:38 lisa kernel: [   53.716023] amdgpu 0000:08:00.0: PM: failed to resume async: error -110
Dec 29 13:12:38 lisa kernel: [   53.734885] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
Dec 29 13:12:38 lisa kernel: [   53.744042] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Dec 29 13:12:38 lisa kernel: [   53.744048] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
Dec 29 13:12:38 lisa kernel: [   53.744053] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000012, smu fw version = 0x003b2200 (59.34.0)
Dec 29 13:12:38 lisa kernel: [   53.744059] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
Dec 29 13:12:38 lisa kernel: [   53.794983] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
Suspend and resume used to work with linux.5.15.11, the first bad commit is
commit 2a50edbf10c8c56e930bfb53d8f9f00a33fd837e
Author: Qingqing Zhuo <qingqing.zhuo@amd.com>
Date:   Tue Aug 17 16:29:07 2021 -0400

    drm/amd/display: Apply w/a for hard hang on HPD
    
    [Why]
    HPD disable and enable sequences are not mutually exclusive
    on Linux. For HPDs that spans under 1s (i.e. HPD low = 1s),
    part of the disable sequence (specifically, a request to SMU
    to lower refclk) could come right before the call to PHY
    enablement, causing DMUB to access an irresponsive PHY
    and thus a hard hang on the system.
    
    [How]
    Disable 48mhz refclk off when there is any HPD status in
    connected state.
    
    Reviewed-by: Hersen Wu <hersenwu@amd.com>
    Acked-by: Mikita Lipski <mikita.lipski@amd.com>
    Signed-off-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
    Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

 .../drm/amd/display/dc/clk_mgr/dcn21/rn_clk_mgr.c  | 12 ++++++++++-
 .../amd/display/dc/irq/dcn21/irq_service_dcn21.c   | 25 ++++++++++++++++++++++
 .../amd/display/dc/irq/dcn21/irq_service_dcn21.h   |  2 ++
 drivers/gpu/drm/amd/display/dc/irq/irq_service.c   |  2 +-
 drivers/gpu/drm/amd/display/dc/irq/irq_service.h   |  4 ++++
 5 files changed, 43 insertions(+), 2 deletions(-)
Comment 1 Alex Deucher 2021-12-29 15:08:27 UTC
See also:
https://gitlab.freedesktop.org/drm/amd/-/issues/1821
Comment 2 Alex Deucher 2021-12-29 15:10:31 UTC
Could you please help add some prints and confirm that the commit doesn't change the code sequence in rn_update_clocks() especially for the following part in suspend and wake sequence:
if (display_count == 0 && !hpd_state) {
rn_vbios_smu_set_dcn_low_power_state(clk_mgr, DCN_PWR_STATE_LOW_POWER);
/* update power state */
clk_mgr_base->clks.pwr_state = DCN_PWR_STATE_LOW_POWER;
It would also be appreciated if you could print out the time that was spent in the function rn_update_clocks() with and w/o the commit.
Comment 3 spasswolf 2021-12-30 13:03:46 UTC
I added printk at the beginnig and exit(s) of rn_clocks_update:
This is the result when resuming from suspend on commit d020970959169627d59a711769f8c4b87bf5f90c

Dec 30 13:55:42 lisa kernel: [  140.198849] PM: suspend exit
Dec 30 13:55:42 lisa kernel: [  140.208063] amdgpu: entering rn_update_clocks ...
Dec 30 13:55:42 lisa kernel: [  140.209251] amdgpu: leaving rn_update_clocks ...
Dec 30 13:55:42 lisa kernel: [  140.211147] amdgpu: entering rn_update_clocks ...
Dec 30 13:55:42 lisa kernel: [  140.211148] amdgpu: leaving rn_update_clocks ...
Dec 30 13:55:42 lisa kernel: [  140.226830] Generic FE-GE Realtek PHY r8169-0-500:00: attached PHY driver (mii_bus:phy_addr=r8169-0-500:00, irq=MAC)
Dec 30 13:55:42 lisa kernel: [  140.277915] amdgpu: entering rn_update_clocks ...
Dec 30 13:55:42 lisa kernel: [  140.277925] amdgpu: leaving rn_update_clocks ...
Dec 30 13:55:42 lisa kernel: [  140.278233] amdgpu: entering rn_update_clocks ...
Dec 30 13:55:42 lisa kernel: [  140.278234] amdgpu: leaving rn_update_clocks ...

In this case the system does not hang, but suspend doesn't seem to work properly, either (Fan and keyboard do not switch off). I probably missed that during the first bisection.
Comment 4 spasswolf 2021-12-30 13:16:41 UTC
Unfortunately I optimized my .config to make bisections faster, but now I cannot reproduce the Hang with commit 2a50edbf10c8c56e930bfb53d8f9f00a33fd837e.
So now I will start over with the bisection ...
Comment 5 spasswolf 2021-12-30 14:33:14 UTC
There might also be different suspend errors we're dealing with:
linux-5.16-rc7 e.g. seems to hang when going into suspend with the last log message before reboot being:
Dec 30 14:53:19 lisa kernel: [   85.056243] PM: suspend entry (s2idle)
Comment 6 spasswolf 2021-12-31 00:17:17 UTC
After sorting out some .config troubles I've reverted commit 2a50edbf10c8c56e930bfb53d8f9f00a33fd837e in linux-5.16-rc7 and this seems to cure both the not properly suspending issue (fan and keyboard lights stayed on) and the hang on resuming from suspend issue.
Comment 7 spasswolf 2021-12-31 00:22:30 UTC
The
 [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
are still there, but they don't seem to be harmful, in fact they were there even
 in linux.5.15.11 where suspend and resume seemed to work.
Comment 8 spasswolf 2021-12-31 15:10:50 UTC
Just did a bit of experimentation and changed dc_get_hpd_state_dcn21 (in 5.16-rc7) to

uint32_t dc_get_hpd_state_dcn21(struct irq_service *irq_service, enum dc_irq_source source)
{
	const struct irq_source_info *info;
	uint32_t addr;
	uint32_t value;
	uint32_t current_status;

	info = find_irq_source_info(irq_service, source);
	if (!info)
		return 0;

	addr = info->status_reg;
	if (!addr)
		return 0;

	value = dm_read_reg(irq_service->ctx, addr);
	//current_status =
	//	get_reg_field_value(
	//		value,
	//		HPD0_DC_HPD_INT_STATUS,
	//		DC_HPD_SENSE);

	//return current_status;
	return 0;
}
This also leads to working suspend resume, the get_reg_field_value seems to trigger the problems.
Comment 9 spasswolf 2021-12-31 16:14:03 UTC
get_reg_field_value(value, HPD0_DC_HPD_INT_STATUS, DC_HPD_SENSE)
is just a macro expanding to an inline function, i.e.
current_status = (value & HPD0_DC_HPD_INT_STATUS__DC_HPD_SENSE_MASK) >> HPD0_DC_HPD_IN_STATUS__DC_HPD_SENSE__SHIFT;
with ..._MASK = 0x2 being 0x2 and ...__SHIFT = 0x1, problems seem to occur when current_status != 0.
Comment 10 Mario Limonciello 2022-01-01 01:42:19 UTC
current_status != 0 causes rn_update_clocks to behave differently, basically rn_vbios_smu_set_dcn_low_power_state can never get called.

I wonder if we want to just not apply this WA as part of the suspend/resume paths, but only during the runtime paths?
Comment 11 spasswolf 2022-01-03 09:29:51 UTC
Created attachment 300211 [details]
dmesg with linux-5.16-rc8

With linux-5.16-rc8, suspend does still not work properly (fan stays on). Resume seemed to work at first, but on the next suspend/resume test I get the usual hang.
Comment 12 spasswolf 2022-01-03 09:35:56 UTC
Suspend works again if I ignore hpd_state in rn_update_clocks.
	// if (display_count == 0 && !hpd_state) {
	if (display_count == 0) {
Comment 13 spasswolf 2022-01-03 10:20:53 UTC
I added a some dev_info calls:

static int amdgpu_pmops_suspend(struct device *dev)
{
	struct drm_device *drm_dev = dev_get_drvdata(dev);
	struct amdgpu_device *adev = drm_to_adev(drm_dev);
	int r;

	if (amdgpu_acpi_is_s0ix_active(adev))
		adev->in_s0ix = true;
	adev->in_s3 = true;
	r = amdgpu_device_suspend(drm_dev, true);
	dev_info(adev->dev, "amdgpu_device_suspend(...) = %d\n", r);
	adev->in_s3 = false;
	if (r)
		return r;
	dev_info(adev->dev, "adev->in_s0ix = %d\n", adev->in_s0ix);
	if (!adev->in_s0ix)
		r = amdgpu_asic_reset(adev);
	return r;
}

which gives

Jan  3 11:13:40 lisa kernel: [   38.792335] amdgpu 0000:08:00.0: amdgpu: amdgpu_device_suspend(...) = 0
Jan  3 11:13:40 lisa kernel: [   38.792341] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 1

So the gpu is not actually reset ...
Comment 14 spasswolf 2022-01-03 10:29:48 UTC
Created attachment 300212 [details]
dmesg with linux-5.16-rc8 while ignoring adev->in_s0ix

Simply ignoring the value of adev->in_s0ix and calling reset anyway does not work either:
Jan  3 11:23:51 lisa kernel: [   32.649142] amdgpu 0000:08:00.0: amdgpu: amdgpu_device_suspend(...) = 0
Jan  3 11:23:51 lisa kernel: [   32.649146] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 1
Jan  3 11:23:51 lisa kernel: [   32.649149] amdgpu 0000:08:00.0: amdgpu: MODE2 reset
Jan  3 11:23:51 lisa kernel: [   37.404614] amdgpu 0000:08:00.0: amdgpu: Mode2 reset failed!
Jan  3 11:23:51 lisa kernel: [   37.404616] PM: pci_pm_suspend(): amdgpu_pmops_suspend+0x0/0x60 [amdgpu] returns -62
Jan  3 11:23:51 lisa kernel: [   37.404733] PM: dpm_run_callback(): pci_pm_suspend+0x0/0x150 returns -62
Comment 15 Mario Limonciello (AMD) 2022-01-03 15:48:01 UTC
Created attachment 300214 [details]
Possible patch that ignores WA in s0ix

Actually this is more like what I was suggesting.  It effectively doesn't use the WA during s0ix suspend or resume.
Comment 16 spasswolf 2022-01-03 17:12:38 UTC
Added a dev_info:

dev_info(adev->dev, "adev->in_s0ix = %d\n", adev->in_s0ix);
if (display_count == 0 && (adev->in_s0ix || !hpd_state)) {

with the result:
It seems that adev->in_s0ix = 0 during suspend. Only later during resume (after ACPI: EC: interrupt unblocked) do we have adev->in_s0ix = 1. 

Jan  3 18:00:24 lisa kernel: [   40.367310] wlp4s0: deauthenticating from 54:67:51:3d:a2:e0 by local choice (Reason: 3=DEAUTH_LEAVING)
Jan  3 18:00:24 lisa kernel: [   40.437382] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 0
Jan  3 18:00:27 lisa kernel: [   44.031004] PM: suspend entry (s2idle)
Jan  3 18:00:28 lisa kernel: [   44.252353] Filesystems sync: 0.221 seconds
Jan  3 18:00:58 lisa kernel: [   44.253185] Freezing user space processes ... (elapsed 0.002 seconds) done.
Jan  3 18:00:58 lisa kernel: [   44.255407] OOM killer disabled.
Jan  3 18:00:58 lisa kernel: [   44.255407] Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
Jan  3 18:00:58 lisa kernel: [   44.256406] printk: Suspending console(s) (use no_console_suspend to debug)
Jan  3 18:00:58 lisa kernel: [   44.277805] queueing ieee80211 work while going to suspend
Jan  3 18:00:58 lisa kernel: [   44.831434] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
Jan  3 18:00:58 lisa kernel: [   44.831464] [drm] PSP is resuming...
Jan  3 18:00:58 lisa kernel: [   45.019729] [drm] reserve 0xa00000 from 0x81fe000000 for PSP TMR
Jan  3 18:00:58 lisa kernel: [   45.091174] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jan  3 18:00:58 lisa kernel: [   45.100301] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Jan  3 18:00:58 lisa kernel: [   45.100305] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
Jan  3 18:00:58 lisa kernel: [   45.100308] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw version = 0x003b2500 (59.37.0)
Jan  3 18:00:58 lisa kernel: [   45.100311] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
Jan  3 18:00:58 lisa kernel: [   45.151255] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
Jan  3 18:00:58 lisa kernel: [   45.152433] [drm] DMUB hardware initialized: version=0x02020007
Jan  3 18:00:58 lisa kernel: [   45.194058] [drm] kiq ring mec 2 pipe 1 q 0
Jan  3 18:00:58 lisa kernel: [   45.198331] [drm] VCN decode and encode initialized successfully(under DPG Mode).
Jan  3 18:00:58 lisa kernel: [   45.198603] [drm] JPEG decode initialized successfully.
Jan  3 18:00:58 lisa kernel: [   45.198652] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Jan  3 18:00:58 lisa kernel: [   45.198654] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Jan  3 18:00:58 lisa kernel: [   45.198655] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Jan  3 18:00:58 lisa kernel: [   45.198655] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Jan  3 18:00:58 lisa kernel: [   45.198656] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Jan  3 18:00:58 lisa kernel: [   45.198657] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Jan  3 18:00:58 lisa kernel: [   45.198658] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Jan  3 18:00:58 lisa kernel: [   45.198659] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Jan  3 18:00:58 lisa kernel: [   45.198659] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Jan  3 18:00:58 lisa kernel: [   45.198660] amdgpu 0000:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Jan  3 18:00:58 lisa kernel: [   45.198661] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Jan  3 18:00:58 lisa kernel: [   45.198661] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
Jan  3 18:00:58 lisa kernel: [   45.198662] amdgpu 0000:03:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
Jan  3 18:00:58 lisa kernel: [   45.198662] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
Jan  3 18:00:58 lisa kernel: [   45.198663] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
Jan  3 18:00:58 lisa kernel: [   45.198663] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
Jan  3 18:00:58 lisa kernel: [   45.202326] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes
Jan  3 18:00:58 lisa kernel: [   45.527398] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
Jan  3 18:00:58 lisa kernel: [   45.844705] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
Jan  3 18:00:58 lisa kernel: [   45.889752] [drm] free PSP TMR buffer
Jan  3 18:00:58 lisa kernel: [   45.939048] amdgpu 0000:03:00.0: amdgpu: MODE1 reset
Jan  3 18:00:58 lisa kernel: [   45.939052] amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
Jan  3 18:00:58 lisa kernel: [   45.939132] amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
Jan  3 18:00:58 lisa kernel: [   46.450444] ACPI: EC: interrupt blocked
Jan  3 18:00:58 lisa kernel: [   46.472489] xhci_hcd 0000:08:00.3: refused to change power state from D0 to D3hot
Jan  3 18:00:58 lisa kernel: [   55.081262] ACPI: EC: interrupt unblocked
Jan  3 18:00:58 lisa kernel: [   55.632167] pci 0000:00:00.2: can't derive routing for PCI INT A
Jan  3 18:00:58 lisa kernel: [   55.632172] pci 0000:00:00.2: PCI INT A: no GSI
Jan  3 18:00:58 lisa kernel: [   55.635561] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
Jan  3 18:00:58 lisa kernel: [   55.637155] [drm] PSP is resuming...
Jan  3 18:00:58 lisa kernel: [   55.637252] [drm] PCIE GART of 1024M enabled.
Jan  3 18:00:58 lisa kernel: [   55.637254] [drm] PTB located at 0x000000F400900000
Jan  3 18:00:58 lisa kernel: [   55.637769] amdgpu 0000:08:00.0: amdgpu: SMU is resuming...
Jan  3 18:00:58 lisa kernel: [   55.640076] amdgpu 0000:08:00.0: amdgpu: dpm has been disabled
Jan  3 18:00:58 lisa kernel: [   55.643194] amdgpu 0000:08:00.0: amdgpu: SMU is resumed successfully!
Jan  3 18:00:58 lisa kernel: [   55.649205] [drm] DMUB hardware initialized: version=0x0101001C
Jan  3 18:00:58 lisa kernel: [   55.661720] nvme nvme0: 15/0/0 default/read/poll queues
Jan  3 18:00:58 lisa kernel: [   55.702951] nvme nvme1: Shutdown timeout set to 10 seconds
Jan  3 18:00:58 lisa kernel: [   55.704396] nvme nvme1: 8/0/0 default/read/poll queues
Jan  3 18:00:58 lisa kernel: [   55.737425] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 1
Jan  3 18:00:58 lisa kernel: [   55.828624] [drm] reserve 0xa00000 from 0x81fe000000 for PSP TMR
Jan  3 18:00:58 lisa kernel: [   55.898619] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jan  3 18:00:58 lisa kernel: [   55.907754] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Jan  3 18:00:58 lisa kernel: [   55.907757] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
Jan  3 18:00:58 lisa kernel: [   55.907761] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw version = 0x003b2500 (59.37.0)
Jan  3 18:00:58 lisa kernel: [   55.907767] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
Jan  3 18:00:58 lisa kernel: [   55.958092] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
Jan  3 18:00:58 lisa kernel: [   55.959299] [drm] DMUB hardware initialized: version=0x02020007
Jan  3 18:00:58 lisa kernel: [   55.968918] amdgpu 0000:08:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 test failed (-110)
Jan  3 18:00:58 lisa kernel: [   55.969120] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <sdma_v4_0> failed -110
Jan  3 18:00:58 lisa kernel: [   55.969298] amdgpu 0000:08:00.0: amdgpu: amdgpu_device_ip_resume failed (-110).
Jan  3 18:00:58 lisa kernel: [   55.969301] PM: dpm_run_callback(): pci_pm_resume+0x0/0xd0 returns -110
Jan  3 18:00:58 lisa kernel: [   55.969313] amdgpu 0000:08:00.0: PM: failed to resume async: error -110
Comment 17 Mario Limonciello (AMD) 2022-01-03 17:49:28 UTC
Created attachment 300215 [details]
Possible patch that ignores WA in s0ix w/ debugging statements (v2)

OK thanks for trying it!  Here's an updated patch, to try to give an extra explicit call to update the clocks again during the suspend path.  Also included a debugging statement so we can see the state of the variables that matter in each call to update_clocks.

If you can please share the output from your kernel log again with it applied.
Comment 18 spasswolf 2022-01-03 17:51:56 UTC
The problem seems to be that rn_update_clocks tries to call rn_vbios_smu_set_dcn_low power before adev->in_s0ix is set to 1 in amdpgu_pmops_suspend:

Changes to rn_update_clocks:
dev_info(adev->dev, "adev->in_s0ix = %d\n", adev->in_s0ix);
//if (display_count == 0 && (adev->in_s0ix || !hpd_state)) {
if (display_count == 0) {
	dev_info(adev->dev, "calling rn_vbios_smu_set_dcn_low_power\n");
	rn_vbios_smu_set_dcn_low_power_state(clk_mgr, DCN_PWR_STATE_LOW_POWER);
Changes to amdgpu_pmops_suspend:
if (amdgpu_acpi_is_s0ix_active(adev))
	adev->in_s0ix = true;
dev_info(adev->dev, "amdgpu_pmops_suspend: adev->in_s0ix = %d\n", adev->in_s0ix);


[   36.010259] wlp4s0: deauthenticating from 54:67:51:3d:a2:e0 by local choice (Reason: 3=DEAUTH_LEAVING)
[   36.043902] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 0
[   36.043904] amdgpu 0000:08:00.0: amdgpu: calling rn_vbios_smu_set_dcn_low_power
[   40.015570] PM: suspend entry (s2idle)
[   40.239013] Filesystems sync: 0.223 seconds
[   40.239912] Freezing user space processes ... (elapsed 0.002 seconds) done.
[   40.242407] OOM killer disabled.
[   40.242407] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[   40.243513] printk: Suspending console(s) (use no_console_suspend to debug)
[   40.246850] amdgpu 0000:08:00.0: amdgpu: amdgpu_pmops_suspend: adev->in_s0ix = 1
[   40.266376] queueing ieee80211 work while going to suspend
Comment 19 spasswolf 2022-01-04 10:17:36 UTC
Created attachment 300222 [details]
debugging patch

I added some more dev_info calls to figure out where things are called:
[   34.946911] wlp4s0: deauthenticating from 54:67:51:3d:a2:e0 by local choice (Reason: 3=DEAUTH_LEAVING)
[   35.079021] amdgpu 0000:08:00.0: amdgpu: Calling update_clocks from dcn10_set_clock
[   35.079026] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 0
[   35.079027] amdgpu 0000:08:00.0: amdgpu: calling rn_vbios_smu_set_dcn_low_power
[   39.030889] PM: suspend entry (s2idle)
[   39.270864] Filesystems sync: 0.239 seconds
[   39.271929] Freezing user space processes ... (elapsed 0.002 seconds) done.
[   39.274694] OOM killer disabled.
The WLAN message is the first message one gets after initiating suspend. The interesting thing is that rn_vbios_set_dc_low_power is called before suspend entry (and there's also a long 4s gap).
   I also have no idea where dcn10_set_clock is called. The only place I found is from within dc_set_clock, but dc_set_clock is called nowhere in the entire amdgpu module.
Comment 20 spasswolf 2022-01-04 10:25:25 UTC
Created attachment 300223 [details]
corrected debugging patch

The patch above contained an error
rn_update_clocks is actually called from dcn20_optimize_bandwidth
Comment 21 spasswolf 2022-01-04 11:15:11 UTC
Created attachment 300224 [details]
new debuggin patch

Seems the call trace of rn_vbios_smu_set_dcn_low_power is amdgpu_dm_atomic_commit_tail -> dc_commit_state -> dc_commit_state_nocheck -> dcn20_optimize_bandwith -> rn_update_clocks -> rn_vbios_smu_set_dcn_low_power

  53.562322] amdgpu 0000:08:00.0: amdgpu: Calling dc_commit_state from amdgpu_dm_atomic_commit_tail
[   53.562426] amdgpu 0000:08:00.0: amdgpu: Calling dc_commit_state from amdgpu_dm_atomic_commit_tail
[   53.563183] wlp4s0: deauthenticating from 54:67:51:3d:a2:e0 by local choice (Reason: 3=DEAUTH_LEAVING)
[   53.673655] amdgpu 0000:08:00.0: amdgpu: Calling optimize_bandwidth from dc_commit_state_no_check
[   53.673669] amdgpu 0000:08:00.0: amdgpu: Calling update_clocks from dcn20_optimize_bandwidth
[   53.673672] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 0
[   53.673673] amdgpu 0000:08:00.0: amdgpu: calling rn_vbios_smu_set_dcn_low_power
[   53.674562] amdgpu 0000:08:00.0: amdgpu: Calling dc_commit_state from amdgpu_dm_atomic_commit_tail

This seems to be not really connected to suspend.
Comment 22 spasswolf 2022-01-04 11:27:37 UTC
I get similar message if I just lock the screen which then switches off:
[ 1732.526162] amdgpu 0000:08:00.0: amdgpu: Calling dc_commit_state from amdgpu_dm_atomic_commit_tail
[ 1732.640992] amdgpu 0000:08:00.0: amdgpu: Calling optimize_bandwidth from dc_commit_state_no_check
[ 1732.641005] amdgpu 0000:08:00.0: amdgpu: Calling update_clocks from dcn20_optimize_bandwidth
[ 1732.641007] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 0
[ 1732.641008] amdgpu 0000:08:00.0: amdgpu: calling rn_vbios_smu_set_dcn_low_power
Comment 23 Mario Limonciello (AMD) 2022-01-04 14:14:14 UTC
Did you try out my patch?  It should have given an extra explicit call in the suspend path.
Comment 24 spasswolf 2022-01-04 14:59:52 UTC
That's it! The extra call to clk_mgr_optimize_pwr_state now leads to rn_update_clock being called with adev->in_s0ix = 1:


[   31.142514] wlp4s0: deauthenticating from 54:67:51:3d:a2:e0 by local choice (Reason: 3=DEAUTH_LEAVING)
[   31.176805] amdgpu 0000:08:00.0: amdgpu: Calling optimize_bandwidth from dc_commit_state_no_check
[   31.176815] amdgpu 0000:08:00.0: amdgpu: Calling update_clocks from dcn20_optimize_bandwidth
[   31.176818] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 0
[   35.056209] PM: suspend entry (s2idle)
[   36.535688] Filesystems sync: 1.479 seconds
[   36.536404] Freezing user space processes ... (elapsed 0.024 seconds) done.
[   36.561024] OOM killer disabled.
[   36.561025] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[   36.562212] printk: Suspending console(s) (use no_console_suspend to debug)
[   36.566381] amdgpu 0000:08:00.0: amdgpu: amdgpu_pmops_suspend: adev->in_s0ix = 1
[   36.566557] amdgpu 0000:08:00.0: amdgpu: calling update_clocks from dcn21_optimize_pwr_state
[   36.566567] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 1
[   36.566569] amdgpu 0000:08:00.0: amdgpu: calling rn_vbios_smu_set_dcn_low_power
Comment 25 Mario Limonciello (AMD) 2022-01-04 15:03:44 UTC
Can you check the state of the other variables though with that extra call?  That was in my debugging patch but not inyours.
Is the HPD active?
Comment 26 spasswolf 2022-01-04 15:27:38 UTC
For the extra call hpd_state = 0:
[   30.684201] wlp4s0: deauthenticating from 54:67:51:3d:a2:e0 by local choice (Reason: 3=DEAUTH_LEAVING)
[   30.794571] amdgpu 0000:08:00.0: amdgpu: Calling optimize_bandwidth from dc_commit_state_no_check
[   30.794609] amdgpu 0000:08:00.0: amdgpu: Calling update_clocks from dcn20_optimize_bandwidth
[   30.794613] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 0, hpd_state = 1
[   35.026604] PM: suspend entry (s2idle)
[   35.256153] Filesystems sync: 0.229 seconds
[   35.257044] Freezing user space processes ... (elapsed 0.002 seconds) done.
[   35.259636] OOM killer disabled.
[   35.259636] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[   35.260642] printk: Suspending console(s) (use no_console_suspend to debug)
[   35.264212] amdgpu 0000:08:00.0: amdgpu: amdgpu_pmops_suspend: adev->in_s0ix = 1
[   35.264369] amdgpu 0000:08:00.0: amdgpu: calling update_clocks from dcn21_optimize_pwr_state
[   35.264379] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 1, hpd_state = 0
[   35.264381] amdgpu 0000:08:00.0: amdgpu: calling rn_vbios_smu_set_dcn_low_power

So one could go back to the old version of rn_update_clocks with		if (display_count == 0 && !hpd_state) {
Comment 27 Mario Limonciello (AMD) 2022-01-04 15:28:40 UTC
Great! 
OK, let me send that out for review then and see what folks think.
Comment 28 Mario Limonciello (AMD) 2022-01-24 22:45:42 UTC
As this is now in 5.17-rc1: https://github.com/torvalds/linux/commit/c4849f88164b13dd141885e28210f599741b304b this issue can be closed.