Created attachment 300180 [details] full kern.log from boot to reboot System: MSI Alpha 15 B5EEK Lapttop with newest BIOS lspci -nn 00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex [1022:1630] 00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU [1022:1631] 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632] 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633] 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632] 00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634] 00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634] 00:02.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634] 00:02.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634] 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632] 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus [1022:1635] 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 51) 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51) 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0 [1022:166a] 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1 [1022:166b] 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2 [1022:166c] 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3 [1022:166d] 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4 [1022:166e] 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5 [1022:166f] 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6 [1022:1670] 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7 [1022:1671] 01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3) 02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479] 03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3) 03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 HDMI Audio [Radeon RX 6800/6800 XT / 6900 XT] [1002:ab28] 04:00.0 Network controller [0280]: MEDIATEK Corp. Device [14c3:0608] 05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15) 06:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology P1 NVMe PCIe SSD [c0a9:2263] (rev 03) 07:00.0 Non-Volatile memory controller [0108]: Kingston Technology Company, Inc. Device [2646:500c] (rev 01) 08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [1002:1638] (rev c5) 08:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller [1002:1637] 08:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df] 08:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639] 08:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639] 08:00.5 Multimedia controller [0480]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2/FireFlight/Renoir Audio Processor [1022:15e2] (rev 01) 08:00.6 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) HD Audio Controller [1022:15e3] 08:00.7 Signal processing controller [1180]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2/Renoir Sensor Fusion Hub [1022:15e4] When going into suspend on the above System suspend does not work properly Dec 29 13:12:38 lisa kernel: [ 52.085072] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002 Dec 29 13:12:38 lisa kernel: [ 52.407841] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002 resuming from suspend freezes the System: Dec 29 13:12:38 lisa kernel: [ 53.663566] [drm] reserve 0xa00000 from 0x81fe000000 for PSP TMR Dec 29 13:12:38 lisa kernel: [ 53.715562] amdgpu 0000:08:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 test failed (-110) Dec 29 13:12:38 lisa kernel: [ 53.715805] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <sdma_v4_0> failed -110 Dec 29 13:12:38 lisa kernel: [ 53.716007] amdgpu 0000:08:00.0: amdgpu: amdgpu_device_ip_resume failed (-110). Dec 29 13:12:38 lisa kernel: [ 53.716010] PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -110 Dec 29 13:12:38 lisa kernel: [ 53.716023] amdgpu 0000:08:00.0: PM: failed to resume async: error -110 Dec 29 13:12:38 lisa kernel: [ 53.734885] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available Dec 29 13:12:38 lisa kernel: [ 53.744042] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available Dec 29 13:12:38 lisa kernel: [ 53.744048] amdgpu 0000:03:00.0: amdgpu: SMU is resuming... Dec 29 13:12:38 lisa kernel: [ 53.744053] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000012, smu fw version = 0x003b2200 (59.34.0) Dec 29 13:12:38 lisa kernel: [ 53.744059] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched Dec 29 13:12:38 lisa kernel: [ 53.794983] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully! Suspend and resume used to work with linux.5.15.11, the first bad commit is commit 2a50edbf10c8c56e930bfb53d8f9f00a33fd837e Author: Qingqing Zhuo <qingqing.zhuo@amd.com> Date: Tue Aug 17 16:29:07 2021 -0400 drm/amd/display: Apply w/a for hard hang on HPD [Why] HPD disable and enable sequences are not mutually exclusive on Linux. For HPDs that spans under 1s (i.e. HPD low = 1s), part of the disable sequence (specifically, a request to SMU to lower refclk) could come right before the call to PHY enablement, causing DMUB to access an irresponsive PHY and thus a hard hang on the system. [How] Disable 48mhz refclk off when there is any HPD status in connected state. Reviewed-by: Hersen Wu <hersenwu@amd.com> Acked-by: Mikita Lipski <mikita.lipski@amd.com> Signed-off-by: Qingqing Zhuo <qingqing.zhuo@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> .../drm/amd/display/dc/clk_mgr/dcn21/rn_clk_mgr.c | 12 ++++++++++- .../amd/display/dc/irq/dcn21/irq_service_dcn21.c | 25 ++++++++++++++++++++++ .../amd/display/dc/irq/dcn21/irq_service_dcn21.h | 2 ++ drivers/gpu/drm/amd/display/dc/irq/irq_service.c | 2 +- drivers/gpu/drm/amd/display/dc/irq/irq_service.h | 4 ++++ 5 files changed, 43 insertions(+), 2 deletions(-)
See also: https://gitlab.freedesktop.org/drm/amd/-/issues/1821
Could you please help add some prints and confirm that the commit doesn't change the code sequence in rn_update_clocks() especially for the following part in suspend and wake sequence: if (display_count == 0 && !hpd_state) { rn_vbios_smu_set_dcn_low_power_state(clk_mgr, DCN_PWR_STATE_LOW_POWER); /* update power state */ clk_mgr_base->clks.pwr_state = DCN_PWR_STATE_LOW_POWER; It would also be appreciated if you could print out the time that was spent in the function rn_update_clocks() with and w/o the commit.
I added printk at the beginnig and exit(s) of rn_clocks_update: This is the result when resuming from suspend on commit d020970959169627d59a711769f8c4b87bf5f90c Dec 30 13:55:42 lisa kernel: [ 140.198849] PM: suspend exit Dec 30 13:55:42 lisa kernel: [ 140.208063] amdgpu: entering rn_update_clocks ... Dec 30 13:55:42 lisa kernel: [ 140.209251] amdgpu: leaving rn_update_clocks ... Dec 30 13:55:42 lisa kernel: [ 140.211147] amdgpu: entering rn_update_clocks ... Dec 30 13:55:42 lisa kernel: [ 140.211148] amdgpu: leaving rn_update_clocks ... Dec 30 13:55:42 lisa kernel: [ 140.226830] Generic FE-GE Realtek PHY r8169-0-500:00: attached PHY driver (mii_bus:phy_addr=r8169-0-500:00, irq=MAC) Dec 30 13:55:42 lisa kernel: [ 140.277915] amdgpu: entering rn_update_clocks ... Dec 30 13:55:42 lisa kernel: [ 140.277925] amdgpu: leaving rn_update_clocks ... Dec 30 13:55:42 lisa kernel: [ 140.278233] amdgpu: entering rn_update_clocks ... Dec 30 13:55:42 lisa kernel: [ 140.278234] amdgpu: leaving rn_update_clocks ... In this case the system does not hang, but suspend doesn't seem to work properly, either (Fan and keyboard do not switch off). I probably missed that during the first bisection.
Unfortunately I optimized my .config to make bisections faster, but now I cannot reproduce the Hang with commit 2a50edbf10c8c56e930bfb53d8f9f00a33fd837e. So now I will start over with the bisection ...
There might also be different suspend errors we're dealing with: linux-5.16-rc7 e.g. seems to hang when going into suspend with the last log message before reboot being: Dec 30 14:53:19 lisa kernel: [ 85.056243] PM: suspend entry (s2idle)
After sorting out some .config troubles I've reverted commit 2a50edbf10c8c56e930bfb53d8f9f00a33fd837e in linux-5.16-rc7 and this seems to cure both the not properly suspending issue (fan and keyboard lights stayed on) and the hang on resuming from suspend issue.
The [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002 are still there, but they don't seem to be harmful, in fact they were there even in linux.5.15.11 where suspend and resume seemed to work.
Just did a bit of experimentation and changed dc_get_hpd_state_dcn21 (in 5.16-rc7) to uint32_t dc_get_hpd_state_dcn21(struct irq_service *irq_service, enum dc_irq_source source) { const struct irq_source_info *info; uint32_t addr; uint32_t value; uint32_t current_status; info = find_irq_source_info(irq_service, source); if (!info) return 0; addr = info->status_reg; if (!addr) return 0; value = dm_read_reg(irq_service->ctx, addr); //current_status = // get_reg_field_value( // value, // HPD0_DC_HPD_INT_STATUS, // DC_HPD_SENSE); //return current_status; return 0; } This also leads to working suspend resume, the get_reg_field_value seems to trigger the problems.
get_reg_field_value(value, HPD0_DC_HPD_INT_STATUS, DC_HPD_SENSE) is just a macro expanding to an inline function, i.e. current_status = (value & HPD0_DC_HPD_INT_STATUS__DC_HPD_SENSE_MASK) >> HPD0_DC_HPD_IN_STATUS__DC_HPD_SENSE__SHIFT; with ..._MASK = 0x2 being 0x2 and ...__SHIFT = 0x1, problems seem to occur when current_status != 0.
current_status != 0 causes rn_update_clocks to behave differently, basically rn_vbios_smu_set_dcn_low_power_state can never get called. I wonder if we want to just not apply this WA as part of the suspend/resume paths, but only during the runtime paths?
Created attachment 300211 [details] dmesg with linux-5.16-rc8 With linux-5.16-rc8, suspend does still not work properly (fan stays on). Resume seemed to work at first, but on the next suspend/resume test I get the usual hang.
Suspend works again if I ignore hpd_state in rn_update_clocks. // if (display_count == 0 && !hpd_state) { if (display_count == 0) {
I added a some dev_info calls: static int amdgpu_pmops_suspend(struct device *dev) { struct drm_device *drm_dev = dev_get_drvdata(dev); struct amdgpu_device *adev = drm_to_adev(drm_dev); int r; if (amdgpu_acpi_is_s0ix_active(adev)) adev->in_s0ix = true; adev->in_s3 = true; r = amdgpu_device_suspend(drm_dev, true); dev_info(adev->dev, "amdgpu_device_suspend(...) = %d\n", r); adev->in_s3 = false; if (r) return r; dev_info(adev->dev, "adev->in_s0ix = %d\n", adev->in_s0ix); if (!adev->in_s0ix) r = amdgpu_asic_reset(adev); return r; } which gives Jan 3 11:13:40 lisa kernel: [ 38.792335] amdgpu 0000:08:00.0: amdgpu: amdgpu_device_suspend(...) = 0 Jan 3 11:13:40 lisa kernel: [ 38.792341] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 1 So the gpu is not actually reset ...
Created attachment 300212 [details] dmesg with linux-5.16-rc8 while ignoring adev->in_s0ix Simply ignoring the value of adev->in_s0ix and calling reset anyway does not work either: Jan 3 11:23:51 lisa kernel: [ 32.649142] amdgpu 0000:08:00.0: amdgpu: amdgpu_device_suspend(...) = 0 Jan 3 11:23:51 lisa kernel: [ 32.649146] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 1 Jan 3 11:23:51 lisa kernel: [ 32.649149] amdgpu 0000:08:00.0: amdgpu: MODE2 reset Jan 3 11:23:51 lisa kernel: [ 37.404614] amdgpu 0000:08:00.0: amdgpu: Mode2 reset failed! Jan 3 11:23:51 lisa kernel: [ 37.404616] PM: pci_pm_suspend(): amdgpu_pmops_suspend+0x0/0x60 [amdgpu] returns -62 Jan 3 11:23:51 lisa kernel: [ 37.404733] PM: dpm_run_callback(): pci_pm_suspend+0x0/0x150 returns -62
Created attachment 300214 [details] Possible patch that ignores WA in s0ix Actually this is more like what I was suggesting. It effectively doesn't use the WA during s0ix suspend or resume.
Added a dev_info: dev_info(adev->dev, "adev->in_s0ix = %d\n", adev->in_s0ix); if (display_count == 0 && (adev->in_s0ix || !hpd_state)) { with the result: It seems that adev->in_s0ix = 0 during suspend. Only later during resume (after ACPI: EC: interrupt unblocked) do we have adev->in_s0ix = 1. Jan 3 18:00:24 lisa kernel: [ 40.367310] wlp4s0: deauthenticating from 54:67:51:3d:a2:e0 by local choice (Reason: 3=DEAUTH_LEAVING) Jan 3 18:00:24 lisa kernel: [ 40.437382] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 0 Jan 3 18:00:27 lisa kernel: [ 44.031004] PM: suspend entry (s2idle) Jan 3 18:00:28 lisa kernel: [ 44.252353] Filesystems sync: 0.221 seconds Jan 3 18:00:58 lisa kernel: [ 44.253185] Freezing user space processes ... (elapsed 0.002 seconds) done. Jan 3 18:00:58 lisa kernel: [ 44.255407] OOM killer disabled. Jan 3 18:00:58 lisa kernel: [ 44.255407] Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done. Jan 3 18:00:58 lisa kernel: [ 44.256406] printk: Suspending console(s) (use no_console_suspend to debug) Jan 3 18:00:58 lisa kernel: [ 44.277805] queueing ieee80211 work while going to suspend Jan 3 18:00:58 lisa kernel: [ 44.831434] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000). Jan 3 18:00:58 lisa kernel: [ 44.831464] [drm] PSP is resuming... Jan 3 18:00:58 lisa kernel: [ 45.019729] [drm] reserve 0xa00000 from 0x81fe000000 for PSP TMR Jan 3 18:00:58 lisa kernel: [ 45.091174] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available Jan 3 18:00:58 lisa kernel: [ 45.100301] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available Jan 3 18:00:58 lisa kernel: [ 45.100305] amdgpu 0000:03:00.0: amdgpu: SMU is resuming... Jan 3 18:00:58 lisa kernel: [ 45.100308] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw version = 0x003b2500 (59.37.0) Jan 3 18:00:58 lisa kernel: [ 45.100311] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched Jan 3 18:00:58 lisa kernel: [ 45.151255] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully! Jan 3 18:00:58 lisa kernel: [ 45.152433] [drm] DMUB hardware initialized: version=0x02020007 Jan 3 18:00:58 lisa kernel: [ 45.194058] [drm] kiq ring mec 2 pipe 1 q 0 Jan 3 18:00:58 lisa kernel: [ 45.198331] [drm] VCN decode and encode initialized successfully(under DPG Mode). Jan 3 18:00:58 lisa kernel: [ 45.198603] [drm] JPEG decode initialized successfully. Jan 3 18:00:58 lisa kernel: [ 45.198652] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 Jan 3 18:00:58 lisa kernel: [ 45.198654] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 Jan 3 18:00:58 lisa kernel: [ 45.198655] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 Jan 3 18:00:58 lisa kernel: [ 45.198655] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0 Jan 3 18:00:58 lisa kernel: [ 45.198656] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0 Jan 3 18:00:58 lisa kernel: [ 45.198657] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0 Jan 3 18:00:58 lisa kernel: [ 45.198658] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0 Jan 3 18:00:58 lisa kernel: [ 45.198659] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0 Jan 3 18:00:58 lisa kernel: [ 45.198659] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0 Jan 3 18:00:58 lisa kernel: [ 45.198660] amdgpu 0000:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0 Jan 3 18:00:58 lisa kernel: [ 45.198661] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0 Jan 3 18:00:58 lisa kernel: [ 45.198661] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0 Jan 3 18:00:58 lisa kernel: [ 45.198662] amdgpu 0000:03:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1 Jan 3 18:00:58 lisa kernel: [ 45.198662] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1 Jan 3 18:00:58 lisa kernel: [ 45.198663] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1 Jan 3 18:00:58 lisa kernel: [ 45.198663] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1 Jan 3 18:00:58 lisa kernel: [ 45.202326] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes Jan 3 18:00:58 lisa kernel: [ 45.527398] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002 Jan 3 18:00:58 lisa kernel: [ 45.844705] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002 Jan 3 18:00:58 lisa kernel: [ 45.889752] [drm] free PSP TMR buffer Jan 3 18:00:58 lisa kernel: [ 45.939048] amdgpu 0000:03:00.0: amdgpu: MODE1 reset Jan 3 18:00:58 lisa kernel: [ 45.939052] amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset Jan 3 18:00:58 lisa kernel: [ 45.939132] amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset Jan 3 18:00:58 lisa kernel: [ 46.450444] ACPI: EC: interrupt blocked Jan 3 18:00:58 lisa kernel: [ 46.472489] xhci_hcd 0000:08:00.3: refused to change power state from D0 to D3hot Jan 3 18:00:58 lisa kernel: [ 55.081262] ACPI: EC: interrupt unblocked Jan 3 18:00:58 lisa kernel: [ 55.632167] pci 0000:00:00.2: can't derive routing for PCI INT A Jan 3 18:00:58 lisa kernel: [ 55.632172] pci 0000:00:00.2: PCI INT A: no GSI Jan 3 18:00:58 lisa kernel: [ 55.635561] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000). Jan 3 18:00:58 lisa kernel: [ 55.637155] [drm] PSP is resuming... Jan 3 18:00:58 lisa kernel: [ 55.637252] [drm] PCIE GART of 1024M enabled. Jan 3 18:00:58 lisa kernel: [ 55.637254] [drm] PTB located at 0x000000F400900000 Jan 3 18:00:58 lisa kernel: [ 55.637769] amdgpu 0000:08:00.0: amdgpu: SMU is resuming... Jan 3 18:00:58 lisa kernel: [ 55.640076] amdgpu 0000:08:00.0: amdgpu: dpm has been disabled Jan 3 18:00:58 lisa kernel: [ 55.643194] amdgpu 0000:08:00.0: amdgpu: SMU is resumed successfully! Jan 3 18:00:58 lisa kernel: [ 55.649205] [drm] DMUB hardware initialized: version=0x0101001C Jan 3 18:00:58 lisa kernel: [ 55.661720] nvme nvme0: 15/0/0 default/read/poll queues Jan 3 18:00:58 lisa kernel: [ 55.702951] nvme nvme1: Shutdown timeout set to 10 seconds Jan 3 18:00:58 lisa kernel: [ 55.704396] nvme nvme1: 8/0/0 default/read/poll queues Jan 3 18:00:58 lisa kernel: [ 55.737425] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 1 Jan 3 18:00:58 lisa kernel: [ 55.828624] [drm] reserve 0xa00000 from 0x81fe000000 for PSP TMR Jan 3 18:00:58 lisa kernel: [ 55.898619] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available Jan 3 18:00:58 lisa kernel: [ 55.907754] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available Jan 3 18:00:58 lisa kernel: [ 55.907757] amdgpu 0000:03:00.0: amdgpu: SMU is resuming... Jan 3 18:00:58 lisa kernel: [ 55.907761] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw version = 0x003b2500 (59.37.0) Jan 3 18:00:58 lisa kernel: [ 55.907767] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched Jan 3 18:00:58 lisa kernel: [ 55.958092] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully! Jan 3 18:00:58 lisa kernel: [ 55.959299] [drm] DMUB hardware initialized: version=0x02020007 Jan 3 18:00:58 lisa kernel: [ 55.968918] amdgpu 0000:08:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 test failed (-110) Jan 3 18:00:58 lisa kernel: [ 55.969120] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <sdma_v4_0> failed -110 Jan 3 18:00:58 lisa kernel: [ 55.969298] amdgpu 0000:08:00.0: amdgpu: amdgpu_device_ip_resume failed (-110). Jan 3 18:00:58 lisa kernel: [ 55.969301] PM: dpm_run_callback(): pci_pm_resume+0x0/0xd0 returns -110 Jan 3 18:00:58 lisa kernel: [ 55.969313] amdgpu 0000:08:00.0: PM: failed to resume async: error -110
Created attachment 300215 [details] Possible patch that ignores WA in s0ix w/ debugging statements (v2) OK thanks for trying it! Here's an updated patch, to try to give an extra explicit call to update the clocks again during the suspend path. Also included a debugging statement so we can see the state of the variables that matter in each call to update_clocks. If you can please share the output from your kernel log again with it applied.
The problem seems to be that rn_update_clocks tries to call rn_vbios_smu_set_dcn_low power before adev->in_s0ix is set to 1 in amdpgu_pmops_suspend: Changes to rn_update_clocks: dev_info(adev->dev, "adev->in_s0ix = %d\n", adev->in_s0ix); //if (display_count == 0 && (adev->in_s0ix || !hpd_state)) { if (display_count == 0) { dev_info(adev->dev, "calling rn_vbios_smu_set_dcn_low_power\n"); rn_vbios_smu_set_dcn_low_power_state(clk_mgr, DCN_PWR_STATE_LOW_POWER); Changes to amdgpu_pmops_suspend: if (amdgpu_acpi_is_s0ix_active(adev)) adev->in_s0ix = true; dev_info(adev->dev, "amdgpu_pmops_suspend: adev->in_s0ix = %d\n", adev->in_s0ix); [ 36.010259] wlp4s0: deauthenticating from 54:67:51:3d:a2:e0 by local choice (Reason: 3=DEAUTH_LEAVING) [ 36.043902] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 0 [ 36.043904] amdgpu 0000:08:00.0: amdgpu: calling rn_vbios_smu_set_dcn_low_power [ 40.015570] PM: suspend entry (s2idle) [ 40.239013] Filesystems sync: 0.223 seconds [ 40.239912] Freezing user space processes ... (elapsed 0.002 seconds) done. [ 40.242407] OOM killer disabled. [ 40.242407] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [ 40.243513] printk: Suspending console(s) (use no_console_suspend to debug) [ 40.246850] amdgpu 0000:08:00.0: amdgpu: amdgpu_pmops_suspend: adev->in_s0ix = 1 [ 40.266376] queueing ieee80211 work while going to suspend
Created attachment 300222 [details] debugging patch I added some more dev_info calls to figure out where things are called: [ 34.946911] wlp4s0: deauthenticating from 54:67:51:3d:a2:e0 by local choice (Reason: 3=DEAUTH_LEAVING) [ 35.079021] amdgpu 0000:08:00.0: amdgpu: Calling update_clocks from dcn10_set_clock [ 35.079026] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 0 [ 35.079027] amdgpu 0000:08:00.0: amdgpu: calling rn_vbios_smu_set_dcn_low_power [ 39.030889] PM: suspend entry (s2idle) [ 39.270864] Filesystems sync: 0.239 seconds [ 39.271929] Freezing user space processes ... (elapsed 0.002 seconds) done. [ 39.274694] OOM killer disabled. The WLAN message is the first message one gets after initiating suspend. The interesting thing is that rn_vbios_set_dc_low_power is called before suspend entry (and there's also a long 4s gap). I also have no idea where dcn10_set_clock is called. The only place I found is from within dc_set_clock, but dc_set_clock is called nowhere in the entire amdgpu module.
Created attachment 300223 [details] corrected debugging patch The patch above contained an error rn_update_clocks is actually called from dcn20_optimize_bandwidth
Created attachment 300224 [details] new debuggin patch Seems the call trace of rn_vbios_smu_set_dcn_low_power is amdgpu_dm_atomic_commit_tail -> dc_commit_state -> dc_commit_state_nocheck -> dcn20_optimize_bandwith -> rn_update_clocks -> rn_vbios_smu_set_dcn_low_power 53.562322] amdgpu 0000:08:00.0: amdgpu: Calling dc_commit_state from amdgpu_dm_atomic_commit_tail [ 53.562426] amdgpu 0000:08:00.0: amdgpu: Calling dc_commit_state from amdgpu_dm_atomic_commit_tail [ 53.563183] wlp4s0: deauthenticating from 54:67:51:3d:a2:e0 by local choice (Reason: 3=DEAUTH_LEAVING) [ 53.673655] amdgpu 0000:08:00.0: amdgpu: Calling optimize_bandwidth from dc_commit_state_no_check [ 53.673669] amdgpu 0000:08:00.0: amdgpu: Calling update_clocks from dcn20_optimize_bandwidth [ 53.673672] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 0 [ 53.673673] amdgpu 0000:08:00.0: amdgpu: calling rn_vbios_smu_set_dcn_low_power [ 53.674562] amdgpu 0000:08:00.0: amdgpu: Calling dc_commit_state from amdgpu_dm_atomic_commit_tail This seems to be not really connected to suspend.
I get similar message if I just lock the screen which then switches off: [ 1732.526162] amdgpu 0000:08:00.0: amdgpu: Calling dc_commit_state from amdgpu_dm_atomic_commit_tail [ 1732.640992] amdgpu 0000:08:00.0: amdgpu: Calling optimize_bandwidth from dc_commit_state_no_check [ 1732.641005] amdgpu 0000:08:00.0: amdgpu: Calling update_clocks from dcn20_optimize_bandwidth [ 1732.641007] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 0 [ 1732.641008] amdgpu 0000:08:00.0: amdgpu: calling rn_vbios_smu_set_dcn_low_power
Did you try out my patch? It should have given an extra explicit call in the suspend path.
That's it! The extra call to clk_mgr_optimize_pwr_state now leads to rn_update_clock being called with adev->in_s0ix = 1: [ 31.142514] wlp4s0: deauthenticating from 54:67:51:3d:a2:e0 by local choice (Reason: 3=DEAUTH_LEAVING) [ 31.176805] amdgpu 0000:08:00.0: amdgpu: Calling optimize_bandwidth from dc_commit_state_no_check [ 31.176815] amdgpu 0000:08:00.0: amdgpu: Calling update_clocks from dcn20_optimize_bandwidth [ 31.176818] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 0 [ 35.056209] PM: suspend entry (s2idle) [ 36.535688] Filesystems sync: 1.479 seconds [ 36.536404] Freezing user space processes ... (elapsed 0.024 seconds) done. [ 36.561024] OOM killer disabled. [ 36.561025] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [ 36.562212] printk: Suspending console(s) (use no_console_suspend to debug) [ 36.566381] amdgpu 0000:08:00.0: amdgpu: amdgpu_pmops_suspend: adev->in_s0ix = 1 [ 36.566557] amdgpu 0000:08:00.0: amdgpu: calling update_clocks from dcn21_optimize_pwr_state [ 36.566567] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 1 [ 36.566569] amdgpu 0000:08:00.0: amdgpu: calling rn_vbios_smu_set_dcn_low_power
Can you check the state of the other variables though with that extra call? That was in my debugging patch but not inyours. Is the HPD active?
For the extra call hpd_state = 0: [ 30.684201] wlp4s0: deauthenticating from 54:67:51:3d:a2:e0 by local choice (Reason: 3=DEAUTH_LEAVING) [ 30.794571] amdgpu 0000:08:00.0: amdgpu: Calling optimize_bandwidth from dc_commit_state_no_check [ 30.794609] amdgpu 0000:08:00.0: amdgpu: Calling update_clocks from dcn20_optimize_bandwidth [ 30.794613] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 0, hpd_state = 1 [ 35.026604] PM: suspend entry (s2idle) [ 35.256153] Filesystems sync: 0.229 seconds [ 35.257044] Freezing user space processes ... (elapsed 0.002 seconds) done. [ 35.259636] OOM killer disabled. [ 35.259636] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [ 35.260642] printk: Suspending console(s) (use no_console_suspend to debug) [ 35.264212] amdgpu 0000:08:00.0: amdgpu: amdgpu_pmops_suspend: adev->in_s0ix = 1 [ 35.264369] amdgpu 0000:08:00.0: amdgpu: calling update_clocks from dcn21_optimize_pwr_state [ 35.264379] amdgpu 0000:08:00.0: amdgpu: adev->in_s0ix = 1, hpd_state = 0 [ 35.264381] amdgpu 0000:08:00.0: amdgpu: calling rn_vbios_smu_set_dcn_low_power So one could go back to the old version of rn_update_clocks with if (display_count == 0 && !hpd_state) {
Great! OK, let me send that out for review then and see what folks think.
As this is now in 5.17-rc1: https://github.com/torvalds/linux/commit/c4849f88164b13dd141885e28210f599741b304b this issue can be closed.