Bug 214091

Summary: Failed to do s2idle on AMD Cezanne platform
Product: Platform Specific/Hardware Reporter: KaiChuan-Hsieh (kaichuan.hsieh)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: RESOLVED CODE_FIX    
Severity: blocking CC: fabriziobertocci, koba.ko, mario.limonciello, vicamo, yihunglin
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.14-rc6 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel log with capture after device hang
5.14-rc7 s2idle hang kernel log
5.14-rc7 s2idle dyndbg enable
acpidump of the system
boot log with acpi dyndbg enabled
lspci log of the system

Description KaiChuan-Hsieh 2021-08-18 07:54:38 UTC
Created attachment 298343 [details]
kernel log with capture after device hang

The AMD Cezanne platform failed to suspend when select mem_sleep to s2idle.
The system will hang, and press power key to boot.

Some driver error when tries to do s2idle.

kernel: amd_pmc AMDI0005:00: SMU response timed out
kernel: amd_pmc AMDI0005:00: suspend failed
kernel: PM: dpm_run_callback(): acpi_subsys_suspend_noirq+0x0/0x50 returns -110
kernel: amd_pmc AMDI0005:00: PM: failed to suspend noirq: error -110
kernel: PM: noirq suspend of devices failed
kernel: pci 0000:00:00.2: can't derive routing for PCI INT A
kernel: pci 0000:00:00.2: PCI INT A: no GSI
Comment 1 Mario Limonciello (AMD) 2021-08-23 13:38:29 UTC
Please try with 5.14-rc7 or later, 5.11 doesn't have all the s2i patches.
Comment 2 Mario Limonciello (AMD) 2021-08-23 13:43:00 UTC
Specifically rc7 contains https://github.com/torvalds/linux/commit/4753b46e16073c3100551a61024989d50f5e4874 which may be important for this system.
Comment 3 KaiChuan-Hsieh 2021-08-23 14:18:51 UTC
Created attachment 298435 [details]
5.14-rc7 s2idle hang kernel log

With 5.14-rc7 kernel, the log is fewer than 5.14-rc6.

Can see s2idle entry, then the device hang. Need to press power key to boot.

Aug 23 22:07:51 u-Inspiron-15-3525 kernel: [   12.832308] Bluetooth: RFCOMM ver 1.11
Aug 23 22:07:51 u-Inspiron-15-3525 kernel: [   13.009304] rfkill: input handler enabled
Aug 23 22:07:53 u-Inspiron-15-3525 kernel: [   14.306839] rfkill: input handler disabled
Aug 23 22:09:07 u-Inspiron-15-3525 kernel: [   88.802924] wlp2s0: deauthenticating from 24:4b:fe:25:a7:ec by local choice (Reason: 3=DEAUTH_LEAVING)
Aug 23 22:09:07 u-Inspiron-15-3525 kernel: [   88.830493] ath10k_pci 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xffef5e50 flags=0x0070]
Aug 23 22:09:08 u-Inspiron-15-3525 kernel: [   89.907796] PM: suspend entry (s2idle)

I've tried to add module parameters, amd_pmc.dyndbg=+pt, but still have no significant error message. And the system hang happens too quick the journald can't even capture the log.

Do you have way to dump more error for debugging?
Comment 4 Mario Limonciello (AMD) 2021-08-23 14:27:04 UTC
>Can see s2idle entry, then the device hang. Need to press power key to boot.

When you say it's hanging, do you know it's actually hung, or it can't "wakeup"?  The difference here is whether it's a problem going down or back up.

Can you try other sources for wakeup like lid, keyboard, xhci?

>Aug 23 22:09:07 u-Inspiron-15-3525 kernel: [   88.830493] ath10k_pci
>0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b
>address=0xffef5e50 flags=0x0070]

As ath10k card is causing a page fault right before, can you remove ath10k card from the system and see if the hang keeps happening?

>Aug 23 21:48:48 u-Inspiron-15-3525 kernel: [    0.738388] ACPI Error: Aborting
>method \_SB.GPIO._EVT due to previous error (AE_NOT_EXIST)
>(20210604/psparse-529)

Particularly worrying is this - if ASL GPIO events have an interpreter problem, then they might not be configured properly.

>Do you have way to dump more error for debugging?

Can you turn on dynamic debugging for uPEP (drivers/acpi/x86/s2idle.c) to see that all those events are sent properly and which method they're using?
Comment 5 KaiChuan-Hsieh 2021-08-23 15:03:02 UTC
Created attachment 298437 [details]
5.14-rc7 s2idle dyndbg enable

I try to blacklist ath10k_pci module and enable all dynamic log for s2idle.c.

drivers/acpi/x86/s2idle.c:399 [acpi]lps0_device_attach =pt "_DSM Using AMD method\012"
drivers/acpi/x86/s2idle.c:395 [acpi]lps0_device_attach =pt "_DSM UUID %s: Adjusted function mask: 0x%x\012"
drivers/acpi/x86/s2idle.c:357 [acpi]validate_dsm =pt "_DSM UUID %s rev %d function mask: 0x%x\012"
drivers/acpi/x86/s2idle.c:351 [acpi]validate_dsm =pt "_DSM UUID %s rev %d function 0 evaluation failed\012"
drivers/acpi/x86/s2idle.c:331 [acpi]acpi_sleep_run_lps0_dsm =pt "_DSM function %u evaluation %s\012"
drivers/acpi/x86/s2idle.c:301 [acpi]lpi_check_constraints =pt "LPI: required min power state:%s current power state:%s\012"
drivers/acpi/x86/s2idle.c:284 [acpi]lpi_device_get_constraints =pt "LPI: constraints list end\012"
drivers/acpi/x86/s2idle.c:276 [acpi]lpi_device_get_constraints =pt "Incomplete constraint defined\012"
drivers/acpi/x86/s2idle.c:265 [acpi]lpi_device_get_constraints =pt "uid:%d min_dstate:%s\012"
drivers/acpi/x86/s2idle.c:240 [acpi]lpi_device_get_constraints =pt "index:%d Name:%s\012"
drivers/acpi/x86/s2idle.c:201 [acpi]lpi_device_get_constraints =pt "LPI: constraints list begin:\012"
drivers/acpi/x86/s2idle.c:189 [acpi]lpi_device_get_constraints =pt "_DSM function 1 eval %s\012"
drivers/acpi/x86/s2idle.c:174 [acpi]lpi_device_get_constraints_amd =pt "LPI: constraints list end\012"
drivers/acpi/x86/s2idle.c:164 [acpi]lpi_device_get_constraints_amd =pt "Incomplete constraint defined\012"
drivers/acpi/x86/s2idle.c:158 [acpi]lpi_device_get_constraints_amd =pt "Name:%s\012"
drivers/acpi/x86/s2idle.c:119 [acpi]lpi_device_get_constraints_amd =pt "LPI: constraints list begin:\012"
drivers/acpi/x86/s2idle.c:102 [acpi]lpi_device_get_constraints_amd =pt "_DSM function 1 eval %s\012"

However, the kernel log still has no significant error.
Comment 6 KaiChuan-Hsieh 2021-08-23 15:05:09 UTC
And I try to use keyboard/touchpad to wake, but they are all failed. I don't have to long press the power key to force it shotdown, just press the power key once as usual, then the system boot after hang. It seems the suspend function leads the system goes to shutdown directly.
Comment 7 Mario Limonciello (AMD) 2021-08-23 15:06:50 UTC
>I try to blacklist ath10k_pci module 

Can you physically remove the card or is it soldered to the board?  If you can please remove it physically.

> and enable all dynamic log for s2idle.c.

I don't see any of the output related to uPEP in your logs.  Is uPEP device not in ACPI tables?
Comment 8 KaiChuan-Hsieh 2021-08-23 15:17:45 UTC
Created attachment 298439 [details]
acpidump of the system

I try to dump the acpi log, but I didn't see the uPEP device you mentioned inside the dsdt.dsl, could indicate which file should contain it? Would you please check if the ACPI table has configured correctly for supporting s2idle.

The acpi.log I attached can be retrieved by acpixtract tool introduced by
http://alexhungdmz.blogspot.com/2012/05/how-to-dump-acpi-tables-in-ubuntu.html

Thanks,
Comment 9 Mario Limonciello (AMD) 2021-08-23 15:31:18 UTC
It's present in ssdt15.dat in your attachment.  
I see that it should be effectively using Microsoft UUID 11e00d56-ce64-47ce-837b-1f898f9aa461 which we have support for in 5.14-rc7.
I also do see that FACP does set low power idle to 1, so the function should be initializing.

Can you please turn on dynamic debugging for s2idle.c at bootup?  Some of those messages only happen at bootup.

Lastly do you have SKU with NVME?  Is failure only happening on SATA and works on NVME?
Comment 10 KaiChuan-Hsieh 2021-08-24 01:31:43 UTC
Created attachment 298449 [details]
boot log with acpi dyndbg enabled

Hello,

This is the boot log with acpi dynamic debug enabled. I saw a lot of IOMMU failed. May I know if it is related to BIOS or kernel driver?

Thanks,
Comment 11 KaiChuan-Hsieh 2021-08-24 01:35:15 UTC
Created attachment 298451 [details]
lspci log of the system

My OS is installed on NVME but not SATA HDD. It can sitll reproduce the hang after entering suspend. Please check the lspci result of the system.
Comment 12 Mario Limonciello (AMD) 2021-08-25 03:29:42 UTC
>My OS is installed on NVME but not SATA HDD. It can sitll reproduce the hang
>after entering suspend. Please check the lspci result of the system.

Sorry the internal ticket on this was indicating it was SATA.

>This is the boot log with acpi dynamic debug enabled. I saw a lot of IOMMU
>failed. May I know if it is related to BIOS or kernel driver?

I think you'll need to open up some internal tickets to let the appropriate team dig into this.

>boot log with acpi dyndbg enabled

It does confirm it's using Microsoft UUID, but your debug log doesn't show a call into s2idle and the functions called to see any issues there.

Are you running this test connected to battery or to AC adapter?  Can you please check both?  There are some community reports that issuing suspend while connected to battery without AC are having problems, and this might be the same they're seeing.  If this is the same as those, it will need deeper firmware debugging on an internal ticket.
Comment 13 You-Sheng Yang 2021-09-10 06:06:54 UTC
Filed another bug https://bugzilla.kernel.org/show_bug.cgi?id=214365 reproduced with AMD Barcelo CRB and SATA device.
Comment 14 Mario Limonciello (AMD) 2021-09-14 20:54:58 UTC
Something I noted in the logs:

>Nov 29 02:37:02 u-Inspiron-15-3525 kernel: [   39.653393] amd_pmc AMDI0005:00:
>SMU response timed out
>Nov 29 02:37:02 u-Inspiron-15-3525 kernel: [   39.653399] amd_pmc AMDI0005:00:
>suspend failed
>Nov 29 02:37:02 u-Inspiron-15-3525 kernel: [   39.653400] PM:
>dpm_run_callback(): acpi_subsys_suspend_noirq+0x0/0x50 returns -110
>Nov 29 02:37:02 u-Inspiron-15-3525 kernel: [   39.653407] amd_pmc AMDI0005:00:
>PM: failed to suspend noirq: error -110

This reminds me of another issue that was being reported and caused the timeout to be extended for amd-pmc.  You might try to see if https://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86.git/commit/?h=fixes&id=3c3c8e88c8712bfe06cd10d7ca77a94a33610cd6 helps.
Comment 15 KobaKo 2021-10-05 04:52:15 UTC
@Mario, I tried this patch and it is not helpful.
after suspend the machine, i couldn't wake up the machine.
Comment 16 Fabrizio Bertocci 2021-11-16 14:17:18 UTC
@Mario: I have a similar system with the same problem. in my case the problem occur only when running with battery power. If the laptop is connected to AC, s2idle works well.

From comparing the system logs when the failure occur, it seems that my system simply fails to go to sleep, and starts the resume functionality immediately, then it hangs with a black screen.
Comment 17 Mario Limonciello (AMD) 2021-11-16 14:26:30 UTC
@Favrizio,

Can you please open your own issue with all of the details of your system and configuration?
Preferably here instead: https://gitlab.freedesktop.org/drm/amd/-/issues/

There has been a lot of movement in recent kernels and we need to look more closely at individual issues.

Thanks,
Comment 18 Mario Limonciello (AMD) 2022-05-06 18:06:29 UTC
The original issue for this with SATA is resolved via https://github.com/torvalds/linux/commit/7c5f641a5914ce0303b06bcfcd7674ee64aeebe9 when SATA is properly configured for DEVSLP.