Bug 218641

Summary: Framework Laptop 13 AMD suspend/wakeup regression with 6.9rc1
Product: Timers Reporter: David Markey (david)
Component: OtherAssignee: anna-maria
Status: RESOLVED CODE_FIX    
Severity: normal CC: anna-maria, david, dimitris.on.linux, dominik.foerderer, linux, linux, mario.limonciello, regressions
Priority: P3    
Hardware: AMD   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: revert against 6.9-rc1
Patch which adds trace printks to locate the timeout
script to enable all required trace options
Fix - please try
6.9-rc4 report showing it working

Description David Markey 2024-03-25 22:08:29 UTC
Was trying out 6.9rc1 to give perfcore a whirl but I've noticed that when I go to s2idle I cannot wake up, the power button does not wake the laptop up again and stays flashing and requires a full reboot to become responsive again.

Arch linux, linux-mainline. 

>>options root=/dev/nvme0n1p3 rw resume=UUID=x-x-x-x-x 
>>rtc_cmos.use_acpi_alarm=1 pcie_aspm.policy=powersupersave amdgpu.abmlevel=3
>>iomem=relaxed amdgpu.ppfeaturemask=0xffffffff  amdgpu.sg_display=0


AMD Ryzen 5 7640U w/ Radeon 760M Graphics.

Works fine on 6.8.1.
Comment 1 David Markey 2024-03-25 22:16:42 UTC
sorry, prefcore!
Comment 2 David Markey 2024-03-25 23:33:45 UTC
AMD Dev Mario Limonciello seems to think they're onto to something over at the Framework forums.

https://community.frame.work/t/fyi-amd-framework-13-suspend-resume-regression-in-6-9rc1/47697/8?u=david_markey
Comment 3 Mario Limonciello (AMD) 2024-03-26 02:38:06 UTC
I finished up bisecting and it comes down to this commit:

```
commit 7ee988770326fca440472200c3eb58935fe712f6
Author: Anna-Maria Behnsen <anna-maria@linutronix.de>                          
Date:   Thu Feb 22 11:37:10 2024 +0100                                                   
                                                                                         
    timers: Implement the hierarchical pull model
```

When the issue happens, the system isn't able to be woken up from the GPIO controller anymore.  However, it can still be woken up from programming a wakeup timer.

Here is a snippet of the logs when it occurs:

```
Mar 25 21:30:14 fedora kernel: PM: suspend-to-idle
Mar 25 21:30:14 fedora kernel: ACPI: EC: ACPI EC GPE status set
Mar 25 21:30:14 fedora kernel: ACPI: PM: Rearming ACPI SCI for wakeup
Mar 25 21:30:14 fedora kernel: amd_pmc: SMU idlemask s0i3: 0x8fff9eb5
Mar 25 21:30:14 fedora kernel: Timekeeping suspended for 2.303 seconds
Mar 25 21:30:14 fedora kernel: PM: Triggering wakeup from IRQ 7
Mar 25 21:30:14 fedora kernel: PM: Triggering wakeup from IRQ 9
Mar 25 21:30:14 fedora kernel: ACPI: EC: ACPI EC GPE status set
Mar 25 21:30:14 fedora kernel: ACPI: EC: ACPI EC GPE dispatched
Mar 25 21:30:14 fedora kernel: ACPI: EC: ACPI EC work flushed
Mar 25 21:30:14 fedora kernel: ACPI: PM: Wakeup after ACPI Notify sync
Mar 25 21:30:14 fedora kernel: PM: resume from suspend-to-idle
Mar 25 21:30:14 fedora kernel: PM: Triggering wakeup from IRQ 0
Mar 25 21:30:14 fedora kernel: ACPI Error: Aborting method \_SB.PCI0.LPC0.EC0.EMBF due to previous error (AE_AML_LOOP_TIMEOUT) (20230628/psparse-529)
Mar 25 21:30:14 fedora kernel: ACPI Error: Aborting method \_SB.PEP._DSM due to previous error (AE_AML_LOOP_TIMEOUT) (20230628/psparse-529)
Mar 25 21:30:14 fedora kernel: ACPI: \_SB_.PEP_: failed to evaluate _DSM 560de011-64ce-ce47-837b-1f898f9aa461 (0x3021)
Mar 25 21:30:14 fedora kernel: ACPI: \_SB_.PEP_: Failed to transitioned to state lps0 exit
Mar 25 21:30:14 fedora kernel: clocksource: Long readout interval, skipping watchdog check: cs_nsec: 8010624773 wd_nsec: 8010615594
Mar 25 21:30:14 fedora kernel: ACPI: \_SB_.PEP_: Successfully transitioned to state lps0 ms exit
Mar 25 21:30:14 fedora kernel: ACPI: \_SB_.PEP_: Successfully transitioned to state screen on
Mar 25 21:30:14 fedora kernel: ACPI: EC: interrupt unblocked
Mar 25 21:30:14 fedora kernel: PM: noirq resume of devices complete after 64.225 msecs
Mar 25 21:30:14 fedora kernel: GPIO 0 is active: 0x3805780f
Mar 25 21:30:14 fedora kernel: GPIO 8 is active: 0x18147b00
```

Conversely here's a snippet of logs from 7ee988770326fca440472200c3eb58935fe712f6~1:

```
Mar 25 21:20:34 fedora kernel: PM: suspend-to-idle
Mar 25 21:20:34 fedora kernel: ACPI: EC: ACPI EC GPE status set
Mar 25 21:20:34 fedora kernel: ACPI: PM: Rearming ACPI SCI for wakeup
Mar 25 21:20:34 fedora kernel: amd_pmc: SMU idlemask s0i3: 0x8fff9eb5
Mar 25 21:20:34 fedora kernel: Timekeeping suspended for 1.835 seconds
Mar 25 21:20:34 fedora kernel: PM: Triggering wakeup from IRQ 9
Mar 25 21:20:34 fedora kernel: PM: Triggering wakeup from IRQ 7
Mar 25 21:20:34 fedora kernel: ACPI: EC: ACPI EC GPE status set
Mar 25 21:20:34 fedora kernel: ACPI: EC: ACPI EC GPE dispatched
Mar 25 21:20:34 fedora kernel: ACPI: EC: ACPI EC work flushed
Mar 25 21:20:34 fedora kernel: ACPI: PM: Wakeup after ACPI Notify sync
Mar 25 21:20:34 fedora kernel: PM: resume from suspend-to-idle
Mar 25 21:20:34 fedora kernel: PM: Triggering wakeup from IRQ 0
Mar 25 21:20:34 fedora kernel: ACPI: \_SB_.PEP_: Successfully transitioned to state lps0 exit
Mar 25 21:20:34 fedora kernel: ACPI: \_SB_.PEP_: Successfully transitioned to state lps0 ms exit
Mar 25 21:20:34 fedora kernel: ACPI: \_SB_.PEP_: Successfully transitioned to state screen on
Mar 25 21:20:34 fedora kernel: ACPI: EC: interrupt unblocked
Mar 25 21:20:34 fedora kernel: PM: noirq resume of devices complete after 64.161 msecs
Mar 25 21:20:34 fedora kernel: GPIO 0 is active: 0x3805780f
```
Comment 4 Mario Limonciello (AMD) 2024-03-26 02:52:20 UTC
Created attachment 306042 [details]
revert against 6.9-rc1

Reverting the commit I posted above against 6.9-rc1 fixes it for me.  Can you see if that works for you too?
Comment 5 anna-maria 2024-03-26 16:50:47 UTC
Thanks for the Bug report! I have access to a similar machine where this behavior is reproducible. So I should be able to investigate it the next days. If I have an update, I'll come back to you.

Thanks
Comment 6 anna-maria 2024-04-02 11:04:59 UTC
Created attachment 306077 [details]
Patch which adds trace printks to locate the timeout
Comment 7 anna-maria 2024-04-02 11:06:25 UTC
Created attachment 306078 [details]
script to enable all required trace options
Comment 8 anna-maria 2024-04-02 11:08:00 UTC
Sorry for the late reply - Easter vacation...

I was able to reproduce it. But in contrast to the above, the laptop always returned from suspend after quite a time (~50 sec). This behavior was seen when pressing the power button and also using a wake timer. During the tests, I also had two (accidental) working resume from suspend when programming a wake timer. I was also able to track it down where this timeout occurs.

At the moment if have no access to the machine which I used last week, so I would be happy to get some help to investigate it further:

- Could you please wait some more time to see if your laptop resumes from suspend as well after some time?

- Could you please apply the patch  which adds trace printks to locate the timeout and run a script (which enables trace points and function tracers) before you send the system into suspend? When it comes back from suspend (doesn't matter whether it is a wakeup timer or pressing the butting), please cat the trace into a file and share it with me.
Comment 9 Mario Limonciello (AMD) 2024-04-02 20:58:47 UTC
I've applied your patch and turned on tracing before suspend.  Here is the trace from /sys/kernel/trace captured on a Framework 13 running 6.9-rc2.

https://drive.google.com/file/d/1jb6_lTyUPTCl_1ii_fQWOzxFHA_p3U_h/view?usp=sharing

It was too big to put on a gist or attach here.

I'd like to point out a few other observations.
1. I don't believe the notification made it over to the EC to exit suspend because the LED is still blinking and the keyboard doesn't work properly.  I had to get this stuff from SSH.

2. Here is the kernel log matching the wakeup this trace was captured after.

[   67.372518] PM: suspend-to-idle
[   67.393190] amd_pmc: SMU idlemask s0i3: 0x3ffbbebd
[   69.680668] Timekeeping suspended for 0.881 seconds
[   69.680715] PM: Triggering wakeup from IRQ 1
[   69.680721] PM: Triggering wakeup from IRQ 9
[   69.680840] ACPI: EC: ACPI EC GPE status set
[   69.680907] ACPI: EC: ACPI EC GPE dispatched
[   72.008808] PM: Triggering wakeup from IRQ 0
[  102.488319] ACPI Error: Aborting method \_SB.PCI0.LPC0.EC0.SIQ1 due to previous error (AE_AML_LOOP_TIMEOUT) (20230628/psparse-529)
[  102.488348] ACPI Error: Aborting method \_SB.PCI0.LPC0.EC0._Q0D due to previous error (AE_AML_LOOP_TIMEOUT) (20230628/psparse-529)
[  102.488450] ACPI: EC: ACPI EC work flushed
[  102.488463] ACPI: PM: Wakeup after ACPI Notify sync
[  102.488465] PM: resume from suspend-to-idle
[  102.490133] ACPI: \_SB_.PEP_: Successfully transitioned to state lps0 exit
[  134.487330] ACPI Error: Aborting method \_SB.PCI0.LPC0.EC0.SIQ1 due to previous error (AE_AML_LOOP_TIMEOUT) (20230628/psparse-529)
[  134.487360] ACPI Error: Aborting method \_SB.PEP._DSM due to previous error (AE_AML_LOOP_TIMEOUT) (20230628/psparse-529)
[  134.487395] ACPI: \_SB_.PEP_: failed to evaluate _DSM 560de011-64ce-ce47-837b-1f898f9aa461 (0x3021)
[  134.487403] ACPI: \_SB_.PEP_: Failed to transitioned to state lps0 ms exit
[  134.488319] ACPI: \_SB_.PEP_: Successfully transitioned to state screen on
[  134.488645] ACPI: EC: interrupt unblocked
[  134.489049] clocksource: Long readout interval, skipping watchdog check: cs_nsec: 64629783981 wd_nsec: 64629428453
[  134.701727] PM: noirq resume of devices complete after 213.402 msecs
[  134.701790] GPIO 8 is active: 0x18147b00
[  134.701805] PM: Triggering wakeup from IRQ 0
[  134.776787] PM: early resume of devices complete after 2.452 msecs
[  134.777893] GPIO 8 is active: 0x18141b00
[  134.779407] GPIO 8 is active: 0x18141b00
Comment 12 Mario Limonciello (AMD) 2024-04-02 21:07:44 UTC
Github is truncating, here is a second captured trace.

https://drive.google.com/file/d/1LnfWJ6FaKtZ_xu6kDDttTra6LbBvKCQa/view?usp=sharing
Comment 13 anna-maria 2024-04-04 16:09:55 UTC
Created attachment 306085 [details]
Fix - please try

Thanks for the trace.

Can you please test this patch (should work on top of linus tree) and see, whether suspend/resume works again? I will write a detailed explanation for the 'why' later on...
Comment 14 D.F. 2024-04-04 18:03:38 UTC
Thank you for the Patch!

Tested it on Framework13 AMD and suspend / resume is now working as it should...
Comment 15 Mario Limonciello (AMD) 2024-04-05 02:36:17 UTC
Works for me too on F13 AMD, BIOS 3.05. Thanks!

Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Comment 16 anna-maria 2024-04-05 08:37:08 UTC
Here you can find the patch (with detailed explanation):

https://lore.kernel.org/r/20240405083410.4896-1-anna-maria@linutronix.de/
Comment 17 David Markey 2024-04-12 17:59:27 UTC
Hey, sorry for the delay. I tested the patch against 6.9 rc3, The laptop wakes up but the screen remains Frozen and I need to hard reboot.
Comment 18 Mario Limonciello (AMD) 2024-04-12 18:06:11 UTC
I didn't test against rc3, I tested against rc2. Can you see if it works on rc2 for you? If so there might be a separate regression in rc3.
Comment 19 D.F. 2024-04-12 18:34:16 UTC
(In reply to David Markey from comment #17)
> Hey, sorry for the delay. I tested the patch against 6.9 rc3, The laptop
> wakes up but the screen remains Frozen and I need to hard reboot.

I can not confirm this behavior. Applied the patch against 6.9rc3 five days ago. 
I've suspended / resumed my machine at least 30 times since and had no problems with it. Suspend / Resume works as intended.
Comment 20 David Markey 2024-04-12 20:43:51 UTC
So to be sure I tried building rc2 and ruling out another patch related to a separate hibernation issue, but this new issue s2idle is still present. 

Suspend, you can wake up, but screen immediately freezes.

I guess the only only thing is I'm on the latest BIOS from Framework which Mario mightn't be on.
Comment 21 Mario Limonciello (AMD) 2024-04-12 20:56:20 UTC
I am on the latest BIOS as well. I suggest running the s2idle script and see if you get any other hints of what's going on if it dumps on resume before you freeze.
Comment 22 David Markey 2024-04-12 21:57:23 UTC
https://gist.github.com/dmarkey/ceef070ded9635828499fd1694e3b39f Here is an s2idle script report.
Comment 23 Mario Limonciello (AMD) 2024-04-12 22:31:58 UTC
The only thing that stands out different to me in that report is mine doesn't have Intel WiFi and I have been testing in balanced mode.
Comment 24 David Markey 2024-04-12 22:39:34 UTC
Hmm tried quickly there with iwlmvm and iwlmvm unloaded, same result, and balanced mode selected.

There are 2 FW 13 AMD models though, one with more cores, mine is the 12 thread model.. could there be a slight discrepancy there?
Comment 25 David Markey 2024-04-12 22:40:12 UTC
* and iwlwifi unloaded.
Comment 26 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-04-13 06:49:39 UTC
David, I'm wondering if it might be a good idea to rule out other problems on your machine by testing these two things:

* retry if 7ee988770326fca440472200c3eb58935fe712f6^ (the parent of the culprit) really works

* checkout 7ee988770326fca440472200c3eb58935fe712f6 and applying the fix on top of it to see if it fixes this problem.
Comment 27 Mario Limonciello (AMD) 2024-04-15 18:35:26 UTC
Created attachment 306159 [details]
6.9-rc4 report showing it working

I did double check 6.9-rc4 with no patches applied.  Works for me now.
Comment 28 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-04-15 18:39:59 UTC
David, if you really still have some problem, it's a different one -- hence please open a new bug and afterwards drop a link here, as things otherwise quickly can become confusing.
Comment 29 David Markey 2024-04-16 22:04:32 UTC
Alright, I'll try to isolate this new problem and maybe raise a new bug. Thanks.
Comment 30 David Markey 2024-04-16 23:06:13 UTC
So this seems to be that my disk disappears on resume, opened a new bug here: https://bugzilla.kernel.org/show_bug.cgi?id=218736

Thanks for addressing this one, all.