Created attachment 305960 [details] Log of system after suspend/resume with frequencies locked There have been various reports of issues on Lenovo P15v G3 AMD and P16v G1 AMD platforms where CPU frequency after suspend/resume is limited to 544MHz I was able to reproduce this reliably on my P16V G1 AMD with the following steps: - Be plugged in - Suspend (s0ix) - Unplug - Resume If then checking the CPU frequencies they are limited to 400 to 544Mhz and cause reduced system performance. The only recovery is to power cycle. I was able to bisect the issue and tracked it down to this commit: https://github.com/torvalds/linux/commit/b5539eb5ee70257520e40bb636a295217c329a50 I've done a 6.8-rc6 build with and without this change and confirmed it is broken/fixed (kernel logs attached). Please let me know how to proceed from here. Thanks Mark
Created attachment 305961 [details] Kernel log with issue causing commit reverted
As you probably know, the commit pointed to above is a regression fix that restores the previously existing behavior and so it cannot be reverted. Besides, in the "good case" log there is this line: kernel: amd_pmc AMDI0009:00: Last suspend didn't reach deepest state for every resume except for the first one, which is never present in the "bad case" log, so it looks like the "fix" is to prevent the platform from reaching the deepest state (and that's what happens without the commit in question).
HP laptops also exhibit this error, I first reported it over half a year ago. See https://bugzilla.kernel.org/show_bug.cgi?id=218305 Mario Limonciello from AMD said it's "an HP EC bug" (embedded controller) however it's weird and alarming we now have _two_ vendors with the same issue. Maybe AMD could do something to prevent vendors from breaking stuff up. And exactly like you and I failed to mention it in the already known bug report, all it takes to encounter this bug is to put the laptop to sleep, unplug it, wait a bit, plug it back in and resume/wake it up. It's broken. A full reboot/power cycle fixes this. Windows is not affected for some reasons or maybe I haven't tested enough.
> Besides, in the "good case" log there is this line: > kernel: amd_pmc AMDI0009:00: Last suspend didn't reach deepest state If the last suspend didn't reach the deepest state with that reverted I do agree it's not actually fixing the root of the issue; it's masking it. Could you repeat your bisect keeping this in mind? > Mario Limonciello from AMD said it's "an HP EC bug" (embedded controller) > > however it's weird and alarming we now have _two_ vendors with the same > issue. I need to point out that the EC is proprietary to each vendor. I have no knowledge of their codebase. It's entirely possible they issue some of the same commands to the SoC though. Mark, would you be able to find out more from your EC team what their expectations are for this situation compared to how Windows behaves? I wonder if we have a "mismatch" scenario that the EC is "expecting" the system to wake up and react; but we don't do that in Linux - we wait for a second interrupt to be active (like the GPIO controller) before we wake the system.
Thanks Mario > Could you repeat your bisect keeping this in mind? What would I be looking for? When sleep stops working? (this platform was certified so I'm assuming it was OK at cert time - but oh boy that's going to take some tracking down and be painful :( (the last round of bisects wasn't a barrel of laughs...I was so happy to have found something concrete. Sigh) > would you be able to find out more from your EC team what their expectations > are for this situation compared to how Windows behaves? Absolutely. I'll take that conversation offline to work thru with you and Renjith. Artem, I scanned your bug only briefly - but any chance your system have an Nvidia card? The two Lenovo systems impacted by this both have Nvidia cards and it just makes me suspicious that we're not (so far...touch wood) seeing this anywhere else. I have an action item to track down a non-Nvidia SKU to confirm this theory. Mark
> What would I be looking for? When sleep stops working? (this platform was > certified so I'm assuming it was OK at cert time - but oh boy that's going to > take some tracking down and be painful :( (the last round of bisects wasn't a > barrel of laughs...I was so happy to have found something concrete. Sigh) I think basically reproduce the issue as you've said, but you need to look at where you are on the timeline for your bisect and might need to do some extra steps. 1) Make sure you're getting to the deepest state after resume or it's a "skip". 2) As you bisect between 6.4 and 6.5 any step that has https://github.com/torvalds/linux/commit/896e97bf99ecf0ecb6cc420bc2c9eb268d3edc05 but not https://github.com/torvalds/linux/commit/b5539eb5ee70257520e40bb636a295217c329a50 you should revert 896e97bf99ecf0ecb6cc420bc2c9eb268d3edc05. > Absolutely. I'll take that conversation offline 👍
Hi Mark, > Artem, I scanned your bug only briefly - but any chance your system have an > Nvidia card? The two Lenovo systems impacted by this both have Nvidia cards > and it just makes me suspicious that we're not (so far...touch wood) seeing > this anywhere else. I have an action item to track down a non-Nvidia SKU to > confirm this theory. Nope, it's an HP EliteBook 845 14" G10 laptop and it only has a built-in Radeon 780m iGPU. Windows is seemingly not affected by this issue.
@Artem: You can try to disable the deepest idle states on all CPUs via the cpuidle sysfs before suspending and see if that makes any difference. The suspicion being that if the SoC gets deep enough with low power, it may need some extra work to restore the previous configuration properly.
That should block VDDOFF which will prevent the SoC from getting into the deepest state over suspend which will mean it behaves similarly to what Mark found. I don't think it's the most useful datapoint. I *think* the common bit with Artem's issue and Mark's issue is that there are some APU thermal coefficients that are set by the EC that aren't getting updated properly over s2idle and the system is staying throttled. The closest analogy to this for Intel is the EC setting PL1 or PL2. The specifics of which are used are different for HP and Lenovo, so I think we should treat them separately for now although I admit that they have a very similar reproduction and might have a similar root cause. I've posted some more debugging steps to Artem's bug.
@Rafael > @Artem: You can try to disable the deepest idle states on all CPUs via the > cpuidle sysfs before suspending and see if that makes any difference. I did: echo 1 | tee /sys/devices/system/cpu/cpu*/cpuidle/state3/disable (not sure if it's the right one, state3/name says "C3" which looks like it's the lowest). Let me check. And I will get to Mario's new debugging steps in a moment.
Nope, didn't help, now on to Mario's suggestions. Sorry for spamming in this bug report, it doesn't look related to my issue.
Framework Phoenix laptops also seem to be affected even when running Windows: https://community.frame.work/t/amd-cpu-stuck-in-low-speed-state-after-system-resume/39921
I've developed a fix for this issue that addresses the CPU frequency limitation on AMD platforms after suspend/resume. After investigating the problem, I found that the root cause is related to commit b5539eb5ee70 ("ACPI: EC: Fix acpi_ec_dispatch_gpe()"). This commit restored the behavior of clearing the GPE in acpi_ec_dispatch_gpe() function to prevent GPE storms during suspend-to-idle. While this fix is necessary for most platforms, it causes problems specifically on certain AMD platforms by interfering with the EC's ability to properly restore power management settings after resume. My patch implements a targeted workaround that: 1. Adds DMI-based detection for the affected AMD platforms (Lenovo P15v Gen 3, P16v Gen 1, HP EliteBook 845 G10) 2. Adds a function to check if we're in suspend-to-idle mode 3. Modifies the acpi_ec_dispatch_gpe() function to handle AMD platforms specially: - For affected AMD platforms during suspend-to-idle, it advances the transaction without clearing the GPE status bit - For all other platforms, it maintains the existing behavior of clearing the GPE status bit I've tested this fix on a Lenovo P16v Gen 1 with AMD Ryzen 7 PRO 7840HS and confirmed that: - Without the patch, the CPU frequency is limited to 544MHz after the suspend/unplug/resume sequence - With the patch applied, the CPU properly scales up to its maximum frequency (5.1GHz) after the same sequence - No regressions were observed in other EC functionality - Multiple suspend/resume cycles with different power states were tested without issues I've submitted this patch to the Linux kernel mailing list for review and inclusion in the mainline kernel. The patch is designed to be a minimal and targeted fix that addresses the specific issue without affecting the behavior on non-AMD platforms. I'll update this ticket once there's feedback from the kernel maintainers.
(In reply to Marcus Bergo from comment #13) > I've developed a fix for this issue that addresses the CPU frequency > limitation on AMD platforms after suspend/resume. > Wonderful news, I just wonder why Windows seemingly doesn't need any quirks and works properly out of the box.
As discussed on the mailing list, it appears that the W/A from Marcus isn't needed as amd-pmc now mandates that any cycles after the first are 2.5s to give components time to settle. This helps to avoid hitting this bug in the EC. https://git.kernel.org/torvalds/c/9f5595d5f03f > Wonderful news, I just wonder why Windows seemingly doesn't need any quirks > and works properly out of the box. Windows is a lot slower at suspend entry. It takes 10's of seconds.