Created attachment 305960 [details] Log of system after suspend/resume with frequencies locked There have been various reports of issues on Lenovo P15v G3 AMD and P16v G1 AMD platforms where CPU frequency after suspend/resume is limited to 544MHz I was able to reproduce this reliably on my P16V G1 AMD with the following steps: - Be plugged in - Suspend (s0ix) - Unplug - Resume If then checking the CPU frequencies they are limited to 400 to 544Mhz and cause reduced system performance. The only recovery is to power cycle. I was able to bisect the issue and tracked it down to this commit: https://github.com/torvalds/linux/commit/b5539eb5ee70257520e40bb636a295217c329a50 I've done a 6.8-rc6 build with and without this change and confirmed it is broken/fixed (kernel logs attached). Please let me know how to proceed from here. Thanks Mark
Created attachment 305961 [details] Kernel log with issue causing commit reverted
As you probably know, the commit pointed to above is a regression fix that restores the previously existing behavior and so it cannot be reverted. Besides, in the "good case" log there is this line: kernel: amd_pmc AMDI0009:00: Last suspend didn't reach deepest state for every resume except for the first one, which is never present in the "bad case" log, so it looks like the "fix" is to prevent the platform from reaching the deepest state (and that's what happens without the commit in question).
HP laptops also exhibit this error, I first reported it over half a year ago. See https://bugzilla.kernel.org/show_bug.cgi?id=218305 Mario Limonciello from AMD said it's "an HP EC bug" (embedded controller) however it's weird and alarming we now have _two_ vendors with the same issue. Maybe AMD could do something to prevent vendors from breaking stuff up. And exactly like you and I failed to mention it in the already known bug report, all it takes to encounter this bug is to put the laptop to sleep, unplug it, wait a bit, plug it back in and resume/wake it up. It's broken. A full reboot/power cycle fixes this. Windows is not affected for some reasons or maybe I haven't tested enough.
> Besides, in the "good case" log there is this line: > kernel: amd_pmc AMDI0009:00: Last suspend didn't reach deepest state If the last suspend didn't reach the deepest state with that reverted I do agree it's not actually fixing the root of the issue; it's masking it. Could you repeat your bisect keeping this in mind? > Mario Limonciello from AMD said it's "an HP EC bug" (embedded controller) > > however it's weird and alarming we now have _two_ vendors with the same > issue. I need to point out that the EC is proprietary to each vendor. I have no knowledge of their codebase. It's entirely possible they issue some of the same commands to the SoC though. Mark, would you be able to find out more from your EC team what their expectations are for this situation compared to how Windows behaves? I wonder if we have a "mismatch" scenario that the EC is "expecting" the system to wake up and react; but we don't do that in Linux - we wait for a second interrupt to be active (like the GPIO controller) before we wake the system.
Thanks Mario > Could you repeat your bisect keeping this in mind? What would I be looking for? When sleep stops working? (this platform was certified so I'm assuming it was OK at cert time - but oh boy that's going to take some tracking down and be painful :( (the last round of bisects wasn't a barrel of laughs...I was so happy to have found something concrete. Sigh) > would you be able to find out more from your EC team what their expectations > are for this situation compared to how Windows behaves? Absolutely. I'll take that conversation offline to work thru with you and Renjith. Artem, I scanned your bug only briefly - but any chance your system have an Nvidia card? The two Lenovo systems impacted by this both have Nvidia cards and it just makes me suspicious that we're not (so far...touch wood) seeing this anywhere else. I have an action item to track down a non-Nvidia SKU to confirm this theory. Mark
> What would I be looking for? When sleep stops working? (this platform was > certified so I'm assuming it was OK at cert time - but oh boy that's going to > take some tracking down and be painful :( (the last round of bisects wasn't a > barrel of laughs...I was so happy to have found something concrete. Sigh) I think basically reproduce the issue as you've said, but you need to look at where you are on the timeline for your bisect and might need to do some extra steps. 1) Make sure you're getting to the deepest state after resume or it's a "skip". 2) As you bisect between 6.4 and 6.5 any step that has https://github.com/torvalds/linux/commit/896e97bf99ecf0ecb6cc420bc2c9eb268d3edc05 but not https://github.com/torvalds/linux/commit/b5539eb5ee70257520e40bb636a295217c329a50 you should revert 896e97bf99ecf0ecb6cc420bc2c9eb268d3edc05. > Absolutely. I'll take that conversation offline 👍
Hi Mark, > Artem, I scanned your bug only briefly - but any chance your system have an > Nvidia card? The two Lenovo systems impacted by this both have Nvidia cards > and it just makes me suspicious that we're not (so far...touch wood) seeing > this anywhere else. I have an action item to track down a non-Nvidia SKU to > confirm this theory. Nope, it's an HP EliteBook 845 14" G10 laptop and it only has a built-in Radeon 780m iGPU. Windows is seemingly not affected by this issue.
@Artem: You can try to disable the deepest idle states on all CPUs via the cpuidle sysfs before suspending and see if that makes any difference. The suspicion being that if the SoC gets deep enough with low power, it may need some extra work to restore the previous configuration properly.
That should block VDDOFF which will prevent the SoC from getting into the deepest state over suspend which will mean it behaves similarly to what Mark found. I don't think it's the most useful datapoint. I *think* the common bit with Artem's issue and Mark's issue is that there are some APU thermal coefficients that are set by the EC that aren't getting updated properly over s2idle and the system is staying throttled. The closest analogy to this for Intel is the EC setting PL1 or PL2. The specifics of which are used are different for HP and Lenovo, so I think we should treat them separately for now although I admit that they have a very similar reproduction and might have a similar root cause. I've posted some more debugging steps to Artem's bug.
@Rafael > @Artem: You can try to disable the deepest idle states on all CPUs via the > cpuidle sysfs before suspending and see if that makes any difference. I did: echo 1 | tee /sys/devices/system/cpu/cpu*/cpuidle/state3/disable (not sure if it's the right one, state3/name says "C3" which looks like it's the lowest). Let me check. And I will get to Mario's new debugging steps in a moment.
Nope, didn't help, now on to Mario's suggestions. Sorry for spamming in this bug report, it doesn't look related to my issue.
Framework Phoenix laptops also seem to be affected even when running Windows: https://community.frame.work/t/amd-cpu-stuck-in-low-speed-state-after-system-resume/39921