Bug 218557 - Kernel regression in acpi/ec.c causing some AMD platforms to limit CPU frequency to < 544Mhz
Summary: Kernel regression in acpi/ec.c causing some AMD platforms to limit CPU freque...
Status: NEW
Alias: None
Product: ACPI
Classification: Unclassified
Component: EC (show other bugs)
Hardware: AMD Linux
: P3 blocking
Assignee: acpi_ec
URL:
Keywords:
Depends on: 218305
Blocks:
  Show dependency tree
 
Reported: 2024-03-04 18:19 UTC by Mark Pearson
Modified: 2024-04-15 17:42 UTC (History)
5 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Log of system after suspend/resume with frequencies locked (138.45 KB, text/plain)
2024-03-04 18:19 UTC, Mark Pearson
Details
Kernel log with issue causing commit reverted (150.09 KB, text/plain)
2024-03-04 18:20 UTC, Mark Pearson
Details

Description Mark Pearson 2024-03-04 18:19:17 UTC
Created attachment 305960 [details]
Log of system after suspend/resume with frequencies locked

There have been various reports of issues on Lenovo P15v G3 AMD and P16v G1 AMD platforms where CPU frequency after suspend/resume is limited to 544MHz

I was able to reproduce this reliably on my P16V G1 AMD with the following steps:
 - Be plugged in
 - Suspend (s0ix)
 - Unplug
 - Resume

If then checking the CPU frequencies they are limited to 400 to 544Mhz and cause reduced system performance. The only recovery is to power cycle.

I was able to bisect the issue and tracked it down to this commit:
https://github.com/torvalds/linux/commit/b5539eb5ee70257520e40bb636a295217c329a50

I've done a 6.8-rc6 build with and without this change and confirmed it is broken/fixed (kernel logs attached).

Please let me know how to proceed from here.
Thanks
Mark
Comment 1 Mark Pearson 2024-03-04 18:20:06 UTC
Created attachment 305961 [details]
Kernel log with issue causing commit reverted
Comment 2 Rafael J. Wysocki 2024-03-05 11:25:23 UTC
As you probably know, the commit pointed to above is a regression fix that restores the previously existing behavior and so it cannot be reverted.

Besides, in the "good case" log there is this line:

kernel: amd_pmc AMDI0009:00: Last suspend didn't reach deepest state

for every resume except for the first one, which is never present in the "bad case" log, so it looks like the "fix" is to prevent the platform from reaching the deepest state (and that's what happens without the commit in question).
Comment 3 Artem S. Tashkinov 2024-03-05 16:57:15 UTC
HP laptops also exhibit this error, I first reported it over half a year ago.

See https://bugzilla.kernel.org/show_bug.cgi?id=218305

Mario Limonciello from AMD said it's "an HP EC bug" (embedded controller) however it's weird and alarming we now have _two_ vendors with the same issue.

Maybe AMD could do something to prevent vendors from breaking stuff up.

And exactly like you and I failed to mention it in the already known bug report, all it takes to encounter this bug is to put the laptop to sleep, unplug it, wait a bit, plug it back in and resume/wake it up. It's broken. A full reboot/power cycle fixes this.

Windows is not affected for some reasons or maybe I haven't tested enough.
Comment 4 Mario Limonciello (AMD) 2024-03-05 17:06:54 UTC
> Besides, in the "good case" log there is this line:
> kernel: amd_pmc AMDI0009:00: Last suspend didn't reach deepest state

If the last suspend didn't reach the deepest state with that reverted I do agree it's not actually fixing the root of the issue; it's masking it.

Could you repeat your bisect keeping this in mind?

> Mario Limonciello from AMD said it's "an HP EC bug" (embedded controller) >
> however it's weird and alarming we now have _two_ vendors with the same
> issue.

I need to point out that the EC is proprietary to each vendor.  I have no knowledge of their codebase.  It's entirely possible they issue some of the same commands to the SoC though.

Mark, would you be able to find out more from your EC team what their expectations are for this situation compared to how Windows behaves?

I wonder if we have a "mismatch" scenario that the EC is "expecting" the system to wake up and react; but we don't do that in Linux - we wait for a second interrupt to be active (like the GPIO controller) before we wake the system.
Comment 5 Mark Pearson 2024-03-05 18:54:51 UTC
Thanks Mario

> Could you repeat your bisect keeping this in mind?
What would I be looking for? When sleep stops working? (this platform was certified so I'm assuming it was OK at cert time - but oh boy that's going to take some tracking down and be painful :( (the last round of bisects wasn't a barrel of laughs...I was so happy to have found something concrete. Sigh)

> would you be able to find out more from your EC team what their expectations
> are for this situation compared to how Windows behaves?
Absolutely. I'll take that conversation offline to work thru with you and Renjith.

Artem, I scanned your bug only briefly - but any chance your system have an Nvidia card? The two Lenovo systems impacted by this both have Nvidia cards and it just makes me suspicious that we're not (so far...touch wood) seeing this anywhere else. I have an action item to track down a non-Nvidia SKU to confirm this theory.

Mark
Comment 6 Mario Limonciello (AMD) 2024-03-05 19:04:56 UTC
> What would I be looking for? When sleep stops working? (this platform was
> certified so I'm assuming it was OK at cert time - but oh boy that's going to
> take some tracking down and be painful :( (the last round of bisects wasn't a
> barrel of laughs...I was so happy to have found something concrete. Sigh)

I think basically reproduce the issue as you've said, but you need to look at where you are on the timeline for your bisect and might need to do some extra steps.  

1) Make sure you're getting to the deepest state after resume or it's a "skip".
2) As you bisect between 6.4 and 6.5 any step that has https://github.com/torvalds/linux/commit/896e97bf99ecf0ecb6cc420bc2c9eb268d3edc05 but not https://github.com/torvalds/linux/commit/b5539eb5ee70257520e40bb636a295217c329a50 you should revert 896e97bf99ecf0ecb6cc420bc2c9eb268d3edc05.

> Absolutely. I'll take that conversation offline 

👍
Comment 7 Artem S. Tashkinov 2024-03-06 09:51:46 UTC
Hi Mark,

> Artem, I scanned your bug only briefly - but any chance your system have an
> Nvidia card? The two Lenovo systems impacted by this both have Nvidia cards
> and it just makes me suspicious that we're not (so far...touch wood) seeing
> this anywhere else. I have an action item to track down a non-Nvidia SKU to
> confirm this theory.

Nope, it's an HP EliteBook 845 14" G10 laptop and it only has a built-in Radeon 780m iGPU.

Windows is seemingly not affected by this issue.
Comment 8 Rafael J. Wysocki 2024-03-06 12:16:04 UTC
@Artem: You can try to disable the deepest idle states on all CPUs via the cpuidle sysfs before suspending and see if that makes any difference.

The suspicion being that if the SoC gets deep enough with low power, it may need some extra work to restore the previous configuration properly.
Comment 9 Mario Limonciello (AMD) 2024-03-06 16:25:58 UTC
That should block VDDOFF which will prevent the SoC from getting into the deepest state over suspend which will mean it behaves similarly to what Mark found.

I don't think it's the most useful datapoint.

I *think* the common bit with Artem's issue and Mark's issue is that there are some APU thermal coefficients that are set by the EC that aren't getting updated properly over s2idle and the system is staying throttled.

The closest analogy to this for Intel is the EC setting PL1 or PL2.

The specifics of which are used are different for HP and Lenovo, so I think we should treat them separately for now although I admit that they have a very similar reproduction and might have a similar root cause.

I've posted some more debugging steps to Artem's bug.
Comment 10 Artem S. Tashkinov 2024-03-06 16:53:47 UTC
@Rafael

> @Artem: You can try to disable the deepest idle states on all CPUs via the
> cpuidle sysfs before suspending and see if that makes any difference.

I did:

echo 1 | tee /sys/devices/system/cpu/cpu*/cpuidle/state3/disable

(not sure if it's the right one, state3/name says "C3" which looks like it's the lowest).

Let me check. And I will get to Mario's new debugging steps in a moment.
Comment 11 Artem S. Tashkinov 2024-03-06 17:14:50 UTC
Nope, didn't help, now on to Mario's suggestions. Sorry for spamming in this bug report, it doesn't look related to my issue.
Comment 12 Artem S. Tashkinov 2024-03-11 11:30:49 UTC
Framework Phoenix laptops also seem to be affected even when running Windows:

https://community.frame.work/t/amd-cpu-stuck-in-low-speed-state-after-system-resume/39921

Note You need to log in before you can comment on or make changes to this bug.