I'm almost sure it's a bug in the firmware but since I cannot make HP fix it, I'll try to report it here. The CPU gets stuck at this extremely low frequency after N number of suspend/resume cycles where N can be 1, 2, 3, 4 but at most 5. The laptop is plugged in at all times. This is happening with both acpi-cpufreq and amd-pstate-epp. # cpupower frequency-info analyzing CPU 10: driver: amd-pstate-epp CPUs which run at the same hardware frequency: 10 CPUs which need to have their frequency coordinated by software: 10 maximum transition latency: Cannot determine or is not supported. hardware limits: 400 MHz - 6.08 GHz available cpufreq governors: performance powersave current policy: frequency should be within 400 MHz and 6.08 GHz. The governor "powersave" may decide which speed to use within this range. current CPU frequency: Unable to call hardware current CPU frequency: 544 MHz (asserted by call to kernel) boost state support: Supported: yes Active: yes AMD PSTATE Highest Performance: 232. Maximum Frequency: 6.08 GHz. AMD PSTATE Nominal Performance: 145. Nominal Frequency: 3.80 GHz. AMD PSTATE Lowest Non-linear Performance: 42. Lowest Non-linear Frequency: 1.10 GHz. AMD PSTATE Lowest Performance: 16. Lowest Frequency: 400 MHz. Some CPU parameters look completely wrong after it happens: # ryzenadj -i | Name | Value | Parameter | |---------------------|-----------|--------------------| | STAPM LIMIT | 30.000 | stapm-limit | | STAPM VALUE | 4.181 | | | PPT LIMIT FAST | 30.000 | fast-limit | | PPT VALUE FAST | 5.347 | | | PPT LIMIT SLOW | 20.000 | slow-limit | | PPT VALUE SLOW | 3.747 | | | StapmTimeConst | nan | stapm-time | | SlowPPTTimeConst | nan | slow-time | | PPT LIMIT APU | nan | apu-slow-limit | | PPT VALUE APU | nan | | | TDC LIMIT VDD | nan | vrm-current | | TDC VALUE VDD | nan | | | TDC LIMIT SOC | nan | vrmsoc-current | | TDC VALUE SOC | nan | | | EDC LIMIT VDD | nan | vrmmax-current | | EDC VALUE VDD | nan | | | EDC LIMIT SOC | nan | vrmsocmax-current | | EDC VALUE SOC | nan | | | THM LIMIT CORE | nan | tctl-temp | | THM VALUE CORE | nan | | | STT LIMIT APU | nan | apu-skin-temp | | STT VALUE APU | nan | | | STT LIMIT dGPU | nan | dgpu-skin-temp | | STT VALUE dGPU | nan | | | CCLK Boost SETPOINT | nan | power-saving / | | CCLK BUSY VALUE | nan | max-performance |
I do use this command to constrain CPU thermals: ryzenadj --tctl-temp=75 --stapm-limit=30000 --fast-limit=30000 --slow-limit=20000 https://github.com/FlyGoat/RyzenAdj Perhaps on resume the firmware sees altered limits and wreaks havoc to everything. These last three parameters suddenly become read only after the bug occurs.
Does the issue also happen if you dont use ryzenadj?
(In reply to Armin Wolf from comment #2) > Does the issue also happen if you dont use ryzenadj? Yes, today a single suspend resume cycle has been enough to trigger this bug. This is the result of this bug (no settings have been altered prior): ryzenadj -i CPU Family: Phoenix SMU_SERVICE REQ_ID:0x3 SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0 SMU_SERVICE REP: REP: 0x1, arg0: 0xe, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0 SMU BIOS Interface Version: 14 Version: v0.14.0 init_table SMU_SERVICE REQ_ID:0x6 SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0 SMU_SERVICE REP: REP: 0x1, arg0: 0x4c0008, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0 SMU_SERVICE REQ_ID:0x66 SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0 SMU_SERVICE REP: REP: 0x1, arg0: 0x9e300000, arg1:0x7, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0 SMU_SERVICE REQ_ID:0x65 SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0 SMU_SERVICE REP: REP: 0xfd, arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0 SMU_SERVICE REQ_ID:0x65 SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0 SMU_SERVICE REP: REP: 0x1, arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0 PM Table Version: 4c0008 SMU_SERVICE REQ_ID:0x65 SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0 SMU_SERVICE REP: REP: 0x1, arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0 | Name | Value | Parameter | |---------------------|-----------|--------------------| | STAPM LIMIT | 51.000 | stapm-limit | | STAPM VALUE | 4.150 | | | PPT LIMIT FAST | 51.000 | fast-limit | | PPT VALUE FAST | 6.040 | | | PPT LIMIT SLOW | 41.000 | slow-limit | | PPT VALUE SLOW | 4.056 | | | StapmTimeConst | nan | stapm-time | | SlowPPTTimeConst | nan | slow-time | | PPT LIMIT APU | nan | apu-slow-limit | | PPT VALUE APU | nan | | | TDC LIMIT VDD | nan | vrm-current | | TDC VALUE VDD | nan | | | TDC LIMIT SOC | nan | vrmsoc-current | | TDC VALUE SOC | nan | | | EDC LIMIT VDD | nan | vrmmax-current | | EDC VALUE VDD | nan | | | EDC LIMIT SOC | nan | vrmsocmax-current | | EDC VALUE SOC | nan | | | THM LIMIT CORE | nan | tctl-temp | | THM VALUE CORE | nan | | | STT LIMIT APU | nan | apu-skin-temp | | STT VALUE APU | nan | | | STT LIMIT dGPU | nan | dgpu-skin-temp | | STT VALUE dGPU | nan | | | CCLK Boost SETPOINT | nan | power-saving / | | CCLK BUSY VALUE | nan | max-performance | STAMP, PPT FAST and PPT SLOW all have broken values.
Could be that the firmware fails to properly restore those values after suspend, does the issue also happen under Windows?
(In reply to Armin Wolf from comment #4) > Could be that the firmware fails to properly restore those values after > suspend, does the issue also happen under Windows? I rarely boot into Windows but I may check.
I've not been able to reproduce this issue under Windows but then I didn't try hard enough (which means multiple attempts spanning several days).
Have you verified that you are using the latest BIOS for you machine?
The issue is reproducible with the latest BIOS release (V82: 01.03.09 Rev.A, released on Dec 15, 2023) and two versions prior. HP doesn't allow to download and flash earlier versions.
Since you are pointing to STAPM, PPT limits, Can you blacklist amd_pmf driver and see if that helps after the suspend/resume cycle?
This is reproducible without the amd-pmf module: [root@hp policy0]# lsmod | grep pmf [root@hp policy0]# cpupower frequency-info analyzing CPU 13: driver: amd-pstate-epp CPUs which run at the same hardware frequency: 13 CPUs which need to have their frequency coordinated by software: 13 maximum transition latency: Cannot determine or is not supported. hardware limits: 400 MHz - 5.76 GHz available cpufreq governors: performance powersave current policy: frequency should be within 400 MHz and 5.76 GHz. The governor "powersave" may decide which speed to use within this range. current CPU frequency: Unable to call hardware current CPU frequency: 542 MHz (asserted by call to kernel) boost state support: Supported: yes Active: yes AMD PSTATE Highest Performance: 220. Maximum Frequency: 5.76 GHz. AMD PSTATE Nominal Performance: 145. Nominal Frequency: 3.80 GHz. AMD PSTATE Lowest Non-linear Performance: 42. Lowest Non-linear Frequency: 1.10 GHz. AMD PSTATE Lowest Performance: 16. Lowest Frequency: 400 MHz.
Created attachment 305659 [details] The contents of /sys/devices/system/cpu/cpufreq/* Switching between power modes using /sys/devices/system/cpu/cpufreq/*/energy_performance_preference does nothing. The exact per CPU frequency stats: # cat /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq 400000 544395 400000 544099 400000 400000 400000 400000 400000 544189 400000 544181 400000 400000 544007 542947 No idea where 544MHz comes from. BTW here's another bug, either firmware or something in the kernel reports wrong max frequency: # cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq 5137000 6080000 6080000 5764000 5764000 5924000 5924000 5137000 6080000 6080000 5608000 5608000 5293000 5293000 5449000 5449000 I'm not aware of any Zen 4 CPUs which can run at 6080000KHz frequency by default, let alone mobile parts.
> BTW here's another bug, either firmware or something in the kernel reports > wrong max frequency: Max frequency is reported correctly only for two out of sixteen logical cores. It's wrong for all other cores. Would be great if AMD fixed this. Speaking of my firmware, it's: https://support.hp.com/us-en/drivers/hp-elitebook-845-14-inch-g10-notebook-pc/2101628462 > Description: > > This package is used to update the supported firmware on HP Business Notebook > systems with a V82 family BIOS. This package is provided for supported > computer systems that are running a supported operating system. > > Fix and enhancements: > > - Fixes an issue where the Performance page in AMD Software: Adrenalin > Edition does not display correctly. - Adds the Gaming Optimized mode to video > memory size. > > - Includes the following firmware: > AMD Graphics Output Protocol (GOP) Firmware, version 3.7.10 > AMD PSP Firmware, version 0.2D.6.6C > AMD SMU Firmware, version 0.76.65.0 > Embedded Controller (EC) Firmware, version 60.28.00 > Intel/Realtek UEFI PXE ROM, version 2.041 > TI Power Delivery (PD) Firmware, version 4.1.0
I am seeing similar behaviour, to the extent that my CPU cores get capped at some low frequency. Sometimes it is a few cores stuck at ~1600MHz, and sometimes it is all cores stuck at 544MHz. It typically happens for me when rebooting. I tried suspend/resume several times but could not reproduce that way. CPU is a AMD Ryzen 5 7640U on a Framework 13 laptop. 6.6.8 kernel on Fedora 39. We may not be having the same issue, but I wanted to mention, I can get all cores back to normal by switching the scaling_governor from powersave to performance and back in case it helps in your case. I am using "sudo cpupower frequency-set -g <GOV>" to switch it.
I read through this thread and I currently think that Artem and Dan have encountered two separate bugs. @Artem: Under the presumption that ryzenadj is actually retrieving the correct values for STAPM, PPT FAST, and PPT SLOW I want to ask if this is tied to a specific power adapter, or sequence of events. Like suspend on power, resume on battery or suspend on battery resume on power. If there is a linkage between any of those, then I think this is "most likely" an HP EC bug. @Dan, Can you reproduce this if you manually always set the scaling governor on all CPUs to "performance" before you reboot?
(In reply to Mario Limonciello (AMD) from comment #14) > Under the presumption that ryzenadj is actually retrieving the correct > values for STAPM, PPT FAST, and PPT SLOW I want to ask if this is tied to a > specific power adapter, or sequence of events. Like suspend on power, > resume on battery or suspend on battery resume on power. My laptop is plugged in 100% of the time. > > If there is a linkage between any of those, then I think this is "most > likely" an HP EC bug. I've given up on reporting bugs to HP. It's a complicated process which takes forever. I bought this laptop and its maximum CPU frequency was limited to 4.5GHz which took HP over four months to resolve and that was at least reproducible under Linux and Windows. This bug seems to affect only Linux or maybe I've not used Windows enough to face it in the Microsoft OS.
I can reproduce Artem's issue on EliteBook 845 G10 (kernel 6.7.0 on NixOS). Also Dan's workaround works for me 80% of the time, with only a few times when I had to reboot to lift the cpufreq lock. My max_freqs are also strange, regardless of whether cpufreq is capped to 544MHz or not. ``` ❯ cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq 5137000 5137000 6080000 6080000 5449000 5449000 5293000 5293000 5924000 5924000 6080000 6080000 5764000 5764000 5608000 5608000 ```
(In reply to Muzhi Yu from comment #16) > I can reproduce Artem's issue on EliteBook 845 G10 (kernel 6.7.0 on NixOS). > Also Dan's workaround works for me 80% of the time, with only a few times > when I had to reboot to lift the cpufreq lock. > I have since found that I don't need to switch to the performance governor at all. It is enough, in my case, to "reset" the the scaling governor to powersave. Just "sudo cpupower frequency-set -g powersave".
(In reply to Mario Limonciello (AMD) from comment #14) > I read through this thread and I currently think that Artem and Dan have > encountered two separate bugs. > > @Artem: > > Under the presumption that ryzenadj is actually retrieving the correct > values for STAPM, PPT FAST, and PPT SLOW I want to ask if this is tied to a > specific power adapter, or sequence of events. Like suspend on power, > resume on battery or suspend on battery resume on power. > > If there is a linkage between any of those, then I think this is "most > likely" an HP EC bug. > > @Dan, > > Can you reproduce this if you manually always set the scaling governor on > all CPUs to "performance" before you reboot? Mario, I just tested setting the governor to performance before reboot and yes, it is reproducible in that case too. 1. load the CPU and observe all cores can reach ~4Ghz 2. set governor: sudo cpupower frequency-set -g performance 3. reboot 4. load the CPU and check frequencies: on first reboot, all cores hit 4GHz range. On second reboot, cores 6-11 can only reach ~1.7GHz. This is in-line with previous tests. It is inconsistent, and various power settings don't seem to affect it (epp, platform_profile, scaling_governor). It does seem much more likely to occur when on battery, but will stills happen sometimes when plugged in. A couple of more recent observations: - I don't need to toggle from performance to powersave to fix it. I can just "sudo cpupower frequency-set -g powersave" even when it is already reporting that it is using the powersave governor. - on reboot, the scaling_governor is always showing powersave, even when I set it to performance before reboot. - Using kernel 6.6.11 as of this morning for the above test Thanks, Dan
Can you please dump teh values from all of these MSR's from userspace while in a reproduced state? #define MSR_AMD_CPPC_CAP1 0xc00102b0 #define MSR_AMD_CPPC_ENABLE 0xc00102b1 #define MSR_AMD_CPPC_CAP2 0xc00102b2 #define MSR_AMD_CPPC_REQ 0xc00102b3 #define MSR_AMD_CPPC_STATUS 0xc00102b4
Can you guys please test this and see if it improves the situation at all? https://lore.kernel.org/linux-pm/20240119113319.54158-1-mario.limonciello@amd.com/T/#u Thanks!
(In reply to Mario Limonciello (AMD) from comment #19) > Can you please dump teh values from all of these MSR's from userspace while > in a reproduced state? > > #define MSR_AMD_CPPC_CAP1 0xc00102b0 > #define MSR_AMD_CPPC_ENABLE 0xc00102b1 > #define MSR_AMD_CPPC_CAP2 0xc00102b2 > #define MSR_AMD_CPPC_REQ 0xc00102b3 > #define MSR_AMD_CPPC_STATUS 0xc00102b4 Hi Mario, Thank you for looking into this. I'll try your kernel patch when I have a bit more time. For now, here are the MSRs: Good state (from a boot when no cores were limited): ========================= ========================= MSR_AMD_CPPC_CAP1 d08a2c10 d08a2c10d08a2c10 dc8a2c10 dc8a2c10dc8a2c10 ca8a2c10 ca8a2c10ca8a2c10 dc8a2c10 dc8a2c10dc8a2c10 c48a2c10 c48a2c10c48a2c10 d68a2c10 d68a2c10d68a2c10 ========================= MSR_AMD_CPPC_ENABLE 1 1 1 1 1 1 1 1 1 1 1 1 ========================= MSR_AMD_CPPC_CAP2 0 0 0 0 0 0 0 0 0 0 0 0 ========================= MSR_AMD_CPPC_REQ 10d0 10d0 10dc 10dc 10ca 10ca 10dc 10dc 10c4 10c4 10d6 f0f ========================= MSR_AMD_CPPC_STATUS 0 0 0 0 0 0 0 0 0 0 0 0 ========================= In reproduced state, where all cores are stuck at ~544MHz, MSR_AMD_CPPC_REQ values appear to have wrapped around? ======================================== ======================================== MSR_AMD_CPPC_CAP1 d08a2c10 d08a2c10d08a2c10 dc8a2c10 dc8a2c10dc8a2c10 ca8a2c10 ca8a2c10ca8a2c10 dc8a2c10 dc8a2c10dc8a2c10 c48a2c10 c48a2c10c48a2c10 d68a2c10 d68a2c10d68a2c10 ========================= MSR_AMD_CPPC_ENABLE 1 1 1 1 1 1 1 1 1 1 1 1 ========================= MSR_AMD_CPPC_CAP2 0 0 0 0 0 0 0 0 0 0 0 0 ========================= MSR_AMD_CPPC_REQ ff000f0f ff000f0f ff000f0f ff000f0f ff000f0f ff000f0f ff000f0f ff000f0f ff000f0f ff000f0f ff000f0f ff000f0f ========================= MSR_AMD_CPPC_STATUS 0 0 0 0 0 0 0 0 0 0 0 0 ========================= ========================= And, when I (re)set the scaling governor, the MSR_AMD_CPPC_REQ change slightly. Here is side-by-side. reproduced state on left, and after re-setting the governor on right. ========================= ========================= MSR_AMD_CPPC_REQ MSR_AMD_CPPC_REQ ff000f0f | ff0010d0 ff000f0f | ff0010d0 ff000f0f | ff0010dc ff000f0f | ff0010dc ff000f0f | ff0010ca ff000f0f | ff0010ca ff000f0f | ff0010dc ff000f0f | ff0010dc ff000f0f | ff0010c4 ff000f0f | ff0010c4 ff000f0f | ff0010d6 ff000f0f | ff0010d6 Thanks, Dan
(In reply to Mario Limonciello (AMD) from comment #20) > Can you guys please test this and see if it improves the situation at all? > > https://lore.kernel.org/linux-pm/20240119113319.54158-1-mario. > limonciello@amd.com/T/#u > > Thanks! Hi again Mario, I tested this patch against Fedora's 6.6.13 kernel and so far, after 6 reboots have not been able to reproduce. When I switch back to the stock kernel, I can typically reproduce the issue in 1-2 reboots so the patch seems to have helped so far. I'll keep using the patched kernel for now and let you know if this issue occurs again. Thanks, Dan
(In reply to Dan Martins from comment #21) > (In reply to Mario Limonciello (AMD) from comment #19) > > Can you please dump teh values from all of these MSR's from userspace while > > in a reproduced state? > In reproduced state, where all cores are stuck at ~544MHz, MSR_AMD_CPPC_REQ > values appear to have wrapped around? Ignore comment about values wrapping around, it appears the upper byte is set when I adjust PPD from performance (0x00) to balanced (0x80) and powersave (0xFF). I must have adjusted this between reboots.
That's great news. Everyone who feels comfortable sharing your email address feel free to reply to the post with a "Tested-by" tag.
Hi guys, Just adding my data point here. I've applied the patch and haven't seen this bug for the evening after ~5 cycles. BTW, are the MSR values still relevant, because I'm seeing no difference between normal and bad states? ``` c4912a10c4912a10 1 0 ff0010c4 0 ``` Thanks!
No need to share MSR values anymore. I believe this this is the correct solution. If there are still problems with it they may be a secondary problem. The MSR values are a little difficult to properly capture because each CPU has it's own register value. So a proper test would need to capture all of them for all CPUs (not all may have this problem occurring).
So I think there are actually two issues in this bug. * The first one was the one that Artem reported which looks like a problem with the EC communicating some limits to the APU. This is Artem's issue. * The second one is that there was a bug in amd-pstate that could cause CPPC requests to have the wrong values. This is (nearly) everyone else's issue in this bug. The second issue is fixed by https://github.com/torvalds/linux/commit/22fb4f041999f5f16ecbda15a2859b4ef4cbf47e For the first issue, Artem can you update to 6.8-rc7, make sure you've added the TEE firmware for the amd-pmf driver from linux-firmware and see if you can still reproduce it?
> For the first issue, Artem can you update to 6.8-rc7, make sure you've added > the TEE firmware for the amd-pmf driver from linux-firmware and see if you > can still reproduce it? I've just added the firmware file, "773bd96f-b83f-4d52-b12dc529b13d8543.bin" (what a weird name) and I will test 6.8 as soon as it gets released. It's coming pretty soon. Thanks.
> I've just added the firmware file, "773bd96f-b83f-4d52-b12dc529b13d8543.bin" > (what a weird name) and I will test 6.8 as soon as it gets released. It's > coming pretty soon. OK. Separately from that I'd like to understand what you were getting at with your ryzenadj comment. I don't know if ryzenadj accesses all those coefficients correctly; but we *do* export them properly under amd-pmf debugfs. There is a debugfs file called "current_power_limits". Can you read it before suspend as well as after a suspend that reproduced the failure?
Sorry two more things. First - there are two sets of coefficients (one for AC and for DC). You can see in current_power_limits_show() that it will return the table matching your power mode. Please capture like this: 1) Start on DC, capture the file. 2) Switch to AC, capture the file. 3) Suspend the machine 4) Unplug adapter 5) Resume 6) Capture the file (while you're on DC) 7) Switch to AC, capture the file. This will let us confirm whether or not there is a problem with the table. Second - after the issue has occurred, does changing the acpi platform profile from sysfs or powerprofilesctl recover it?
1) current_power_limits spl:51000 fppt:51000 sppt:41000 sppt_apu_only:41001 stt_min:25000 stt[APU]:0 stt[HS2]: 0 2) cat current_power_limits spl:51000 fppt:51000 sppt:41000 sppt_apu_only:41000 stt_min:25000 stt[APU]:0 stt[HS2]: 0 3-4-5) done 6) cat current_power_limits spl:51000 fppt:51000 sppt:41000 sppt_apu_only:41000 stt_min:25000 stt[APU]:0 stt[HS2]: 0 7) cat current_power_limits spl:51000 fppt:51000 sppt:41000 sppt_apu_only:41000 stt_min:25000 stt[APU]:0 stt[HS2]: 0 While we have been discussing this, I've just found out that when this bug occurs, all I need to do is to unplug and that fixes everything. It's actually such a simple workaround, I will leave it up to you whether anything needs to be done to address it.
> cat current_power_limits It looks like those don't change. > While we have been discussing this, I've just found out that when this bug > occurs, all I need to do is to unplug and that fixes everything. Presumably you mean unplug OR replug (IE opposite of what you did in suspend) right? > It's actually such a simple workaround, I will leave it up to you whether > anything needs to be done to address it. It's good you have that workaround. I'd like to know if powerprofilesctl/acpi platform profile can also recover it. If so; we might want to add an explicit code in the suspend/resume callbacks to rewrite the state if power adapter changed over suspend. I think this would be a safe solution for everyone.
> Presumably you mean unplug OR replug (IE opposite of what you did in suspend) > right? After unplugging it's already fixed. I do of course replug not to waste battery power. > If so; we might want to add an explicit code in the suspend/resume callbacks > to rewrite the state if power adapter changed over suspend. I think this > would be a safe solution for everyone. If only it doesn't break other systems. That sounds a tad scary to me. So far I seem to have been the only affected person (not that many people seem to be using HP business laptops with Linux).
> If only it doesn't break other systems. That sounds a tad scary to me. So far > I seem to have been the only affected person (not that many people seem to be > using HP business laptops with Linux). The code would basically look like this: * Capture state of power adapter at suspend callback into a private variable * If state of power adapter has changed during resume then rewrite all CPU coefficients. It should be safe for everyone. But I need to know that it actually helps your problem.
Shouldn't the driver generally restore all CPU coefficients after suspend/resume? Or is there a specification saying that the CPU coefficients will be restored by the platform firmware?
AMD-PMF can be used differently by different OEMs and models depending upon their needs and desires. Some will control entirely by their EC. Some will rely on PMF to do more functionality.
Could it be that the Windows equivalent of the amd-pmf driver does restore all/some coefficients after suspend/resume?
The Windows equivalent of the amd-pmf driver on this HP system uses the features in kernel 6.8 that I've been asking Artem to test. Once I know whether the issue happens on kernel 6.8 and whether changing the profile manually helps it we can decide on whether to do anything.
Ok
Hello everyone. I just read through this bugreport as I have the same problem on my HP Elitebook 845 G10 with a 7840U CPU. It randomely gets stuck at 544MHz. I'm running Endeavour OS (Arch based) with the latest Kernel (6.8.2-arch2-1) and Firmware (core/linux-firmware 20240312.3b128b60-1). I thought things would be fixed now, but I just had the hanging CPU freq again. As the bug is not closed and last comment is nearly 4 weeks old I just wanted to know if the fix is not official yet... Thanks for an update and to all for investigating here :) (Meanwhile I'll try the "sudo cpupower frequency-set -g powersave" workaround to see if it helps to circumvent an annoying reboot)
Created attachment 306075 [details] possible patch (v1) > It randomely gets stuck at 544MHz. Are you sure it's random? From the above discussions I believe it is triggered specifically from an event sent by the EC when changing the power adapter while suspended. > with the latest Kernel (6.8.2-arch2-1) Thanks. I've been waiting for feedback with kernel 6.8. And you have CONFIG_AMD_PMF set? > I just wanted to know if the fix is not official yet... There is no fix or workaround right now, like I said above this "looks" like a bug caused by HP's EC or BIOS. Assuming you tested with amd-pmf in place and it really is the same root cause described above (only by power adapter) I was thinking about it and this sounds like it could be a race condition. I do have an idea for a workaround. Can you see if this patch helps?
> From the above discussions I believe it is triggered specifically from an > event sent by the EC when changing the power adapter while suspended. Yep, and like I said in my case unplugging/plugging the power cord is enough to fix it which was a relief for me.
I have $ zcat /proc/config.gz | grep CONFIG_AMD_PMF CONFIG_AMD_PMF=m # CONFIG_AMD_PMF_DEBUG is not set Unfortunately I could not reproduce the effect during the test I did right now. Subjective impression is that the bug occurs less often since 6.8.x kernel. Testing procedure was: Plugged - suspend - unplug - resume - OK Unplugged - suspend - plug back in - resume - OK Starting plugged - suspend - resume - repeated 6 times while plugged - OK Resume was done via "systemctl suspend" command on my hotkey "Strg-Super-End" Then I tried it 3 times by using the lid - same here - it works for the moment. So it seems to be more random that for Artem. I'll check if un-/plug procedure helps as a quick fix to not have to reboot when CPU gets stucked again. I somehow need to find a reliable procedure to run into this bug before it makes sense to test the patch.
(typo: -Resume-) Suspend was done via "systemctl suspend" command on my hotkey "Strg-Super-End"
If 6.8 is more reliable you can also try to apply the patch to 6.7 or an earlier kernel that could more easily trigger it.
Any testing results for that patch idea?
Hi Mario, sorry for not responding, I still haven't been able to reproduce the bug. Just had it once after Kernel 6.8.x. I will test once I have a reproduceable scenario.
(In reply to Peter Ries from comment #47) > Hi Mario, sorry for not responding, I still haven't been able to reproduce > the bug. Just had it once after Kernel 6.8.x. > > I will test once I have a reproduceable scenario. Please try what triggers it for me 100%: 1. While the laptop is plugged in/connected to them mains, put it to sleep. 2. Unplug it for a little while - 20 seconds is enough I guess. 3. Plug it back it. 4. Wake it up.
Hi Arten, this unfortunately works for me - meaning the CPU frequency does NOT get stuck if I do it like this. - I put laptop to sleep - unplugged - waited 1 minute (without resuming on battery) - plugged back in - resume -> CPU scales up and down as expected I just wonder what happened AFTER I had the effect with kernel 6.8.x (only once) I currently have 6.8.2-arch2-1 core/linux-firmware-whence 20240312.3b128b60-1 core/linux-firmware 20240312.3b128b60-1 really weird
Coming from Bug #217931 , I found the mentions of being stuck at low frequency odd as I couldn't observe that despite managing multiple hosts, but then here I am. The twist is that I have a 7950X3D desktop setup, not a laptop one, and I apparently I just ran into the same low frequency issue others experienced. Unfortunately the usefulness of my information will be limited as I'm on a not really customized Kubuntu 23.10 setup with kernel 6.5.0 , but on the other hand I haven't touched anything relevant, not even setting a frequency limit. I'm observing the CPU being stuck in the 400 MHz - 549 MHz range which is quite fitting for this bug report, and the host was never suspended / hibernated. The only relevant oddity I've found so far is that /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq was sticking out like a sore thumb with 400000 set while other cores had the value of 5759000, but changing that didn't make a difference. Not really sure when did this manifest itself, but highly likely after (or during?) a case of Bug #204253 as I brushed away the slowness for a while as the usual heavy I/O (over NFS) problem which even used to freeze the desktop for more than a minute on a weaker setup, but the current higher performance CPU seemed to take it better, although the experience was still disruptive. Is this really a laptop bug then instead of a more generic problem with a large stutter causing some logic to get upset possibly due to timing problems? Heavy CPU usage alone surely doesn't do the trick as I've seen hosts doing fine with that, but heavy I/O seems more brutal with possibly similar "world stopping power" as suspending.
Please let's stick to upstream kernels. It will just confuse the issue with the distro kernel, ESPECIALLY a kernel that is EOL upstream. We had other fixes that have landed in amd-pstate that are definitely missing from a 6.5 kernel that could very well be a similar or same issue. So please reproduce with 6.9-rc4 or 6.8.7. If you can still reproduce it then please open a new issue and collect all possible information. If it's indeed the same issue we can mark as a duplicate at that time. This issue is looking like a thermal throttling issue where the APU didn't properly ack a request from the EC while in suspend. I posted a patch that gives the APU more time to ack it during suspend but it needs to be tested still in a case that it can be reproduced reliably. If it doesn't help, I would like to see if extending the time duration in between cycles helps.
I believe the other issue was supposed to be strictly about limiting max frequency causing issues, and I'm definitely not doing that, but possibly I missed other fixed, I surely didn't keep up with everything. Understood the warning though, but that's exactly why I pointed out that my kernel version might not be helpful. The main point was that while all discussions seems to be about APUs, I encountered an issue that appears to be really similar if not the same with a desktop CPU. Just wanted this information to be added as I've found 3 bug reports where this problem is mentioned but with only laptop CPUs discussed.
7950X3D is a desktop SoC, but IIRC it has integrated graphics. It's a desktop APU. But that aside, thermal throttling can be triggered even in CPU products from the EC. The interface the EC uses to do this applies to both types of parts.
I can reproduce the issue on my HP Elitebook 845 G10 with an AMD Ryzen 7 Pro 7840U running Linux 6.8.6 with the following steps: 1. Ensure that the power adapter is not plugged in 2. Suspend the machine 3. Wait for 10 seconds 4. Plug in the power adapter 5. Wait for 10 seconds 6. Wake the machine 7. The CPU frequency is now stuck at 544 MHz When I unplug the power adapter now, the frequency will immediately start scaling up again. Replugging the power adapter again while the device is awake is also okay. I was unable to reproduce the issue by starting with a power adapter plugged in and unplugging it before waking up. This did not seem to cause any issues.
(In reply to Daan Vanoverloop from comment #54) > I can reproduce the issue on my HP Elitebook 845 G10 with an AMD Ryzen 7 Pro > 7840U running Linux 6.8.6 with the following steps: > > 1. Ensure that the power adapter is not plugged in > 2. Suspend the machine > 3. Wait for 10 seconds > 4. Plug in the power adapter > 5. Wait for 10 seconds > 6. Wake the machine > 7. The CPU frequency is now stuck at 544 MHz > > When I unplug the power adapter now, the frequency will immediately start > scaling up again. Replugging the power adapter again while the device is > awake is also okay. > > I was unable to reproduce the issue by starting with a power adapter plugged > in and unplugging it before waking up. This did not seem to cause any issues. Exactly how I experience it and what this bug is about. Mario said he would post a patch to reset the EC on resume and that should fix the issue but I've not seen the patch yet.
> Mario said he would post a patch to reset the EC on resume and that should > fix the issue but I've not seen the patch yet. Eh? I don't recall saying I'd post a patch to reset EC on resume. I did post a patch to this bug that could try to adjust the timing that is waiting for testing though in case it's a race condition. It will force 10-20ms more time spent in the Linux kernel when the power adapter is unplugged over suspend. Also if it doesn't help, please modify it to make it 100-200ms. This should rule out a race condition.
> Eh? I don't recall saying I'd post a patch to reset EC on resume. My memory is faltering obviously. Sorry. > The Windows equivalent of the amd-pmf driver on this HP system uses the > features in kernel 6.8 that I've been asking Artem to test. No, kernel 6.8 didn't fix the issue for me.
> I did post a patch to this bug that could try to adjust the timing that is > waiting for testing though in case it's a race condition. It will force > 10-20ms more time spent in the Linux kernel when the power adapter is > unplugged over suspend. > > Also if it doesn't help, please modify it to make it 100-200ms. This should > rule out a race condition. I will apply this patch later today or tomorrow and report back on whether I can still reproduce this issue.
Created attachment 306264 [details] debugging patch I'm attaching a patch that isn't upstreamed at the moment, but you can apply to your kernel to try to capture a debug register for me. Apply it to your kernel and then read the register value like this: echo "0x59804" | sudo tee /sys/kernel/debug/amd_nb/smn_address sudo cat /sys/kernel/debug/amd_nb/smn_value Here is what a reasonable value looks like on my local system: $ echo "0x59804" | sudo tee /sys/kernel/debug/amd_nb/smn_address $ sudo cat /sys/kernel/debug/amd_nb/smn_value 0x017f1201 Share to me the values that you get from smn_value in these 3 situations: 1) At bootup (before you suspend) 2) After you've suspended and reproduced the issue 3) After you've done the W/A to undo the issue.
I applied your patch, but I'm not able to reproduce the issue at home. When the issue doesn't occur, I find the same smn_value as you. It might be related to the specific power adapter I use at work, or other devices that were plugged in. I will try to reproduce the issue tomorrow at work, and try to narrow it down to a specific device that's plugged in.
I was able to reproduce the issue consistently at work with the USB power adapter that was included with the laptop, with or without a display and USB devices plugged in. I was not able to reproduce the issue with a different USB power adapter at home. These are the smn values I found: 1) at bootup: 0x017f1201 2) after reproducing the issue: 0x017f1201 3) after doing the workaround (unplugging the power adapter): 0x017f1221
Are you sure you didn't mix up 2 & 3?
Yes, I tried it a few times and I'm pretty sure this is correct. I noticed that the value only changes to 0x017f1221 after unplugging the power adapter. But I'll try again just to make sure.
I just encountered the issue again, but this time I unplugged my power adapter while the device was suspended, which can also trigger this bug. When I look at the smn_value, I find 0x017f1201 again (the "normal" value). When I plug in the adapter, I can work around the issue and find smn_value 0x017f1221. The value will stay on 0x017f1221 until I do a suspend/wake cycle, which resets it back to 0x017f1201, regardless of whether I the low clock speed issue occurred or not. Any kind of plugging or unplugging of the power adapter while the device is awake causes it to change to 0x017f1221. Plugging and unplugging while suspended does not seem to have any effect on this value, as it would always reset to 0x017f1201 when waking from suspend.
Especially paired with the fact that different adapters don't trigger it I stand by this being an EC issue as the EC controls the throttling behavior. I suggest you guys raise with HP and point them at this issue.
> Especially paired with the fact that different adapters don't trigger it I > stand by this being an EC issue as the EC controls the throttling behavior. What does EC stand for? Might this (https://h30434.www3.hp.com/t5/Business-Notebooks/HP-Elitebook-865-G10-w-AMD-Ryzen-9-PRO-7940HS-cannot-sustain/m-p/9061799) be related? What's weird is that it only happens when I'm using the external monitors plugged into the dock, but I don't have any problem if I'm just using the dock's ethernet adapter or USB hub. > I suggest you guys raise with HP and point them at this issue. Easier said that done: they don't care about Linux via the official support channels. I'm sure there is someone who cares because they distribute updates via LVFS and they even sold Linux laptops like the HP Dev One but I have no idea how to reach whoever could be interested to fix this.
(In reply to Mario Limonciello (AMD) from comment #65) > Especially paired with the fact that different adapters don't trigger it I > stand by this being an EC issue as the EC controls the throttling behavior. > > I suggest you guys raise with HP and point them at this issue. But why does it affect only Linux?
> What does EC stand for? EC is "Embedded Controller". Here's the ACPI specification for how it is supposed to be interacted with: https://uefi.org/specs/ACPI/6.5/12_Embedded_Controller_Interface_Specification.html It's a black box to anyone but the system manufacturer. > Might this > (https://h30434.www3.hp.com/t5/Business-Notebooks/HP-Elitebook-865-G10-w-AMD-Ryzen-9-PRO-7940HS-cannot-sustain/m-p/9061799) > be related? > What's weird is that it only happens when I'm using the external monitors > plugged into the dock, but I don't have any problem if I'm just using the > dock's ethernet adapter or USB hub. Yes, it "could" be related. This is getting OT, but if you have enough ports on your laptop without a dock you could try to plug dongle(s) for monitor(s) and a regular power adapter and see if you can reproduce the same behavior. > Easier said that done: they don't care about Linux via the official support > channels. :/ > But why does it affect only Linux? As it pertains to how the sleep wake up works, Linux and Windows work slightly differently. Windows has a concept of "dark screen wakeup" after any wakeup event and will move in and out of hardware sleep while in this state. Linux once you get a wakeup event if it's not enough to wake the system (such as the ACPI SCI but no other interrupt) then it goes back to sleep immediately. This difference of behavior has uncovered bugs where the X86 cores race for some of the same resources with the power management firmware on earlier hardware. So my working theory has been some timing margins for throttling are not being met when suspend/resume has occurred under Linux. That's why I was suggesting patches to try to keep the kernel alive longer when a power adapter event wakes the APU. But the behavior and timing of when to throttle are totally controlled by the EC. So if there is a timing problem and forcing the X86 cores to be awake longer doesn't help I'm not sure what else we can do without HP coming to the table to debug from their EC perspective.
> Yes, it "could" be related. This is getting OT, but if you have enough ports > on your laptop without a dock you could try to plug dongle(s) for monitor(s) > and a regular power adapter and see if you can reproduce the same behavior. https://www.amazon.it/sicotool-Adattatore-DisplayPort-Thunderbolt-Compatibile/dp/B08B647L2X Would something like this work on Phoenix (HP Elitebook 865 G10)? I'm pretty sure it requires DP Alt mode. Also, would it support Displayport MST? I would like to keep my setup the same to make the test more valid and I'm currently using two Dell UltraSharp U2515H attached via a single mini DP cable.
Yes that should work, but the resolution availability will depend upon how much bandwidth your connection series needs.
> > I suggest you guys raise with HP and point them at this issue. > > Easier said that done: they don't care about Linux via the official support > channels. > I'm sure there is someone who cares because they distribute updates via LVFS > and they even sold Linux laptops like the HP Dev One but I have no idea how > to reach whoever could be interested to fix this. This also affects Rembrandt (845 G9), in case someone from HP makes it to this bug report.
(In reply to Prasun from comment #71) > This also affects Rembrandt (845 G9), in case someone from HP makes it to > this bug report. Sadly it looks like HP generally doesn't care about Linux and Linux support or compatibility for the G line of laptops has never been mentioned either. I'm marking it as INVALID because Maria has basically said it's a bug in the EC (code).
Mario, I meant Mario. Sorry :-)
The issue seems to be resolved for me when running kernel 6.10.5 and firmware (01.05.11 Rev.A), which can be installed from LVFS using fwupdmgr (or one of the wrapper GUIs like GNOME Software).
Apologies for the false alert, it is not fixed. I was just unable to reproduce it when I wanted to.