Bug 218305 - Ryzen 7 7840HS gets stuck at 544MHz frequency after resuming after unplugging the power cord during sleep
Summary: Ryzen 7 7840HS gets stuck at 544MHz frequency after resuming after unplugging...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Platform_x86 (show other bugs)
Hardware: AMD Linux
: P3 blocking
Assignee: drivers_platform_x86@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks: 218557
  Show dependency tree
 
Reported: 2023-12-24 07:21 UTC by Artem S. Tashkinov
Modified: 2024-04-15 20:43 UTC (History)
9 users (show)

See Also:
Kernel Version: 6.6.7-200.fc39.x86_64
Subsystem:
Regression: No
Bisected commit-id:


Attachments
The contents of /sys/devices/system/cpu/cpufreq/* (2.49 KB, application/zstd)
2023-12-26 15:25 UTC, Artem S. Tashkinov
Details
possible patch (v1) (1.66 KB, application/mbox)
2024-04-01 13:19 UTC, Mario Limonciello (AMD)
Details

Description Artem S. Tashkinov 2023-12-24 07:21:48 UTC
I'm almost sure it's a bug in the firmware but since I cannot make HP fix it, I'll try to report it here.

The CPU gets stuck at this extremely low frequency after N number of suspend/resume cycles where N can be 1, 2, 3, 4 but at most 5.

The laptop is plugged in at all times.

This is happening with both acpi-cpufreq and amd-pstate-epp.

# cpupower frequency-info
analyzing CPU 10:
  driver: amd-pstate-epp
  CPUs which run at the same hardware frequency: 10
  CPUs which need to have their frequency coordinated by software: 10
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 400 MHz - 6.08 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 400 MHz and 6.08 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 544 MHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: yes
    AMD PSTATE Highest Performance: 232. Maximum Frequency: 6.08 GHz.
    AMD PSTATE Nominal Performance: 145. Nominal Frequency: 3.80 GHz.
    AMD PSTATE Lowest Non-linear Performance: 42. Lowest Non-linear Frequency: 1.10 GHz.
    AMD PSTATE Lowest Performance: 16. Lowest Frequency: 400 MHz.

Some CPU parameters look completely wrong after it happens:

# ryzenadj -i
|        Name         |   Value   |     Parameter      |
|---------------------|-----------|--------------------|
| STAPM LIMIT         |    30.000 | stapm-limit        |
| STAPM VALUE         |     4.181 |                    |
| PPT LIMIT FAST      |    30.000 | fast-limit         |
| PPT VALUE FAST      |     5.347 |                    |
| PPT LIMIT SLOW      |    20.000 | slow-limit         |
| PPT VALUE SLOW      |     3.747 |                    |
| StapmTimeConst      |       nan | stapm-time         |
| SlowPPTTimeConst    |       nan | slow-time          |
| PPT LIMIT APU       |       nan | apu-slow-limit     |
| PPT VALUE APU       |       nan |                    |
| TDC LIMIT VDD       |       nan | vrm-current        |
| TDC VALUE VDD       |       nan |                    |
| TDC LIMIT SOC       |       nan | vrmsoc-current     |
| TDC VALUE SOC       |       nan |                    |
| EDC LIMIT VDD       |       nan | vrmmax-current     |
| EDC VALUE VDD       |       nan |                    |
| EDC LIMIT SOC       |       nan | vrmsocmax-current  |
| EDC VALUE SOC       |       nan |                    |
| THM LIMIT CORE      |       nan | tctl-temp          |
| THM VALUE CORE      |       nan |                    |
| STT LIMIT APU       |       nan | apu-skin-temp      |
| STT VALUE APU       |       nan |                    |
| STT LIMIT dGPU      |       nan | dgpu-skin-temp     |
| STT VALUE dGPU      |       nan |                    |
| CCLK Boost SETPOINT |       nan | power-saving /     |
| CCLK BUSY VALUE     |       nan | max-performance    |
Comment 1 Artem S. Tashkinov 2023-12-24 07:30:38 UTC
I do use this command to constrain CPU thermals:

ryzenadj --tctl-temp=75 --stapm-limit=30000 --fast-limit=30000 --slow-limit=20000 

https://github.com/FlyGoat/RyzenAdj

Perhaps on resume the firmware sees altered limits and wreaks havoc to everything.

These last three parameters suddenly become read only after the bug occurs.
Comment 2 Armin Wolf 2023-12-24 15:03:36 UTC
Does the issue also happen if you dont use ryzenadj?
Comment 3 Artem S. Tashkinov 2023-12-24 15:12:47 UTC
(In reply to Armin Wolf from comment #2)
> Does the issue also happen if you dont use ryzenadj?

Yes, today a single suspend resume cycle has been enough to trigger this bug.

This is the result of this bug (no settings have been altered prior):

ryzenadj -i
CPU Family: Phoenix
SMU_SERVICE REQ_ID:0x3
SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU_SERVICE REP: REP: 0x1, arg0: 0xe, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU BIOS Interface Version: 14
Version: v0.14.0 
init_table
SMU_SERVICE REQ_ID:0x6
SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU_SERVICE REP: REP: 0x1, arg0: 0x4c0008, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU_SERVICE REQ_ID:0x66
SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU_SERVICE REP: REP: 0x1, arg0: 0x9e300000, arg1:0x7, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU_SERVICE REQ_ID:0x65
SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU_SERVICE REP: REP: 0xfd, arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU_SERVICE REQ_ID:0x65
SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU_SERVICE REP: REP: 0x1, arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
PM Table Version: 4c0008
SMU_SERVICE REQ_ID:0x65
SMU_SERVICE REQ: arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
SMU_SERVICE REP: REP: 0x1, arg0: 0x0, arg1:0x0, arg2:0x0, arg3:0x0, arg4: 0x0, arg5: 0x0
|        Name         |   Value   |     Parameter      |
|---------------------|-----------|--------------------|
| STAPM LIMIT         |    51.000 | stapm-limit        |
| STAPM VALUE         |     4.150 |                    |
| PPT LIMIT FAST      |    51.000 | fast-limit         |
| PPT VALUE FAST      |     6.040 |                    |
| PPT LIMIT SLOW      |    41.000 | slow-limit         |
| PPT VALUE SLOW      |     4.056 |                    |
| StapmTimeConst      |       nan | stapm-time         |
| SlowPPTTimeConst    |       nan | slow-time          |
| PPT LIMIT APU       |       nan | apu-slow-limit     |
| PPT VALUE APU       |       nan |                    |
| TDC LIMIT VDD       |       nan | vrm-current        |
| TDC VALUE VDD       |       nan |                    |
| TDC LIMIT SOC       |       nan | vrmsoc-current     |
| TDC VALUE SOC       |       nan |                    |
| EDC LIMIT VDD       |       nan | vrmmax-current     |
| EDC VALUE VDD       |       nan |                    |
| EDC LIMIT SOC       |       nan | vrmsocmax-current  |
| EDC VALUE SOC       |       nan |                    |
| THM LIMIT CORE      |       nan | tctl-temp          |
| THM VALUE CORE      |       nan |                    |
| STT LIMIT APU       |       nan | apu-skin-temp      |
| STT VALUE APU       |       nan |                    |
| STT LIMIT dGPU      |       nan | dgpu-skin-temp     |
| STT VALUE dGPU      |       nan |                    |
| CCLK Boost SETPOINT |       nan | power-saving /     |
| CCLK BUSY VALUE     |       nan | max-performance    |


STAMP, PPT FAST and PPT SLOW all have broken values.
Comment 4 Armin Wolf 2023-12-24 15:37:25 UTC
Could be that the firmware fails to properly restore those values after suspend, does the issue also happen under Windows?
Comment 5 Artem S. Tashkinov 2023-12-24 15:38:35 UTC
(In reply to Armin Wolf from comment #4)
> Could be that the firmware fails to properly restore those values after
> suspend, does the issue also happen under Windows?

I rarely boot into Windows but I may check.
Comment 6 Artem S. Tashkinov 2023-12-25 10:46:21 UTC
I've not been able to reproduce this issue under Windows but then I didn't try hard enough (which means multiple attempts spanning several days).
Comment 7 Armin Wolf 2023-12-25 10:53:41 UTC
Have you verified that you are using the latest BIOS for you machine?
Comment 8 Artem S. Tashkinov 2023-12-25 14:42:34 UTC
The issue is reproducible with the latest BIOS release (V82: 01.03.09 Rev.A, released on Dec 15, 2023) and two versions prior. HP doesn't allow to download and flash earlier versions.
Comment 9 Shyam Sundar S K (AMD) 2023-12-26 04:21:52 UTC
Since you are pointing to STAPM, PPT limits, Can you blacklist amd_pmf driver and see if that helps after the suspend/resume cycle?
Comment 10 Artem S. Tashkinov 2023-12-26 15:18:22 UTC
This is reproducible without the amd-pmf module:

[root@hp policy0]# lsmod | grep pmf
[root@hp policy0]# cpupower frequency-info
analyzing CPU 13:
  driver: amd-pstate-epp
  CPUs which run at the same hardware frequency: 13
  CPUs which need to have their frequency coordinated by software: 13
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 400 MHz - 5.76 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 400 MHz and 5.76 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 542 MHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: yes
    AMD PSTATE Highest Performance: 220. Maximum Frequency: 5.76 GHz.
    AMD PSTATE Nominal Performance: 145. Nominal Frequency: 3.80 GHz.
    AMD PSTATE Lowest Non-linear Performance: 42. Lowest Non-linear Frequency: 1.10 GHz.
    AMD PSTATE Lowest Performance: 16. Lowest Frequency: 400 MHz.
Comment 11 Artem S. Tashkinov 2023-12-26 15:25:27 UTC
Created attachment 305659 [details]
The contents of /sys/devices/system/cpu/cpufreq/*

Switching between power modes using /sys/devices/system/cpu/cpufreq/*/energy_performance_preference does nothing.

The exact per CPU frequency stats:
# cat /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq
400000
544395
400000
544099
400000
400000
400000
400000
400000
544189
400000
544181
400000
400000
544007
542947

No idea where 544MHz comes from.

BTW here's another bug, either firmware or something in the kernel reports wrong max frequency:

# cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
5137000
6080000
6080000
5764000
5764000
5924000
5924000
5137000
6080000
6080000
5608000
5608000
5293000
5293000
5449000
5449000

I'm not aware of any Zen 4 CPUs which can run at 6080000KHz frequency by default, let alone mobile parts.
Comment 12 Artem S. Tashkinov 2023-12-26 15:29:47 UTC
> BTW here's another bug, either firmware or something in the kernel reports
> wrong max frequency:

Max frequency is reported correctly only for two out of sixteen logical cores. It's wrong for all other cores. Would be great if AMD fixed this.

Speaking of my firmware, it's:

https://support.hp.com/us-en/drivers/hp-elitebook-845-14-inch-g10-notebook-pc/2101628462

> Description:
> 
> This package is used to update the supported firmware on HP Business Notebook
> systems with a V82 family BIOS. This package is provided for supported
> computer systems that are running a supported operating system.
> 
> Fix and enhancements:
> 
> - Fixes an issue where the Performance page in AMD Software: Adrenalin
> Edition does not display correctly. - Adds the Gaming Optimized mode to video
> memory size.
> 
> - Includes the following firmware:
> AMD Graphics Output Protocol (GOP) Firmware, version 3.7.10
> AMD PSP Firmware, version 0.2D.6.6C
> AMD SMU Firmware, version 0.76.65.0
> Embedded Controller (EC) Firmware, version 60.28.00
> Intel/Realtek UEFI PXE ROM, version 2.041
> TI Power Delivery (PD) Firmware, version 4.1.0
Comment 13 Dan Martins 2024-01-03 00:03:22 UTC
I am seeing similar behaviour, to the extent that my CPU cores get capped at some low frequency. Sometimes it is a few cores stuck at ~1600MHz, and sometimes it is all cores stuck at 544MHz. It typically happens for me when rebooting. I tried suspend/resume several times but could not reproduce that way.

CPU is a AMD Ryzen 5 7640U on a Framework 13 laptop. 6.6.8 kernel on Fedora 39.

We may not be having the same issue, but I wanted to mention, I can get all cores back to normal by switching the scaling_governor from powersave to performance and back in case it helps in your case. I am using "sudo cpupower frequency-set -g <GOV>" to switch it.
Comment 14 Mario Limonciello (AMD) 2024-01-17 04:10:54 UTC
I read through this thread and I currently think that Artem and Dan have encountered two separate bugs.

@Artem:

Under the presumption that ryzenadj is actually retrieving the correct values for STAPM, PPT FAST, and PPT SLOW I want to ask if this is tied to a specific power adapter, or sequence of events.  Like suspend on power, resume on battery or suspend on battery resume on power.

If there is a linkage between any of those, then I think this is "most likely" an HP EC bug.

@Dan,

Can you reproduce this if you manually always set the scaling governor on all CPUs to "performance" before you reboot?
Comment 15 Artem S. Tashkinov 2024-01-17 19:33:34 UTC
(In reply to Mario Limonciello (AMD) from comment #14)
> Under the presumption that ryzenadj is actually retrieving the correct
> values for STAPM, PPT FAST, and PPT SLOW I want to ask if this is tied to a
> specific power adapter, or sequence of events.  Like suspend on power,
> resume on battery or suspend on battery resume on power.

My laptop is plugged in 100% of the time.

> 
> If there is a linkage between any of those, then I think this is "most
> likely" an HP EC bug.

I've given up on reporting bugs to HP. It's a complicated process which takes forever. I bought this laptop and its maximum CPU frequency was limited to 4.5GHz which took HP over four months to resolve and that was at least reproducible under Linux and Windows.

This bug seems to affect only Linux or maybe I've not used Windows enough to face it in the Microsoft OS.
Comment 16 Muzhi Yu 2024-01-19 07:31:13 UTC
I can reproduce Artem's issue on EliteBook 845 G10 (kernel 6.7.0 on NixOS). Also Dan's workaround works for me 80% of the time, with only a few times when I had to reboot to lift the cpufreq lock.

My max_freqs are also strange, regardless of whether cpufreq is capped to 544MHz or not.

```
❯ cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
5137000
5137000
6080000
6080000
5449000
5449000
5293000
5293000
5924000
5924000
6080000
6080000
5764000
5764000
5608000
5608000
```
Comment 17 Dan Martins 2024-01-19 12:45:48 UTC
(In reply to Muzhi Yu from comment #16)
> I can reproduce Artem's issue on EliteBook 845 G10 (kernel 6.7.0 on NixOS).
> Also Dan's workaround works for me 80% of the time, with only a few times
> when I had to reboot to lift the cpufreq lock.
> 

I have since found that I don't need to switch to the performance governor at all. It is enough, in my case, to "reset" the the scaling governor to powersave. Just "sudo cpupower frequency-set -g powersave".
Comment 18 Dan Martins 2024-01-19 13:52:48 UTC
(In reply to Mario Limonciello (AMD) from comment #14)
> I read through this thread and I currently think that Artem and Dan have
> encountered two separate bugs.
> 
> @Artem:
> 
> Under the presumption that ryzenadj is actually retrieving the correct
> values for STAPM, PPT FAST, and PPT SLOW I want to ask if this is tied to a
> specific power adapter, or sequence of events.  Like suspend on power,
> resume on battery or suspend on battery resume on power.
> 
> If there is a linkage between any of those, then I think this is "most
> likely" an HP EC bug.
> 
> @Dan,
> 
> Can you reproduce this if you manually always set the scaling governor on
> all CPUs to "performance" before you reboot?

Mario,
I just tested setting the governor to performance before reboot and yes, it is reproducible in that case too.
1. load the CPU and observe all cores can reach ~4Ghz
2. set governor: sudo cpupower frequency-set -g performance
3. reboot
4. load the CPU and check frequencies: on first reboot, all cores hit 4GHz range. On second reboot, cores 6-11 can only reach ~1.7GHz.

This is in-line with previous tests. It is inconsistent, and various power settings don't seem to affect it (epp, platform_profile, scaling_governor). It does seem much more likely to occur when on battery, but will stills happen sometimes when plugged in.

A couple of more recent observations:
- I don't need to toggle from performance to powersave to fix it. I can just "sudo cpupower frequency-set -g powersave" even when it is already reporting that it is using the powersave governor.
- on reboot, the scaling_governor is always showing powersave, even when I set it to performance before reboot.
- Using kernel 6.6.11 as of this morning for the above test

Thanks,
Dan
Comment 19 Mario Limonciello (AMD) 2024-01-19 15:25:00 UTC
Can you please dump teh values from all of these MSR's from userspace while in a reproduced state?

#define MSR_AMD_CPPC_CAP1		0xc00102b0
#define MSR_AMD_CPPC_ENABLE		0xc00102b1
#define MSR_AMD_CPPC_CAP2		0xc00102b2
#define MSR_AMD_CPPC_REQ		0xc00102b3
#define MSR_AMD_CPPC_STATUS		0xc00102b4
Comment 20 Mario Limonciello (AMD) 2024-01-20 00:21:42 UTC
Can you guys please test this and see if it improves the situation at all?

https://lore.kernel.org/linux-pm/20240119113319.54158-1-mario.limonciello@amd.com/T/#u

Thanks!
Comment 21 Dan Martins 2024-01-21 18:05:34 UTC
(In reply to Mario Limonciello (AMD) from comment #19)
> Can you please dump teh values from all of these MSR's from userspace while
> in a reproduced state?
> 
> #define MSR_AMD_CPPC_CAP1             0xc00102b0
> #define MSR_AMD_CPPC_ENABLE           0xc00102b1
> #define MSR_AMD_CPPC_CAP2             0xc00102b2
> #define MSR_AMD_CPPC_REQ              0xc00102b3
> #define MSR_AMD_CPPC_STATUS           0xc00102b4

Hi Mario,

Thank you for looking into this. I'll try your kernel patch when I have a bit more time. For now, here are the MSRs:

Good state (from a boot when no cores were limited):
=========================
=========================
MSR_AMD_CPPC_CAP1  
d08a2c10  
d08a2c10d08a2c10  
dc8a2c10  
dc8a2c10dc8a2c10  
ca8a2c10  
ca8a2c10ca8a2c10  
dc8a2c10  
dc8a2c10dc8a2c10  
c48a2c10  
c48a2c10c48a2c10  
d68a2c10  
d68a2c10d68a2c10  
=========================  
MSR_AMD_CPPC_ENABLE  
1  
1  
1  
1  
1  
1  
1  
1  
1  
1  
1  
1  
=========================  
MSR_AMD_CPPC_CAP2  
0  
0  
0  
0  
0  
0  
0  
0  
0  
0  
0  
0  
=========================  
MSR_AMD_CPPC_REQ  
10d0  
10d0  
10dc  
10dc  
10ca  
10ca  
10dc  
10dc  
10c4  
10c4  
10d6  
f0f  
=========================  
MSR_AMD_CPPC_STATUS  
0  
0  
0  
0  
0  
0  
0  
0  
0  
0  
0  
0  
=========================

In reproduced state, where all cores are stuck at ~544MHz, MSR_AMD_CPPC_REQ values appear to have wrapped around?
========================================
========================================
MSR_AMD_CPPC_CAP1
d08a2c10
d08a2c10d08a2c10
dc8a2c10
dc8a2c10dc8a2c10
ca8a2c10
ca8a2c10ca8a2c10
dc8a2c10
dc8a2c10dc8a2c10
c48a2c10
c48a2c10c48a2c10
d68a2c10
d68a2c10d68a2c10
=========================
MSR_AMD_CPPC_ENABLE
1
1
1
1
1
1
1
1
1
1
1
1
=========================
MSR_AMD_CPPC_CAP2
0
0
0
0
0
0
0
0
0
0
0
0
=========================
MSR_AMD_CPPC_REQ
ff000f0f
ff000f0f
ff000f0f
ff000f0f
ff000f0f
ff000f0f
ff000f0f
ff000f0f
ff000f0f
ff000f0f
ff000f0f
ff000f0f
=========================
MSR_AMD_CPPC_STATUS
0
0
0
0
0
0
0
0
0
0
0
0
=========================
=========================



And, when I (re)set the scaling governor, the MSR_AMD_CPPC_REQ change slightly. Here is side-by-side. reproduced state on left, and after re-setting the governor on right.
=========================                                       =========================
MSR_AMD_CPPC_REQ                                                MSR_AMD_CPPC_REQ
ff000f0f                                                      | ff0010d0
ff000f0f                                                      | ff0010d0
ff000f0f                                                      | ff0010dc
ff000f0f                                                      | ff0010dc
ff000f0f                                                      | ff0010ca
ff000f0f                                                      | ff0010ca
ff000f0f                                                      | ff0010dc
ff000f0f                                                      | ff0010dc
ff000f0f                                                      | ff0010c4
ff000f0f                                                      | ff0010c4
ff000f0f                                                      | ff0010d6
ff000f0f                                                      | ff0010d6

Thanks,
Dan
Comment 22 Dan Martins 2024-01-22 01:48:15 UTC
(In reply to Mario Limonciello (AMD) from comment #20)
> Can you guys please test this and see if it improves the situation at all?
> 
> https://lore.kernel.org/linux-pm/20240119113319.54158-1-mario.
> limonciello@amd.com/T/#u
> 
> Thanks!

Hi again Mario,

I tested this patch against Fedora's 6.6.13 kernel and so far, after 6 reboots have not been able to reproduce. When I switch back to the stock kernel, I can typically reproduce the issue in 1-2 reboots so the patch seems to have helped so far. I'll keep using the patched kernel for now and let you know if this issue occurs again.

Thanks,
Dan
Comment 23 Dan Martins 2024-01-22 01:53:23 UTC
(In reply to Dan Martins from comment #21)
> (In reply to Mario Limonciello (AMD) from comment #19)
> > Can you please dump teh values from all of these MSR's from userspace while
> > in a reproduced state?

> In reproduced state, where all cores are stuck at ~544MHz, MSR_AMD_CPPC_REQ
> values appear to have wrapped around?

Ignore comment about values wrapping around, it appears the upper byte is set when I adjust PPD from performance (0x00) to balanced (0x80) and powersave (0xFF). I must have adjusted this between reboots.
Comment 24 Mario Limonciello (AMD) 2024-01-22 03:22:15 UTC
That's great news. Everyone who feels comfortable sharing your email address feel free to reply to the post with a "Tested-by" tag.
Comment 25 Muzhi Yu 2024-01-22 17:50:56 UTC
Hi guys,

Just adding my data point here. I've applied the patch and haven't seen this bug for the evening after ~5 cycles.

BTW, are the MSR values still relevant, because I'm seeing no difference between normal and bad states?

```
c4912a10c4912a10
1
0
ff0010c4
0
```

Thanks!
Comment 26 Mario Limonciello (AMD) 2024-01-22 17:54:17 UTC
No need to share MSR values anymore.  I believe this this is the correct solution.
If there are still problems with it they may be a secondary problem.

The MSR values are a little difficult to properly capture because each CPU has it's own register value.  So a proper test would need to capture all of them for all CPUs (not all may have this problem occurring).
Comment 27 Mario Limonciello (AMD) 2024-03-05 19:03:11 UTC
So I think there are actually two issues in this bug.  
* The first one was the one that Artem reported which looks like a problem with the EC communicating some limits to the APU.  This is Artem's issue.
* The second one is that there was a bug in amd-pstate that could cause CPPC requests to have the wrong values.  This is (nearly) everyone else's issue in this bug.

The second issue is fixed by https://github.com/torvalds/linux/commit/22fb4f041999f5f16ecbda15a2859b4ef4cbf47e

For the first issue, Artem can you update to 6.8-rc7, make sure you've added the TEE firmware for the amd-pmf driver from linux-firmware and see if you can still reproduce it?
Comment 28 Artem S. Tashkinov 2024-03-06 09:59:39 UTC
> For the first issue, Artem can you update to 6.8-rc7, make sure you've added
> the TEE firmware for the amd-pmf driver from linux-firmware and see if you
> can still reproduce it?

I've just added the firmware file, "773bd96f-b83f-4d52-b12dc529b13d8543.bin" (what a weird name) and I will test 6.8 as soon as it gets released. It's coming pretty soon.

Thanks.
Comment 29 Mario Limonciello (AMD) 2024-03-06 16:20:25 UTC
> I've just added the firmware file, "773bd96f-b83f-4d52-b12dc529b13d8543.bin"
> (what a weird name) and I will test 6.8 as soon as it gets released. It's
> coming pretty soon.

OK.  Separately from that I'd like to understand what you were getting at with your ryzenadj comment. 

I don't know if ryzenadj accesses all those coefficients correctly; but we *do* export them properly under amd-pmf debugfs.

There is a debugfs file called "current_power_limits".  Can you read it before suspend as well as after a suspend that reproduced the failure?
Comment 30 Mario Limonciello (AMD) 2024-03-06 16:31:33 UTC
Sorry two more things.

First - there are two sets of coefficients (one for AC and for DC).  You can see in current_power_limits_show() that it will return the table matching your power mode.

Please capture like this:
1) Start on DC, capture the file.
2) Switch to AC, capture the file.
3) Suspend the machine
4) Unplug adapter
5) Resume
6) Capture the file (while you're on DC)
7) Switch to AC, capture the file.

This will let us confirm whether or not there is a problem with the table.

Second - after the issue has occurred, does changing the acpi platform profile from sysfs or powerprofilesctl recover it?
Comment 31 Artem S. Tashkinov 2024-03-06 17:38:03 UTC
1)

current_power_limits 
spl:51000 fppt:51000 sppt:41000 sppt_apu_only:41001 stt_min:25000 stt[APU]:0 stt[HS2]: 0

2)

cat current_power_limits 
spl:51000 fppt:51000 sppt:41000 sppt_apu_only:41000 stt_min:25000 stt[APU]:0 stt[HS2]: 0


3-4-5) done

6) cat current_power_limits 
spl:51000 fppt:51000 sppt:41000 sppt_apu_only:41000 stt_min:25000 stt[APU]:0 stt[HS2]: 0


7) cat current_power_limits 
spl:51000 fppt:51000 sppt:41000 sppt_apu_only:41000 stt_min:25000 stt[APU]:0 stt[HS2]: 0

While we have been discussing this, I've just found out that when this bug occurs, all I need to do is to unplug and that fixes everything.

It's actually such a simple workaround, I will leave it up to you whether anything needs to be done to address it.
Comment 32 Mario Limonciello (AMD) 2024-03-06 17:46:24 UTC
> cat current_power_limits

It looks like those don't change.

> While we have been discussing this, I've just found out that when this bug
> occurs, all I need to do is to unplug and that fixes everything.

Presumably you mean unplug OR replug (IE opposite of what you did in suspend) right?

> It's actually such a simple workaround, I will leave it up to you whether
> anything needs to be done to address it.

It's good you have that workaround.  I'd like to know if powerprofilesctl/acpi platform profile can also recover it.

If so; we might want to add an explicit code in the suspend/resume callbacks to rewrite the state if power adapter changed over suspend.  I think this would be a safe solution for everyone.
Comment 33 Artem S. Tashkinov 2024-03-06 17:50:48 UTC
> Presumably you mean unplug OR replug (IE opposite of what you did in suspend)
> right?

After unplugging it's already fixed. I do of course replug not to waste battery power.

> If so; we might want to add an explicit code in the suspend/resume callbacks
> to rewrite the state if power adapter changed over suspend.  I think this
> would be a safe solution for everyone.

If only it doesn't break other systems. That sounds a tad scary to me. So far I seem to have been the only affected person (not that many people seem to be using HP business laptops with Linux).
Comment 34 Mario Limonciello (AMD) 2024-03-06 17:55:33 UTC
> If only it doesn't break other systems. That sounds a tad scary to me. So far
> I seem to have been the only affected person (not that many people seem to be
> using HP business laptops with Linux).

The code would basically look like this:
* Capture state of power adapter at suspend callback into a private variable
* If state of power adapter has changed during resume then rewrite all CPU coefficients.

It should be safe for everyone.  But I need to know that it actually helps your problem.
Comment 35 Armin Wolf 2024-03-06 18:03:25 UTC
Shouldn't the driver generally restore all CPU coefficients after suspend/resume? Or is there a specification saying that the CPU coefficients will be restored by the platform firmware?
Comment 36 Mario Limonciello (AMD) 2024-03-06 18:06:15 UTC
AMD-PMF can be used differently by different OEMs and models depending upon their needs and desires.

Some will control entirely by their EC.  Some will rely on PMF to do more functionality.
Comment 37 Armin Wolf 2024-03-06 18:08:04 UTC
Could it be that the Windows equivalent of the amd-pmf driver does restore all/some coefficients after suspend/resume?
Comment 38 Mario Limonciello (AMD) 2024-03-06 18:09:51 UTC
The Windows equivalent of the amd-pmf driver on this HP system uses the features in kernel 6.8 that I've been asking Artem to test.

Once I know whether the issue happens on kernel 6.8 and whether changing the profile manually helps it we can decide on whether to do anything.
Comment 39 Armin Wolf 2024-03-06 18:10:35 UTC
Ok
Comment 40 Peter Ries 2024-04-01 08:00:47 UTC
Hello everyone. I just read through this bugreport as I have the same problem on my HP Elitebook 845 G10 with a 7840U CPU. It randomely gets stuck at 544MHz.

I'm running Endeavour OS (Arch based) with the latest Kernel (6.8.2-arch2-1) and Firmware (core/linux-firmware 20240312.3b128b60-1).

I thought things would be fixed now, but I just had the hanging CPU freq again.

As the bug is not closed and last comment is nearly 4 weeks old I just wanted to know if the fix is not official yet...

Thanks for an update and to all for investigating here :)

(Meanwhile I'll try the "sudo cpupower frequency-set -g powersave" workaround to see if it helps to circumvent an annoying reboot)
Comment 41 Mario Limonciello (AMD) 2024-04-01 13:19:58 UTC
Created attachment 306075 [details]
possible patch (v1)

> It randomely gets stuck at 544MHz.

Are you sure it's random?  From the above discussions I believe it is triggered specifically from an event sent by the EC when changing the power adapter while suspended.

> with the latest Kernel (6.8.2-arch2-1) 

Thanks.  I've been waiting for feedback with kernel 6.8.  And you have CONFIG_AMD_PMF set?

> I just wanted to know if the fix is not official yet...

There is no fix or workaround right now, like I said above this "looks" like a bug caused by HP's EC or BIOS.

Assuming you tested with amd-pmf in place and it really is the same root cause described above (only by power adapter) I was thinking about it and this sounds like it could be a race condition. I do have an idea for a workaround.  Can you see if this patch helps?
Comment 42 Artem S. Tashkinov 2024-04-01 13:37:49 UTC
>  From the above discussions I believe it is triggered specifically from an
>  event sent by the EC when changing the power adapter while suspended.

Yep, and like I said in my case unplugging/plugging the power cord is enough to fix it which was a relief for me.
Comment 43 Peter Ries 2024-04-01 14:04:48 UTC
I have 

$ zcat /proc/config.gz | grep CONFIG_AMD_PMF
CONFIG_AMD_PMF=m
# CONFIG_AMD_PMF_DEBUG is not set

Unfortunately I could not reproduce the effect during the test I did right now. Subjective impression is that the bug occurs less often since 6.8.x kernel.

Testing procedure was:

Plugged - suspend - unplug - resume - OK
Unplugged - suspend - plug back in - resume - OK
Starting plugged - suspend - resume - repeated 6 times while plugged - OK

Resume was done via "systemctl suspend" command on my hotkey "Strg-Super-End"

Then I tried it 3 times by using the lid - same here - it works for the moment. So it seems to be more random that for Artem. I'll check if un-/plug procedure helps as a quick fix to not have to reboot when CPU gets stucked again.

I somehow need to find a reliable procedure to run into this bug before it makes sense to test the patch.
Comment 44 Peter Ries 2024-04-01 14:14:28 UTC
(typo: -Resume-) Suspend was done via "systemctl suspend" command on my hotkey "Strg-Super-End"
Comment 45 Mario Limonciello (AMD) 2024-04-01 14:23:48 UTC
If 6.8 is more reliable you can also try to apply the patch to 6.7 or an earlier kernel that could more easily trigger it.
Comment 46 Mario Limonciello (AMD) 2024-04-05 02:20:34 UTC
Any testing results for that patch idea?
Comment 47 Peter Ries 2024-04-05 08:09:01 UTC
Hi Mario, sorry for not responding, I still haven't been able to reproduce the bug. Just had it once after Kernel 6.8.x. 

I will test once I have a reproduceable scenario.
Comment 48 Artem S. Tashkinov 2024-04-05 11:15:29 UTC
(In reply to Peter Ries from comment #47)
> Hi Mario, sorry for not responding, I still haven't been able to reproduce
> the bug. Just had it once after Kernel 6.8.x. 
> 
> I will test once I have a reproduceable scenario.

Please try what triggers it for me 100%:

1. While the laptop is plugged in/connected to them mains, put it to sleep.
2. Unplug it for a little while - 20 seconds is enough I guess.
3. Plug it back it.
4. Wake it up.
Comment 49 Peter Ries 2024-04-06 07:48:22 UTC
Hi Arten, this unfortunately works for me - meaning the CPU frequency does NOT get stuck if I do it like this. 

- I put laptop to sleep
- unplugged
- waited 1 minute (without resuming on battery)
- plugged back in
- resume
-> CPU scales up and down as expected 

I just wonder what happened AFTER I had the effect with kernel 6.8.x (only once)


I currently have

6.8.2-arch2-1

core/linux-firmware-whence 20240312.3b128b60-1
core/linux-firmware 20240312.3b128b60-1 

really weird
Comment 50 Pedro 2024-04-15 17:42:21 UTC
Coming from Bug #217931 , I found the mentions of being stuck at low frequency odd as I couldn't observe that despite managing multiple hosts, but then here I am.

The twist is that I have a 7950X3D desktop setup, not a laptop one, and I apparently I just ran into the same low frequency issue others experienced.
Unfortunately the usefulness of my information will be limited as I'm on a not really customized Kubuntu 23.10 setup with kernel 6.5.0 , but on the other hand I haven't touched anything relevant, not even setting a frequency limit.

I'm observing the CPU being stuck in the 400 MHz - 549 MHz range which is quite fitting for this bug report, and the host was never suspended / hibernated.
The only relevant oddity I've found so far is that /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq was sticking out like a sore thumb with 400000 set while other cores had the value of 5759000, but changing that didn't make a difference.

Not really sure when did this manifest itself, but highly likely after (or during?) a case of Bug #204253 as I brushed away the slowness for a while as the usual heavy I/O (over NFS) problem which even used to freeze the desktop for more than a minute on a weaker setup, but the current higher performance CPU seemed to take it better, although the experience was still disruptive.
Is this really a laptop bug then instead of a more generic problem with a large stutter causing some logic to get upset possibly due to timing problems? Heavy CPU usage alone surely doesn't do the trick as I've seen hosts doing fine with that, but heavy I/O seems more brutal with possibly similar "world stopping power" as suspending.
Comment 51 Mario Limonciello (AMD) 2024-04-15 19:01:52 UTC
Please let's stick to upstream kernels. It will just confuse the issue with the distro kernel, ESPECIALLY a kernel that is EOL upstream. We had other fixes that have landed in amd-pstate that are definitely missing from a 6.5 kernel that could very well be a similar or same issue.

So please reproduce with 6.9-rc4 or 6.8.7. If you can still reproduce it then please open a new issue and collect all possible information. If it's indeed the same issue we can mark as a duplicate at that time.

This issue is looking like a thermal throttling issue where the APU didn't properly ack a request from the EC while in suspend. I posted a patch that gives the APU more time to ack it during suspend but it needs to be tested still in a case that it can be reproduced reliably.

If it doesn't help, I would like to see if extending the time duration in between cycles helps.
Comment 52 Pedro 2024-04-15 19:35:54 UTC
I believe the other issue was supposed to be strictly about limiting max frequency causing issues, and I'm definitely not doing that, but possibly I missed other fixed, I surely didn't keep up with everything.
Understood the warning though, but that's exactly why I pointed out that my kernel version might not be helpful.

The main point was that while all discussions seems to be about APUs, I encountered an issue that appears to be really similar if not the same with a desktop CPU. Just wanted this information to be added as I've found 3 bug reports where this problem is mentioned but with only laptop CPUs discussed.
Comment 53 Mario Limonciello (AMD) 2024-04-15 20:43:16 UTC
7950X3D is a desktop SoC, but IIRC it has integrated graphics. It's a desktop APU.

But that aside, thermal throttling can be triggered even in CPU products from the EC. The interface the EC uses to do this applies to both types of parts.

Note You need to log in before you can comment on or make changes to this bug.