Bug 217571 - amd_pmf: AMD 7840HS cpufreq locked at 400-544MHz after power unplugged
Summary: amd_pmf: AMD 7840HS cpufreq locked at 400-544MHz after power unplugged
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Platform_x86 (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Shyam Sundar S K (AMD)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-06-18 14:52 UTC by Allen Zhong
Modified: 2024-03-11 17:24 UTC (History)
6 users (show)

See Also:
Kernel Version: 6.3.8; 6.4.0-rc6; GIT
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg w/ options amd-pmf dyndbg=+pflmt (110.31 KB, text/plain)
2023-06-19 07:38 UTC, Allen Zhong
Details
Fix notify handler sequence in amd_pmf driver (3.83 KB, application/mbox)
2023-06-19 14:56 UTC, Shyam Sundar S K (AMD)
Details

Description Allen Zhong 2023-06-18 14:52:19 UTC
I'm using the newly published HP EliteBook 845 14 inch G10 with AMD Ryzen 7 PRO 7840HS CPU.

The system boot normally, with CPU Freq up to ~5.1GHz, but every time if I unplug the AC power supply, it drops to 400-544MHz and locked there.

I've tried using cpupower to set frequency, using ryzenadj to set --max-performance, none worked, only rebooting the system with AC power attached could reset it back to normal.

For several times, even warm reboot could not reset, I've had to change some setting in BIOS to trigger a cold reboot to make the CPU freq be normal again.

I'm using amd_pstate=active but I also observed the same issue with amd_pstate=passive and acpi-cpufreq.

When the freq is locked to 400-544MHz, I get a warning in dmesg:

[   22.592078] ------------[ cut here ]------------
[   22.592081] Voluntary context switch within RCU read-side critical section!
[   22.592083] WARNING: CPU: 0 PID: 9 at kernel/rcu/tree_plugin.h:318 rcu_note_context_switch+0x5e0/0x660
[   22.592089] Modules linked in: tun ccm rfcomm hid_sensor_als hid_sensor_trigger industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio hid_sensor_hub cmac algif_hash algif_skcipher af_alg bnep uvcvideo btusb videobuf2_vmalloc btrtl uvc btbcm videobuf2_memops videobuf2_v4l2 btintel btmtk videodev bluetooth videobuf2_common mc ecdh_generic tcp_diag inet_diag snd_sof_amd_rembrandt vfat fat snd_sof_amd_renoir iwlmvm snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek snd_sof_xtensa_dsp snd_hda_codec_generic snd_sof snd_hda_scodec_cs35l41_spi ledtrig_audio joydev snd_sof_utils mac80211 snd_hda_codec_hdmi mousedev snd_soc_core snd_compress intel_rapl_msr libarc4 snd_hda_intel ac97_bus intel_rapl_common snd_pcm_dmaengine snd_intel_dspcfg snd_pci_ps edac_mce_amd snd_rpl_pci_acp6x snd_intel_sdw_acpi ext4 snd_hda_codec snd_acp_pci snd_pci_acp6x snd_hda_scodec_cs35l41_i2c crc16 kvm_amd snd_hda_core snd_hda_scodec_cs35l41 mbcache iwlwifi snd_pci_acp5x snd_hwdep snd_hda_cs_dsp_ctls jbd2 snd_rn_pci_acp3x kvm
[   22.592117]  snd_pcm cs_dsp ucsi_acpi cfg80211 snd_timer snd_soc_cs35l41_lib irqbypass hp_wmi snd_acp_config hid_multitouch typec_ucsi snd_soc_acpi snd sparse_keymap rapl thunderbolt typec wmi_bmof pcspkr amd_sfh k10temp soundcore snd_pci_acp3x rfkill i2c_piix4 roles i2c_hid_acpi amd_pmf amd_pmc i2c_hid serial_multi_instantiate platform_profile wireless_hotkey mac_hid crypto_user loop fuse bpf_preload ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod hid_logitech_hidpp hid_logitech_dj usbhid amdgpu serio_raw i2c_algo_bit atkbd drm_ttm_helper libps2 ttm crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic gf128mul drm_suballoc_helper vivaldi_fmap ghash_clmulni_intel nvme sha512_ssse3 drm_buddy aesni_intel gpu_sched nvme_core crypto_simd drm_display_helper xhci_pci cryptd video i8042 ccp cec xhci_pci_renesas nvme_common serio wmi
[   22.592149] CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.4.0-rc6-1-mainline #1 f389f89fcf30e775529b6bb0f192e37f43fa3079
[   22.592150] Hardware name: HP HP EliteBook 845 14 inch G10 Notebook PC/8B6E, BIOS V82 Ver. 01.01.08 05/22/2023
[   22.592151] Workqueue: events power_supply_changed_work
[   22.592154] RIP: 0010:rcu_note_context_switch+0x5e0/0x660
[   22.592156] Code: 00 00 00 00 0f 85 07 fd ff ff 49 89 8c 24 a0 00 00 00 e9 fa fc ff ff 48 c7 c7 10 ec 63 ac c6 05 6e 95 e5 01 01 e8 90 4f f4 ff <0f> 0b e9 7b fa ff ff 49 83 bc 24 98 00 00 00 00 49 8b 84 24 a0 00
[   22.592157] RSP: 0018:ffffb60700187bc0 EFLAGS: 00010082
[   22.592158] RAX: 0000000000000000 RBX: ffff99783e834f40 RCX: 0000000000000027
[   22.592158] RDX: ffff99783e821688 RSI: 0000000000000001 RDI: ffff99783e821680
[   22.592159] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffb60700187a50
[   22.592159] R10: 0000000000000003 R11: ffffffffaceca808 R12: ffff99783e834040
[   22.592160] R13: ffff996900832700 R14: 0000000000000000 R15: ffff996909fe9c50
[   22.592160] FS:  0000000000000000(0000) GS:ffff99783e800000(0000) knlGS:0000000000000000
[   22.592161] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   22.592161] CR2: 0000563c72ba1018 CR3: 0000000e32a20000 CR4: 0000000000750ef0
[   22.592162] PKRU: 55555554
[   22.592162] Call Trace:
[   22.592164]  <TASK>
[   22.592164]  ? rcu_note_context_switch+0x5e0/0x660
[   22.592166]  ? __warn+0x81/0x130
[   22.592171]  ? rcu_note_context_switch+0x5e0/0x660
[   22.592172]  ? report_bug+0x171/0x1a0
[   22.592175]  ? prb_read_valid+0x1b/0x30
[   22.592177]  ? handle_bug+0x3c/0x80
[   22.592178]  ? exc_invalid_op+0x17/0x70
[   22.592179]  ? asm_exc_invalid_op+0x1a/0x20
[   22.592182]  ? rcu_note_context_switch+0x5e0/0x660
[   22.592183]  ? acpi_ut_delete_object_desc+0x86/0xb0
[   22.592186]  ? acpi_ut_update_ref_count.part.0+0x22d/0x930
[   22.592187]  __schedule+0xc0/0x1410
[   22.592189]  ? ktime_get+0x3c/0xa0
[   22.592191]  ? lapic_next_event+0x1d/0x30
[   22.592193]  ? hrtimer_start_range_ns+0x25b/0x350
[   22.592196]  schedule+0x5e/0xd0
[   22.592197]  schedule_hrtimeout_range_clock+0xbe/0x140
[   22.592199]  ? __pfx_hrtimer_wakeup+0x10/0x10
[   22.592200]  usleep_range_state+0x64/0x90
[   22.592203]  amd_pmf_send_cmd+0x106/0x2a0 [amd_pmf bddfe0fe3712aaa99acce3d5487405c5213c6616]
[   22.592207]  amd_pmf_update_slider+0x56/0x1b0 [amd_pmf bddfe0fe3712aaa99acce3d5487405c5213c6616]
[   22.592210]  amd_pmf_set_sps_power_limits+0x72/0x80 [amd_pmf bddfe0fe3712aaa99acce3d5487405c5213c6616]
[   22.592213]  amd_pmf_pwr_src_notify_call+0x49/0x90 [amd_pmf bddfe0fe3712aaa99acce3d5487405c5213c6616]
[   22.592216]  notifier_call_chain+0x5a/0xd0
[   22.592218]  atomic_notifier_call_chain+0x32/0x50
[   22.592219]  power_supply_changed_work+0x7c/0xe0
[   22.592220]  process_one_work+0x1c4/0x3d0
[   22.592223]  worker_thread+0x51/0x390
[   22.592225]  ? __pfx_worker_thread+0x10/0x10
[   22.592226]  kthread+0xe5/0x120
[   22.592228]  ? __pfx_kthread+0x10/0x10
[   22.592229]  ret_from_fork+0x29/0x50
[   22.592233]  </TASK>
[   22.592233] ---[ end trace 0000000000000000 ]---

This is also the case if I boot the system without AC supply plugged.

My boot parameters are:

root=/dev/mapper/vg-root rw rootflags=subvol=@ amdgpu.sg_display=0 amd_iommu=on iommu=pt amd_pstate=active loglevel=3 cryptdevice=UUID=xxxx-xxxx:cryptdev resume=/dev/vg/swap nowatchdog modprobe.blacklist=sp5100_tco

I've tried with linux-6.3.8, linux-zen-6.3.8 from ArchLinux, my self compiled 6.3.8 and mainline 6.4.0-rc6, 6.4.0-rc6-r269-g1b29d271614a without any luck, they all behavior the same. It's a 100% reproducible issue for me.

With the help of FlyGoat I tried to blacklist amd_pmf as a workaround and it works. 

Without amd_pmf CPU freq is not locked at 400-544MHz after unplugging the power, no warning in dmesg as well.

If modprobe amd_pmf without AC power, the CPU freq get locked at 400-544MHz in several seconds and the warning is printed to dmesg as described above, modprobe -r amd_pmf does not fix it, only rebooting could reset.
Comment 1 Mario Limonciello (AMD) 2023-06-19 04:00:54 UTC
Can you please turn on dynamic debugging for amd_pmf on the kernel command line and then share a full dmesg demonstrating from boot until when this happens?
Comment 2 Allen Zhong 2023-06-19 07:38:56 UTC
Created attachment 304452 [details]
dmesg w/ options amd-pmf dyndbg=+pflmt

Attached is the full dmesg output from a boot without AC supply and "options amd-pmf dyndbg=+pflmt" set on the ArchLinux default kernel.
Comment 3 Shyam Sundar S K (AMD) 2023-06-19 14:56:18 UTC
Created attachment 304455 [details]
Fix notify handler sequence in amd_pmf driver

can you try this change attached and see if that helps ?
Comment 4 Allen Zhong 2023-06-20 06:19:16 UTC
Thanks! I can confirm the patch works for me on 6.3.8.

With amd_pmf loaded, there is no warning log with both unplugging AC power after boot and booting without AC. CPU freq is normal.

Unloading and loading amd_pmf with modpreobe works normally as well.
Comment 6 Mario Limonciello (AMD) 2023-07-13 12:45:05 UTC
FYI - your system supported PMF functions 0xe0c3.  

Because of the fix for this issue static slider is no longer offered, but technically your system *should* offer static slider but the targets it uses to set are stored in the EC not the BIOS.

This series should enable static slider for you.

https://patchwork.kernel.org/project/platform-driver-x86/list/?series=765217
Comment 7 Jingyuan Deng 2023-09-21 04:02:14 UTC
Will this patch merge to mainline kernel sometime?
Comment 8 Mario Limonciello (AMD) 2023-09-21 11:54:06 UTC
Both patches referenced here are now merged.
Comment 9 Thong Pham 2023-11-09 00:41:05 UTC
I also have a HP 845 G10 (7840HS), running on kernel 6.6 and the problem still there. The freq is locked at 400 MHz - 544 MHz.  Blacklist amd_pmf then reboot does fix the issue.

I see the problem in this morning after suspend and then connect the AC power supply overnight. Here is the dmesg message: http://ix.io/4L2b
Comment 10 Mario Limonciello (AMD) 2023-11-09 00:47:32 UTC
Please cherry pick this commit:

https://github.com/torvalds/linux/commit/bbaa6ffa5b6c9609d3b3c431c389b407eea5441f
Comment 11 Jingyuan Deng 2023-11-09 01:09:45 UTC
I have Hp zbook power G10 A and I have cpu frequency problem all the time when Ac is not in use. In such situation, frequency and power of cpu is fine when system load is of low of medium, but when load is high the frequency will drop to about 600 MHz and the power of cpu is down to about 8-15 W, then the system will be really 

For example, if I build linux kernel when Ac off-line with -j 16 since the cpu has 8C16T, the cpu power will drop to about 15w and frequency at about 800 Mhz. However, if I use maybe -j 8 to build the kernel, the the power and cpu power and frequency is fine and the building will not be slow.

Plugging Ac in can help to unlock frequency, but next time when Ac off-line the problem still exists.

blacklist amd_pmf cannot solve this problem.

I also have a Hp Elitebook 865 G10 ( Almost same as 845 with same UEFI but with larger screen ). Sometimes suspend can make similar problem, but frequency problem will not happen after reboot even with Ac off-load.

I guess two problem above may not be one problem. Do you think so?
Comment 12 Thong Pham 2023-11-09 01:31:49 UTC
Thanks Mario for quick response, I've applied the patch. Let me see for a few days if I see the problem again.
Comment 13 Thong Pham 2023-11-12 08:48:32 UTC
Hi Mario, I still encounter the problem even after applied the patch. Hope these log can help you debug the issue:

http://ix.io/4Lhi
http://ix.io/4Lhj
Comment 14 Mario Limonciello (AMD) 2023-11-13 21:44:43 UTC
Can you please explain the reproduce steps better?  
Is it specific to the sequence of events:

* Power supply plugged in
* Suspend machine
* Unplug power supply
* Resume machine
* Observe cores stuck


> http://ix.io/4Lhi
> http://ix.io/4Lhj

Are these just two separate reproductions of the issue?

What mode do you have CONFIG_X86_AMD_PSTATE_DEFAULT_MODE set to for your kernel build?

Can you please turn on dynamic debugging for drivers/cpufreq/amd-pstate.c and also for the amd-pmf kernel module and reproduce again?
Comment 15 Mario Limonciello (AMD) 2023-11-13 21:46:16 UTC
For both of your machines they can benefit from Smart PC solution builder patch series and firmware.  

After sharing my above asks, it would be really helpful if you can apply the series to your kernel and grab the matching firmware from linux-firmware.git to see if you can still reproduce with them in place.
Comment 16 Thong Pham 2023-11-14 00:28:50 UTC
> Can you please explain the reproduce steps better? 

Honestly I don't know how to reproduce the problem. It just happen after a few days of my normal usage and I just realize because of the machine become unresponsive. 

Also doing those step above does not reproduce the issue. I'll find a way reproduce it.

> Are these just two separate reproductions of the issue?

The log are on the same machine right after I realized the issue. The first one is from `dmesg` and the second one is from `journalctrl -b0`.

> What mode do you have CONFIG_X86_AMD_PSTATE_DEFAULT_MODE set to for your
> kernel build?

It's set to 3, but I also having these settings in TLP related to P-state and power management.

```
CPU_DRIVER_OPMODE_ON_AC = "active";             
CPU_DRIVER_OPMODE_ON_BAT = "active";            
                                                
CPU_SCALING_GOVERNOR_ON_AC = "powersave";       
CPU_SCALING_GOVERNOR_ON_BAT = "powersave";      
CPU_ENERGY_PERF_POLICY_ON_AC = "power";         
CPU_ENERGY_PERF_POLICY_ON_BAT = "power";        
                      
CPU_HWP_DYN_BOOST_ON_AC = 1;  # doesn't seem to work since /sys/devices/system/cpu/amd_pstate/cppc_dynamic_boost not available           
CPU_HWP_DYN_BOOST_ON_BAT = 1; # doesn't seem to work since /sys/devices/system/cpu/amd_pstate/cppc_dynamic_boost not available                  
                                                
# Runtime Power Management and ASPM             
RUNTIME_PM_ON_AC = "auto";                      
RUNTIME_PM_ON_BAT = "auto";                     
PCIE_ASPM_ON_AC = "powersave";                  
PCIE_ASPM_ON_BAT = "powersave";                 

```

> Can you please turn on dynamic debugging for drivers/cpufreq/amd-pstate.c and
> also for the amd-pmf kernel module and reproduce again?

Sure, but currently I don't know how to reproduce it yet.

> For both of your machines they can benefit from Smart PC solution builder
> patch series and firmware. 

I haven't heard about this but let me try. Thank you for the suggestion.
Comment 17 Mario Limonciello (AMD) 2023-11-14 01:02:27 UTC
> Honestly I don't know how to reproduce the problem. It just happen after a
> few days of my normal usage and I just realize because of the machine become
> unresponsive. 

OK.

> The log are on the same machine right after I realized the issue. The first
> one is from `dmesg` and the second one is from `journalctrl -b0`.

Got it, OK.

> It's set to 3, but I also having these settings in TLP related to P-state and
> power management.

OK.  It's conceivable that a sequence of events that TLP does cause this issue.  Can you please stop using TLP while we try to figure out what is happening?  If it doesn't happen after a period of time that you normally would have done something with you have configured with TLP that may point at root cause.

Some other ideas:
* Are you using any other software that may be changing things?
* Is there by chance any correlation with video playback over suspend/resume or adapter plug/unplug?
* Were you changing any DPM settings with TLP?
Comment 18 Thong Pham 2023-11-14 02:43:07 UTC
>  Are you using any other software that may be changing things?

I think no, but I could provide more relevant inputs.

- I'm having these additional config for modules beside of the default one in NixOS

```
boot.initrd.availableKernelModules =
        [ "nvme" "xhci_pci" "thunderbolt" "usb_storage" "sd_mod" "amdgpu" ];
kernelModules =
        [ "kvm-amd" "synaptics_usb" "hp-wmi" "hp-wmi-sensors" "k10temp" ];
```
- And here is my logind config

```
HandleLidSwitch=hibernate
HandlePowerKey=suspend
HandleLidSwitchDocked=ignore
IdleAction=suspend-then-hibernate
IdleActionSec=5min
```
- Before one week ago, I always encountered this crash when booting (search for WARNING in this log http://ix.io/4Lrh). But it's gone recently. I notice that suspend and hibernate were not stable. I still feel hot while the machine was suspending.And hibernating overnight drew 30% percentage of battery. Fortunately, I haven't experience the issue after that day which I believe because of applying your suggested patch. 

>  Is there by chance any correlation with video playback over suspend/resume
>  or adapter plug/unplug?

Just tested now, suspending works fine. But I encountered abnormal issue with hibernating while tlp was enabled. It can not hibernate while playing a youtube video. Without tlp, it works fine.

>  Were you changing any DPM settings with TLP?

Yes. These are DPM related config (my whole tlp config: http://ix.io/4Lry):

```
# Runtime Power Management and ASPM 
RADEON_DPM_PERF_LEVEL_ON_AC = "auto";
RADEON_DPM_PERF_LEVEL_ON_BAT = "low";            
RUNTIME_PM_ON_AC = "auto";                      
RUNTIME_PM_ON_BAT = "auto";                     
PCIE_ASPM_ON_AC = "powersave";                  
PCIE_ASPM_ON_BAT = "powersave"; 
```
Comment 19 Thong Pham 2023-11-14 02:45:58 UTC
While the hibernation works fine without tlp, it take 1 min to hibernate and 20s to start from hibernation. I'm not sure if that's normal. My laptop using SK Hynix PC801 which is very fast.
Comment 20 Mario Limonciello (AMD) 2023-11-14 02:57:33 UTC
> - Before one week ago, I always encountered this crash when booting (search
> for WARNING in this log http://ix.io/4Lrh). But it's gone recently. I notice
> that suspend and hibernate were not stable. I still feel hot while the
> machine was suspending.And hibernating overnight drew 30% percentage of
> battery. Fortunately, I haven't experience the issue after that day which I
> believe because of applying your suggested patch. 

Ah suspend then hibernate I have a fix for you.  Pick up this patch:

https://lore.kernel.org/linux-rtc/20231106162310.85711-1-mario.limonciello@amd.com/

It sure would be nice if this is actually the fix for your speed problem too, but I think that's unlikely.

> Yes. These are DPM related config (my whole tlp config: http://ix.io/4Lry):

Yes; please turn all these off for now. If you're confident that things are better without them you can isolate which one causes issues.

> While the hibernation works fine without tlp, it take 1 min to hibernate and
> 20s to start from hibernation. I'm not sure if that's normal. My laptop using
> SK Hynix PC801 which is very fast.

Depends on how much memory you have if that's reasonable or not.
Comment 21 Thong Pham 2023-11-14 03:14:31 UTC
> Ah suspend then hibernate I have a fix for you

Thank you. I'll try this. I wonder when will it be merged into the kernel

> Depends on how much memory you have if that's reasonable or not.

I have 32GB of RAM and 40GB of swap. But I test the hibernation right after started the machine with only firefox opened.
Comment 22 Thong Pham 2023-11-14 03:27:51 UTC
Just right now I encountered the issue but this time the clock freeze at 865Hz. It's happening right now, after (not sure if right after):
- Building the Linux kernel (for the patch you mentioned above).
- Plug the power adapter


Tlp is disabled. I haven't enabled dynamic debugging you mentioned (just because I don't know it yet). How can I help you to debug the issue.
Comment 23 Thong Pham 2023-11-14 06:01:27 UTC
This time, I can reproduce that every time by just compile the kernel. After about 30 seconds of compiling the kernel, the freq is reduced to 865Mhz. It bound back if I stop the kernel build. My kernel is compile with `-march=znver4` option, I'm not sure if that's the issue.
Comment 24 Thong Pham 2023-11-14 11:14:39 UTC
Here is the video that I recorded at the time: https://www.youtube.com/watch?v=_bIfRg798bI . But I haven't seen the problem after restarted.
Comment 25 Mario Limonciello (AMD) 2023-11-14 15:25:45 UTC
> Thank you. I'll try this. I wonder when will it be merged into the kernel

If you test it and it improves things for you too you can reply with a "Tested-by" tag.  It's up to the maintainer for that subsystem when it will be merged.

> I have 32GB of RAM and 40GB of swap. But I test the hibernation right after
> started the machine with only firefox opened.

I think it would require some profiling to see where the problem is.  But I feel that's very likely separate from your issue at hand here unless the CPU cores are running slow when you run hibernate.

> This time, I can reproduce that every time by just compile the kernel. After
> about 30 seconds of compiling the kernel, the freq is reduced to 865Mhz. It
> bound back if I stop the kernel build. My kernel is compile with
> `-march=znver4` option, I'm not sure if that's the issue.

If it is indeed only triggered by -march=znver4 it might be fixed by https://github.com/torvalds/linux/commit/f454b18e07f518bcd0c05af17a2239138bff52de

Is that present in your current kernel?
Comment 26 Thong Pham 2023-11-25 13:27:27 UTC
> Is that present in your current kernel?

It wasn't in my kernel before. But after the video I sent you, I haven't see the problem again.

I also upgrade my system to kernel 6.7rc1 about week ago and hasn't gotten the issue so far.
Comment 27 Mario Limonciello (AMD) 2023-11-26 19:11:09 UTC
OK, if you end up not reproducing it again on 6.7-rcX please close this issue.
Comment 28 Thong Pham 2023-11-28 12:33:34 UTC
Hi Mario, do you request me to close the issue? I'm sorry I don't know how to close this issue since I don't familiar with bugzilla.

Note You need to log in before you can comment on or make changes to this bug.