Bug 217571
Summary: | amd_pmf: AMD 7840HS cpufreq locked at 400-544MHz after power unplugged | ||
---|---|---|---|
Product: | Drivers | Reporter: | Allen Zhong (allen) |
Component: | Platform_x86 | Assignee: | Shyam Sundar S K (AMD) (shyam-sundar.s-k) |
Status: | RESOLVED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | 1700011628, aslightlyrandomemail, L.Bonnaud, mario.limonciello, nikola.ilo, shyam-sundar.s-k, thongpv87 |
Priority: | P3 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 6.3.8; 6.4.0-rc6; GIT | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg w/ options amd-pmf dyndbg=+pflmt
Fix notify handler sequence in amd_pmf driver |
Description
Allen Zhong
2023-06-18 14:52:19 UTC
Can you please turn on dynamic debugging for amd_pmf on the kernel command line and then share a full dmesg demonstrating from boot until when this happens? Created attachment 304452 [details]
dmesg w/ options amd-pmf dyndbg=+pflmt
Attached is the full dmesg output from a boot without AC supply and "options amd-pmf dyndbg=+pflmt" set on the ArchLinux default kernel.
Created attachment 304455 [details]
Fix notify handler sequence in amd_pmf driver
can you try this change attached and see if that helps ?
Thanks! I can confirm the patch works for me on 6.3.8. With amd_pmf loaded, there is no warning log with both unplugging AC power after boot and booting without AC. CPU freq is normal. Unloading and loading amd_pmf with modpreobe works normally as well. FYI - your system supported PMF functions 0xe0c3. Because of the fix for this issue static slider is no longer offered, but technically your system *should* offer static slider but the targets it uses to set are stored in the EC not the BIOS. This series should enable static slider for you. https://patchwork.kernel.org/project/platform-driver-x86/list/?series=765217 Will this patch merge to mainline kernel sometime? Both patches referenced here are now merged. I also have a HP 845 G10 (7840HS), running on kernel 6.6 and the problem still there. The freq is locked at 400 MHz - 544 MHz. Blacklist amd_pmf then reboot does fix the issue. I see the problem in this morning after suspend and then connect the AC power supply overnight. Here is the dmesg message: http://ix.io/4L2b Please cherry pick this commit: https://github.com/torvalds/linux/commit/bbaa6ffa5b6c9609d3b3c431c389b407eea5441f I have Hp zbook power G10 A and I have cpu frequency problem all the time when Ac is not in use. In such situation, frequency and power of cpu is fine when system load is of low of medium, but when load is high the frequency will drop to about 600 MHz and the power of cpu is down to about 8-15 W, then the system will be really For example, if I build linux kernel when Ac off-line with -j 16 since the cpu has 8C16T, the cpu power will drop to about 15w and frequency at about 800 Mhz. However, if I use maybe -j 8 to build the kernel, the the power and cpu power and frequency is fine and the building will not be slow. Plugging Ac in can help to unlock frequency, but next time when Ac off-line the problem still exists. blacklist amd_pmf cannot solve this problem. I also have a Hp Elitebook 865 G10 ( Almost same as 845 with same UEFI but with larger screen ). Sometimes suspend can make similar problem, but frequency problem will not happen after reboot even with Ac off-load. I guess two problem above may not be one problem. Do you think so? Thanks Mario for quick response, I've applied the patch. Let me see for a few days if I see the problem again. Hi Mario, I still encounter the problem even after applied the patch. Hope these log can help you debug the issue: http://ix.io/4Lhi http://ix.io/4Lhj Can you please explain the reproduce steps better?
Is it specific to the sequence of events:
* Power supply plugged in
* Suspend machine
* Unplug power supply
* Resume machine
* Observe cores stuck
> http://ix.io/4Lhi
> http://ix.io/4Lhj
Are these just two separate reproductions of the issue?
What mode do you have CONFIG_X86_AMD_PSTATE_DEFAULT_MODE set to for your kernel build?
Can you please turn on dynamic debugging for drivers/cpufreq/amd-pstate.c and also for the amd-pmf kernel module and reproduce again?
For both of your machines they can benefit from Smart PC solution builder patch series and firmware. After sharing my above asks, it would be really helpful if you can apply the series to your kernel and grab the matching firmware from linux-firmware.git to see if you can still reproduce with them in place. > Can you please explain the reproduce steps better? Honestly I don't know how to reproduce the problem. It just happen after a few days of my normal usage and I just realize because of the machine become unresponsive. Also doing those step above does not reproduce the issue. I'll find a way reproduce it. > Are these just two separate reproductions of the issue? The log are on the same machine right after I realized the issue. The first one is from `dmesg` and the second one is from `journalctrl -b0`. > What mode do you have CONFIG_X86_AMD_PSTATE_DEFAULT_MODE set to for your > kernel build? It's set to 3, but I also having these settings in TLP related to P-state and power management. ``` CPU_DRIVER_OPMODE_ON_AC = "active"; CPU_DRIVER_OPMODE_ON_BAT = "active"; CPU_SCALING_GOVERNOR_ON_AC = "powersave"; CPU_SCALING_GOVERNOR_ON_BAT = "powersave"; CPU_ENERGY_PERF_POLICY_ON_AC = "power"; CPU_ENERGY_PERF_POLICY_ON_BAT = "power"; CPU_HWP_DYN_BOOST_ON_AC = 1; # doesn't seem to work since /sys/devices/system/cpu/amd_pstate/cppc_dynamic_boost not available CPU_HWP_DYN_BOOST_ON_BAT = 1; # doesn't seem to work since /sys/devices/system/cpu/amd_pstate/cppc_dynamic_boost not available # Runtime Power Management and ASPM RUNTIME_PM_ON_AC = "auto"; RUNTIME_PM_ON_BAT = "auto"; PCIE_ASPM_ON_AC = "powersave"; PCIE_ASPM_ON_BAT = "powersave"; ``` > Can you please turn on dynamic debugging for drivers/cpufreq/amd-pstate.c and > also for the amd-pmf kernel module and reproduce again? Sure, but currently I don't know how to reproduce it yet. > For both of your machines they can benefit from Smart PC solution builder > patch series and firmware. I haven't heard about this but let me try. Thank you for the suggestion. > Honestly I don't know how to reproduce the problem. It just happen after a > few days of my normal usage and I just realize because of the machine become > unresponsive. OK. > The log are on the same machine right after I realized the issue. The first > one is from `dmesg` and the second one is from `journalctrl -b0`. Got it, OK. > It's set to 3, but I also having these settings in TLP related to P-state and > power management. OK. It's conceivable that a sequence of events that TLP does cause this issue. Can you please stop using TLP while we try to figure out what is happening? If it doesn't happen after a period of time that you normally would have done something with you have configured with TLP that may point at root cause. Some other ideas: * Are you using any other software that may be changing things? * Is there by chance any correlation with video playback over suspend/resume or adapter plug/unplug? * Were you changing any DPM settings with TLP? > Are you using any other software that may be changing things? I think no, but I could provide more relevant inputs. - I'm having these additional config for modules beside of the default one in NixOS ``` boot.initrd.availableKernelModules = [ "nvme" "xhci_pci" "thunderbolt" "usb_storage" "sd_mod" "amdgpu" ]; kernelModules = [ "kvm-amd" "synaptics_usb" "hp-wmi" "hp-wmi-sensors" "k10temp" ]; ``` - And here is my logind config ``` HandleLidSwitch=hibernate HandlePowerKey=suspend HandleLidSwitchDocked=ignore IdleAction=suspend-then-hibernate IdleActionSec=5min ``` - Before one week ago, I always encountered this crash when booting (search for WARNING in this log http://ix.io/4Lrh). But it's gone recently. I notice that suspend and hibernate were not stable. I still feel hot while the machine was suspending.And hibernating overnight drew 30% percentage of battery. Fortunately, I haven't experience the issue after that day which I believe because of applying your suggested patch. > Is there by chance any correlation with video playback over suspend/resume > or adapter plug/unplug? Just tested now, suspending works fine. But I encountered abnormal issue with hibernating while tlp was enabled. It can not hibernate while playing a youtube video. Without tlp, it works fine. > Were you changing any DPM settings with TLP? Yes. These are DPM related config (my whole tlp config: http://ix.io/4Lry): ``` # Runtime Power Management and ASPM RADEON_DPM_PERF_LEVEL_ON_AC = "auto"; RADEON_DPM_PERF_LEVEL_ON_BAT = "low"; RUNTIME_PM_ON_AC = "auto"; RUNTIME_PM_ON_BAT = "auto"; PCIE_ASPM_ON_AC = "powersave"; PCIE_ASPM_ON_BAT = "powersave"; ``` While the hibernation works fine without tlp, it take 1 min to hibernate and 20s to start from hibernation. I'm not sure if that's normal. My laptop using SK Hynix PC801 which is very fast. > - Before one week ago, I always encountered this crash when booting (search > for WARNING in this log http://ix.io/4Lrh). But it's gone recently. I notice > that suspend and hibernate were not stable. I still feel hot while the > machine was suspending.And hibernating overnight drew 30% percentage of > battery. Fortunately, I haven't experience the issue after that day which I > believe because of applying your suggested patch. Ah suspend then hibernate I have a fix for you. Pick up this patch: https://lore.kernel.org/linux-rtc/20231106162310.85711-1-mario.limonciello@amd.com/ It sure would be nice if this is actually the fix for your speed problem too, but I think that's unlikely. > Yes. These are DPM related config (my whole tlp config: http://ix.io/4Lry): Yes; please turn all these off for now. If you're confident that things are better without them you can isolate which one causes issues. > While the hibernation works fine without tlp, it take 1 min to hibernate and > 20s to start from hibernation. I'm not sure if that's normal. My laptop using > SK Hynix PC801 which is very fast. Depends on how much memory you have if that's reasonable or not. > Ah suspend then hibernate I have a fix for you Thank you. I'll try this. I wonder when will it be merged into the kernel > Depends on how much memory you have if that's reasonable or not. I have 32GB of RAM and 40GB of swap. But I test the hibernation right after started the machine with only firefox opened. Just right now I encountered the issue but this time the clock freeze at 865Hz. It's happening right now, after (not sure if right after): - Building the Linux kernel (for the patch you mentioned above). - Plug the power adapter Tlp is disabled. I haven't enabled dynamic debugging you mentioned (just because I don't know it yet). How can I help you to debug the issue. This time, I can reproduce that every time by just compile the kernel. After about 30 seconds of compiling the kernel, the freq is reduced to 865Mhz. It bound back if I stop the kernel build. My kernel is compile with `-march=znver4` option, I'm not sure if that's the issue. Here is the video that I recorded at the time: https://www.youtube.com/watch?v=_bIfRg798bI . But I haven't seen the problem after restarted. > Thank you. I'll try this. I wonder when will it be merged into the kernel If you test it and it improves things for you too you can reply with a "Tested-by" tag. It's up to the maintainer for that subsystem when it will be merged. > I have 32GB of RAM and 40GB of swap. But I test the hibernation right after > started the machine with only firefox opened. I think it would require some profiling to see where the problem is. But I feel that's very likely separate from your issue at hand here unless the CPU cores are running slow when you run hibernate. > This time, I can reproduce that every time by just compile the kernel. After > about 30 seconds of compiling the kernel, the freq is reduced to 865Mhz. It > bound back if I stop the kernel build. My kernel is compile with > `-march=znver4` option, I'm not sure if that's the issue. If it is indeed only triggered by -march=znver4 it might be fixed by https://github.com/torvalds/linux/commit/f454b18e07f518bcd0c05af17a2239138bff52de Is that present in your current kernel? > Is that present in your current kernel?
It wasn't in my kernel before. But after the video I sent you, I haven't see the problem again.
I also upgrade my system to kernel 6.7rc1 about week ago and hasn't gotten the issue so far.
OK, if you end up not reproducing it again on 6.7-rcX please close this issue. Hi Mario, do you request me to close the issue? I'm sorry I don't know how to close this issue since I don't familiar with bugzilla. Hi, zbook power G10 A user here (7840HS, 4050). Running on kernel 6.11.0-arch1-1. Not exactly the above issue, but when unplugged, and running a more demanding load (e.g. stress -c 16, but doesnt need to be that demanding), the cpu throttles to 544MHz. I found that I could fix that by decreasing the power limits with ryzenadj (sudo ryzenadj --tctl-temp=95 --stapm-limit=40000 --fast-limit=40000 --slow-limit=35000 --power-saving). However, when suspending the system, upon resume, it always throttles to around 0.95GHz when running stress -c 16 (the clock speed is higher with lower core count stress, but still way too low, but only for high core count loads, i tried usually < 4 is fine), no matter the power profile it is in (set with ppd), and no matter what power limits i set it to, it always does this. Hibernating fixes it, so its probably a BIOS issue. Additionally, if suspending on power profile powersave (which disables turbo), upon resume and changing power profile the limits are not changed and a suspend then resume on balanced or performance (which enable turbo) is required to fix this. I was wondering if there might be some workaround, or what is actually going on here (especially for the frequency issue)? |