Created attachment 274151 [details] turbostat log powertop shows the CPUs in C10, core in C6, and package in C6. No residency in PC10. turbostat agrees. From the Intel Software Developers' Manual, Vol. 4 [1], I know that there are MSRs for core C1/C3/C6 and package C2/C3/C6/C10. Also confirmed by the commit message of 5c10b048c37c ("perf/x86/intel: Enable C-state residency events for Apollo Lake") In the attached turbostat log, I noticed something odd: cpu1: MSR_PKG_CST_CONFIG_CONTROL: 0x14008072 (... pkg-cstate-limit=2 ...) The BIOS allows you to limit the deepest cstate, with options disabled, C1, C6, C7, C8, C9, C10. I tried each option and did a "rdmsr 0xe2" with each set: BIOS limit powertop reports rdmsr 0xe2 disabled no pkg residency 8000 C1 PC2 14008012 C6 PC6 14008032 C7 PC6 14008042 C8 PC6 14008052 C9 PC6 14008062 C10 PC6 14008072 According to ISD Vol 4 [1] Table 2-12, bits 3:0 of Goldmont's MSR_PKG_CST_CONFIG_CONTROL are: 0000b: No limit 0001b: C1 0010b: C3 0011b: C6 0100b: C7 0101b: C7S 0110b: C8 0111b: C9 1000b: C10 Until C7S, these correspond to bits 7:4 of my readings of 0xe2 (with a missing C3 fitting in as 14008022). The high bits are the lock bit (bit 15) and presumably demote/undemote bits, though those are not documented for Goldmont. Is the BIOS programming the wrong nibble? 0x2 in the low nibble looks like it would be C3 and maybe the undemotion bit would cause the package to use C6? Maybe I'm off on a tangent? [1] https://software.intel.com/sites/default/files/managed/22/0d/335592-sdm-vol-4.pdf
Created attachment 274153 [details] dmesg
Created attachment 274155 [details] lspci -vvnn
Created attachment 274157 [details] acpidump
(In reply to Matt Turner from comment #0) > Created attachment 274151 [details] > turbostat log > > powertop shows the CPUs in C10, core in C6, and package in C6. No residency > in PC10. turbostat agrees. > > From the Intel Software Developers' Manual, Vol. 4 [1], I know that there > are MSRs for core C1/C3/C6 and package C2/C3/C6/C10. Also confirmed by the > commit message of 5c10b048c37c ("perf/x86/intel: Enable C-state residency > events for Apollo Lake") > > In the attached turbostat log, I noticed something odd: > > cpu1: MSR_PKG_CST_CONFIG_CONTROL: 0x14008072 (... pkg-cstate-limit=2 ...) > > The BIOS allows you to limit the deepest cstate, with options disabled, C1, > C6, C7, C8, C9, C10. I tried each option and did a "rdmsr 0xe2" with each > set: > > BIOS limit powertop reports rdmsr 0xe2 > disabled no pkg residency 8000 > C1 PC2 14008012 > C6 PC6 14008032 > C7 PC6 14008042 > C8 PC6 14008052 > C9 PC6 14008062 > C10 PC6 14008072 > > According to ISD Vol 4 [1] Table 2-12, bits 3:0 of Goldmont's > MSR_PKG_CST_CONFIG_CONTROL are: > > 0000b: No limit > 0001b: C1 > 0010b: C3 > 0011b: C6 > 0100b: C7 > 0101b: C7S > 0110b: C8 > 0111b: C9 > 1000b: C10 > > Until C7S, these correspond to bits 7:4 of my readings of 0xe2 (with a > missing C3 fitting in as 14008022). The high bits are the lock bit (bit 15) > and presumably demote/undemote bits, though those are not documented for > Goldmont. > > Is the BIOS programming the wrong nibble? 0x2 in the low nibble looks like > it would be C3 and maybe the undemotion bit would cause the package to use > C6? > > Maybe I'm off on a tangent? > > [1] > https://software.intel.com/sites/default/files/managed/22/0d/335592-sdm-vol- > 4.pdf This is interesting, I'm not sure if what does undemotion mean, does it mean, the processor will auto convert C1/C3 request to C6, which is the opposite of demotion? I also saw there's bit16 in MSR_PKG_CST_CONFIG_CONTROL for Xeon Processors E5 v4 Family Based on the Broadwell SDM vol 4 2-233: Automatic C-State Conversion Enable (R/W) If 1, the processor will convert HALT or MWAT(C1) to MWAIT(C6) But anyway I think you are right, the BIOS might has scribbled the wrong bits on MSR_PKG_CST_CONFIG_CONTROL. @Len
The status was changed to NEEDINFO... but I don't see a request for any info. Is there something I can provide?
Six week ping
We'll talk to Len/Rafael about this issue in the meeting this week.
I contacted ASRock, and they responded with: """ In our ApolloLake and GeminiLake platform, they both can support up to C7 state. But C8, C9, C10 state are not supported. To enter C7 state: Please set "SATA Aggressive Link Power Management" to enabled. This option is set to disabled by default, so it could only enter C6. In these platform, we need to use S0ix to support C8 ~C10. Now, it is the hardware limitation, so it could only support up to C7. """ I think what they're saying is that S0ix is not supported (on this platform, on this board? I don't know) and it is required to get to PC8~PC10. Documentation/acpi/lpit.txt says """ The following attributes are added dynamically to the cpuidle sysfs attribute group: /sys/devices/system/cpu/cpuidle/low_power_idle_cpu_residency_us /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us The "low_power_idle_cpu_residency_us" attribute shows time spent by the CPU package in PKG C10 The "low_power_idle_system_residency_us" attribute shows SLP_S0 residency, or system time spent with the SLP_S0# signal asserted. This is the lowest possible system power state, achieved only when CPU is in PKG C10 and all functional blocks in PCH are in a low power state. """ The last paragraph seems to suggest the dependency between S0ix and PC10 is the other way around. In any case, my system does not have low_power_idle_system_residency_us in sysfs, so I suspect that indicates it does not support S0ix. Really not sure how any end users are supposed to decipher all of this. At least some confirmation from the experts that we've come to the right conclusion would be nice.
Can you provide the output of /sys/kernel/debug/telemetry/soc_states ? Which idle driver are they using on APL and GLK? intel_idle ?
I've built my kernel with CONFIG_INTEL_TELEMETRY, INTEL_PMC_IPC, INTEL_PUNIT_IPC, DEBUG_FS, etc, but I do not have /sys/kernel/debug/telemetry/ I am using intel_idle: # cat /sys/devices/system/cpu/cpuidle/current_driver intel_idle
Moving to Product "Power Management" so we will see this... The turbostat output shows the OS is requesting C10 most of the time, but never getting deeper than Core C6 (and thus, Package C6). The response from the vendor says that the system should be able to get into C7 and PC7. I recommend you follow their advice on ASPM to achieve that state. Also, just for grins, to see what Windows would get, boot in ACPI mode via "intel_idle.max_cstate=0". When you do so, please use the latest turbostat from the kernel tree. sudo turbostat -o ts.out sleep 10
Re: comment #11 I meant to type SATA ALPM (not PCI ASPM) Please enable SATA ALPM and run latest turbostat to show you can get into PC7, the deepest state the OEM claims to support on this platform. # echo min_power > /sys/class/scsi_host/*/link_power_management_policy or some such should be able to do it at run-time. Re: comment #8 yes, PC10 is required to reach S0ix. But sounds like the OEM supports only to PC7. Further, dmesg in comment #1 shows "ACPI: (supports S0 S3 S5)", so the BIOS claims to support S3, and that may be the default suspend state. You can tell by running the upstream kernel and looking at /sys/power/mem_sleep. For an S0ix box, it will default to "s2idle", and for an S3 box, it will be set to "deep". Re: Original Description I agree, the BIOS SETUP impact on MSR 0xE2 does not seem to match the table 2-12 in the SDM. Further turbostat.c doesn't match either: int bxt_pkg_cstate_limits[16] = {PCL__0, PCL__2, PCLUNL, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV};
Re: Original Description The SDM matches the APL EDS for the Package C-state limits, which are in bits 3:0. The BIOS SETUP appears to be erroneous. If it is claiming to limit *package* c-states, then you are right, it is programming the wrong nibble. It is changing 7:4 when it should be changing 3:0. Also, the values used don't quite match the SDM. This particular hardware did have a core-cstate limit in 7:4. I don't know if that feature ever shipped, but the values are totally different, and so it seems erroneous that the BIOS is changing 7:4 based on BIOS SETUP. But I'm puzzled by the value of 2 in 3:0. turbostat decodes that as unlimited, and you are observing actual pkg%c6, but the SDM table suggests the value 2 should limit the package to PC3.
(In reply to Len Brown from comment #12) > Re: comment #11 > > I meant to type SATA ALPM (not PCI ASPM) > Please enable SATA ALPM and run latest turbostat to show you can get into > PC7, > the deepest state the OEM claims to support on this platform. > > # echo min_power > /sys/class/scsi_host/*/link_power_management_policy In doing that with powertop I have occasionally seen PC7 residency (50% PC6/50% PC7), but not in any predictable manner. I've never gotten turbostat to report PC7 residency. I'm also somewhat confused where these tools are reading PC7 residency from, since the SDM says there are only MSRs for PC2/3/6/10 on BXT. > or some such should be able to do it at run-time. > > Re: comment #8 > > yes, PC10 is required to reach S0ix. > But sounds like the OEM supports only to PC7. As an aside, is it common for an OEM to limit the maximum C states like that? > Further, dmesg in comment #1 shows "ACPI: (supports S0 S3 S5)", > so the BIOS claims to support S3, and that may be the default suspend state. > > You can tell by running the upstream kernel and looking at > /sys/power/mem_sleep. For an S0ix box, it will default to "s2idle", > and for an S3 box, it will be set to "deep". Thanks. That's very good to know. mattst88@asrock-j3455m:~$ sudo cat /sys/power/mem_sleep s2idle [deep] So indeed the board does not support S0ix. > Re: Original Description > > I agree, the BIOS SETUP impact on MSR 0xE2 > does not seem to match the table 2-12 in the SDM. > Further turbostat.c doesn't match either: > > int bxt_pkg_cstate_limits[16] = {PCL__0, PCL__2, PCLUNL, PCLRSV, PCLRSV, > PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, > PCLRSV, PCLRSV}; I'd glad to get some confirmation that I'm not grossly confused :) (In reply to Len Brown from comment #13) > Re: Original Description > > The SDM matches the APL EDS for the Package C-state limits, which are in > bits 3:0. > > The BIOS SETUP appears to be erroneous. If it is claiming to limit > *package* c-states, then you are right, it is programming the wrong nibble. > It is changing 7:4 when it should be changing 3:0. Also, the values used > don't quite match the SDM. > > This particular hardware did have a core-cstate limit in 7:4. I don't know > if that feature ever shipped, but the values are totally different, and so > it seems erroneous that the BIOS is changing 7:4 based on BIOS SETUP. > > But I'm puzzled by the value of 2 in 3:0. turbostat decodes that as > unlimited, and you are observing actual pkg%c6, but the SDM table suggests > the value 2 should limit the package to PC3. Agreed. That's where my theory hits a snag. Could the demotion/undemotion bits cause the CPU to enter PC6 when the deepest allowed state is programmed to be PC3? IIRC demotion/undemotion (can't remember which does what) allows the CPU to pick PC6 instead of PC3 under some conditions. Not sure if that's a possible explanation. I really wish the lock bit wasn't set so I could actually test my theory. Why do we want to lock that register anyway? Maybe it's time to get out the SOIC-8 programmer and disassemble the BIOS.
Looks like on your device the IPC1 interface is also disabled, please check with BIOS vendor to enable that. Without this interface, Telemetry driver can not work and hence hinders debug.
ping...
We determined that the BIOS does not have the ability to enable IPC1, so the telemetry interface cannot be enabled. Without that, I think I understood from Raj that there wasn't much else that could be done. I guess feel free to close the bug.