Bug 198779

Summary: ASRock J3455M (Goldmont) does not reach PC7 (MSR_PKG_CST_CONFIG_CONTROL limit looks incorrect)
Product: Power Management Reporter: Matt Turner (mattst88)
Component: intel_idleAssignee: Len Brown (lenb)
Status: CLOSED WILL_NOT_FIX    
Severity: normal CC: lenb, rajneesh.bhardwaj, rjw, rui.zhang, yu.c.chen
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.15.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: turbostat log
dmesg
lspci -vvnn
acpidump

Description Matt Turner 2018-02-13 21:23:03 UTC
Created attachment 274151 [details]
turbostat log

powertop shows the CPUs in C10, core in C6, and package in C6. No residency in PC10. turbostat agrees.

From the Intel Software Developers' Manual, Vol. 4 [1], I know that there are MSRs for core C1/C3/C6 and package C2/C3/C6/C10. Also confirmed by the commit message of 5c10b048c37c ("perf/x86/intel: Enable C-state residency events for Apollo Lake")

In the attached turbostat log, I noticed something odd:

cpu1: MSR_PKG_CST_CONFIG_CONTROL: 0x14008072 (... pkg-cstate-limit=2 ...)

The BIOS allows you to limit the deepest cstate, with options disabled, C1, C6, C7, C8, C9, C10. I tried each option and did a "rdmsr 0xe2" with each set:

BIOS limit    powertop reports     rdmsr 0xe2
  disabled    no pkg residency           8000
        C1                 PC2       14008012
        C6                 PC6       14008032
        C7                 PC6       14008042
        C8                 PC6       14008052
        C9                 PC6       14008062
       C10                 PC6       14008072

According to ISD Vol 4 [1] Table 2-12, bits 3:0 of Goldmont's MSR_PKG_CST_CONFIG_CONTROL are:

0000b: No limit
0001b: C1
0010b: C3
0011b: C6
0100b: C7
0101b: C7S
0110b: C8
0111b: C9
1000b: C10

Until C7S, these correspond to bits 7:4 of my readings of 0xe2 (with a missing C3 fitting in as 14008022). The high bits are the lock bit (bit 15) and presumably demote/undemote bits, though those are not documented for Goldmont.

Is the BIOS programming the wrong nibble? 0x2 in the low nibble looks like it would be C3 and maybe the undemotion bit would cause the package to use C6?

Maybe I'm off on a tangent?

[1] https://software.intel.com/sites/default/files/managed/22/0d/335592-sdm-vol-4.pdf
Comment 1 Matt Turner 2018-02-13 21:24:57 UTC
Created attachment 274153 [details]
dmesg
Comment 2 Matt Turner 2018-02-13 21:25:15 UTC
Created attachment 274155 [details]
lspci -vvnn
Comment 3 Matt Turner 2018-02-13 21:31:58 UTC
Created attachment 274157 [details]
acpidump
Comment 4 Chen Yu 2018-03-26 06:13:00 UTC
(In reply to Matt Turner from comment #0)
> Created attachment 274151 [details]
> turbostat log
> 
> powertop shows the CPUs in C10, core in C6, and package in C6. No residency
> in PC10. turbostat agrees.
> 
> From the Intel Software Developers' Manual, Vol. 4 [1], I know that there
> are MSRs for core C1/C3/C6 and package C2/C3/C6/C10. Also confirmed by the
> commit message of 5c10b048c37c ("perf/x86/intel: Enable C-state residency
> events for Apollo Lake")
> 
> In the attached turbostat log, I noticed something odd:
> 
> cpu1: MSR_PKG_CST_CONFIG_CONTROL: 0x14008072 (... pkg-cstate-limit=2 ...)
> 
> The BIOS allows you to limit the deepest cstate, with options disabled, C1,
> C6, C7, C8, C9, C10. I tried each option and did a "rdmsr 0xe2" with each
> set:
> 
> BIOS limit    powertop reports     rdmsr 0xe2
>   disabled    no pkg residency           8000
>         C1                 PC2       14008012
>         C6                 PC6       14008032
>         C7                 PC6       14008042
>         C8                 PC6       14008052
>         C9                 PC6       14008062
>        C10                 PC6       14008072
> 
> According to ISD Vol 4 [1] Table 2-12, bits 3:0 of Goldmont's
> MSR_PKG_CST_CONFIG_CONTROL are:
> 
> 0000b: No limit
> 0001b: C1
> 0010b: C3
> 0011b: C6
> 0100b: C7
> 0101b: C7S
> 0110b: C8
> 0111b: C9
> 1000b: C10
> 
> Until C7S, these correspond to bits 7:4 of my readings of 0xe2 (with a
> missing C3 fitting in as 14008022). The high bits are the lock bit (bit 15)
> and presumably demote/undemote bits, though those are not documented for
> Goldmont.
> 
> Is the BIOS programming the wrong nibble? 0x2 in the low nibble looks like
> it would be C3 and maybe the undemotion bit would cause the package to use
> C6?
> 
> Maybe I'm off on a tangent?
> 
> [1]
> https://software.intel.com/sites/default/files/managed/22/0d/335592-sdm-vol-
> 4.pdf
This is interesting, I'm not sure if what does undemotion mean, 
does it mean, the processor will auto convert C1/C3 request to C6, which
is the opposite of demotion?
I also saw there's bit16 in MSR_PKG_CST_CONFIG_CONTROL for Xeon Processors E5 v4 Family Based on the Broadwell  SDM vol 4   2-233:
Automatic C-State Conversion Enable (R/W)
If 1, the processor will convert HALT or MWAT(C1) to MWAIT(C6)

But anyway I think you are right, the BIOS might has scribbled the  wrong bits on MSR_PKG_CST_CONFIG_CONTROL. @Len
Comment 5 Matt Turner 2018-04-04 19:06:54 UTC
The status was changed to NEEDINFO... but I don't see a request for any info. Is there something I can provide?
Comment 6 Matt Turner 2018-05-20 22:49:07 UTC
Six week ping
Comment 7 Chen Yu 2018-05-21 02:42:51 UTC
We'll talk to Len/Rafael about this issue in the meeting this week.
Comment 8 Matt Turner 2018-07-10 17:56:00 UTC
I contacted ASRock, and they responded with:

"""
In our ApolloLake and GeminiLake platform, they both can support up to C7 state. 
But C8, C9, C10 state are not supported.
To enter C7 state:
Please set "SATA Aggressive Link Power Management" to enabled.
This option is set to disabled by default, so it could only enter C6.

In these platform, we need to use S0ix to support C8 ~C10. 
Now, it is the hardware limitation, so it could only support up to C7.
"""

I think what they're saying is that S0ix is not supported (on this platform, on this board? I don't know) and it is required to get to PC8~PC10.

Documentation/acpi/lpit.txt says

"""
The following attributes are added dynamically to the cpuidle
sysfs attribute group:
        /sys/devices/system/cpu/cpuidle/low_power_idle_cpu_residency_us
        /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us

The "low_power_idle_cpu_residency_us" attribute shows time spent
by the CPU package in PKG C10

The "low_power_idle_system_residency_us" attribute shows SLP_S0
residency, or system time spent with the SLP_S0# signal asserted.
This is the lowest possible system power state, achieved only when CPU is in
PKG C10 and all functional blocks in PCH are in a low power state.
"""

The last paragraph seems to suggest the dependency between S0ix and PC10 is the other way around.

In any case, my system does not have low_power_idle_system_residency_us in sysfs, so I suspect that indicates it does not support S0ix.

Really not sure how any end users are supposed to decipher all of this. At least some confirmation from the experts that we've come to the right conclusion would be nice.
Comment 9 Rajneesh Bhardwaj 2018-07-18 14:34:38 UTC
Can you provide the output of /sys/kernel/debug/telemetry/soc_states ?

Which idle driver are they using on APL and GLK? intel_idle ?
Comment 10 Matt Turner 2018-07-21 20:30:08 UTC
I've built my kernel with CONFIG_INTEL_TELEMETRY, INTEL_PMC_IPC, INTEL_PUNIT_IPC, DEBUG_FS, etc, but I do not have /sys/kernel/debug/telemetry/

I am using intel_idle:

# cat /sys/devices/system/cpu/cpuidle/current_driver 
intel_idle
Comment 11 Len Brown 2018-08-14 14:49:53 UTC
Moving to Product "Power Management" so we will see this...

The turbostat output shows the OS is requesting C10 most of the time, but never getting deeper than Core C6 (and thus, Package C6).

The response from the vendor says that the system should be able to get into C7 and PC7.  I recommend you follow their advice on ASPM to achieve that state.

Also, just for grins, to see what Windows would get, boot in ACPI mode via  "intel_idle.max_cstate=0".  When you do so, please use the latest turbostat from the kernel tree.

sudo turbostat -o ts.out sleep 10
Comment 12 Len Brown 2018-08-15 03:14:39 UTC
Re: comment #11

I meant to type SATA ALPM (not PCI ASPM)
Please enable SATA ALPM and run latest turbostat to show you can get into PC7,
the deepest state the OEM claims to support on this platform.

# echo min_power > /sys/class/scsi_host/*/link_power_management_policy

or some such should be able to do it at run-time.

Re: comment #8

yes, PC10 is required to reach S0ix.
But sounds like the OEM supports only to PC7.

Further, dmesg in comment #1 shows "ACPI: (supports S0 S3 S5)",
so the BIOS claims to support S3, and that may be the default suspend state.

You can tell by running the upstream kernel and looking at
/sys/power/mem_sleep.  For an S0ix box, it will default to "s2idle",
and for an S3 box, it will be set to "deep".

Re: Original Description

I agree, the BIOS SETUP impact on MSR 0xE2
does not seem to match the table 2-12 in the SDM.
Further turbostat.c doesn't match either:

int bxt_pkg_cstate_limits[16] = {PCL__0, PCL__2, PCLUNL, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV};
Comment 13 Len Brown 2018-08-15 04:14:31 UTC
Re: Original Description

The SDM matches the APL EDS for the Package C-state limits, which are in bits 3:0.

The BIOS SETUP appears to be erroneous.  If it is claiming to limit *package* c-states, then you are right, it is programming the wrong nibble.  It is changing 7:4 when it should be changing 3:0.  Also, the values used don't quite match the SDM.

This particular hardware did have a core-cstate limit in 7:4.  I don't know if that feature ever shipped, but the values are totally different, and so it seems erroneous that the BIOS is changing 7:4 based on BIOS SETUP.

But I'm puzzled by the value of 2 in 3:0.  turbostat decodes that as unlimited, and you are observing actual pkg%c6, but the SDM table suggests the value 2 should limit the package to PC3.
Comment 14 Matt Turner 2018-08-15 06:41:54 UTC
(In reply to Len Brown from comment #12)
> Re: comment #11
> 
> I meant to type SATA ALPM (not PCI ASPM)
> Please enable SATA ALPM and run latest turbostat to show you can get into
> PC7,
> the deepest state the OEM claims to support on this platform.
>
> # echo min_power > /sys/class/scsi_host/*/link_power_management_policy

In doing that with powertop I have occasionally seen PC7 residency (50% PC6/50% PC7), but not in any predictable manner. I've never gotten turbostat to report PC7 residency. I'm also somewhat confused where these tools are reading PC7 residency from, since the SDM says there are only MSRs for PC2/3/6/10 on BXT.

> or some such should be able to do it at run-time.
> 
> Re: comment #8
> 
> yes, PC10 is required to reach S0ix.
> But sounds like the OEM supports only to PC7.

As an aside, is it common for an OEM to limit the maximum C states like that?
 
> Further, dmesg in comment #1 shows "ACPI: (supports S0 S3 S5)",
> so the BIOS claims to support S3, and that may be the default suspend state.
> 
> You can tell by running the upstream kernel and looking at
> /sys/power/mem_sleep.  For an S0ix box, it will default to "s2idle",
> and for an S3 box, it will be set to "deep".

Thanks. That's very good to know.

mattst88@asrock-j3455m:~$ sudo cat /sys/power/mem_sleep
s2idle [deep]

So indeed the board does not support S0ix.

> Re: Original Description
> 
> I agree, the BIOS SETUP impact on MSR 0xE2
> does not seem to match the table 2-12 in the SDM.
> Further turbostat.c doesn't match either:
> 
> int bxt_pkg_cstate_limits[16] = {PCL__0, PCL__2, PCLUNL, PCLRSV, PCLRSV,
> PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV,
> PCLRSV, PCLRSV};

I'd glad to get some confirmation that I'm not grossly confused :)

(In reply to Len Brown from comment #13)
> Re: Original Description
> 
> The SDM matches the APL EDS for the Package C-state limits, which are in
> bits 3:0.
> 
> The BIOS SETUP appears to be erroneous.  If it is claiming to limit
> *package* c-states, then you are right, it is programming the wrong nibble. 
> It is changing 7:4 when it should be changing 3:0.  Also, the values used
> don't quite match the SDM.
> 
> This particular hardware did have a core-cstate limit in 7:4.  I don't know
> if that feature ever shipped, but the values are totally different, and so
> it seems erroneous that the BIOS is changing 7:4 based on BIOS SETUP.
> 
> But I'm puzzled by the value of 2 in 3:0.  turbostat decodes that as
> unlimited, and you are observing actual pkg%c6, but the SDM table suggests
> the value 2 should limit the package to PC3.

Agreed. That's where my theory hits a snag. Could the demotion/undemotion bits cause the CPU to enter PC6 when the deepest allowed state is programmed to be PC3? IIRC demotion/undemotion (can't remember which does what) allows the CPU to pick PC6 instead of PC3 under some conditions. Not sure if that's a possible explanation.

I really wish the lock bit wasn't set so I could actually test my theory. Why do we want to lock that register anyway? Maybe it's time to get out the SOIC-8 programmer and disassemble the BIOS.
Comment 15 Rajneesh Bhardwaj 2018-08-24 09:57:32 UTC
Looks like on your device the IPC1 interface is also disabled, please check with BIOS vendor to enable that. Without this interface, Telemetry driver can not work and hence hinders debug.
Comment 16 Zhang Rui 2018-09-30 06:20:24 UTC
ping...
Comment 17 Matt Turner 2018-10-06 17:22:36 UTC
We determined that the BIOS does not have the ability to enable IPC1, so the telemetry interface cannot be enabled. Without that, I think I understood from Raj that there wasn't much else that could be done.

I guess feel free to close the bug.