Bug 210741 - intel_pstate and intel_idle: HWP and C1E are incompatible.
Summary: intel_pstate and intel_idle: HWP and C1E are incompatible.
Status: NEW
Alias: None
Product: Power Management
Classification: Unclassified
Component: intel_pstate (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Chen Yu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-12-17 01:13 UTC by Doug Smythies
Modified: 2021-03-29 22:56 UTC (History)
3 users (show)

See Also:
Kernel Version: all that have been tested (5.2 - 5.10-rc7)
Tree: Mainline
Regression: No


Attachments
Graph of load sweep up and down at 347 Hertz. (85.03 KB, image/png)
2020-12-17 01:13 UTC, Doug Smythies
Details
Graph of an area of concern breaking down. (151.34 KB, image/png)
2020-12-17 01:23 UTC, Doug Smythies
Details
Graph of overruns from the same experiment as the previous post (232.44 KB, image/png)
2020-12-17 01:29 UTC, Doug Smythies
Details
inverse impulse test - short short sleep exit response (47.09 KB, image/png)
2020-12-17 01:47 UTC, Doug Smythies
Details
inverse impulse response - multiple (like 1000) bad exits (18.77 KB, image/png)
2020-12-17 01:55 UTC, Doug Smythies
Details
inverse impulse response - multiple (like 1000) bad exits - detail A (55.29 KB, image/png)
2020-12-17 02:01 UTC, Doug Smythies
Details
inverse impulse response - multiple (like 1000) bad exits - detail (62.45 KB, image/png)
2020-12-17 03:59 UTC, Doug Smythies
Details
inverse impulse response - multiple (like 1000) bad exits - detail C (37.35 KB, image/png)
2020-12-17 04:02 UTC, Doug Smythies
Details
inverse impulse response - i5-6200u multi all bad (13.10 KB, image/png)
2020-12-17 06:25 UTC, Doug Smythies
Details
Just an example of inverse impulse verses some different EPPs (14.15 KB, image/png)
2020-12-17 17:18 UTC, Doug Smythies
Details
Graph of load sweep at 200 Hertz for various idle states (38.04 KB, image/png)
2020-12-21 22:58 UTC, Doug Smythies
Details
step function system response - overview (17.67 KB, image/png)
2020-12-29 19:35 UTC, Doug Smythies
Details
step function system response - detail A (17.46 KB, image/png)
2020-12-29 19:36 UTC, Doug Smythies
Details
step function system response - detail B (11.36 KB, image/png)
2020-12-29 19:37 UTC, Doug Smythies
Details
step function system response - detail B-1 (8.48 KB, image/png)
2020-12-29 19:39 UTC, Doug Smythies
Details
step function system response - detail B-2 (17.22 KB, image/png)
2020-12-29 19:40 UTC, Doug Smythies
Details
step function system response - idle state 2 disabled (37.08 KB, image/png)
2021-01-02 17:23 UTC, Doug Smythies
Details
a set of tools for an automated test (13.41 KB, application/x-tar)
2021-01-16 22:29 UTC, Doug Smythies
Details
an example run of the quick test tools (8.18 KB, text/plain)
2021-01-16 22:39 UTC, Doug Smythies
Details
An example idle trace capture of the issue (11.82 KB, image/png)
2021-02-09 07:39 UTC, Doug Smythies
Details
Just for reference, a good example of some idle trace data (9.30 KB, image/png)
2021-02-09 07:54 UTC, Doug Smythies
Details
graph of inverse impulse response measured verses theoretical failure probabilities. (26.52 KB, image/png)
2021-02-09 15:37 UTC, Doug Smythies
Details
forgot to label my axis on the previous post (27.55 KB, image/png)
2021-02-09 15:42 UTC, Doug Smythies
Details
take one point, 4500 uSec from the previous graph and do add a couple of other configurations (24.29 KB, image/png)
2021-02-10 23:14 UTC, Doug Smythies
Details
changing the MWAIT definition of C1E fixes the problem (729 bytes, text/plain)
2021-02-28 22:20 UTC, Doug Smythies
Details
wult statistics for c1,c1e for stock and mwait modifed kernels (32.87 KB, image/png)
2021-03-13 00:20 UTC, Doug Smythies
Details
graph of wult test results (15.14 KB, image/png)
2021-03-13 00:26 UTC, Doug Smythies
Details
wult statistics for c1,c1e for stock and mwait modifed kernels - version 2 (32.60 KB, image/png)
2021-03-14 20:23 UTC, Doug Smythies
Details
wult graph c1e for stock and mwait modifed kernels - version 2 (15.37 KB, image/png)
2021-03-14 21:04 UTC, Doug Smythies
Details

Description Doug Smythies 2020-12-17 01:13:11 UTC
Created attachment 294171 [details]
Graph of load sweep up and down at 347 Hertz.

Consider a steady state periodic single threaded workflow, with a work/sleep frequency of 347 Hertz and a load somewhere in the ~75% range at the steady state operating point.
For the intel-cpufreq CPU frequency scaling driver and powersave governor and hwp disabled, it goes indefinitely without any issues.
For the acpi-cpufreq CPU frequency scaling driver and ondemand governor, it goes indefinitely without any issues.
For the intel-cpufreq CPU frequency scaling driver and powersave governor and hwp enabled, it suffers from overruns.

Why?

For unknown reasons, HWP seems to incorrectly decide that the processor is idle and spins the PLL down to a very low frequency. Upon exit from the sleep portion of the periodic workflow it takes a very long time (on the order of 20 milliseconds (supporting data for that statement will added in a later posting)), resulting in the periodic job no being able to complete its work before the next interval, whereas it normally has plenty of time to do its work. Actually, typical worst case overruns are around 12 milliseconds, or several work/sleep periods (i.e. it takes a very long time to catch up.)

The probability of this occurring is about 3%, but varies significantly. Obviously, the recovery time is also a function of EPP, but mostly this work has been done with the default EPP of 128. I believe this to be a sampling and anti-aliasing issue, but can not prove it because HWP is black box. My best GUESS is:

If the periodic load is busy on a jiffy boundary, such that the tick is on.
Then if it is sleeping at the next jiffy boundary, with a pending wake such that idle state 2 was used.
  Then if the rest of the system was idle such that HWP decides to spin down the PLL.
    Then it is highly probable that upon that idle state 2 exit, the PLL is too slow to ramp up and the task will overrun as a result.
Else everything will be fine.

For a 1000 Hz kernel the above suggests that a work/sleep frequency of 500 Hz should behave in a binary way, either lots of overruns or none. It does.
For a 1000 Hz kernel the above suggests that a work/sleep frequency of 333.333 Hz should behave in a binary way, either lots of overruns or none. It does.
Note: in all cases the sleep time has to be within the window of opportunity.

Now, actually I can not prove if the idle state 2 part is a cause or consequence, but it never happens with it disabled, but at the cost of significant power.

Another way this issue would manifest itself is as seeming to be an extraordinary idle exit latency, but would be rather difficult to isolate as the cause.

processors tested:
Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz (mine)
Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz (not mine)

HWP has been around for years, why am I just reporting this now?

I never owned an HWP capable processor before. My older i7-2600K based test computer was getting a little old, so I built a new test computer. I noticed this issue the same day I first enabled HWP. That was months ago (notice the dates on the graphs that will eventually be added to this), and I tried, repeatedly, to get help from Intel via the linux-pm e-mail list.

Now, given the above system response issue, a new test was developed to focus specifically on this issue, dubbed the "Inverse Impulse Response" test. It examines in great detail the CPU frequency rise time after a brief, less than 1 millisecond, gap in an otherwise continuous workflow. I'll attach graphs and details in subsequent postings to this bug report.

While I believe this is an issue entirely within HWP, I have not been able to prove that there was nothing sent from the kernel somehow telling HWP to spin down.

Notes:

CPU affinity does not need to be forced, but sometimes is for data acquisition.

1000 hertz kernels were tested back to kernel 5.2, all failed.

Kernel 5.10-rc7 (I have yet to compile 5.10) also fails.

A 250 hertz kernel was tested, and it did not have this issue in this area. Perhaps elsewhere, I didn't look.

Both teo and menu idle governors were tested, and while both suffer from the unexpected CPU frequency drop, teo seems much worse. However failure points for both governors are repeatable.

The test computers were always checked for any throttling log sticky bits, and regardless were never anywhere even close to throttling.

Note, however that every HWP capable computer I was to acquire data from has at least one of those sticky bits set after boot, so they need to be reset before any test that might want to examine them afterwards.
Comment 1 Doug Smythies 2020-12-17 01:15:34 UTC
Intel: Kristen hasn't been the maintainer for years. please update the auto-assigned thing.
Comment 2 Doug Smythies 2020-12-17 01:23:05 UTC
Created attachment 294173 [details]
Graph of an area of concern breaking down.

an experiment was done looking around the area initially found at 347 hertz work/sleep frequency of the periodic workflow and load.
Comment 3 Doug Smythies 2020-12-17 01:29:27 UTC
Created attachment 294175 [details]
Graph of overruns from the same experiment as the previous post

There should not be overruns. (sometimes there are 1 or 2 from the first time start up)
Comment 4 Doug Smythies 2020-12-17 01:47:08 UTC
Created attachment 294177 [details]
inverse impulse test - short short sleep exit response

Good and bad inverse impulse response exits all on one graph.

The graph mentions 5 milliseconds a lot. At that time I did not know that the frequency steps times are a function of EPP. I have since mapped the entire EPP space getting:

0 <= EPP <= 1 : unable to measure.
2 <= EPP <= 39 : 2 milliseconds between frequency steps
40 <= EPP <= 55 : 3 milliseconds between frequency steps
56 <= EPP <= 79 : 4 milliseconds between frequency steps
80 <= EPP <= 133 : 5 milliseconds between frequency steps
134 <= EPP <= 143 : 6 milliseconds between frequency steps
144 <= EPP <= 154 : 7 milliseconds between frequency steps
155 <= EPP <= 175 : 8 milliseconds between frequency steps
176 <= EPP <= 255 : 9 milliseconds between frequency steps
Comment 5 Doug Smythies 2020-12-17 01:55:16 UTC
Created attachment 294179 [details]
inverse impulse response - multiple (like 1000) bad exits

by capturing a great many bad exits, one can begin to observe the width of the timing race window (which I already knew from other work, but don't think I wrote it herein yet). the next few attachments will drill down into some details of this same data.
Comment 6 Doug Smythies 2020-12-17 02:01:56 UTC
Created attachment 294181 [details]
inverse impulse response - multiple (like 1000) bad exits - detail A


just a zoomed in graph of an area of interest, so I could verify that the window size was the same as (close enough) as what I asked for. The important point being that the window is always exactly around the frequency step point.

Now we already know that the frequency step points are aHWP thing, so this data supports the argument that HWP is doing this stuff on its own.
Comment 7 Doug Smythies 2020-12-17 03:59:47 UTC
Created attachment 294185 [details]
inverse impulse response - multiple (like 1000) bad exits - detail
Comment 8 Doug Smythies 2020-12-17 04:02:56 UTC
Created attachment 294187 [details]
inverse impulse response - multiple (like 1000) bad exits - detail C

 the previous and this one are details B and C zoomed in looks at another two spots. Again calculating the window width.
Comment 9 Doug Smythies 2020-12-17 06:25:20 UTC
Created attachment 294189 [details]
inverse impulse response - i5-6200u multi all bad

this is the other computer. there are also detail graphs, if needed.
Comment 10 Doug Smythies 2020-12-17 17:18:05 UTC
Created attachment 294201 [details]
Just an example of inverse impulse verses some different EPPs

see also:

https://marc.info/?l=linux-pm&m=159354421400342&w=2

and on that old thread, I just added a link to this.
Comment 11 Doug Smythies 2020-12-19 16:56:45 UTC
> A 250 hertz kernel was tested, and it did not have this
> issue in this area. Perhaps elsewhere, I didn't look.

Correction: same thing for 250 Hertz kernel.

Some summary data for the periodic workflow manifestation of the issue. 347 hertz work/sleep frequency, fixed packet of work to do per cycle, 5 minutes, kernel 5.10, both 1000 Hz and 250 Hz, teo and menu idle governors, idle state 2 enabled and disabled.

1000 Hz, teo, idle state 2 enabled:
overruns 28399
maximum catch up 13334 uSec
Ave. work percent: 76.767
Power: ~14.5 watts

1000 Hz, menu, idle state 2 enabled:
overruns 835
maximum catch up 10934 uSec
Ave. work percent: 68.106
Power: ~16.3 watts

1000 Hz, teo, idle state 2 disabled:
overruns 0
maximum catch up 0 uSec
Ave. work percent: 67.453
Power: ~16.8 watts (+2.3 watts)

1000 Hz, menu, idle state 2 disabled:
overruns 0
maximum catch up 0 uSec
Ave. work percent: 67.849
Power: ~16.4 watts (and yes the 0.1 diff is relevant)

250 Hz, teo, idle state 2 enabled:
overruns 193
maximum catch up 10768 uSec
Ave. work percent: 68.618
Power: ~16.1 watts

250 Hz, menu, idle state 2 enabled:
overruns 22
maximum catch up 10818 uSec
Ave. work percent: 68.607
Power: ~16.1 watts

250 Hz, teo, idle state 2 disabled:
overruns 0
maximum catch up 0 uSec
Ave. work percent: 68.550
Power: ~16.1 watts

250 Hz, menu, idle state 2 disabled:
overruns 0
maximum catch up 0 uSec
Ave. work percent: 68.586
Power: ~16.1 watts

So, the reason I missed the 250 hertz kernel in my earlier work, was because the probability was so much less. The probability is less because the operating point is so different between the teo and menu governors and the 1000 and 250 Hz kernels. i.e. there is much more spin down margin for the menu case.

The operating point difference between difference between the 250 Hz and 1000 Hz kernels for teo is worth a deeper look.
Comment 12 Doug Smythies 2020-12-20 17:02:18 UTC
additionally, and for all other things being equal, the use of idle state 2 is dramatically different between the 1000 (0.66%) and 250 (0.03%) Hertz kernels, resulting in differing probabilities of hitting the timing window while in idle state 2.

HWP does not work correctly in these scenarios.
Comment 13 Doug Smythies 2020-12-21 22:58:31 UTC
Created attachment 294275 [details]
Graph of load sweep at 200 Hertz for various idle states

> Now, actually I can not prove if the idle state 2 part
> is a cause or consequence, but it never happens with it
> disabled, but at the cost of significant power.

idle state 2, combined with the timing window, which is much much larger than previously known, is the cause.

The CPU load is increased to max, then decreased. As a side note, there is a staggering amount of hysteresis and very long time constants involved here.

If one just sits and watches turbostat with the system supposedly in steady state operation, HWP can be observed very gradually (10s of seconds) deciding that it can reduce the CPU frequency, thus saving power. Then it has one of these false frequency drops, HWP struggles to catch up, raising the CPU frequency as it does so, and the cycle repeats.
Comment 14 Doug Smythies 2020-12-29 19:35:07 UTC
Created attachment 294399 [details]
step function system response - overview

1514 step function tests were done.
the system response was monitored each time.
For 93% of the tests, the system response was as expected.
(do not confuse "as expected" with "ideal" or "best".)
For 7% of the tests the system response was not as expected, being much much too slow and taking way too long thereafter to completely come up to speed.

Note: The y-axis of these graphs is now "gap-time" instead of CPU frequency. This was not done to confuse the reader, but the reverse frequency calculation was not done on purpose. It is preferable to observe the data in units of time, without introducing frequency errors due to ISR and other latency gaps. Approximate CPU frequency conversions have been added.

While I will post about 5 graphs for this experiment, I have hundreds and have done many different EPPs and on and on ...
Comment 15 Doug Smythies 2020-12-29 19:36:10 UTC
Created attachment 294401 [details]
step function system response - detail A
Comment 16 Doug Smythies 2020-12-29 19:37:26 UTC
Created attachment 294403 [details]
step function system response - detail B
Comment 17 Doug Smythies 2020-12-29 19:39:18 UTC
Created attachment 294405 [details]
step function system response - detail B-1
Comment 18 Doug Smythies 2020-12-29 19:40:49 UTC
Created attachment 294407 [details]
step function system response - detail B-2
Comment 19 Doug Smythies 2021-01-02 17:23:02 UTC
Created attachment 294469 [details]
step function system response - idle state 2 disabled

1552 test runs with idle state 2 disabled, no failures.
Comment 20 Doug Smythies 2021-01-16 22:29:58 UTC
Created attachment 294685 [details]
a set of tools for an automated test

At this point, I have provided 3 different methods that reveal the same HWP issue. Herein, tools are provided to perform an automated quick test to answer the question "does my processor have this HWP issue?"

The motivation for this automation is to make it easier to test other HWP capable Intel processors. Until now the other methods for manifesting the issue have required "tweeking", and have probabilities of occurrence even lower than 0.01%, requiring unbearably long testing times (many hours) in order to acquire enough data to be statistically valid. Typically, this test provides PASS/FAIL results in about 5 minutes.

The test changes idle state enabled/disabled status, requiring root rights to do so. The scale for the fixed workpacket periodic workflow is both arbitrary and different between processors. The test runs in two steps: The first finds the operating point for the test (i.e. it does the "tweeking" automatically); The second does the actual tests one without idle state 2 and one with only idle state 2 (recall that the issue is linked with the use of idle state 2). Forcing idle state 2 greatly increases the probability of the issue occurring. While this test has been created specifically for the intel_pstate CPU frequency scaling driver with HWP enabled and the powersave governor, it doesn't check. Therefore one way to test the test is to try it with HWP disabled.

Note: the subject test computer must be able to run one CPU at 100% without needing to throttle (power or thermal or any other reason), including with only idle state 2 enabled.

Results so far: 3 of 3 processors FAIL; i5-9600k; i5-6200U; i7-10610U.

use this command:

./job-control-periodic 347 6 6 900 10

Legend:
347 hertz work/sleep frequency
6 seconds per iteration run.
6 seconds per test run.
try for approximately 900 uSec average sleep time.
10 test loops at that 6 seconds per test.

the test will take about 5 minutes.
Comment 21 Doug Smythies 2021-01-16 22:39:25 UTC
Created attachment 294687 [details]
an example run of the quick test tools

the example contains results for:
HWP disabled: PASS (as expected)
HWP enabled: FAIL (as expected)

but tests were also done with a 250 Hertz kernel, turbo disabled, EEO and RHO bits changed... all give FAIL for HWP enabled forcing idle state 2, and PASS for other conditions.
Comment 22 Doug Smythies 2021-02-09 00:29:13 UTC
some other results for the quick test:

i5-9600k (Doug): FAIL. (Ubuntu 20.04; kernel any)
i5-6200U (Alin): FAIL. (Debian.)
i7-7700HQ (Gunnar): FAIL (Ubuntu 20.10)
i7-10610U (Russell) : FAIL. (CentOS (RedHat 8), 4.18.0-240.10.1.el8_3.x86_64 #1 SMP).
another Skylake(Rick) still waiting to hear back.

so 4 out of 4 so far (and I gave them no guidance at all, on purpose, as to any particular kernel to try).

I have been picking away at this thread (pun intended) for months, and I think it is finally starting to unravel. Somewhere above i said:

> For unknown reasons, HWP seems to incorrectly decide
> that the processor is idle and spins the PLL down to
> a very low frequency.

I now believe it to be something inside the processor, but maybe not part of HWP. I think that non-hwp processors or ones with it disabled, also misdiagnose that the entire processor is idle. My evidence is both not very thorough and not currently in a presentable form, but this issue only ever occurs some short time or immediately after every core has been idle, with at least one in idle state 2. The huge difference between HWP and OS driven pstates is that the OS knows the system wasn't actually idle and HWP doesn't. Even though package C1E is disabled it behaves, perhaps, similar to be it being enabled.

There is some small timing window where this really screws up. Mostly is works fine, and either the CPU frequency doesn't even ramp down at all, or it recovers quickly, within about 120 uSec.

And as far as I know, it exits the idle state O.K. but it takes an incredibly long time for HWP to ramp up the CPU frequency again. Meanwhile, any non-HWP approach doesn't drop the pstate request to minimum nor re-start any sluggish ramp up.

Now, this issue is rare and would be extremely difficult to diagnose appearing as occasional glitches, i.e. a frame rate drop in a game, dropped data, unbelievably long latency is any kind of performance is required. I consider this issue to be of the utmost importance.
Comment 23 Doug Smythies 2021-02-09 07:39:46 UTC
Created attachment 295137 [details]
An example idle trace capture of the issue

these are very difficult to find.
Comment 24 Doug Smythies 2021-02-09 07:54:28 UTC
Created attachment 295139 [details]
Just for reference, a good example of some idle trace data
Comment 25 Doug Smythies 2021-02-09 15:37:42 UTC
Created attachment 295155 [details]
graph of inverse impulse response measured verses theoretical failure probabilities.

As recently as late yesterday, I was still attempting to refine the gap time definition from comment #1. Through this entire process, I just assumed the processor would at least require 2 samples before deciding the entire system was idle. Why? Because it was beyond my comprehension that it would be based on one instant in time. Well, that was wrong, and it is actually based on one sample only at the HWP loop time (see attachment #294201 [details]), if idle state 2 is involved.

Oh, only idle state 2 was enabled for this. The reason I could not originally refine the gap definition, was that I did not yet know enough. I have to force idle state 2 increase the failure probabilities enough to find these limits without tests that would have otherwise run for days.
Comment 26 Doug Smythies 2021-02-09 15:42:01 UTC
Created attachment 295159 [details]
forgot to label my axis on the previous post
Comment 27 Doug Smythies 2021-02-10 23:14:28 UTC
Created attachment 295211 [details]
take one point, 4500 uSec from the previous graph and do add a couple of other configurations

Observe the recovery time, which does not include the actual idle state exit latency, just the extra time needed to get to get to adequate CPU frequency, is on average 87 times slower for HWP verses noHWP and 44 times slower the passive/ondemand/noHWP.

Yes, there a few interesting spikes on the passive/ondemand/noHWP graph, but those things we can debug relatively easily (which I will not do).
Comment 28 Doug Smythies 2021-02-28 22:20:05 UTC
Created attachment 295533 [details]
changing the MWAIT definition of C1E fixes the problem

I only changed the one definition relevant to my test computer. The documentation on these bits is rather scant. Other potential fixes include getting rid of Idle state 2 (C1E) altogether. Or booting with it disabled: "intel_idle.states_off=4".

I observe that Rui fixed the "assigned" field. Thanks, not that it helps as Srinivas has been aware of this for over 1/2 a year.
Comment 29 Srinivas Pandruvada 2021-03-01 06:26:31 UTC
I tried to reproduce with your scripts on CFL-S systems and didn't observe the same almost 1/2 half year back. Systems can be configured different way which impacts HWP algorithm. So it is possible that my lab system is configured differently.
Comment 30 Doug Smythies 2021-03-02 15:37:47 UTC
(In reply to Srinivas Pandruvada from comment #29)
> I tried to reproduce with your scripts on CFL-S systems and didn't observe
> the same almost 1/2 half year back. Systems can be configured different way
> which impacts HWP algorithm. So it is possible that my lab system is
> configured differently.

By "CFL-S" I assume you mean "Coffee Lake".

I wish you had reported back to me your findings, as we could have figured out the difference.

Anyway, try the automated quick test I posted in comment 20. Keep in mind that it needs to be HWP enabled, active, powersave, default epp=128. It is on purpose that the tool does not check for this configuration.
Comment 31 Doug Smythies 2021-03-04 15:43:26 UTC
(In reply to Doug Smythies from comment #28)
> Created attachment 295533 [details]
> changing the MWAIT definition of C1E fixes the problem

Conversely, I have tried to determine if other idle states can be broken by introducing the least significant bit of the MWAIT.

I did idle state 3, C3, and could not detect any change in system response.

I did idle state 5, C7S, which already had the bit set, along with bit 1, so I set bit one to 0:

  .name = "C7s",
- .desc = "MWAIT 0x33",
- .flags = MWAIT2flg(0x33) | CPUIDLE_FLAG_TLB_FLUSHED,
+ .desc = "MWAIT 0x31",
+ .flags = MWAIT2flg(0x31) | CPUIDLE_FLAG_TLB_FLUSHED,
  .exit_latency = 124,
  .target_residency = 800,
  .enter = &intel_idle,

I could not detect any change in system response.

I am also unable to detect any difference in system response between idle state 1, C1, and idle state 2, C1E, with this change. I do not know if the change merely makes idle state 2 = idle state 1.
Comment 32 Doug Smythies 2021-03-13 00:20:57 UTC
Created attachment 295827 [details]
wult statistics for c1,c1e for stock and mwait modifed kernels

Attempting to measure exit latency using Artem Bityutskiy's wult tool, tdt method.
Kernel 5.12-rc2 stock and with the MWAIT change from 0X01 to 0X03.
Statistics.
Comment 33 Doug Smythies 2021-03-13 00:26:23 UTC
Created attachment 295829 [details]
graph of wult test results

graph of wult tdt method results.
If a I210 based NIC can be sourced, it will be tried, if pre-wake needs to be eliminated. I do not know if is needed or not.
Comment 34 Srinivas Pandruvada 2021-03-14 18:57:27 UTC
(In reply to Doug Smythies from comment #30)
> (In reply to Srinivas Pandruvada from comment #29)
> > I tried to reproduce with your scripts on CFL-S systems and didn't observe
> > the same almost 1/2 half year back. Systems can be configured different way
> > which impacts HWP algorithm. So it is possible that my lab system is
> > configured differently.
> 
> By "CFL-S" I assume you mean "Coffee Lake".
Yes, desktop part.

> 
> I wish you had reported back to me your findings, as we could have figured
> out the difference.
> 
I thought I have responded you, I have to search my emails. I had to specially get a system arranged but it had 200MHz higher turbo. You did share your scripts at that time.

These algorithm are tuned on a system, so small variations can have bigger impact.

Let's see if ChenYu, has some system same as yours.

> Anyway, try the automated quick test I posted in comment 20. Keep in mind
> that it needs to be HWP enabled, active, powersave, default epp=128. It is
> on purpose that the tool does not check for this configuration.
Comment 35 Doug Smythies 2021-03-14 20:23:59 UTC
Created attachment 295853 [details]
wult statistics for c1,c1e for stock and mwait modifed kernels - version 2

Artum advised that I lock the CPU frequencies at some high value, in order to show some difference. Frequencies locked at 4.6 GHz for this attempt.
Comment 36 Doug Smythies 2021-03-14 21:04:22 UTC
Created attachment 295855 [details]
wult graph c1e for stock and mwait modifed kernels - version 2

As Artum advised, with locked CPU frequencies.

Other data (kernel 5.12-rc2):

Phoronix dbench 1.0.2 0 client count 1:

stock: 264.8 MB/S
stock, idle state 2 disabled: 311.3 MB/S (+18%)
stock, HWP boost: 417.9 MB/S (+58%)
stock, idle state 2 disabled & HWP boost: 434.3 MB/S (+64%)
stock, performance governor: 420 MB/S (+59%)
stock, performance governor & is2 disabled: 435MB/S (+64%)

inverse impulse response, 847 uSec gap:
stock: 2302 tests 38 fails, 98.35% pass rate.
+ MWAIT change: 1072 tests, 0 fails, 100% pass rate.

@Srinivas: The whole point of the quick test stuff is that it self adjusts to the system under test.
Comment 37 Doug Smythies 2021-03-20 21:07:12 UTC
For this:
Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
The quicktest gives indeterminate results.
However, it also is not using any idle state involving the least significant bit of MWAIT being set.

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1_ACPI
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2_ACPI
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3_ACPI

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/desc
/sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI FFH MWAIT 0x0
/sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI FFH MWAIT 0x30
/sys/devices/system/cpu/cpu0/cpuidle/state3/desc:ACPI FFH MWAIT 0x60

If there is a way to make idle work like all previous ways, i.e.:

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C1E
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3
/sys/devices/system/cpu/cpu0/cpuidle/state4/name:C6

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/desc
/sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu0/cpuidle/state1/desc:MWAIT 0x00
/sys/devices/system/cpu/cpu0/cpuidle/state2/desc:MWAIT 0x01
/sys/devices/system/cpu/cpu0/cpuidle/state3/desc:MWAIT 0x10
/sys/devices/system/cpu/cpu0/cpuidle/state4/desc:MWAIT 0x20

I have not been able to figure out how.
Comment 38 Doug Smythies 2021-03-24 04:30:20 UTC
I found this thread:
https://patchwork.kernel.org/project/linux-pm/patch/20200826120421.44356-1-guilhem@barpilot.io/

And somehow figured out that a i5-10600K is COMETLAKE, and so did the same as that link:

doug@s19:~/temp-k-git/linux$ git diff
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 3273360f30f7..770660d777c4 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -1155,6 +1155,7 @@ static const struct x86_cpu_id intel_idle_ids[] __initconst = {
        X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE_L,          &idle_cpu_skl),
        X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE,            &idle_cpu_skl),
        X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X,           &idle_cpu_skx),
+       X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE,           &idle_cpu_skl),
        X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X,           &idle_cpu_icx),
        X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNL,        &idle_cpu_knl),
        X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNM,        &idle_cpu_knl),

And got back the original types of idle states.
I do not want beat up my nvme drive with dbench, so installed an old intel SSD I had lying around:

Phoronix dbench 1.0.2 0 client count 1: (MB/S)
Intel_pstate HWP enabled, active powersave:

Kernel 5.12-rc2 stock:
All idle states enabled: 416.5
Only Idle State 0: 400.1
Only Idle State 1: 294.2
Only idle State 2: 401.6
Only idle State 3: 403.0

Kernel 5.12-rc2 patched as above:
All idle states enabled: 396.8
Only Idle State 0: 400.4
Only Idle State 1: 294.4
Only idle State 2: 245.9
Only idle State 3: 405.3
Only idle State 4: 402.8
quick test: FAIL.

Intel_pstate HWP disabled, active powersave:
Kernel 5.12-rc2 patched as above:
All idle states enabled: 340.0
Only Idle State 0: 399.5
Only Idle State 1: 358.5
Only idle State 2: 353.1
Only idle State 3: 346.9
Only idle State 4: 344.2
quick test: PASS.
Comment 39 Doug Smythies 2021-03-29 22:55:28 UTC
It is true that the quicktest should at least check the idle state 2 is indeed C1E. 

I ran the inverse impulse response test. Kernel 5.12-rc2. Processor i5-10600K. inverse gap 842 nSec:

1034 tests 0 fails.

with the patch as per comment 38 above, i.e. with C1E:

1000 tests 16 fails. 98.40% pass 1.60% fail.

I ran just the generic periodic test at 347 hertz and light load, stock kernel, i.e. no C1E:

HWP disabled: active/powersave:
doug@s19:~/freq-scalers$ /home/doug/c/consume 32.0 347 300 1
consume: 32.0 347 300  PID: 1280
 - fixed workpacket method:  Elapsed: 300000158  Now: 1617030857155911
Total sleep: 169222343
Overruns: 0  Max ovr: 0
Loops: 104094  Ave. work percent: 43.592582

HWP enabled: active/powersave:
doug@s19:~$ /home/doug/c/consume 32.0 347 300 1
consume: 32.0 347 300  PID: 1293
 - fixed workpacket method:  Elapsed: 300000654  Now: 1617031529268276
Total sleep: 171458395
Overruns: 725  Max ovr: 1449
Loops: 104094  Ave. work percent: 42.847326

The above was NOT due to CPU migration:

doug@s19:~$ taskset -c 10 /home/doug/c/consume 32.0 347 3600 1
consume: 32.0 347 3600  PID: 1341
 - fixed workpacket method:  Elapsed: 3600002498  Now: 1617036391455519
Total sleep: 2086618739
Overruns: 3189  Max ovr: 1864
Loops: 1249133  Ave. work percent: 42.038409

Conclusion: there is still something very minor going on even without C1E being involved.

Notes:

I think HWPBOOST was, at least partially, programming around the C1E issue.

In addition to the ultimate rejection of the patch of the thread referenced in  
comment 38, I think other processors should be rolled back to the same state. I have never been able to measure any energy consumption or performance difference for all of those deep idle states on my i5-9600K processor.

Call me dense, but I only figured out yesterday that HWP is called "Speed Shift" in other literature and BIOS.

It does not make sense that we spent so much effort a few years ago to make sure that we did not dwell in shallow idle states for long periods, only to have HWP set the requested pstate to minimum upon its (C1E) use, albeit under some other conditions. By definition the system is NOT actually idle, it it were we would have asked for a deep idle state.

Note You need to log in before you can comment on or make changes to this bug.