Bug 199227

Summary: Idle loop ordering issue
Product: Power Management Reporter: Rafael J. Wysocki (rjw)
Component: cpuidleAssignee: Rafael J. Wysocki (rjw)
Status: CLOSED CODE_FIX    
Severity: normal CC: dsmythies, lenb, riel, srinivas.pandruvada, thomas.ilsche
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.16-rc6 Subsystem:
Regression: No Bisected commit-id:
Attachments: BDW EX 330Watt baseline idle measurement result
BDW EX 300Watt idle-loop-v7.3 measurement result
idle-loop-v8, kernel build with "make -j1"
shows what I did to try to disable idle state 1
Package power Verses time for kernel compiles with j1
test code to obey idle disabled states
power consumption under different workloads idle-loop-v9 vs unmodified 4.16.0
utilization of different idle states with the 500 us workload
average time spent for each idle phase per idle state (variance among cores)
utilization of different idle states with the 500 us workload
average time spent for each idle phase per idle state (variance among cores)
utilization of different idle states with the 800 us workload
100% load on one CPU graph
iperf test results.
Performance dring 1 core pipe test
Old - Shows excessive power consumption with intel-cpufreq and schedutil
Processor package power graph - intel-cpufreq schedutil test
load sweep test/experiment package power and overruns
Phoronix unpack-linux test results.
updated load sweep power graph
phoronix build-apache test power graph
system idle test
Power graph for Phoronix himeno test

Description Rafael J. Wysocki 2018-03-27 16:50:50 UTC
The problem is that if we stop the sched tick in
tick_nohz_idle_enter() and then the idle governor predicts short idle
duration, we lose regardless of whether or not it is right.

If it is right, we've lost already, because we stopped the tick
unnecessarily.  If it is not right, we'll lose going forward, because
the idle state selected by the governor is going to be too shallow and
we'll draw too much power (that has been reported recently to actually
happen often enough for people to care).
Comment 1 Rafael J. Wysocki 2018-03-27 16:55:08 UTC
There is a patch series to improve the situation under active discussion.

The idea is to make the decision whether or not to stop the tick deeper
in the idle loop and in particular after running the idle state selection
in the path where the idle governor is invoked.  This way the problem
can be avoided, because the idle duration predicted by the idle governor
can be used to decide whether or not to stop the tick so that the tick
is only stopped if that value is large enough (and, consequently, the
idle state selected by the governor is deep enough).

See:
 https://marc.info/?t=152115254600008&r=2&w=2
 https://marc.info/?t=152156445000008&r=1&w=2
 https://marc.info/?t=152103659800003&r=2&w=2
Comment 3 Len Brown 2018-03-27 17:58:41 UTC
Created attachment 274961 [details]
BDW EX 330Watt baseline idle measurement result

Attached is a baseline idle power measurement of a Broadwell EX
(4-socket server) running Linux 4.16-rc6.

The Y axis shows Watts, as measured by an Yokogawa WT310 AC power meter.
The X axis is time, with 1-second granularityl.
Each with a blue-dot showing the power, as integrated by the power meter,
over the previous 1-second.

The green line is the simple software average of the blue dots.

The lowest blue dots are the "power floor" -- this machine never
goes below 300 Watts.  There is a trend of power up at about 420 watts,
and there is plenty of "noise" in the results.  Eyeballing the green line,
one might say "this machine idles at about 330 watts"
Comment 4 Len Brown 2018-03-27 18:04:34 UTC
Created attachment 274963 [details]
BDW EX 300Watt idle-loop-v7.3 measurement result

In contrast to the baseline result in comment #3,
this graph shows the idle measurement result
for Linux-4.16-rc6 + idle-loop-v7.3 patch series.

All the blue dots and the green average line sit right on the "power floor"
of 300 Watts.  Ie. The machine now idles at 300 Watts, down 30 Watts from the 330 Watt baseline.
Comment 5 Rafael J. Wysocki 2018-03-27 22:12:22 UTC
New results from Thomas Ilsche:
 https://marc.info/?l=linux-pm&m=152218749113965&w=2
Comment 6 Doug Smythies 2018-03-28 06:56:23 UTC
My computer is an older i7-2700K (4 cores, 8 CPUs)

Using iperf on the test computer running:

$ iperf -c s10 -M 536 --time 1200

And another computer (S10) acting as the host for iperf, I tested most kernels variants recently discussed. The most dramatic results I got (~19 minute averages):

Idle state 0 = enabled
Idle state 1 = disabled
Idle state 2 = enabled
Idle state 3 = enabled
Idle state 4 = enabled

Kernel 4.16-rc6 - baseline reference kernel: 
Average processor package power: 30.87 Watts
Throughput: 139/140 Mbits/Sec (needs to be double checked)

Kernel 4.16-rc6 + 2 Poll idle changes :
Average processor package power: 21.30 Watts
Power saved overbaseline: 31%
Throughput: 135 Mbits/Sec (3% worse)

Kernel 4.16-rc6 + poll + v7.3 of the idle loop
Average processor package power: 16.74 Watts
Power saved over baseline: 45.8%
Throughput: 139 Mbits/Sec (~~ within noise)

Now Idle State 1 set to enabled:

Kernel 4.16-rc6 + poll + v7.3 of the idle loop
Average processor package power: 17.9 Watts
Throughput: 140 Mbits/Sec (~~ within noise)

Now Idle Sate 0 = disabled.

(note: it doesn't actually seem to disable,
I get an average of 0.30 seconds of idle
state 0 time per minute with it disabled, but only with this workflow. Other workflows show 0 idle state 0 time.)

Average processor package power: 18.51 Watts
Throughput: 141 Mbits/Sec (1% better)
Comment 7 Doug Smythies 2018-03-28 07:09:02 UTC
(In reply to Len Brown from comment #3)

> There is a trend of power up at about 420 watts,

I also observe up to 40% extra power consumption for the first few minutes (~3 or 4) after boot. Tests are always started after a short settling time.
Comment 9 Rafael J. Wysocki 2018-03-29 18:27:25 UTC
Created attachment 275001 [details]
idle-loop-v8, kernel build with "make -j1"

Test of the idle-loop-v8 git branch on a quad core desktop Ivy Bridge processor with HT (8 logical CPUs).

The workload is a "make -j1" kernel build.  ~10% improvement in power and slightly shorter execution time.
Comment 10 Doug Smythies 2018-03-29 20:16:38 UTC
Created attachment 275003 [details]
shows what I did to try to disable idle state 1

For Kernel 4.16-rc7 + idle-loop-v8, I am unable to fully disable idle state 0, under some workloads. Kernel 4.16-rc7 seems to fully disable idle state 0 for the same workloads.

The attached shows what I have tried already, and some supporting test results.
Comment 11 Rafael J. Wysocki 2018-03-29 21:44:27 UTC
(In reply to Doug Smythies from comment #10)
> For Kernel 4.16-rc7 + idle-loop-v8, I am unable to fully disable idle state
> 0, under some workloads. Kernel 4.16-rc7 seems to fully disable idle state 0
> for the same workloads.

That may be an effect of the idle-poll patches.  Is this reproducible with the idle-poll branch alone (on top of 4.16-rc7)?
Comment 12 Rafael J. Wysocki 2018-03-29 21:51:51 UTC
Also it may come from the "cpuidle: menu: Refine idle state
selection for running tick" patch which will use state 0 as
the ultimate fallback regardless of its disable/enable
status.
Comment 13 Doug Smythies 2018-03-30 00:37:01 UTC
Created attachment 275007 [details]
Package power Verses time for kernel compiles with j1

The compile was incremental, not clean, and memory was flushed in between tests.
The command:

time make -j1 olddefconfig bindeb-pkg LOCALVERSION=-test

run 1 with V8 = rejected, operator error.
run 2 with V8 = v8_j1_2 on the graph
run 2 with V8 = v8_j1_3 on the graph
run 1 with rc7 = baseline reference = rc7_j1_1 on the graph
run 2 with rc7 = baseline reference = rc7_j1_2 on the graph

The times were all within +- a second of each other: 11 minutes 35 seconds.
Comment 14 Doug Smythies 2018-03-30 02:50:06 UTC
(In reply to Rafael J. Wysocki from comment #11)
> (In reply to Doug Smythies from comment #10)
> > For Kernel 4.16-rc7 + idle-loop-v8, I am unable to fully disable idle state
> > 0, under some workloads. Kernel 4.16-rc7 seems to fully disable idle state
> 0
> > for the same workloads.
> 
> That may be an effect of the idle-poll patches.  Is this reproducible with
> the idle-poll branch alone (on top of 4.16-rc7)?

Kernel:
~/temp-k-git/linux$ git log --oneline
f719850 cpuidle: poll_state: Avoid invoking local_clock() too often
c43fdde cpuidle: poll_state: Add time limit to poll_idle()
3eb2ce8 Linux 4.16-rc7

Idle State disables properly with this kernel.
Comment 15 Doug Smythies 2018-03-30 02:54:19 UTC
(In reply to Rafael J. Wysocki from comment #12)
> Also it may come from the "cpuidle: menu: Refine idle state
> selection for running tick" patch which will use state 0 as
> the ultimate fallback regardless of its disable/enable
> status.

I tried to allow idle State 1 in that case. if it was enabled, but maybe I made a mistake. see attachment 275007 [details].
Comment 16 Doug Smythies 2018-03-30 07:24:52 UTC
compared to kernel 4.16-rc7, baseline reference:
Idle: noisy at ~~ 3.7 Watts

with idle-poll branch alone:
Idle: steady at 3.7 watts.
100% load on one CPU: 17% less power, 1.2% faster.

with idle-poll and V8:
Idle: steady at 3.7 watts.
100% load on one CPU: 17% less power, 1.7% faster.
Comment 17 Rafael J. Wysocki 2018-03-30 08:50:38 UTC
(In reply to Doug Smythies from comment #15)
> (In reply to Rafael J. Wysocki from comment #12)
> > Also it may come from the "cpuidle: menu: Refine idle state
> > selection for running tick" patch which will use state 0 as
> > the ultimate fallback regardless of its disable/enable
> > status.
> 
> I tried to allow idle State 1 in that case. if it was enabled, but maybe I
> made a mistake. see attachment 275007 [details].

Well, why don't you just try to revert commits ec418a799ee7 and 8be5f55d60e8 from the idle-loop-v8 branch and try that?  Then you will know for sure. :-)

The other commits in that branch don't even touch the state selection algo.
Comment 18 Doug Smythies 2018-04-01 03:18:47 UTC
Created attachment 275037 [details]
test code to obey idle disabled states

I have been running this test code on top of V8, and it seems to be disabling idle state 0 (or more) if desired and for any workflow thrown at it, so far.

I want to add some automation to my testing process, it will take several days or even a week.
Comment 19 Rafael J. Wysocki 2018-04-03 14:38:57 UTC
(In reply to Doug Smythies from comment #18)
> Created attachment 275037 [details]
> test code to obey idle disabled states

It will do that, but the resulting state may not satisfy all of the selection conditions.  That may not be a problem, though.
Comment 21 Thomas Ilsche 2018-04-06 07:56:31 UTC
Created attachment 275117 [details]
power consumption under different workloads idle-loop-v9 vs unmodified 4.16.0

Idle and the "classic Powernightmare trigger" look good with v9. There is a regression in power consumption for a workload where all cores sleep for 500 us in a loop.
Comment 22 Thomas Ilsche 2018-04-06 07:58:47 UTC
Created attachment 275119 [details]
utilization of different idle states with the 500 us workload

v9 chooses the correct C state. This shows the fix of a previous issue with v7 and certain HR timers.
Comment 23 Thomas Ilsche 2018-04-06 08:09:08 UTC
Created attachment 275121 [details]
average time spent for each idle phase per idle state (variance among cores)

For the 500 us loop workload the governor doesn't go into the deepest sleep state, it decides not to disable the sched tick. This causes more intermediate wakeups, hence the higher power consumption (Note: 1000 Hz Kernel).

I see two ways that would improve that behavior:

1) Disable the sched tick based on the knowledge that the upcoming timer gave us reliable information about the time to wakeup (rather than unreliable heuristic).

2) Don't just disable/enable the sched tick, but move it such that it will occur *after* the expected wake-up time, not before. This is what I meant previously with
> modify the next upcoming sched tick to be better suitable as fallback timer
Comment 24 Thomas Ilsche 2018-04-06 08:11:38 UTC
Created attachment 275123 [details]
utilization of different idle states with the 500 us workload
Comment 25 Thomas Ilsche 2018-04-06 08:12:35 UTC
Created attachment 275125 [details]
average time spent for each idle phase per idle state (variance among cores)

(sorry uploaded chart previously)
Comment 26 Thomas Ilsche 2018-04-06 08:17:57 UTC
Created attachment 275127 [details]
utilization of different idle states with the 800 us workload

There is also a change in behavior for a 800 us sleep loop workload. The patch does not consistently go to C6 as opposed to 4.16.0. I'm not sure yet why.

Note 800 us = C6 residency on that system.
Comment 27 Rafael J. Wysocki 2018-04-06 08:30:57 UTC
Thanks for the information!

From all of the information available to me at this point, the changes in v9 overall are a clear improvement over what happens in 4.16.0 (especially as far as idle power is concerned) and the remaining issues can be dealt with on top of them.
Comment 28 Doug Smythies 2018-04-11 07:07:55 UTC
Created attachment 275287 [details]
100% load on one CPU graph

I'm going back to basics. This is the 100% load one CPU test.
Legend:
"k416" unmodified kernel 4.16
"idle" kernel 4.16 + 10 V9 idle loop re-work patches
"poll" kernel 4.16 + 2 poll idle patches
"V9" Kernel 4.16 + 10 V9 idle loop re-work + 2 poll idle patches

Performance:
k416 = baseline reference
idle < 0.1% degradation
poll ~0.8% improvement
V9 ~0.55 improvement

Note: For only a 100 minute test, the difference between "poll" and "V9" performance improvement is within experimental error.

@Thomas: Since your posts a few days ago, I have been playing with my understanding of your loads types: Synchronized; staggered; and staggered letting the system decide which CPU to use (I use 16 X NUM_CPU load threads but at a lower period each). I have an overrun detector in one of the programs I use to create the load/sleep cycle. For the high frequency (poller-omp 500), there are always a few overruns, even at very light load per cycle (with either kernel 4.16 or 4.16+V9).

For this:
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v9_250_Hz_power.png
(I do not have the green kernel with "delta")

On my 1000 Hz kernel I also observe significant power differences between 
poller-sync 3000 and poller-stagger 3000 (and in my case no taskset, I see another increase in power used).
Comment 29 Doug Smythies 2018-04-12 07:09:00 UTC
Created attachment 275333 [details]
iperf test results.

Method: Be an iperf client to 3 servers at once. Packets are small on purpose, we want the highest frequency of packets, not fastest payload delivery.
script used:

doug@s15:~/idle$ cat iperf_once
#!/bin/dash

iperf -c cyd-hp2 --parallel 1 -M 100 --time 6100 &
iperf -c rpi01 --parallel 1 -M 100 --time 6100 &
iperf -c doug-xps2 --parallel 1 -M 100 --time 6100

Performance:
k416 = 66.0 + 49.4 + 15.7 = 131.1 Mbits/Sec.
idle = 49.4 + 66.1 + 14.2 = 129.7 Mbits/Sec.
poll = 49.4 + 65.4 + 13.4 = 128.2 Mbits/Sec.
V9 = 49.4 + 66.6 + 13.2 = 129.2 Mbits/Sec.

Notes: Sometimes there was other traffic on the LAN, and it seemed to have an affect on the test. For example the slight drop in power from 74 to 79 minutes of the poll test was for certain due to other network traffic.

the rest of the test data as graphs:
http://www.smythies.com/~doug/linux/not_idle/k416/k416-iperf.htm

The rest of the test data graphs for the 100% load on one CPU test:
http://www.smythies.com/~doug/linux/not_idle/k416/k416-1-load.htm

In progress is a test running 40 staggered threads looping at 10 cycles of 4 uSec sleep followed by 0.999 seconds long sleep. So far it shows 12% power improvement between kernel 4.16 and 4.16+V9.
Comment 30 Doug Smythies 2018-04-13 06:41:28 UTC
(In reply to Doug Smythies from comment #29)

> 
> In progress is a test running 40 staggered threads looping at 10 cycles of 4
> uSec sleep followed by 0.999 seconds long sleep. So far it shows 12% power
> improvement between kernel 4.16 and 4.16+V9.

The test result graphs:
http://www.smythies.com/~doug/linux/not_idle/k416/k416-4usec.htm
Comment 31 Doug Smythies 2018-04-13 14:31:28 UTC
Created attachment 275355 [details]
Performance dring 1 core pipe test

The 1 core, 2 cpu, pipe test.
V9 over kernel 4.16: Power is 11% better and performance is 17% better.
The rest of the graphs:
http://www.smythies.com/~doug/linux/not_idle/k416/k416-4usec.htm
Comment 32 Doug Smythies 2018-04-13 15:20:59 UTC
Created attachment 275357 [details]
Old - Shows excessive power consumption with intel-cpufreq  and schedutil

All my previous posts used the intel_pstate CPU frequency scaling driver and the powersave governor.

This post is about the intel_pstate driver in passive mode which gives the intel-cpufreq CPU frequency scaling driver, and I set the schedutil governor.

Refer to this old e-mail from January 6th:
https://marc.info/?l=linux-pm&m=151525518828905&w=2

which had a follow up off-list e-mail with the attached graph.

Further investigation was done, and never reported back. 1281 Hertz work/sleep frequency was chosen as a reference condition within the problem area. If idle state 0 was disabled the excessive power consumption was eliminated.

Now, fast forward to Kernel 4.16 and this patch set: it also eliminates the excessive power consumption, without needing to disable idle state 0. I did not catch the worst excessive power consumption during the test (not finished yet) but have manually observed 13 watts (with kernel 4.16) verses what should be about 6.6 watts (with kernel 4.16 + V9). More typically, 10 watts verses 6.6 is observed.
Comment 33 Doug Smythies 2018-04-14 05:14:14 UTC
Created attachment 275363 [details]
Processor package power graph - intel-cpufreq schedutil test

For my previous post, this is the power graph from the completed test at 1281Hz work/sleep frequency using the intel-pstate driver in passive mode and with the schedutil governor.

Kernel 4.16 uses 52% more power than 4.16 + all the patches.

All of the graphs for the test are here:
http://www.smythies.com/~doug/linux/not_idle/k416/k416-1281ps.htm
Comment 34 Doug Smythies 2018-04-14 06:43:26 UTC
(In reply to Doug Smythies from comment #31)
> Created attachment 275355 [details]
> Performance dring 1 core pipe test
> 
> The 1 core, 2 cpu, pipe test.
> V9 over kernel 4.16: Power is 11% better and performance is 17% better.
> The rest of the graphs:
> http://www.smythies.com/~doug/linux/not_idle/k416/k416-4usec.htm

Correct link:
http://www.smythies.com/~doug/linux/not_idle/k416/k416-pipe-1core.htm
Comment 35 Doug Smythies 2018-04-15 16:53:22 UTC
Created attachment 275383 [details]
load sweep test/experiment package power and overruns

This test/experiment was done to have a look at higher overall loads and events per second and not forcing any CPU affinity (i.e. no taskset).

The work is periodic (at 128.1 Hertz work / sleep frequency, and 40 threads, for this test (128.1 * 40 = 5124 Hz overall). An overrun is counted if the calculated sleep interval for the current cycle is negative. The way this particular test workflow works, the system can, and does, catch up from an overrun. At the point where overruns do not catch up, all CPUs are going flat out, and that transition point should be the same for all the kernels under test herein. The upper bound for this test was set to just below the point where overruns do not catch up. 

While V9 uses more power at some loads, it has a considerably lower number of overruns.
Comment 36 Doug Smythies 2018-04-15 17:21:18 UTC
In the previous post, the title block of the processor package power graph says:

"load per thread varies from ~1% to ~10% at nominal (non-turbo) CPU frequency"

The calibration of the utility used was done for single threaded low load applications, and it is way out for this application.
The important point is that the high end (which was 10% in the script) was more like 95% per cpu overall (which, if the calibration was correct, would have been 19% in the script, for my 8 CPU system).
Comment 37 Doug Smythies 2018-04-17 15:44:54 UTC
Created attachment 275425 [details]
Phoronix unpack-linux test results.

From doing short manual tests, it was not clear what was going on for this Phoronix test, so a longer test was run. It turns out about 3% power is saved when using V9 here.
Comment 38 Thomas Ilsche 2018-04-17 17:54:10 UTC
I was surprised to see the Phoronix results, is it just a single threaded benchmark? So your 100% on one CPU is a proxy for that?
Comment 39 Doug Smythies 2018-04-17 18:32:13 UTC
(In reply to Thomas Ilsche from comment #38)
> I was surprised to see the Phoronix results, is it just a single threaded
> benchmark? So your 100% on one CPU is a proxy for that?

No, this was a few threads. Some single threaded Phoronix tests use considerably less power with either the poll fix or V9 (poll + idle). I'm specifically looking for higher load and more thread type Phoronix tests for now. I'm running apache-build right now (~43 watts average power).

Yes, my 100% on one CPU test is a proxy for single threaded applications.

The best single threaded Phoronix test result I recall was with V5 and was:

From 2018.03.18:

> A couple of Phoronix tests (I didn't get very far yet):
> himeno: 11% performance improvement; 14% less power.

I'll re-do that test soon and post the results here.
Comment 40 Doug Smythies 2018-04-17 18:34:42 UTC
Created attachment 275427 [details]
updated load sweep power graph

A re-post of a previous graph with added lines using the intel_pstate CPU frequency scaling driver and powersave governor.
Comment 41 Doug Smythies 2018-04-17 21:09:08 UTC
Created attachment 275429 [details]
phoronix build-apache test power graph

This test was chosen because it was higher power and multi-threaded. While there is no improvement under these conditions, there is no regression either, so I am happy with the result.
Comment 42 Doug Smythies 2018-04-18 06:30:48 UTC
Created attachment 275435 [details]
system idle test

Just a basic idle system test, on my server with extra services disabled. No surprises. I did not suffer from (much) higher power consumption at idle as these big server systems seem to.
Comment 43 Doug Smythies 2018-04-18 18:26:20 UTC
Created attachment 275439 [details]
Power graph for Phoronix himeno test

the Phoronix himeno test is single threaded and CPU intensive. These results should be, and are, similar to my 100% load on one CPU test from comment 28.
Comment 44 Rafael J. Wysocki 2018-04-23 10:25:50 UTC
Thanks for all of the input!

Since the (equivalent of) v9 is in the mainline kernel now (as of 4.17-rc2), I'm closing this bug entry.

If there are any further issues in this area to track, let's open a new bug entry for that.