The problem is that if we stop the sched tick in tick_nohz_idle_enter() and then the idle governor predicts short idle duration, we lose regardless of whether or not it is right. If it is right, we've lost already, because we stopped the tick unnecessarily. If it is not right, we'll lose going forward, because the idle state selected by the governor is going to be too shallow and we'll draw too much power (that has been reported recently to actually happen often enough for people to care).
There is a patch series to improve the situation under active discussion. The idea is to make the decision whether or not to stop the tick deeper in the idle loop and in particular after running the idle state selection in the path where the idle governor is invoked. This way the problem can be avoided, because the idle duration predicted by the idle governor can be used to decide whether or not to stop the tick so that the tick is only stopped if that value is large enough (and, consequently, the idle state selected by the governor is deep enough). See: https://marc.info/?t=152115254600008&r=2&w=2 https://marc.info/?t=152156445000008&r=1&w=2 https://marc.info/?t=152103659800003&r=2&w=2
Also see: https://marc.info/?l=linux-pm&m=150116085925208&w=2 https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf
Created attachment 274961 [details] BDW EX 330Watt baseline idle measurement result Attached is a baseline idle power measurement of a Broadwell EX (4-socket server) running Linux 4.16-rc6. The Y axis shows Watts, as measured by an Yokogawa WT310 AC power meter. The X axis is time, with 1-second granularityl. Each with a blue-dot showing the power, as integrated by the power meter, over the previous 1-second. The green line is the simple software average of the blue dots. The lowest blue dots are the "power floor" -- this machine never goes below 300 Watts. There is a trend of power up at about 420 watts, and there is plenty of "noise" in the results. Eyeballing the green line, one might say "this machine idles at about 330 watts"
Created attachment 274963 [details] BDW EX 300Watt idle-loop-v7.3 measurement result In contrast to the baseline result in comment #3, this graph shows the idle measurement result for Linux-4.16-rc6 + idle-loop-v7.3 patch series. All the blue dots and the green average line sit right on the "power floor" of 300 Watts. Ie. The machine now idles at 300 Watts, down 30 Watts from the 330 Watt baseline.
New results from Thomas Ilsche: https://marc.info/?l=linux-pm&m=152218749113965&w=2
My computer is an older i7-2700K (4 cores, 8 CPUs) Using iperf on the test computer running: $ iperf -c s10 -M 536 --time 1200 And another computer (S10) acting as the host for iperf, I tested most kernels variants recently discussed. The most dramatic results I got (~19 minute averages): Idle state 0 = enabled Idle state 1 = disabled Idle state 2 = enabled Idle state 3 = enabled Idle state 4 = enabled Kernel 4.16-rc6 - baseline reference kernel: Average processor package power: 30.87 Watts Throughput: 139/140 Mbits/Sec (needs to be double checked) Kernel 4.16-rc6 + 2 Poll idle changes : Average processor package power: 21.30 Watts Power saved overbaseline: 31% Throughput: 135 Mbits/Sec (3% worse) Kernel 4.16-rc6 + poll + v7.3 of the idle loop Average processor package power: 16.74 Watts Power saved over baseline: 45.8% Throughput: 139 Mbits/Sec (~~ within noise) Now Idle State 1 set to enabled: Kernel 4.16-rc6 + poll + v7.3 of the idle loop Average processor package power: 17.9 Watts Throughput: 140 Mbits/Sec (~~ within noise) Now Idle Sate 0 = disabled. (note: it doesn't actually seem to disable, I get an average of 0.30 seconds of idle state 0 time per minute with it disabled, but only with this workflow. Other workflows show 0 idle state 0 time.) Average processor package power: 18.51 Watts Throughput: 141 Mbits/Sec (1% better)
(In reply to Len Brown from comment #3) > There is a trend of power up at about 420 watts, I also observe up to 40% extra power consumption for the first few minutes (~3 or 4) after boot. Tests are always started after a short settling time.
New version of the patch series posted: https://marc.info/?l=linux-pm&m=152232622719003&w=2 Patchwork links: https://patchwork.kernel.org/patch/10315353/ https://patchwork.kernel.org/patch/10315357/ https://patchwork.kernel.org/patch/10315343/ https://patchwork.kernel.org/patch/10315345/ https://patchwork.kernel.org/patch/10315361/ https://patchwork.kernel.org/patch/10315369/ https://patchwork.kernel.org/patch/10315363/ https://patchwork.kernel.org/patch/10315373/ https://patchwork.kernel.org/patch/10315335/ https://patchwork.kernel.org/patch/10315333/ git branch: git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \ idle-loop-v8
Created attachment 275001 [details] idle-loop-v8, kernel build with "make -j1" Test of the idle-loop-v8 git branch on a quad core desktop Ivy Bridge processor with HT (8 logical CPUs). The workload is a "make -j1" kernel build. ~10% improvement in power and slightly shorter execution time.
Created attachment 275003 [details] shows what I did to try to disable idle state 1 For Kernel 4.16-rc7 + idle-loop-v8, I am unable to fully disable idle state 0, under some workloads. Kernel 4.16-rc7 seems to fully disable idle state 0 for the same workloads. The attached shows what I have tried already, and some supporting test results.
(In reply to Doug Smythies from comment #10) > For Kernel 4.16-rc7 + idle-loop-v8, I am unable to fully disable idle state > 0, under some workloads. Kernel 4.16-rc7 seems to fully disable idle state 0 > for the same workloads. That may be an effect of the idle-poll patches. Is this reproducible with the idle-poll branch alone (on top of 4.16-rc7)?
Also it may come from the "cpuidle: menu: Refine idle state selection for running tick" patch which will use state 0 as the ultimate fallback regardless of its disable/enable status.
Created attachment 275007 [details] Package power Verses time for kernel compiles with j1 The compile was incremental, not clean, and memory was flushed in between tests. The command: time make -j1 olddefconfig bindeb-pkg LOCALVERSION=-test run 1 with V8 = rejected, operator error. run 2 with V8 = v8_j1_2 on the graph run 2 with V8 = v8_j1_3 on the graph run 1 with rc7 = baseline reference = rc7_j1_1 on the graph run 2 with rc7 = baseline reference = rc7_j1_2 on the graph The times were all within +- a second of each other: 11 minutes 35 seconds.
(In reply to Rafael J. Wysocki from comment #11) > (In reply to Doug Smythies from comment #10) > > For Kernel 4.16-rc7 + idle-loop-v8, I am unable to fully disable idle state > > 0, under some workloads. Kernel 4.16-rc7 seems to fully disable idle state > 0 > > for the same workloads. > > That may be an effect of the idle-poll patches. Is this reproducible with > the idle-poll branch alone (on top of 4.16-rc7)? Kernel: ~/temp-k-git/linux$ git log --oneline f719850 cpuidle: poll_state: Avoid invoking local_clock() too often c43fdde cpuidle: poll_state: Add time limit to poll_idle() 3eb2ce8 Linux 4.16-rc7 Idle State disables properly with this kernel.
(In reply to Rafael J. Wysocki from comment #12) > Also it may come from the "cpuidle: menu: Refine idle state > selection for running tick" patch which will use state 0 as > the ultimate fallback regardless of its disable/enable > status. I tried to allow idle State 1 in that case. if it was enabled, but maybe I made a mistake. see attachment 275007 [details].
compared to kernel 4.16-rc7, baseline reference: Idle: noisy at ~~ 3.7 Watts with idle-poll branch alone: Idle: steady at 3.7 watts. 100% load on one CPU: 17% less power, 1.2% faster. with idle-poll and V8: Idle: steady at 3.7 watts. 100% load on one CPU: 17% less power, 1.7% faster.
(In reply to Doug Smythies from comment #15) > (In reply to Rafael J. Wysocki from comment #12) > > Also it may come from the "cpuidle: menu: Refine idle state > > selection for running tick" patch which will use state 0 as > > the ultimate fallback regardless of its disable/enable > > status. > > I tried to allow idle State 1 in that case. if it was enabled, but maybe I > made a mistake. see attachment 275007 [details]. Well, why don't you just try to revert commits ec418a799ee7 and 8be5f55d60e8 from the idle-loop-v8 branch and try that? Then you will know for sure. :-) The other commits in that branch don't even touch the state selection algo.
Created attachment 275037 [details] test code to obey idle disabled states I have been running this test code on top of V8, and it seems to be disabling idle state 0 (or more) if desired and for any workflow thrown at it, so far. I want to add some automation to my testing process, it will take several days or even a week.
(In reply to Doug Smythies from comment #18) > Created attachment 275037 [details] > test code to obey idle disabled states It will do that, but the resulting state may not satisfy all of the selection conditions. That may not be a problem, though.
New version of the patch series posted: https://marc.info/?l=linux-pm&m=152283202832217&w=2 Patchwork links: https://patchwork.kernel.org/patch/10322237/ https://patchwork.kernel.org/patch/10322233/ https://patchwork.kernel.org/patch/10322249/ https://patchwork.kernel.org/patch/10322247/ https://patchwork.kernel.org/patch/10322261/ https://patchwork.kernel.org/patch/10322257/ https://patchwork.kernel.org/patch/10322271/ https://patchwork.kernel.org/patch/10322269/ https://patchwork.kernel.org/patch/10322263/ https://patchwork.kernel.org/patch/10322227/ git branch: git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \ idle-loop-v9
Created attachment 275117 [details] power consumption under different workloads idle-loop-v9 vs unmodified 4.16.0 Idle and the "classic Powernightmare trigger" look good with v9. There is a regression in power consumption for a workload where all cores sleep for 500 us in a loop.
Created attachment 275119 [details] utilization of different idle states with the 500 us workload v9 chooses the correct C state. This shows the fix of a previous issue with v7 and certain HR timers.
Created attachment 275121 [details] average time spent for each idle phase per idle state (variance among cores) For the 500 us loop workload the governor doesn't go into the deepest sleep state, it decides not to disable the sched tick. This causes more intermediate wakeups, hence the higher power consumption (Note: 1000 Hz Kernel). I see two ways that would improve that behavior: 1) Disable the sched tick based on the knowledge that the upcoming timer gave us reliable information about the time to wakeup (rather than unreliable heuristic). 2) Don't just disable/enable the sched tick, but move it such that it will occur *after* the expected wake-up time, not before. This is what I meant previously with > modify the next upcoming sched tick to be better suitable as fallback timer
Created attachment 275123 [details] utilization of different idle states with the 500 us workload
Created attachment 275125 [details] average time spent for each idle phase per idle state (variance among cores) (sorry uploaded chart previously)
Created attachment 275127 [details] utilization of different idle states with the 800 us workload There is also a change in behavior for a 800 us sleep loop workload. The patch does not consistently go to C6 as opposed to 4.16.0. I'm not sure yet why. Note 800 us = C6 residency on that system.
Thanks for the information! From all of the information available to me at this point, the changes in v9 overall are a clear improvement over what happens in 4.16.0 (especially as far as idle power is concerned) and the remaining issues can be dealt with on top of them.
Created attachment 275287 [details] 100% load on one CPU graph I'm going back to basics. This is the 100% load one CPU test. Legend: "k416" unmodified kernel 4.16 "idle" kernel 4.16 + 10 V9 idle loop re-work patches "poll" kernel 4.16 + 2 poll idle patches "V9" Kernel 4.16 + 10 V9 idle loop re-work + 2 poll idle patches Performance: k416 = baseline reference idle < 0.1% degradation poll ~0.8% improvement V9 ~0.55 improvement Note: For only a 100 minute test, the difference between "poll" and "V9" performance improvement is within experimental error. @Thomas: Since your posts a few days ago, I have been playing with my understanding of your loads types: Synchronized; staggered; and staggered letting the system decide which CPU to use (I use 16 X NUM_CPU load threads but at a lower period each). I have an overrun detector in one of the programs I use to create the load/sleep cycle. For the high frequency (poller-omp 500), there are always a few overruns, even at very light load per cycle (with either kernel 4.16 or 4.16+V9). For this: https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v9_250_Hz_power.png (I do not have the green kernel with "delta") On my 1000 Hz kernel I also observe significant power differences between poller-sync 3000 and poller-stagger 3000 (and in my case no taskset, I see another increase in power used).
Created attachment 275333 [details] iperf test results. Method: Be an iperf client to 3 servers at once. Packets are small on purpose, we want the highest frequency of packets, not fastest payload delivery. script used: doug@s15:~/idle$ cat iperf_once #!/bin/dash iperf -c cyd-hp2 --parallel 1 -M 100 --time 6100 & iperf -c rpi01 --parallel 1 -M 100 --time 6100 & iperf -c doug-xps2 --parallel 1 -M 100 --time 6100 Performance: k416 = 66.0 + 49.4 + 15.7 = 131.1 Mbits/Sec. idle = 49.4 + 66.1 + 14.2 = 129.7 Mbits/Sec. poll = 49.4 + 65.4 + 13.4 = 128.2 Mbits/Sec. V9 = 49.4 + 66.6 + 13.2 = 129.2 Mbits/Sec. Notes: Sometimes there was other traffic on the LAN, and it seemed to have an affect on the test. For example the slight drop in power from 74 to 79 minutes of the poll test was for certain due to other network traffic. the rest of the test data as graphs: http://www.smythies.com/~doug/linux/not_idle/k416/k416-iperf.htm The rest of the test data graphs for the 100% load on one CPU test: http://www.smythies.com/~doug/linux/not_idle/k416/k416-1-load.htm In progress is a test running 40 staggered threads looping at 10 cycles of 4 uSec sleep followed by 0.999 seconds long sleep. So far it shows 12% power improvement between kernel 4.16 and 4.16+V9.
(In reply to Doug Smythies from comment #29) > > In progress is a test running 40 staggered threads looping at 10 cycles of 4 > uSec sleep followed by 0.999 seconds long sleep. So far it shows 12% power > improvement between kernel 4.16 and 4.16+V9. The test result graphs: http://www.smythies.com/~doug/linux/not_idle/k416/k416-4usec.htm
Created attachment 275355 [details] Performance dring 1 core pipe test The 1 core, 2 cpu, pipe test. V9 over kernel 4.16: Power is 11% better and performance is 17% better. The rest of the graphs: http://www.smythies.com/~doug/linux/not_idle/k416/k416-4usec.htm
Created attachment 275357 [details] Old - Shows excessive power consumption with intel-cpufreq and schedutil All my previous posts used the intel_pstate CPU frequency scaling driver and the powersave governor. This post is about the intel_pstate driver in passive mode which gives the intel-cpufreq CPU frequency scaling driver, and I set the schedutil governor. Refer to this old e-mail from January 6th: https://marc.info/?l=linux-pm&m=151525518828905&w=2 which had a follow up off-list e-mail with the attached graph. Further investigation was done, and never reported back. 1281 Hertz work/sleep frequency was chosen as a reference condition within the problem area. If idle state 0 was disabled the excessive power consumption was eliminated. Now, fast forward to Kernel 4.16 and this patch set: it also eliminates the excessive power consumption, without needing to disable idle state 0. I did not catch the worst excessive power consumption during the test (not finished yet) but have manually observed 13 watts (with kernel 4.16) verses what should be about 6.6 watts (with kernel 4.16 + V9). More typically, 10 watts verses 6.6 is observed.
Created attachment 275363 [details] Processor package power graph - intel-cpufreq schedutil test For my previous post, this is the power graph from the completed test at 1281Hz work/sleep frequency using the intel-pstate driver in passive mode and with the schedutil governor. Kernel 4.16 uses 52% more power than 4.16 + all the patches. All of the graphs for the test are here: http://www.smythies.com/~doug/linux/not_idle/k416/k416-1281ps.htm
(In reply to Doug Smythies from comment #31) > Created attachment 275355 [details] > Performance dring 1 core pipe test > > The 1 core, 2 cpu, pipe test. > V9 over kernel 4.16: Power is 11% better and performance is 17% better. > The rest of the graphs: > http://www.smythies.com/~doug/linux/not_idle/k416/k416-4usec.htm Correct link: http://www.smythies.com/~doug/linux/not_idle/k416/k416-pipe-1core.htm
Created attachment 275383 [details] load sweep test/experiment package power and overruns This test/experiment was done to have a look at higher overall loads and events per second and not forcing any CPU affinity (i.e. no taskset). The work is periodic (at 128.1 Hertz work / sleep frequency, and 40 threads, for this test (128.1 * 40 = 5124 Hz overall). An overrun is counted if the calculated sleep interval for the current cycle is negative. The way this particular test workflow works, the system can, and does, catch up from an overrun. At the point where overruns do not catch up, all CPUs are going flat out, and that transition point should be the same for all the kernels under test herein. The upper bound for this test was set to just below the point where overruns do not catch up. While V9 uses more power at some loads, it has a considerably lower number of overruns.
In the previous post, the title block of the processor package power graph says: "load per thread varies from ~1% to ~10% at nominal (non-turbo) CPU frequency" The calibration of the utility used was done for single threaded low load applications, and it is way out for this application. The important point is that the high end (which was 10% in the script) was more like 95% per cpu overall (which, if the calibration was correct, would have been 19% in the script, for my 8 CPU system).
Created attachment 275425 [details] Phoronix unpack-linux test results. From doing short manual tests, it was not clear what was going on for this Phoronix test, so a longer test was run. It turns out about 3% power is saved when using V9 here.
I was surprised to see the Phoronix results, is it just a single threaded benchmark? So your 100% on one CPU is a proxy for that?
(In reply to Thomas Ilsche from comment #38) > I was surprised to see the Phoronix results, is it just a single threaded > benchmark? So your 100% on one CPU is a proxy for that? No, this was a few threads. Some single threaded Phoronix tests use considerably less power with either the poll fix or V9 (poll + idle). I'm specifically looking for higher load and more thread type Phoronix tests for now. I'm running apache-build right now (~43 watts average power). Yes, my 100% on one CPU test is a proxy for single threaded applications. The best single threaded Phoronix test result I recall was with V5 and was: From 2018.03.18: > A couple of Phoronix tests (I didn't get very far yet): > himeno: 11% performance improvement; 14% less power. I'll re-do that test soon and post the results here.
Created attachment 275427 [details] updated load sweep power graph A re-post of a previous graph with added lines using the intel_pstate CPU frequency scaling driver and powersave governor.
Created attachment 275429 [details] phoronix build-apache test power graph This test was chosen because it was higher power and multi-threaded. While there is no improvement under these conditions, there is no regression either, so I am happy with the result.
Created attachment 275435 [details] system idle test Just a basic idle system test, on my server with extra services disabled. No surprises. I did not suffer from (much) higher power consumption at idle as these big server systems seem to.
Created attachment 275439 [details] Power graph for Phoronix himeno test the Phoronix himeno test is single threaded and CPU intensive. These results should be, and are, similar to my 100% load on one CPU test from comment 28.
Thanks for all of the input! Since the (equivalent of) v9 is in the mainline kernel now (as of 4.17-rc2), I'm closing this bug entry. If there are any further issues in this area to track, let's open a new bug entry for that.