Jörg Otte reports: in v4.5 and earlier intel-pstate downscaled idle processors (load 0.1-0.2%) to minumum frequency, in my case 800MHz. Now in v4.6-rc1 the characteristic has dramatically changed. If in idle the processor frequency is more or less a few MHz around 2500Mhz. This is the maximum non turbo frequency. No difference between powersafe or performance governor. I currently use acpi_cpufreq which works as usual. Processor: Intel(R) Core(TM) i5-4200M CPU @ 2.50GHz (family: 0x6, model: 0x3c, stepping: 0x3) Reference: http://marc.info/?l=linux-acpi&m=145927229609595&w=2
Created attachment 211401 [details] Revert of commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Jörg reports that the attached revert of commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) makes the problem go away for him.
Suspicion arises that the problem may be related to an initialization issue, so Jörg is requested to test the following two patches: [1] https://patchwork.kernel.org/patch/8714401/ [2] https://patchwork.kernel.org/patch/8723461/ ([2] depends on [1]), but these patches don't make any difference for him.
Created attachment 211411 [details] pstate_sample trace from the affected system - working case Trace from the pstate_sample trace point collected by Jörg with the revert patch attachment #1 [details] applied.
Created attachment 211421 [details] pstate_sample trace from the affected system - failing case Trace from the pstate_sample trace point collected by Jörg with the patches from comment #2 applied (but without the revert).
Created attachment 211431 [details] Kernel config file from the affected system
@Jörg: After a bit of reconsideration, I think that full output from the most recent turbostat would be useful here in both the failing and working cases. You can get that by running "turbostat --debug". I'm mostly interested in the C-state residencies (the CPU%cx and Pkg%pcx columns) and the PkgWatt and CorWatt numbers.
Created attachment 211511 [details] turbo-debug-good_case
Created attachment 211521 [details] turbo-debug-failing_case
Thanks! Evidently, in the failing case the processors spend more time in C1 and the package PC2 residency is much less that in the good case. That translates to increased energy usage. It looks like the elimination of intel_pstate timers leads to increased activity somewhere else in your system.
Please attach /proc/interrupts for both the good and failing cases. Also please generate powertop reports with "powertop --html --time=30" for both the good and failing cases and attach them too.
(In reply to Rafael J. Wysocki from comment #9) > Evidently, in the failing case the processors spend more time in C1 and the > package PC2 residency is much less that in the good case. That translates > to increased energy usage. While working on another theory (now disproved), from the trace data it was determined (very crudely) that CPU 3 services some event at ~125 Hz (possibly aliased) in the failing case and ~40 Hz in the good case. (I haven't gone into this level of detail for the other CPUs (particularly CPU 2) yet. I have been trying, without success, to re-create Jörg's failure scenario.
Created attachment 211531 [details] powertop good case
Created attachment 211541 [details] powertop fail case
Created attachment 211551 [details] interrupts good case
Created attachment 211561 [details] interrupts fail case
I forgot that the latest turbostat has irg counts. Very useful. I tried a simulated low load and higher interrupts rate (than normal, for me) on my computer. I did: CPU 0 520 Hz; CPU 1 515 hz; CPU 2 75 Hz; CPU 3 390 Hz. Still, the resulting CPU frequencies were fine.
I tried on three systems with same cpu model with very similar kernel config, with minor changes. I couldn't reproduce. I am generating one periodic interrupt and triggering a soft irq where I do small processing to bump up frequncies. I am able to reach higher frequencies but not in turbo range.
@Jörg: Thanks for the attachments! The /proc/interrupts files contain an indication on what may be the source of the problem: you get an insanely large number of ACPI interrupts in both the good and the failing cases. That number is two or three orders of magnitude too high which usually means that something really odd is happening. And that's where there's a difference: Good: 9: 3156 34845 2061 47444 IO-APIC 9-fasteoi acpi Failing: 9: 4938 17921 3161 72687 IO-APIC 9-fasteoi acpi as in the failing case you got around 10000 of those interrupts more. Now, if you look at the numbers of wakeups as reported by powertop, there's a difference too. In the failing case, the primary source of wakeups is hrtimer_wakeup, which also is highly unusual AFAICS. It looks like the elimination of intel_pstate timers induces more usage of highres timers by something. Question is by what and why, but to answer that we need to take the ACPI noise out of the picture first.
@Jörg: Please attach the output of acpidump from your system. Also please do # cd /sys/firmware/acpi/interrupts/ # for f in *; do echo $f; cat $f; done > /tmp/acpi-interrupts.txt and attach the /tmp/acpi-interrupts.txt file.
Created attachment 211611 [details] excerpt from Jorg reverted trace data I still think this is the same as or similar to bug 93521. I have asked on that bug for them to try kernel 4.6-rc1. There is a difference in the number of passes through the intel_pstate driver between the non reverted and reverted trace data. One reason, is that in the reverted case, the driver is not called if the CPU is idle on a jiffy boundary. When it does happen that the driver is active on a jiffy boundary, then often there is a long duration involved, resulting in driving down the target pstate, whereas otherwise it would have stayed higher (as in the non-reverted case). The attached screenshot of the reverted trace data details a clear area where this is occurring with CPU 3. Note: Other sometimes active CPUs have higher target pstates (not shown). Notice very low load on CPU 3. CPU 3 jumps way up in target pstate, due to the higher active frequency due to another CPU, line 855. Now, CPU requests pstate 26 for several more passes, until the long duration scale adjustment kicks in at line 860. Based on aperf clock cycles, it seems about 5 events actually took place in that time period. The point is that the target pstate got driven down by long durations in the reverted case. Under these conditions there will not be long durations in the non-reverted case (which, in general, is a good thing).
@Rafael: > /sys/firmware/acpi/interrupts/ This reminds me on a big problem I had a year ago after installing Linux on the notebook: I had a kworker-thread consuming 80% cpu time. I found in an Ubuntu forum a workaround for that problem. The workaround was to disable /sys/firmware/acpi/interrupts/gpe13 (echo "disable" > gpe13). To disable gpe13 during boot as early as possible I created an upstart job to disable gpe13 as soon as the /sys filesystem has been mounted. This worked so well for a year that I completely forgotten this problem, till today. -- start on mounted MOUNTPOINT=/sys script echo "disable" > /sys/firmware/acpi/interrupts/gpe13 exit 0 end script --
Created attachment 211621 [details] acpi interrupts
Created attachment 211631 [details] acpidump
(In reply to jrg.otte from comment #21) > @Rafael: > > /sys/firmware/acpi/interrupts/ > > This reminds me on a big problem I had a year ago after installing Linux on > the notebook: > > I had a kworker-thread consuming 80% cpu time. I found in an Ubuntu forum a > workaround for that problem. The workaround was to disable > /sys/firmware/acpi/interrupts/gpe13 (echo "disable" > gpe13). > > To disable gpe13 during boot as early as possible I created an upstart job > to disable gpe13 as soon as the /sys filesystem has been mounted. This > worked so well for a year that I completely forgotten this problem, till > today. > > -- > start on mounted MOUNTPOINT=/sys > > script > echo "disable" > /sys/firmware/acpi/interrupts/gpe13 > exit 0 > > end script > -- That's good information. This means that the ACPI interrupts you get probably almost all happen before you disable GPE 0x13. To confirm that, please check if the number of ACPI interrupts in /proc/interrupts changes significantly after you've disabled GPE 0x13.
(In reply to Doug Smythies from comment #20) > Created attachment 211611 [details] > excerpt from Jorg reverted trace data > > I still think this is the same as or similar to bug 93521. I have asked on > that bug for them to try kernel 4.6-rc1. > > There is a difference in the number of passes through the intel_pstate > driver between the non reverted and reverted trace data. One reason, is that > in the reverted case, the driver is not called if the CPU is idle on a jiffy > boundary. This is correct, but I still would like to explain the increased number of hrtimer_wakeup wakeups in the powertop report for the failing case.
@Jörg: Actually, in both the good and failing cases please save /proc/interrupts at one point and then again after around 30 s, produce the diff between the two and attach the resul.
I think timer stats might help provide insight. as root: 1.) clear anything that might be there already # echo "0" > /proc/timer_stats 2.) enable: # echo "1" > /proc/timer_stats 3.) wait a few seconds, perhaps the 30 that Rafael suggested. 4.) extract a snapshot, redirecting to a file # cat /proc/timer_stats >timer_stats_not_reverted 5.) disable (holds at last snapshot i.e. steps 4 and 5 can be reversed). # echo "0" > /proc/timer_stats
@Doug: Agreed. @Jörg: In addition to that, since you have all of the tracing goodies enabled in the kernel config already, please do as follows: # cd /sys/kernel/debug/tracing/ (or wherever your tracing directory is located) # echo 0 > tracing_on # echo hrtimer_init_sleeper > set_ftrace_filter # echo function > current_tracer # echo 1 > options/func_stack_trace # echo 1 > tracing_on and save the "trace" file after around 10 s in both the good and the failing cases. Please attach the results.
@Rafael: > please check if the number of ACPI interrupts in /proc/interrupts no ACPI interrupts within 15 minutes. @Doug > /proc/timer_stats I'm missing timer_stats there is only /proc/timer_list. Do I need a special configuration option? Which?
Created attachment 211651 [details] diff-proc-interrupts-fail-case
Created attachment 211661 [details] diff-proc-interrupts-good-case
(In reply to jrg.otte from comment #29) > @Doug > > /proc/timer_stats > I'm missing timer_stats there is only /proc/timer_list. Do I need a special > configuration option? Which? Well, I guess so (Sorry, it did not occur to me that you might not have it). I am guessing at this one, which I have and you do not: CONFIG_TIMER_STATS=y
Created attachment 211671 [details] hrtimer-tracer-fail-case
Created attachment 211681 [details] hrtimer-tracer-good-case
Created attachment 211691 [details] timer_stats_not_reverted
I only timer_stat for bad case, do you also have for good case?
@Jörg: Thanks for all the info, but the picture is still rather murky (to me at least). Please also collect traces from the events/power/cpu_idle trace point for both the good and the failing cases and attach them.
Interestingly enough, the average number of interrupts per time unit is roughly the same in both the good and the failing cases. This means that the average number of times the CPUs are taken out of idle states is roughly the same in both cases too. Nevertheless, a significant difference is visible in the turbostat and powertop reports which show that C1 and polling are used in the failing case (roughly 16% of the time) while in the good one the CPUs seem to be entering C7s all the time. To me, there are two possible explanations of that. One (A) is that the distribution of timer events has changed, so they tend to clump together in the failing case and we get timer interrupts really close to each other on the one hand and such that are more farther apart on the other hand. The other (B) is that for some obscure reason the tick_nohz_get_sleep_length() computation is confused somehow and returns incorrect results. The traces from the cpu_idle trace point should help us to find out which of them really happens.
It also would be good to check if changing HZ from 300 to 1000 changes the picture in any way.
Rafael: I agree with your comments. My comment #16 is wrong, as I had thought the turbostat sample time was 1 second instead of 5. Everything agrees, roughly 300 IRQs / second across all CPUs. I am not familiar with cpu_idle trace stuff, but just for completeness would still like to see timer_stats_reverted.
(In reply to Doug Smythies from comment #40) > Rafael: I agree with your comments. My comment #16 is wrong, as I had > thought the turbostat sample time was 1 second instead of 5. Everything > agrees, roughly 300 IRQs / second across all CPUs. I am not familiar with > cpu_idle trace stuff, but just for completeness would still like to see > timer_stats_reverted. Agreed. However, I misread the powertop reports, because I was only looking at CPU0. If you look at all of them, things start to look differently, especially for CPU2 in the good case (where the percentages don't even add up to 100). I trust turbostat more. :-)
Does this still happen if you boot in single-user-mode? How about if you boot with "maxcpus=1"
Created attachment 211751 [details] cpu-idle-tracer-fail-case
Created attachment 211761 [details] cpu-idle-tracer-good-case
Created attachment 211791 [details] timer-stats-good-case
Tried good and fail case with 1000Hz. For me no visible change.
maxcpus=1 : no visible change for me. single-user-mode : unfortunately my system does not boot to single user mode. with Ubuntu-kernel I get black screen with backlight on. With my kernel I see it running to starting udev but then I get no konsole prompt.
Created attachment 211801 [details] Grapah 1 of 2 - CPU Frequencies - fail case I have continued to attempt to re-create Jorg's problem on my computer. I have a repeatable (small sample space, so far) way to show 9% extra energy consumption and an average increase in CPU frequency of around 5 p-states. I have not been to get the patch in comment 1 to apply, so I have been using kernel 4.5-rc7 as my "good" reference control sample. The "fail" kernel is 4.6-rc1 at 300Hz and scheduler set to noop. The test work flow is 1.5% load (at nominal CPU frequency) on each CPU. CPUs 0 and 4 at 104.73 Hz. CPUs 1 and 3 at 33.67 Hz. CPUs 2 and 6 at 112.43 Hz. CPUs 3 and 7 at 71.93 Hz. The work flow times between CPUs on the same core are NOT staggered. Staggering by approximately 1/2 period was tested, and NOT staggered seemed a better test condition (meaning worse for higher CPU frequency), but not convincingly so. The above CPU frequencies were extracted from of Jorg's attached test data.
Created attachment 211811 [details] Graph 2 of 2 - CPU Frequencies - good case
Doug, Does something like this helps in your case: diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 4b64452..54654af 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -114,6 +114,7 @@ struct cpudata { u64 prev_mperf; u64 prev_tsc; u64 prev_cummulative_iowait; + u64 low_cpu_util_cnt; struct sample sample; }; @@ -957,6 +958,7 @@ static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu) static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) { int32_t core_busy, max_pstate, current_pstate, sample_ratio; + int32_t core_util; u64 duration_ns; intel_pstate_calc_busy(cpu); @@ -984,6 +986,15 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) * enough period of time to adjust our busyness. */ duration_ns = cpu->sample.time - cpu->last_sample_time; + core_util = div64_u64(int_tofp(100) * cpu->sample.mperf, cpu->sample.tsc); + if (fp_toint(core_util) < 2) { + cpu->low_cpu_util_cnt++; + if (cpu->low_cpu_util_cnt > 2) { + duration_ns = cpu->sample.time * cpu->low_cpu_util_cnt; + } + } else { + cpu->low_cpu_util_cnt = 0; + } if ((s64)duration_ns > pid_params.sample_rate_ns * 3 && cpu->last_sample_time > 0) { sample_ratio = div_fp(int_tofp(pid_params.sample_rate_ns),
So in the good case we are requesting more time in deeper states than fail case with exception state = 5. Looks like menu governor now interpret our idleness differently with scheduler timer than differed timer. Time: 369969 uS Fail case [spandruv@spandruv-desk3 Downloads]$ grep -c "state=1 " cpu-idle-tracer-fail-case 38 [spandruv@spandruv-desk3 Downloads]$ grep -c "state=2 " cpu-idle-tracer-fail-case 10 [spandruv@spandruv-desk3 Downloads]$ grep -c "state=3 " cpu-idle-tracer-fail-case 8 [spandruv@spandruv-desk3 Downloads]$ grep -c "state=4 " cpu-idle-tracer-fail-case 2 [spandruv@spandruv-desk3 Downloads]$ grep -c "state=5 " cpu-idle-tracer-fail-case 184 [spandruv@spandruv-desk3 Downloads]$ grep -c "state=6 " cpu-idle-tracer-fail-case 0 [spandruv@spandruv-desk3 Downloads]$ grep -c "state=7 " cpu-idle-tracer-fail-case 0 [spandruv@spandruv-desk3 Downloads]$ grep -c "state=4294967295 " cpu-idle-tracer-fail-case 244 Total = 486 Time: 375667 uS Good case druv@spandruv-desk3 Downloads]$ grep -c "state=1 " cpu-idle-tracer-good-case 27 [spandruv@spandruv-desk3 Downloads]$ grep -c "state=2 " cpu-idle-tracer-good-case 23 [spandruv@spandruv-desk3 Downloads]$ grep -c "state=3 " cpu-idle-tracer-good-case 27 [spandruv@spandruv-desk3 Downloads]$ grep -c "state=4 " cpu-idle-tracer-good-case 3 [spandruv@spandruv-desk3 Downloads]$ grep -c "state=5 " cpu-idle-tracer-good-case 164 [spandruv@spandruv-desk3 Downloads]$ grep -c "state=6 " cpu-idle-tracer-good-case 0 [spandruv@spandruv-desk3 Downloads]$ grep -c "state=4294967295 " cpu-idle-tracer-good-case 244 Total = 488
Created attachment 211821 [details] cpu_idle trace point good case The trace from comment #44 unzipped.
Created attachment 211831 [details] cpu_idle trace point failing case The trace from comment #45 unzipped.
Doug, My suggested code was wrong. I testing with something like this, with my simulated short load. diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 4b64452..7a8a60d 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -114,6 +114,7 @@ struct cpudata { u64 prev_mperf; u64 prev_tsc; u64 prev_cummulative_iowait; + u64 low_cpu_util_cnt; struct sample sample; }; @@ -957,6 +958,7 @@ static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu) static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) { int32_t core_busy, max_pstate, current_pstate, sample_ratio; + int32_t core_util; u64 duration_ns; intel_pstate_calc_busy(cpu); @@ -984,6 +986,19 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) * enough period of time to adjust our busyness. */ duration_ns = cpu->sample.time - cpu->last_sample_time; + core_util = div64_u64(int_tofp(100) * cpu->sample.mperf, cpu->sample.tsc); + if (fp_toint(core_util) < 2) { + cpu->low_cpu_util_cnt++; + if (cpu->low_cpu_util_cnt > 3) { + u64 _duration_ns; + + _duration_ns = pid_params.sample_rate_ns * cpu->low_cpu_util_cnt; + if (_duration_ns > duration_ns) + duration_ns = _duration_ns; + } + } else { + cpu->low_cpu_util_cnt = 0; + } if ((s64)duration_ns > pid_params.sample_rate_ns * 3 && cpu->last_sample_time > 0) { sample_ratio = div_fp(int_tofp(pid_params.sample_rate_ns),
@Jörg: Thanks for the info! Note: Generally, it is better to attach plain text or HTML here, because people can easily look at that from mobile devices too. And please always remember to set the appropriate MIME type for your attachments. The comparison of timer_stats from the good and failing cases is interesting. One thing noticeable to me is that in the failing case the vast majority of timer events seem to come from schedule_hrtimeout_range_clock() or tick_nohz_restart() (ie. highres timers), while in the good case there is a number of intel_pstate_timer_func() invocations (ie. timer wheel). I'm not sure if/how relevant that is, though. It would take some time to process the cpu_idle traces I think, but maybe it's not necessary. For a quick comparison here's the number of different idle states requested by the kernel (state / times requested - failing case / times requested - good case): Poll / 3 / 1 C1 / 38 / 27 C1E / 10 / 23 C3 / 8 / 27 C6 / 2 / 3 C7s / 184 / 164 That means that in the failing case we got slightly more timer events with shorter timeouts (poll, C1) but overall there were more events in the C3 target residency range in the good case. In turn, in the failing case we requested C5 more often. That indicates that we only spend 50% of the time in C1 in the failing case because the processor demotes us and not because we ask for C1.
Created attachment 211881 [details] intel_pstate: Use %busy-based algorithm on Core Since cpuidle appears to be working correctly, let's focus on intel_pstate. Jörg: Please check this patch instead of the revert from comment #1 and report back.
(In reply to Rafael J. Wysocki from comment #56) > That means that in the failing case we got slightly more timer events with > shorter timeouts (poll, C1) but overall there were more events in the C3 > target residency range in the good case. In turn, in the failing case we > requested C5 more often. That should be: "in the failing case we requested C7s more often".
(In reply to Srinivas Pandruvada from comment #55) > Doug, My suggested code was wrong. I testing with something like this, with > my simulated short load. I would do something else. > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > index 4b64452..7a8a60d 100644 > --- a/drivers/cpufreq/intel_pstate.c > +++ b/drivers/cpufreq/intel_pstate.c > @@ -114,6 +114,7 @@ struct cpudata { > u64 prev_mperf; > u64 prev_tsc; > u64 prev_cummulative_iowait; > + u64 low_cpu_util_cnt; > struct sample sample; > }; > > @@ -957,6 +958,7 @@ static inline int32_t > get_target_pstate_use_cpu_load(struct cpudata *cpu) > static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) > { > int32_t core_busy, max_pstate, current_pstate, sample_ratio; > + int32_t core_util; > u64 duration_ns; > > intel_pstate_calc_busy(cpu); > @@ -984,6 +986,19 @@ static inline int32_t > get_target_pstate_use_performance(struct cpudata *cpu) > * enough period of time to adjust our busyness. > */ > duration_ns = cpu->sample.time - cpu->last_sample_time; > + core_util = div64_u64(int_tofp(100) * cpu->sample.mperf, > cpu->sample.tsc); I would do this if the duration_ns check below returns false only. > + if (fp_toint(core_util) < 2) { Here I'd simply multiply core_busy by core_util. > + cpu->low_cpu_util_cnt++; > + if (cpu->low_cpu_util_cnt > 3) { > + u64 _duration_ns; > + > + _duration_ns = pid_params.sample_rate_ns * > cpu->low_cpu_util_cnt; > + if (_duration_ns > duration_ns) > + duration_ns = _duration_ns; > + } > + } else { > + cpu->low_cpu_util_cnt = 0; > + } > if ((s64)duration_ns > pid_params.sample_rate_ns * 3 > && cpu->last_sample_time > 0) { > sample_ratio = div_fp(int_tofp(pid_params.sample_rate_ns), I'll attach that patch later.
I was trying to have least impact not radically change algorithm. I was trying if we treat case of core util < 2 is as good as idle after 3 consecutive time periods, very similar to what is done now. I did test on server and client, and no visible difference in performance was observed.
Created attachment 211891 [details] intel_pstate: Scale core_busy by %busy if super-lightly loaded Another patch to try if someone is brave enough. It is a bit more aggressive than the last Srinivas' one, but simpler.
(In reply to Srinivas Pandruvada from comment #60) > I was trying to have least impact not radically change algorithm. I was > trying if we treat case of core util < 2 is as good as idle after 3 > consecutive time periods, very similar to what is done now. > I did test on server and client, and no visible difference in performance > was observed. OK, I see. It could be simpler still IMO. That said, please check if the one I attached above has any significant performance impact.
I will give a try. If you reduce after one time period, we may have performance impact compared to what we have now (I think). Let me give a try.
(In reply to Rafael J. Wysocki from comment #57) > Created attachment 211881 [details] > intel_pstate: Use %busy-based algorithm on Core > > Since cpuidle appears to be working correctly, let's focus on intel_pstate. > > Jörg: Please check this patch instead of the revert from comment #1 and > report back. I prefer a load based algorithm. I have plenty of test data from a few users(albeit with a different load based algorithm) collected over the last year. Almost two years ago, a group of us spent a couple of months on various hybrid algorithms. However, the timing of driver is different now.
I did spec power and ran 40+ Phoronix benchmarks on Rafael's change "intel_pstate: Scale core_busy by %busy if super-lightly loaded". I didn't see any deterioration in performance. I usually run several rounds, but I want confirm that this fixes the issue. @Jörg, Please try the latest attachment: "intel_pstate: Scale core_busy by %busy if super-lightly loaded"
I tried "Use %busy-based algorithm on Core" patch. - Idle is back to 800Mhz. - Frequency scaling is less jumpy than usual. - Kernel compiling is as fast as usual. For me a very good approach.
I tried "Scale core_busy by %busy if super-lightly loaded" patch. - Idle is back to 800Mhz. - Tends to higher frequencies than V4.5 - Tends to higher frequencies than "Use %busy-based algorithm on Core" patch. - "Use %busy-based algorithm on Core" tends to lower frequencies than V4.5 I'd prefer "Use %busy-based algorithm on Core". I had to modify the patch because the old content did not match: diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 4b64452..75e1865 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -989,6 +989,10 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) sample_ratio = div_fp(int_tofp(pid_params.sample_rate_ns), int_tofp(duration_ns)); core_busy = mul_fp(core_busy, sample_ratio); + } else { + sample_ratio = div_fp(cpu->sample.mperf * 100, cpu->sample.tsc); + if (sample_ratio < int_tofp(1)) + core_busy = mul_fp(core_busy, sample_ratio); } cpu->sample.busy_scaled = core_busy;
(In reply to jrg.otte from comment #67) Thanks for the testing! > I tried "Scale core_busy by %busy if super-lightly loaded" patch. > > - Idle is back to 800Mhz. Good. > - Tends to higher frequencies than V4.5 > - Tends to higher frequencies than "Use %busy-based algorithm on Core" patch. > > - "Use %busy-based algorithm on Core" tends to lower frequencies than V4.5 It will do that, but there's a bunch of people who won't like it. > I'd prefer "Use %busy-based algorithm on Core". Well, first of all, we need a minimum fix for v4.6, so I'd like to use the "Scale core_busy by %busy if super-lightly loaded" one for that if it is good enough. Second, get_target_pstate_use_cpu_load() has its disadvantages too, like the get_cpu_iowait_time_us() called by it which involves doing a ktime_get() which may be quite costly on some setups (from the perspective of the scheduler paths invoking intel_pstate at least). We will need to use some kind of %busy-based algorithm on Core going forward, but that will need some serious consideration. If you can help to test patches, please let me know. For now, please attach the output of "turbostat --debug" for both the "Use %busy-based algorithm on Core" and "Scale core_busy by %busy if super-lightly loaded" patches.
(In reply to Rafael J. Wysocki from comment #61) > Created attachment 211891 [details] > intel_pstate: Scale core_busy by %busy if super-lightly loaded > > Another patch to try if someone is brave enough. > > It is a bit more aggressive than the last Srinivas' one, but simpler. For my test case, this patch does not work. Why Not? The threshold is too low. With a threshold type solution the question is what if the problem scenario is just above the threshold.
(In reply to Doug Smythies from comment #69) > (In reply to Rafael J. Wysocki from comment #61) > > Created attachment 211891 [details] > > intel_pstate: Scale core_busy by %busy if super-lightly loaded > > > > Another patch to try if someone is brave enough. > > > > It is a bit more aggressive than the last Srinivas' one, but simpler. > > For my test case, this patch does not work. Why Not? The threshold is too > low. With a threshold type solution the question is what if the problem > scenario is just above the threshold. Right, but that applies to all threshold values, doesn't it? What threshold value would work for you, then? Or is there any?
(In reply to Rafael J. Wysocki from comment #70) > (In reply to Doug Smythies from comment #69) > > (In reply to Rafael J. Wysocki from comment #61) > > > Created attachment 211891 [details] > > > intel_pstate: Scale core_busy by %busy if super-lightly loaded > > > > > > Another patch to try if someone is brave enough. > > > > > > It is a bit more aggressive than the last Srinivas' one, but simpler. > > > > For my test case, this patch does not work. Why Not? The threshold is too > > low. With a threshold type solution the question is what if the problem > > scenario is just above the threshold. > > Right, but that applies to all threshold values, doesn't it? Yes exactly, which is what we struggled with 2 years ago when trying these type of solutions. > > What threshold value would work for you, then? Or is there any? 5% would work for the test scenario I have right now, but then I could just change the test case to use 6%. I think there is/was an energy regression with the current load based algorithm on SpecPower (not sure why I think that). During the last cycle (around mid to late February) I found that the load algorithm I had been proposing did well on my SpecPower simulated test.
Created attachment 211961 [details] turbostat-debug-busy-based-algorithm-on-Core
Created attachment 211971 [details] urbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded
> If you can help to test patches, please let me know. Sure, I'm up for it. Thanks to all debuggers and contributors! Jörg
Created attachment 211991 [details] intel_pstate: Scale core_busy by %busy if super-lightly loaded (v2) I'm still not entirely fond of the turbostat results with the previous version of this patch, so please apply this one run "turbostat --debug" with it and attach the result.
(In reply to Doug Smythies from comment #71) > (In reply to Rafael J. Wysocki from comment #70) > > (In reply to Doug Smythies from comment #69) > > > (In reply to Rafael J. Wysocki from comment #61) > > > > Created attachment 211891 [details] > > > > intel_pstate: Scale core_busy by %busy if super-lightly loaded > > > > > > > > Another patch to try if someone is brave enough. > > > > > > > > It is a bit more aggressive than the last Srinivas' one, but simpler. > > > > > > For my test case, this patch does not work. Why Not? The threshold is too > > > low. With a threshold type solution the question is what if the problem > > > scenario is just above the threshold. > > > > Right, but that applies to all threshold values, doesn't it? > > Yes exactly, which is what we struggled with 2 years ago when trying these > type of solutions. I'm not looking for a final solution at this point, though. The current algorithm is clearly unstable as a relatively small timing change has apparently caused it to produce results that are way off. It is too late in the cycle for us to modify the algorithm significantly, however. We can either go back to using timers with intel_pstate, which I really wouldn't like to do for a couple of reasons (one of them being that it really wouldn't fix the algorithm, but rather hide the problem with it on this particular machine), or we can do something to stabilize it somewhat which is what I'm trying to do.
Note: I write this a few hours ago , but it collided with Rafael's edit and I didn't notice. Based on the turbostat results, I don't think the "super lightly loaded" method threshold was high enough for Jörg's case either. (And I see Rafael has increased the threshold.) Srinivas: Is there any chance you would be willing to test my version of a load based algorithm on your SpecPower test-bed? I never got the results from Kristen's tests, and suspect it wasn't done with the correct code or "knob" settings anyway. I have not rebased the patch set since kernel 4.3, and so it has to be applied to a 4.3 start point. References: http://www.smythies.com/~doug/linux/intel_pstate/build58/0000-cover.txt http://www.smythies.com/~doug/linux/intel_pstate/build58/0001-intel_pstate-Use-C0-time-for-busy-calculations-again.txt http://www.smythies.com/~doug/linux/intel_pstate/build58/0002-intel_pstate-Calculate-target-pstate-directly.txt http://www.smythies.com/~doug/linux/intel_pstate/build58/0003-intel_pstate-Compensate-for-intermediate-durations-v.txt http://www.smythies.com/~doug/linux/intel_pstate/build58/0004-intel_pstate-move-utilization-calculation-to-a-separ.txt http://www.smythies.com/~doug/linux/intel_pstate/build58/0005-intel_pstate-delete-unused-PID-controller.txt http://www.smythies.com/~doug/linux/intel_pstate/build58/0006-intel_pstate-Change-utilization-calculation-to-a-leg.txt http://www.smythies.com/~doug/linux/intel_pstate/build58/0001-intel_pstate-Doug-patch-set-clamp-scaled-busy.txt All of the above links also available in the references section here: http://www.smythies.com/~doug/linux/intel_pstate/build58/index.html There are also some notes and graphs comparing the current threshold (setpoint) based load algogrithm to the current performance based algorithm and mine at: http://www.smythies.com/~doug/linux/intel_pstate/philippe_longepe/index.html
(In reply to Rafael J. Wysocki from comment #76) > (In reply to Doug Smythies from comment #71) > > I'm not looking for a final solution at this point, though. Fair enough. > The current algorithm is clearly unstable as a relatively small timing > change has apparently caused it to produce results that are way off. Yes, the current algorithm has problems and always has. This is what I have been complaining about for almost two years now.
Doug, Rebase to the latest and send me.
Doug, FYI, Using scheduler utility I have already tried y = mx + c, type algorithm. I already have this data. Will send offline. The performance/watt number regress by 3 to 4% for specpower.
(In reply to Srinivas Pandruvada from comment #79) > Doug, Rebase to the latest and send me. Srinivas: Due to limited time, I can not right now. I could rebase starting maybe around May 10th. (In reply to Srinivas Pandruvada from comment #80) > Doug, FYI, Using scheduler utility I have already tried y = mx + c, type > algorithm. I already have this data. Will send offline. The performance/watt > number regress by 3 to 4% for specpower. O.K. thanks. Note that, at least in my testing, the values for m and c were very important. The default response curve I use is the result of considerable work.
@Jörg: Did you ever try the Ubuntu mainline PPA kernel? http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-rc2-wily/ At some point you mentioned your system was somewhat Ubuntu 14.04 based. Would it be possible to try kernel 4.6-rc2, but based on a 16.04 system. Perhaps via live CD or something. While I still believe there is, and always has been, an issue with the control algorithm, I also think there can be odd things in during 14.04 initialization.
@Doug: > Did you ever try the Ubuntu mainline PPA kernel? No, but I can give it a try. > Would it be possible to try kernel 4.6-rc2, but based on a 16.04 system. > Perhaps via live CD Don't know how. I have bad experiences upgrading too early. I always wait to the first point release.
@Jörg: I hope you haven't missed comment #75. :-)
@Rafael: Soory, overlooked! Will come soon.
Created attachment 212001 [details] turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V2
(In reply to jrg.otte from comment #86) > Created attachment 212001 [details] > turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V2 Thanks! So this one seems to be keeping the P-states in check while in idle.
(In reply to Rafael J. Wysocki from comment #87) > (In reply to jrg.otte from comment #86) > > Created attachment 212001 [details] > > turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V2 > > Thanks! > > So this one seems to be keeping the P-states in check while in idle. Agreed, but the first turbostat sample CPU frequencies seem a little high. It may be a waste of time, but I'd like to work through a trace sample with this kernel. It would help if acquired using "sudo perf record -a --event=power:pstate_sample sleep 100" as then post processing of the resulting perf.data would take 2 minutes instead of hours. I listed 100 seconds above, but would really like 300 seconds. Jörg, maybe your odd kernel configuration is such that you can not use "perf record". I compile perf myself from the tools/perf directory and then put it in my user bin directory. So the command I actually use is: sudo ~/bin/perf record -a --event=power:pstate_sample sleep 300
(In reply to Doug Smythies from comment #88) > (In reply to Rafael J. Wysocki from comment #87) > > (In reply to jrg.otte from comment #86) > > > Created attachment 212001 [details] > > > turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V2 > > > > Thanks! > > > > So this one seems to be keeping the P-states in check while in idle. > > Agreed, but the first turbostat sample CPU frequencies seem a little high. The first sample has higher frequencies in all of the Jörg's turbostat reports, even in the good case. [cut] > Jörg, maybe your odd kernel configuration is such that you can not use "perf > record". I compile perf myself from the tools/perf directory and then put it > in my user bin directory. So the command I actually use is: > > sudo ~/bin/perf record -a --event=power:pstate_sample sleep 300 Well, I guess why not if that can help. In any case, the plan for v4.6 is to go with the patch from comment #75. Going forward, we'll need a new (and %busy-based) algorithm for Core processors.
(In reply to Rafael J. Wysocki from comment #89) > (In reply to Doug Smythies from comment #88) >> (In reply to Rafael J. Wysocki from comment #87) >>> So this one seems to be keeping the P-states in check while in idle. >> >> Agreed, but the first turbostat sample CPU frequencies seem a little high. > > The first sample has higher frequencies in all of the Jörg's turbostat > reports, even in the good case. Disagree. attachment 211511 [details] (Good) 3rd sample of 3 has highest CPU frequencies. attachment 211521 [details] (Fail) 3rd sample of 3 has highest CPU frequencies.
(In reply to Doug Smythies from comment #90) > (In reply to Rafael J. Wysocki from comment #89) > > (In reply to Doug Smythies from comment #88) > >> (In reply to Rafael J. Wysocki from comment #87) > > >>> So this one seems to be keeping the P-states in check while in idle. > >> > >> Agreed, but the first turbostat sample CPU frequencies seem a little high. > > > > The first sample has higher frequencies in all of the Jörg's turbostat > > reports, even in the good case. > > Disagree. > attachment 211511 [details] (Good) 3rd sample of 3 has highest CPU > frequencies. > attachment 211521 [details] (Fail) 3rd sample of 3 has highest CPU > frequencies. Higher != the highest :-) But the point essentially is that frequencies are not uniform in each report. They vary between intervals, sometimes quite a bit. I'm not sure if it matters that the first interval is higher and the second one is lower frequency etc, because it depends on when the measurement started too.
(In reply to Rafael J. Wysocki from comment #91) > (In reply to Doug Smythies from comment #90) >>>(In reply to Rafael J. Wysocki from comment #89) >>> (In reply to Doug Smythies from comment #88) >>>> (In reply to Rafael J. Wysocki from comment #87) >> >>>>> So this one seems to be keeping the P-states in check while in idle. >>>> >>>> Agreed, but the first turbostat sample CPU frequencies seem a little high. >>> >>> The first sample has higher frequencies in all of the Jörg's turbostat >>> reports, even in the good case. >> >> Disagree. >> attachment 211511 [details] (Good) 3rd sample of 3 has highest CPU >> frequencies. >> attachment 211521 [details] (Fail) 3rd sample of 3 has highest CPU >> frequencies. > > Higher != the highest :-) > > But the point essentially is that frequencies are not uniform in each > report. They vary between intervals, sometimes quite a bit. Agreed. > I'm not sure if it matters that the first interval is higher and the second > one is lower frequency etc, because it depends on when the measurement > started too. Agreed. I think this is a reason to acquire 5 minutes worth of trace data to examine.
Created attachment 212081 [details] Screen shots 1 of 2: From trace data. Not normalized case. (In reply to Rafael J. Wysocki from comment #75) > Created attachment 211991 [details] > intel_pstate: Scale core_busy by %busy if super-lightly loaded (v2) > > I'm still not entirely fond of the turbostat results with the previous > version of this patch, so please apply this one run "turbostat --debug" with > it and attach the result. Rafael: When you doubled the threshold to 2%, shouldn't you have also normalized the adjustment to be 1 at 2%? I'm suggesting that this: if (sample_ratio < int_tofp(2)) core_busy = mul_fp(core_busy, sample_ratio); Should be something like this: if (sample_ratio < int_tofp(2)) core_busy = mul_fp(core_busy, sample_ratio / 2);
Created attachment 212091 [details] Screen shots 2 of 2: From trace data. Normalized case.
(In reply to Doug Smythies from comment #93) > Created attachment 212081 [details] > Screen shots 1 of 2: From trace data. Not normalized case. > > (In reply to Rafael J. Wysocki from comment #75) > > Created attachment 211991 [details] > > intel_pstate: Scale core_busy by %busy if super-lightly loaded (v2) > > > > I'm still not entirely fond of the turbostat results with the previous > > version of this patch, so please apply this one run "turbostat --debug" > with > > it and attach the result. > > Rafael: When you doubled the threshold to 2%, shouldn't you have also > normalized the adjustment to be 1 at 2%? > > I'm suggesting that this: > > if (sample_ratio < int_tofp(2)) > core_busy = mul_fp(core_busy, sample_ratio); > > Should be something like this: > > if (sample_ratio < int_tofp(2)) > core_busy = mul_fp(core_busy, sample_ratio / 2); I made a mistake, but I didn't mean the above. I meant to multiply core_busy by the fraction and not by the percentage.
(In reply to Doug Smythies from comment #94) > Created attachment 212091 [details] > Screen shots 2 of 2: From trace data. Normalized case. It would be good to explain what's illustrated by it, even if it's obvious to you.
Created attachment 212101 [details] intel_pstate: Scale core_busy by %busy if super-lightly loaded (v3) Previous versions of this were incorrect, as the scaling should be by the fraction and not by percentage. @Jörg: Please test this one and attach the output of "turbostat --debug", but please let it run for more time (8 intervals say).
(In reply to Rafael J. Wysocki from comment #97) > Created attachment 212101 [details] > intel_pstate: Scale core_busy by %busy if super-lightly loaded (v3) > > Previous versions of this were incorrect, as the scaling should be by the > fraction and not by percentage. > > @Jörg: Please test this one and attach the output of "turbostat --debug", > but please let it run for more time (8 intervals say). Rafael: O.K. That is actually what I thought you were wanting to do originally. Because there are only 8 fractional bits, it gets a little coarse, but is fine. (the actual threshold is about 2.34%.) Myself, I would have just done this: } else { sample_ratio = div_fp(cpu->sample.mperf, cpu->sample.tsc); if (100 * sample_ratio < int_tofp(2)) core_busy = 0;
@Jörg: if possible, I would like to get trace data, as per my comment #88, for the test of Rafael's comment #97.
Created attachment 212111 [details] v3_spec_pwr_hsw_ep0
The spec power results are comparable with the baseline 4.6.rc2 with the patch. Other client tests still going on. Notable one is the IO bench mark regress. AIO-Stress 0.21 Test: Random Write MB/s baseline 4.6-rc2 .. 575.88 |=============================================== +v3patch .. 521.42 |=========================================== They are in MB/S, so more is better.
Created attachment 212121 [details] turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V3
perf record -a --event=power:pstate_sample sleep 300 was successful. So I need to do queries. Which?
(In reply to Doug Smythies from comment #98) > Because there are only 8 fractional bits, it gets a little coarse, but is > fine. (the actual threshold is about 2.34%.) Myself, I would have just done > this: > > } else { > sample_ratio = div_fp(cpu->sample.mperf, cpu->sample.tsc); > if (100 * sample_ratio < int_tofp(2)) > core_busy = 0; Yes, that is an option, of course, and actually preferred due to less computations. I wonder if lowering the threshold would make a difference, though.
Created attachment 212151 [details] intel_pstate: Scale core_busy by %busy if super-lightly loaded (v4) Simply reset core_busy if %busy is below the threshold and lower the threshold at the same time. @Jörg: Please try this one and attach the "turbostat --debug" output (like before). @Srinivas: Please test this one.
Created attachment 212161 [details] intel_pstate: Scale core_busy by %busy if super-lightly loaded (v5) Cosmetic changes. @Jörg, Srinivas: Please test as per the previous comment.
(In reply to jrg.otte from comment #103) > perf record -a --event=power:pstate_sample sleep 300 > was successful. > So I need to do queries. Which? In the same directory that you ran the command from, there will be a file called perf.data. I crudely estimate it will be about 7 megabytes, but it should compress by a factor of roughly 7. Compress that file and attach it, or if there is size limit, send it to me directly. At this point, a V5 trace would be preferred.
Created attachment 212211 [details] turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V5
(In reply to Srinivas Pandruvada from comment #101) > Notable one is the IO bench mark regress. > > AIO-Stress 0.21 > Test: Random Write > MB/s > baseline 4.6-rc2 .. 575.88 > |=============================================== > +v3patch .. 521.42 |=========================================== > > They are in MB/S, so more is better. Srinivas: On my computer I get (sorry, still using rc1. Due to high variability I increased the number of runs from the default of 3 to 10): 4.6-rc1 2667.86 MB/S 6.37% Standard Deviation 4.6-rc1+v3 2750.67 MB/S 0.64% Standard Deviation
Doug, Yes. These tests show lots of variability. Looks like you are running on a server (Haswell?). I am rerunning tests.
(In reply to jrg.otte from comment #108) > Created attachment 212211 [details] > turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V5 OK, that looks reasonable, thanks! Now all depends on the Srinivas' tests results.
Created attachment 212261 [details] IO disk tests Tested v5 version of the patch attached here. No regression on IO tests. Trying one more systems (Haswell ultrabook).
(In reply to Srinivas Pandruvada from comment #110) > Looks like you are running on a server (Haswell?). I have an older Sandy Bridge i7 (Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz) on an Asus P8 Z68-M PRO motherboard.
I tested on Haswell ultrabook also, results are comparable with the baseline and with patch version v5. Doug: You get pretty good IO performance.
Cool, thanks! I'll send the patch from comment #106 to the mailing lists then and queue it up as a 4.6 fix if there are no objections against that.
(In reply to Rafael J. Wysocki from comment #115) > I'll send the patch from comment #106 to the mailing lists then and queue it > up as a 4.6 fix if there are no objections against that. No Objection.
(In reply to Doug Smythies from comment #20) > I still think this is the same as or similar to bug 93521. I have asked on > that bug for them to try kernel 4.6-rc1. We have some results from bug 93521. The issue is the same or similar, and the patch V5, (the one Rafael also sent to the mail lists) gives similar good results.
V5 also helped for Skylake 6600U system, thank you!
The fix has been merged as http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ffb810563c0c049872a504978e06c8892104fb6c