Bug 115771 - [intel-pstate driver regression] processor frequency very high even if in idle
Summary: [intel-pstate driver regression] processor frequency very high even if in idle
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: intel_pstate (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: Rafael J. Wysocki
URL: http://marc.info/?t=145927241200005&r...
Keywords:
Depends on:
Blocks:
 
Reported: 2016-04-01 20:03 UTC by Rafael J. Wysocki
Modified: 2017-01-04 22:13 UTC (History)
10 users (show)

See Also:
Kernel Version: 4.6-rc1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Revert of commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) (4.63 KB, patch)
2016-04-01 20:07 UTC, Rafael J. Wysocki
Details | Diff
pstate_sample trace from the affected system - working case (147.44 KB, text/plain)
2016-04-01 20:17 UTC, Rafael J. Wysocki
Details
pstate_sample trace from the affected system - failing case (72.05 KB, text/plain)
2016-04-01 20:20 UTC, Rafael J. Wysocki
Details
Kernel config file from the affected system (83.25 KB, text/plain)
2016-04-01 21:19 UTC, Rafael J. Wysocki
Details
turbo-debug-good_case (4.78 KB, text/plain)
2016-04-03 07:47 UTC, jrg.otte
Details
turbo-debug-failing_case (4.79 KB, text/plain)
2016-04-03 07:49 UTC, jrg.otte
Details
powertop good case (74.16 KB, text/html)
2016-04-03 15:03 UTC, jrg.otte
Details
powertop fail case (69.59 KB, text/html)
2016-04-03 15:04 UTC, jrg.otte
Details
interrupts good case (1.99 KB, text/plain)
2016-04-03 15:04 UTC, jrg.otte
Details
interrupts fail case (1.99 KB, text/plain)
2016-04-03 15:05 UTC, jrg.otte
Details
excerpt from Jorg reverted trace data (45.76 KB, image/png)
2016-04-04 06:45 UTC, Doug Smythies
Details
acpi interrupts (1.77 KB, text/plain)
2016-04-04 08:01 UTC, jrg.otte
Details
acpidump (69.98 KB, application/octet-stream)
2016-04-04 08:02 UTC, jrg.otte
Details
diff-proc-interrupts-fail-case (1.17 KB, text/plain)
2016-04-04 15:25 UTC, jrg.otte
Details
diff-proc-interrupts-good-case (3.63 KB, text/plain)
2016-04-04 15:25 UTC, jrg.otte
Details
hrtimer-tracer-fail-case (1.08 KB, text/plain)
2016-04-04 17:01 UTC, jrg.otte
Details
hrtimer-tracer-good-case (437 bytes, text/plain)
2016-04-04 17:01 UTC, jrg.otte
Details
timer_stats_not_reverted (6.27 KB, text/plain)
2016-04-04 17:02 UTC, jrg.otte
Details
cpu-idle-tracer-fail-case (2.48 KB, application/octet-stream)
2016-04-05 08:14 UTC, jrg.otte
Details
cpu-idle-tracer-good-case (2.47 KB, application/octet-stream)
2016-04-05 08:15 UTC, jrg.otte
Details
timer-stats-good-case (5.66 KB, text/plain)
2016-04-05 14:54 UTC, jrg.otte
Details
Grapah 1 of 2 - CPU Frequencies - fail case (33.48 KB, image/png)
2016-04-05 17:45 UTC, Doug Smythies
Details
Graph 2 of 2 - CPU Frequencies - good case (28.74 KB, image/png)
2016-04-05 17:47 UTC, Doug Smythies
Details
cpu_idle trace point good case (38.39 KB, text/plain)
2016-04-05 19:28 UTC, Rafael J. Wysocki
Details
cpu_idle trace point failing case (38.39 KB, text/plain)
2016-04-05 19:29 UTC, Rafael J. Wysocki
Details
intel_pstate: Use %busy-based algorithm on Core (592 bytes, patch)
2016-04-06 00:14 UTC, Rafael J. Wysocki
Details | Diff
intel_pstate: Scale core_busy by %busy if super-lightly loaded (732 bytes, patch)
2016-04-06 00:50 UTC, Rafael J. Wysocki
Details | Diff
turbostat-debug-busy-based-algorithm-on-Core (7.33 KB, text/plain)
2016-04-06 16:36 UTC, jrg.otte
Details
urbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded (7.17 KB, text/plain)
2016-04-06 16:36 UTC, jrg.otte
Details
intel_pstate: Scale core_busy by %busy if super-lightly loaded (v2) (705 bytes, patch)
2016-04-06 19:05 UTC, Rafael J. Wysocki
Details | Diff
turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V2 (6.17 KB, text/plain)
2016-04-07 14:45 UTC, jrg.otte
Details
Screen shots 1 of 2: From trace data. Not normalized case. (46.42 KB, image/png)
2016-04-08 00:42 UTC, Doug Smythies
Details
Screen shots 2 of 2: From trace data. Normalized case. (85.49 KB, image/png)
2016-04-08 00:43 UTC, Doug Smythies
Details
intel_pstate: Scale core_busy by %busy if super-lightly loaded (v3) (732 bytes, patch)
2016-04-08 02:02 UTC, Rafael J. Wysocki
Details | Diff
v3_spec_pwr_hsw_ep0 (3.15 KB, text/plain)
2016-04-08 06:13 UTC, Srinivas Pandruvada
Details
turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V3 (10.62 KB, text/plain)
2016-04-08 08:23 UTC, jrg.otte
Details
intel_pstate: Scale core_busy by %busy if super-lightly loaded (v4) (702 bytes, patch)
2016-04-08 12:30 UTC, Rafael J. Wysocki
Details | Diff
intel_pstate: Scale core_busy by %busy if super-lightly loaded (v5) (702 bytes, patch)
2016-04-08 12:34 UTC, Rafael J. Wysocki
Details | Diff
turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V5 (11.05 KB, text/plain)
2016-04-08 14:24 UTC, jrg.otte
Details
IO disk tests (172.49 KB, application/pdf)
2016-04-08 22:06 UTC, Srinivas Pandruvada
Details

Description Rafael J. Wysocki 2016-04-01 20:03:50 UTC
Jörg Otte reports:

in v4.5 and earlier intel-pstate downscaled idle processors (load
0.1-0.2%) to minumum frequency, in my case 800MHz.

Now in v4.6-rc1 the characteristic has dramatically changed. If in
idle the processor frequency is more or less a few MHz around 2500Mhz.
This is the maximum non turbo frequency.

No difference between powersafe or performance governor.

I currently use acpi_cpufreq which works as usual.

Processor:
Intel(R) Core(TM) i5-4200M CPU @ 2.50GHz (family: 0x6, model: 0x3c,
stepping: 0x3)

Reference: http://marc.info/?l=linux-acpi&m=145927229609595&w=2
Comment 1 Rafael J. Wysocki 2016-04-01 20:07:15 UTC
Created attachment 211401 [details]
Revert of commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks)

Jörg reports that the attached revert of commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) makes the problem go away for him.
Comment 2 Rafael J. Wysocki 2016-04-01 20:10:06 UTC
Suspicion arises that the problem may be related to an initialization issue, so Jörg is requested to test the following two patches:

[1] https://patchwork.kernel.org/patch/8714401/
[2] https://patchwork.kernel.org/patch/8723461/

([2] depends on [1]), but these patches don't make any difference for him.
Comment 3 Rafael J. Wysocki 2016-04-01 20:17:54 UTC
Created attachment 211411 [details]
pstate_sample trace from the affected system - working case

Trace from the pstate_sample trace point collected by Jörg with the revert patch attachment #1 [details] applied.
Comment 4 Rafael J. Wysocki 2016-04-01 20:20:52 UTC
Created attachment 211421 [details]
pstate_sample trace from the affected system - failing case

Trace from the pstate_sample trace point collected by Jörg with the patches from comment #2 applied (but without the revert).
Comment 5 Rafael J. Wysocki 2016-04-01 21:19:07 UTC
Created attachment 211431 [details]
Kernel config file from the affected system
Comment 6 Rafael J. Wysocki 2016-04-02 14:53:46 UTC
@Jörg: After a bit of reconsideration, I think that full output from the most recent turbostat would be useful here in both the failing and working cases.

You can get that by running "turbostat --debug".

I'm mostly interested in the C-state residencies (the CPU%cx and Pkg%pcx columns) and the PkgWatt and CorWatt numbers.
Comment 7 jrg.otte 2016-04-03 07:47:44 UTC
Created attachment 211511 [details]
turbo-debug-good_case
Comment 8 jrg.otte 2016-04-03 07:49:53 UTC
Created attachment 211521 [details]
turbo-debug-failing_case
Comment 9 Rafael J. Wysocki 2016-04-03 13:16:16 UTC
Thanks!

Evidently, in the failing case the processors spend more time in C1 and the package PC2 residency is much less that in the good case.  That translates to increased energy usage.

It looks like the elimination of intel_pstate timers leads to increased activity somewhere else in your system.
Comment 10 Rafael J. Wysocki 2016-04-03 13:28:26 UTC
Please attach /proc/interrupts for both the good and failing cases.

Also please generate powertop reports with "powertop --html --time=30" for both the good and failing cases and attach them too.
Comment 11 Doug Smythies 2016-04-03 14:28:37 UTC
(In reply to Rafael J. Wysocki from comment #9)
> Evidently, in the failing case the processors spend more time in C1 and the
> package PC2 residency is much less that in the good case.  That translates
> to increased energy usage.

While working on another theory (now disproved), from the trace data it was determined (very crudely) that CPU 3 services some event at ~125 Hz (possibly aliased) in the failing case and ~40 Hz in the good case. (I haven't gone into this level of detail for the other CPUs (particularly CPU 2) yet. 

I have been trying, without success, to re-create Jörg's failure scenario.
Comment 12 jrg.otte 2016-04-03 15:03:21 UTC
Created attachment 211531 [details]
powertop good case
Comment 13 jrg.otte 2016-04-03 15:04:02 UTC
Created attachment 211541 [details]
powertop fail case
Comment 14 jrg.otte 2016-04-03 15:04:51 UTC
Created attachment 211551 [details]
interrupts good case
Comment 15 jrg.otte 2016-04-03 15:05:22 UTC
Created attachment 211561 [details]
interrupts fail case
Comment 16 Doug Smythies 2016-04-03 16:00:42 UTC
I forgot that the latest turbostat has irg counts. Very useful.
I tried a simulated low load and higher interrupts rate (than normal, for me) on my computer. I did: CPU 0 520 Hz; CPU 1 515 hz; CPU 2 75 Hz; CPU 3 390 Hz. Still, the resulting CPU frequencies were fine.
Comment 17 Srinivas Pandruvada 2016-04-03 18:51:12 UTC
I tried on three systems with same cpu model with very similar kernel config, with minor changes. I couldn't reproduce.
I am generating one periodic interrupt and triggering a soft irq where I do small processing to bump up frequncies. I am able to reach higher frequencies but not in turbo range.
Comment 18 Rafael J. Wysocki 2016-04-04 00:51:10 UTC
@Jörg: Thanks for the attachments!

The /proc/interrupts files contain an indication on what may be the source of the problem: you get an insanely large number of ACPI interrupts in both the good and the failing cases.  That number is two or three orders of magnitude too high which usually means that something really odd is happening.

And that's where there's a difference:

Good:
  9:       3156      34845       2061      47444   IO-APIC   9-fasteoi   acpi

Failing:
  9:       4938      17921       3161      72687   IO-APIC   9-fasteoi   acpi

as in the failing case you got around 10000 of those interrupts more.

Now, if you look at the numbers of wakeups as reported by powertop, there's a difference too.  In the failing case, the primary source of wakeups is hrtimer_wakeup, which also is highly unusual AFAICS.

It looks like the elimination of intel_pstate timers induces more usage of highres timers by something.  Question is by what and why, but to answer that we need to take the ACPI noise out of the picture first.
Comment 19 Rafael J. Wysocki 2016-04-04 00:58:02 UTC
@Jörg: Please attach the output of acpidump from your system.

Also please do

# cd /sys/firmware/acpi/interrupts/
# for f in *; do echo $f; cat $f; done > /tmp/acpi-interrupts.txt

and attach the /tmp/acpi-interrupts.txt file.
Comment 20 Doug Smythies 2016-04-04 06:45:09 UTC
Created attachment 211611 [details]
excerpt from Jorg reverted trace data

I still think this is the same as or similar to bug 93521. I have asked on that bug for them to try kernel 4.6-rc1.

There is a difference in the number of passes through the intel_pstate driver between the non reverted and reverted trace data. One reason, is that in the reverted case, the driver is not called if the CPU is idle on a jiffy boundary. When it does happen that the driver is active on a jiffy boundary, then often there is a long duration involved, resulting in driving down the target pstate, whereas otherwise it would have stayed higher (as in the non-reverted case).

The attached screenshot of the reverted trace data details a clear area where this is occurring with CPU 3. Note: Other sometimes active CPUs have higher target pstates (not shown). Notice very low load on CPU 3. CPU 3 jumps way up in target pstate, due to the higher active frequency due to another CPU, line 855. Now, CPU requests pstate 26 for several more passes, until the long duration scale adjustment kicks in at line 860. Based on aperf clock cycles, it seems about 5 events actually took place in that time period. The point is that the target pstate got driven down by long durations in the reverted case. Under these conditions there will not be long durations in the non-reverted case (which, in general, is a good thing).
Comment 21 jrg.otte 2016-04-04 07:58:11 UTC
@Rafael: 
> /sys/firmware/acpi/interrupts/

This reminds me on a big problem I had a year ago after installing Linux on the notebook:

I had a kworker-thread consuming 80% cpu time. I found in an Ubuntu forum a workaround for that problem. The workaround was to disable /sys/firmware/acpi/interrupts/gpe13 (echo "disable" > gpe13).

To disable gpe13 during boot as early as possible I created an upstart job to disable gpe13 as soon as the /sys filesystem has been mounted. This worked so well for a year that I completely forgotten this problem, till today.

--
start on mounted MOUNTPOINT=/sys

script
echo "disable" > /sys/firmware/acpi/interrupts/gpe13
exit 0

end script
--
Comment 22 jrg.otte 2016-04-04 08:01:14 UTC
Created attachment 211621 [details]
acpi interrupts
Comment 23 jrg.otte 2016-04-04 08:02:43 UTC
Created attachment 211631 [details]
acpidump
Comment 24 Rafael J. Wysocki 2016-04-04 10:36:15 UTC
(In reply to jrg.otte from comment #21)
> @Rafael: 
> > /sys/firmware/acpi/interrupts/
> 
> This reminds me on a big problem I had a year ago after installing Linux on
> the notebook:
> 
> I had a kworker-thread consuming 80% cpu time. I found in an Ubuntu forum a
> workaround for that problem. The workaround was to disable
> /sys/firmware/acpi/interrupts/gpe13 (echo "disable" > gpe13).
> 
> To disable gpe13 during boot as early as possible I created an upstart job
> to disable gpe13 as soon as the /sys filesystem has been mounted. This
> worked so well for a year that I completely forgotten this problem, till
> today.
> 
> --
> start on mounted MOUNTPOINT=/sys
> 
> script
> echo "disable" > /sys/firmware/acpi/interrupts/gpe13
> exit 0
> 
> end script
> --

That's good information.  This means that the ACPI interrupts you get probably almost all happen before you disable GPE 0x13.

To confirm that, please check if the number of ACPI interrupts in /proc/interrupts changes significantly after you've disabled GPE 0x13.
Comment 25 Rafael J. Wysocki 2016-04-04 10:39:43 UTC
(In reply to Doug Smythies from comment #20)
> Created attachment 211611 [details]
> excerpt from Jorg reverted trace data
> 
> I still think this is the same as or similar to bug 93521. I have asked on
> that bug for them to try kernel 4.6-rc1.
> 
> There is a difference in the number of passes through the intel_pstate
> driver between the non reverted and reverted trace data. One reason, is that
> in the reverted case, the driver is not called if the CPU is idle on a jiffy
> boundary.

This is correct, but I still would like to explain the increased number of hrtimer_wakeup wakeups in the powertop report for the failing case.
Comment 26 Rafael J. Wysocki 2016-04-04 10:45:30 UTC
@Jörg: Actually, in both the good and failing cases please save /proc/interrupts at one point and then again after around 30 s, produce the diff between the two and attach the resul.
Comment 27 Doug Smythies 2016-04-04 14:07:45 UTC
I think timer stats might help provide insight.
as root:

1.) clear anything that might be there already
# echo "0" > /proc/timer_stats

2.) enable:
# echo "1" > /proc/timer_stats

3.) wait a few seconds, perhaps the 30 that Rafael suggested.

4.) extract a snapshot, redirecting to a file
# cat /proc/timer_stats >timer_stats_not_reverted

5.) disable (holds at last snapshot i.e. steps 4 and 5 can be reversed).
# echo "0" > /proc/timer_stats
Comment 28 Rafael J. Wysocki 2016-04-04 14:53:40 UTC
@Doug: Agreed.

@Jörg: In addition to that, since you have all of the tracing goodies enabled in the kernel config already, please do as follows:

# cd /sys/kernel/debug/tracing/
(or wherever your tracing directory is located)
# echo 0 > tracing_on
# echo hrtimer_init_sleeper > set_ftrace_filter
# echo function > current_tracer
# echo 1 > options/func_stack_trace
# echo 1 > tracing_on

and save the "trace" file after around 10 s in both the good and the failing cases.

Please attach the results.
Comment 29 jrg.otte 2016-04-04 15:24:06 UTC
@Rafael:
> please check if the number of ACPI interrupts in /proc/interrupts
no ACPI interrupts within 15 minutes.

@Doug
> /proc/timer_stats
I'm missing timer_stats there is only /proc/timer_list. Do I need a special
configuration option? Which?
Comment 30 jrg.otte 2016-04-04 15:25:11 UTC
Created attachment 211651 [details]
diff-proc-interrupts-fail-case
Comment 31 jrg.otte 2016-04-04 15:25:42 UTC
Created attachment 211661 [details]
diff-proc-interrupts-good-case
Comment 32 Doug Smythies 2016-04-04 15:32:40 UTC
(In reply to jrg.otte from comment #29)
> @Doug
> > /proc/timer_stats
> I'm missing timer_stats there is only /proc/timer_list. Do I need a special
> configuration option? Which?

Well, I guess so (Sorry, it did not occur to me that you might not have it).
I am guessing at this one, which I have and you do not:

CONFIG_TIMER_STATS=y
Comment 33 jrg.otte 2016-04-04 17:01:15 UTC
Created attachment 211671 [details]
hrtimer-tracer-fail-case
Comment 34 jrg.otte 2016-04-04 17:01:36 UTC
Created attachment 211681 [details]
hrtimer-tracer-good-case
Comment 35 jrg.otte 2016-04-04 17:02:04 UTC
Created attachment 211691 [details]
timer_stats_not_reverted
Comment 36 Srinivas Pandruvada 2016-04-04 17:57:10 UTC
I only timer_stat for bad case, do you also have for good case?
Comment 37 Rafael J. Wysocki 2016-04-04 21:23:57 UTC
@Jörg: Thanks for all the info, but the picture is still rather murky (to me at least).

Please also collect traces from the events/power/cpu_idle trace point for both the good and the failing cases and attach them.
Comment 38 Rafael J. Wysocki 2016-04-04 22:03:31 UTC
Interestingly enough, the average number of interrupts per time unit is roughly the same in both the good and the failing cases.  This means that the average number of times the CPUs are taken out of idle states is roughly the same in both cases too.

Nevertheless, a significant difference is visible in the turbostat and powertop reports which show that C1 and polling are used in the failing case (roughly 16% of the time) while in the good one the CPUs seem to be entering C7s all the time.

To me, there are two possible explanations of that.  One (A) is that the distribution of timer events has changed, so they tend to clump together in the failing case and we get timer interrupts really close to each other on the one hand and such that are more farther apart on the other hand.  The other (B) is that for some obscure reason the tick_nohz_get_sleep_length() computation is confused somehow and returns incorrect results.

The traces from the cpu_idle trace point should help us to find out which of them really happens.
Comment 39 Rafael J. Wysocki 2016-04-04 22:13:40 UTC
It also would be good to check if changing HZ from 300 to 1000 changes the picture in any way.
Comment 40 Doug Smythies 2016-04-04 22:18:41 UTC
Rafael: I agree with your comments. My comment #16 is wrong, as I had thought the turbostat sample time was 1 second instead of 5. Everything agrees, roughly 300 IRQs / second across all CPUs. I am not familiar with cpu_idle trace stuff, but just for completeness would still like to see timer_stats_reverted.
Comment 41 Rafael J. Wysocki 2016-04-04 22:47:23 UTC
(In reply to Doug Smythies from comment #40)
> Rafael: I agree with your comments. My comment #16 is wrong, as I had
> thought the turbostat sample time was 1 second instead of 5. Everything
> agrees, roughly 300 IRQs / second across all CPUs. I am not familiar with
> cpu_idle trace stuff, but just for completeness would still like to see
> timer_stats_reverted.

Agreed.

However, I misread the powertop reports, because I was only looking at CPU0.

If you look at all of them, things start to look differently, especially for CPU2 in the good case (where the percentages don't even add up to 100).

I trust turbostat more. :-)
Comment 42 Len Brown 2016-04-05 00:58:07 UTC
Does this still happen if you boot in single-user-mode?

How about if you boot with "maxcpus=1"
Comment 43 jrg.otte 2016-04-05 08:14:50 UTC
Created attachment 211751 [details]
cpu-idle-tracer-fail-case
Comment 44 jrg.otte 2016-04-05 08:15:14 UTC
Created attachment 211761 [details]
cpu-idle-tracer-good-case
Comment 45 jrg.otte 2016-04-05 14:54:25 UTC
Created attachment 211791 [details]
timer-stats-good-case
Comment 46 jrg.otte 2016-04-05 14:56:00 UTC
Tried good and fail case with 1000Hz.
For me no visible change.
Comment 47 Len Brown 2016-04-05 16:09:28 UTC
Does this still happen if you boot in single-user-mode?

How about if you boot with "maxcpus=1"
Comment 48 jrg.otte 2016-04-05 17:22:20 UTC
maxcpus=1 : no visible change for me.

single-user-mode : unfortunately my system does not boot to single user mode.
with Ubuntu-kernel I get black screen with backlight on. With my kernel I see it running to starting udev but then I get no konsole prompt.
Comment 49 Doug Smythies 2016-04-05 17:45:42 UTC
Created attachment 211801 [details]
Grapah 1 of 2 - CPU Frequencies - fail case

I have continued to attempt to re-create Jorg's problem on my computer.
I have a repeatable (small sample space, so far) way to show 9% extra energy consumption and an average increase in CPU frequency of around 5 p-states.  

I have not been to get the patch in comment 1 to apply, so I have been using kernel 4.5-rc7 as my "good" reference control sample.

The "fail" kernel is 4.6-rc1 at 300Hz and scheduler set to noop.
The test work flow is 1.5% load (at nominal CPU frequency) on each CPU.
CPUs 0 and 4 at 104.73 Hz.
CPUs 1 and 3 at 33.67 Hz.
CPUs 2 and 6 at 112.43 Hz.
CPUs 3 and 7 at 71.93 Hz.
The work flow times between CPUs on the same core are NOT staggered.
Staggering by approximately 1/2 period was tested, and NOT staggered seemed a better test condition (meaning worse for higher CPU frequency), but not convincingly so.
The above CPU frequencies were extracted from of Jorg's attached test data.
Comment 50 Doug Smythies 2016-04-05 17:47:26 UTC
Created attachment 211811 [details]
Graph 2 of 2 - CPU Frequencies - good case
Comment 51 Srinivas Pandruvada 2016-04-05 18:02:47 UTC
Doug, 
Does something like this helps in your case:

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 4b64452..54654af 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -114,6 +114,7 @@ struct cpudata {
        u64     prev_mperf;
        u64     prev_tsc;
        u64     prev_cummulative_iowait;
+       u64     low_cpu_util_cnt;
        struct sample sample;
 };
 
@@ -957,6 +958,7 @@ static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu)
 static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
 {
        int32_t core_busy, max_pstate, current_pstate, sample_ratio;
+       int32_t core_util;
        u64 duration_ns;
 
        intel_pstate_calc_busy(cpu);
@@ -984,6 +986,15 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
         * enough period of time to adjust our busyness.
         */
        duration_ns = cpu->sample.time - cpu->last_sample_time;
+       core_util = div64_u64(int_tofp(100) * cpu->sample.mperf, cpu->sample.tsc);
+       if (fp_toint(core_util) < 2) {
+               cpu->low_cpu_util_cnt++;
+               if (cpu->low_cpu_util_cnt > 2) {
+                       duration_ns = cpu->sample.time * cpu->low_cpu_util_cnt;
+               }
+       } else {
+               cpu->low_cpu_util_cnt = 0;
+       }
        if ((s64)duration_ns > pid_params.sample_rate_ns * 3
            && cpu->last_sample_time > 0) {
                sample_ratio = div_fp(int_tofp(pid_params.sample_rate_ns),
Comment 52 Srinivas Pandruvada 2016-04-05 18:45:54 UTC
So in the good case we are requesting more time in deeper states than fail case with exception state = 5. Looks like menu governor now interpret our idleness differently with scheduler timer than differed timer.

Time: 369969 uS 
Fail case 
[spandruv@spandruv-desk3 Downloads]$ grep -c "state=1 " cpu-idle-tracer-fail-case
38
[spandruv@spandruv-desk3 Downloads]$ grep -c "state=2 " cpu-idle-tracer-fail-case
10
[spandruv@spandruv-desk3 Downloads]$ grep -c "state=3 " cpu-idle-tracer-fail-case
8
[spandruv@spandruv-desk3 Downloads]$ grep -c "state=4 " cpu-idle-tracer-fail-case
2
[spandruv@spandruv-desk3 Downloads]$ grep -c "state=5 " cpu-idle-tracer-fail-case
184
[spandruv@spandruv-desk3 Downloads]$ grep -c "state=6 " cpu-idle-tracer-fail-case
0
[spandruv@spandruv-desk3 Downloads]$ grep -c "state=7 " cpu-idle-tracer-fail-case
0
[spandruv@spandruv-desk3 Downloads]$ grep -c "state=4294967295 " cpu-idle-tracer-fail-case
244

Total = 486 

Time: 375667 uS
Good case
druv@spandruv-desk3 Downloads]$ grep -c "state=1 " cpu-idle-tracer-good-case
27
[spandruv@spandruv-desk3 Downloads]$ grep -c "state=2 " cpu-idle-tracer-good-case
23
[spandruv@spandruv-desk3 Downloads]$ grep -c "state=3 " cpu-idle-tracer-good-case
27
[spandruv@spandruv-desk3 Downloads]$ grep -c "state=4 " cpu-idle-tracer-good-case
3
[spandruv@spandruv-desk3 Downloads]$ grep -c "state=5 " cpu-idle-tracer-good-case
164
[spandruv@spandruv-desk3 Downloads]$ grep -c "state=6 " cpu-idle-tracer-good-case
0
[spandruv@spandruv-desk3 Downloads]$ grep -c "state=4294967295 " cpu-idle-tracer-good-case
244

Total = 488
Comment 53 Rafael J. Wysocki 2016-04-05 19:28:20 UTC
Created attachment 211821 [details]
cpu_idle trace point good case

The trace from comment #44 unzipped.
Comment 54 Rafael J. Wysocki 2016-04-05 19:29:17 UTC
Created attachment 211831 [details]
cpu_idle trace point failing case

The trace from comment #45 unzipped.
Comment 55 Srinivas Pandruvada 2016-04-05 20:30:51 UTC
Doug, My suggested code was wrong. I testing with something like this, with my simulated short load. 

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 4b64452..7a8a60d 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -114,6 +114,7 @@ struct cpudata {
        u64     prev_mperf;
        u64     prev_tsc;
        u64     prev_cummulative_iowait;
+       u64     low_cpu_util_cnt;
        struct sample sample;
 };
 
@@ -957,6 +958,7 @@ static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu)
 static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
 {
        int32_t core_busy, max_pstate, current_pstate, sample_ratio;
+       int32_t core_util;
        u64 duration_ns;
 
        intel_pstate_calc_busy(cpu);
@@ -984,6 +986,19 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
         * enough period of time to adjust our busyness.
         */
        duration_ns = cpu->sample.time - cpu->last_sample_time;
+       core_util = div64_u64(int_tofp(100) * cpu->sample.mperf, cpu->sample.tsc);
+       if (fp_toint(core_util) < 2) {
+               cpu->low_cpu_util_cnt++;
+               if (cpu->low_cpu_util_cnt > 3) {
+                       u64 _duration_ns;
+
+                       _duration_ns = pid_params.sample_rate_ns * cpu->low_cpu_util_cnt;
+                       if (_duration_ns > duration_ns)
+                               duration_ns = _duration_ns;
+               }
+       } else {
+               cpu->low_cpu_util_cnt = 0;
+       }
        if ((s64)duration_ns > pid_params.sample_rate_ns * 3
            && cpu->last_sample_time > 0) {
                sample_ratio = div_fp(int_tofp(pid_params.sample_rate_ns),
Comment 56 Rafael J. Wysocki 2016-04-05 21:42:23 UTC
@Jörg: Thanks for the info!

Note: Generally, it is better to attach plain text or HTML here, because people can easily look at that from mobile devices too.  And please always remember to set the appropriate MIME type for your attachments.

The comparison of timer_stats from the good and failing cases is interesting.  One thing noticeable to me is that in the failing case the vast majority of timer events seem to come from schedule_hrtimeout_range_clock() or tick_nohz_restart() (ie. highres timers), while in the good case there is a number of intel_pstate_timer_func() invocations (ie. timer wheel).  I'm not sure if/how relevant that is, though.

It would take some time to process the cpu_idle traces I think, but maybe it's not necessary.  For a quick comparison here's the number of different idle states requested by the kernel (state / times requested - failing case / times requested - good case):

Poll /   3 /   1
C1   /  38 /  27
C1E  /  10 /  23
C3   /   8 /  27
C6   /   2 /   3
C7s  / 184 / 164

That means that in the failing case we got slightly more timer events with shorter timeouts (poll, C1) but overall there were more events in the C3 target residency range in the good case.  In turn, in the failing case we requested C5 more often.

That indicates that we only spend 50% of the time in C1 in the failing case because the processor demotes us and not because we ask for C1.
Comment 57 Rafael J. Wysocki 2016-04-06 00:14:40 UTC
Created attachment 211881 [details]
intel_pstate: Use %busy-based algorithm on Core

Since cpuidle appears to be working correctly, let's focus on intel_pstate.

Jörg: Please check this patch instead of the revert from comment #1 and report back.
Comment 58 Rafael J. Wysocki 2016-04-06 00:16:40 UTC
(In reply to Rafael J. Wysocki from comment #56)

> That means that in the failing case we got slightly more timer events with
> shorter timeouts (poll, C1) but overall there were more events in the C3
> target residency range in the good case.  In turn, in the failing case we
> requested C5 more often.

That should be: "in the failing case we requested C7s more often".
Comment 59 Rafael J. Wysocki 2016-04-06 00:23:14 UTC
(In reply to Srinivas Pandruvada from comment #55)
> Doug, My suggested code was wrong. I testing with something like this, with
> my simulated short load. 

I would do something else.

> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 4b64452..7a8a60d 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -114,6 +114,7 @@ struct cpudata {
>         u64     prev_mperf;
>         u64     prev_tsc;
>         u64     prev_cummulative_iowait;
> +       u64     low_cpu_util_cnt;
>         struct sample sample;
>  };
>  
> @@ -957,6 +958,7 @@ static inline int32_t
> get_target_pstate_use_cpu_load(struct cpudata *cpu)
>  static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
>  {
>         int32_t core_busy, max_pstate, current_pstate, sample_ratio;
> +       int32_t core_util;
>         u64 duration_ns;
>  
>         intel_pstate_calc_busy(cpu);
> @@ -984,6 +986,19 @@ static inline int32_t
> get_target_pstate_use_performance(struct cpudata *cpu)
>          * enough period of time to adjust our busyness.
>          */
>         duration_ns = cpu->sample.time - cpu->last_sample_time;
> +       core_util = div64_u64(int_tofp(100) * cpu->sample.mperf,
> cpu->sample.tsc);

I would do this if the duration_ns check below returns false only.

> +       if (fp_toint(core_util) < 2) {

Here I'd simply multiply core_busy by core_util.

> +               cpu->low_cpu_util_cnt++;
> +               if (cpu->low_cpu_util_cnt > 3) {
> +                       u64 _duration_ns;
> +
> +                       _duration_ns = pid_params.sample_rate_ns *
> cpu->low_cpu_util_cnt;
> +                       if (_duration_ns > duration_ns)
> +                               duration_ns = _duration_ns;
> +               }
> +       } else {
> +               cpu->low_cpu_util_cnt = 0;
> +       }
>         if ((s64)duration_ns > pid_params.sample_rate_ns * 3
>             && cpu->last_sample_time > 0) {
>                 sample_ratio = div_fp(int_tofp(pid_params.sample_rate_ns),

I'll attach that patch later.
Comment 60 Srinivas Pandruvada 2016-04-06 00:31:14 UTC
I was trying to have least impact not radically change algorithm. I was trying if we treat case of core util < 2 is as good as idle after 3 consecutive time periods, very similar to what is done now.
I did test on server and client, and no visible difference in performance was observed.
Comment 61 Rafael J. Wysocki 2016-04-06 00:50:15 UTC
Created attachment 211891 [details]
intel_pstate: Scale core_busy by %busy if super-lightly loaded

Another patch to try if someone is brave enough.

It is a bit more aggressive than the last Srinivas' one, but simpler.
Comment 62 Rafael J. Wysocki 2016-04-06 00:54:57 UTC
(In reply to Srinivas Pandruvada from comment #60)
> I was trying to have least impact not radically change algorithm. I was
> trying if we treat case of core util < 2 is as good as idle after 3
> consecutive time periods, very similar to what is done now.
> I did test on server and client, and no visible difference in performance
> was observed.

OK, I see.  It could be simpler still IMO.  That said, please check if the one I attached above has any significant performance impact.
Comment 63 Srinivas Pandruvada 2016-04-06 01:11:03 UTC
I will give a try. If you reduce after one time period, we may have performance impact compared to what we have now (I think). Let me give a try.
Comment 64 Doug Smythies 2016-04-06 04:29:36 UTC
(In reply to Rafael J. Wysocki from comment #57)
> Created attachment 211881 [details]
> intel_pstate: Use %busy-based algorithm on Core
> 
> Since cpuidle appears to be working correctly, let's focus on intel_pstate.
> 
> Jörg: Please check this patch instead of the revert from comment #1 and
> report back.

I prefer a load based algorithm. I have plenty of test data from a few users(albeit with a different load based algorithm) collected over the last year.

Almost two years ago, a group of us spent a couple of months on various hybrid algorithms. However, the timing of driver is different now.
Comment 65 Srinivas Pandruvada 2016-04-06 05:50:45 UTC
I did spec power and ran 40+ Phoronix  benchmarks on Rafael's change "intel_pstate: Scale core_busy by %busy if super-lightly loaded". I didn't see any deterioration in performance. I usually run several rounds, but I want confirm that this fixes the issue.

@Jörg, Please try the latest attachment:
"intel_pstate: Scale core_busy by %busy if super-lightly loaded"
Comment 66 jrg.otte 2016-04-06 10:09:55 UTC
I tried "Use %busy-based algorithm on Core" patch.

- Idle is back to 800Mhz.
- Frequency scaling is less jumpy than usual.
- Kernel compiling is as fast as usual.

For me a very good approach.
Comment 67 jrg.otte 2016-04-06 12:06:20 UTC
I tried  "Scale core_busy by %busy if super-lightly loaded" patch.

- Idle is back to 800Mhz.
- Tends to higher frequencies than V4.5
- Tends to higher frequencies than "Use %busy-based algorithm on Core" patch.

- "Use %busy-based algorithm on Core" tends to lower frequencies than V4.5

I'd prefer "Use %busy-based algorithm on Core".


I had to modify the patch because the old content did not match:

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 4b64452..75e1865 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -989,6 +989,10 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
 		sample_ratio = div_fp(int_tofp(pid_params.sample_rate_ns),
 				      int_tofp(duration_ns));
 		core_busy = mul_fp(core_busy, sample_ratio);
+	} else {
+		sample_ratio = div_fp(cpu->sample.mperf * 100, cpu->sample.tsc);
+		if (sample_ratio < int_tofp(1))
+			core_busy = mul_fp(core_busy, sample_ratio);
 	}
 
 	cpu->sample.busy_scaled = core_busy;
Comment 68 Rafael J. Wysocki 2016-04-06 12:49:15 UTC
(In reply to jrg.otte from comment #67)

Thanks for the testing!

> I tried  "Scale core_busy by %busy if super-lightly loaded" patch.
> 
> - Idle is back to 800Mhz.

Good.

> - Tends to higher frequencies than V4.5
> - Tends to higher frequencies than "Use %busy-based algorithm on Core" patch.
> 
> - "Use %busy-based algorithm on Core" tends to lower frequencies than V4.5

It will do that, but there's a bunch of people who won't like it.

> I'd prefer "Use %busy-based algorithm on Core".

Well, first of all, we need a minimum fix for v4.6, so I'd like to use the "Scale core_busy by %busy if super-lightly loaded" one for that if it is good enough.

Second, get_target_pstate_use_cpu_load() has its disadvantages too, like the get_cpu_iowait_time_us() called by it which involves doing a ktime_get() which may be quite costly on some setups (from the perspective of the scheduler paths invoking intel_pstate at least).

We will need to use some kind of %busy-based algorithm on Core going forward, but that will need some serious consideration.  If you can help to test patches, please let me know.

For now, please attach the output of "turbostat --debug" for both the "Use %busy-based algorithm on Core" and "Scale core_busy by %busy if super-lightly loaded" patches.
Comment 69 Doug Smythies 2016-04-06 15:07:01 UTC
(In reply to Rafael J. Wysocki from comment #61)
> Created attachment 211891 [details]
> intel_pstate: Scale core_busy by %busy if super-lightly loaded
> 
> Another patch to try if someone is brave enough.
> 
> It is a bit more aggressive than the last Srinivas' one, but simpler.

For my test case, this patch does not work. Why Not? The threshold is too low. With a threshold type solution the question is what if the problem scenario is just above the threshold.
Comment 70 Rafael J. Wysocki 2016-04-06 15:12:48 UTC
(In reply to Doug Smythies from comment #69)
> (In reply to Rafael J. Wysocki from comment #61)
> > Created attachment 211891 [details]
> > intel_pstate: Scale core_busy by %busy if super-lightly loaded
> > 
> > Another patch to try if someone is brave enough.
> > 
> > It is a bit more aggressive than the last Srinivas' one, but simpler.
> 
> For my test case, this patch does not work. Why Not? The threshold is too
> low. With a threshold type solution the question is what if the problem
> scenario is just above the threshold.

Right, but that applies to all threshold values, doesn't it?

What threshold value would work for you, then?  Or is there any?
Comment 71 Doug Smythies 2016-04-06 15:36:32 UTC
(In reply to Rafael J. Wysocki from comment #70)
> (In reply to Doug Smythies from comment #69)
> > (In reply to Rafael J. Wysocki from comment #61)
> > > Created attachment 211891 [details]
> > > intel_pstate: Scale core_busy by %busy if super-lightly loaded
> > > 
> > > Another patch to try if someone is brave enough.
> > > 
> > > It is a bit more aggressive than the last Srinivas' one, but simpler.
> > 
> > For my test case, this patch does not work. Why Not? The threshold is too
> > low. With a threshold type solution the question is what if the problem
> > scenario is just above the threshold.
> 
> Right, but that applies to all threshold values, doesn't it?

Yes exactly, which is what we struggled with 2 years ago when trying these type of solutions.

> 
> What threshold value would work for you, then?  Or is there any?

5% would work for the test scenario I have right now, but then I could just change the test case to use 6%.

I think there is/was an energy regression with the current load based algorithm on SpecPower (not sure why I think that). During the last cycle (around mid to late February) I found that the load algorithm I had been proposing did well on my SpecPower simulated test.
Comment 72 jrg.otte 2016-04-06 16:36:09 UTC
Created attachment 211961 [details]
turbostat-debug-busy-based-algorithm-on-Core
Comment 73 jrg.otte 2016-04-06 16:36:35 UTC
Created attachment 211971 [details]
urbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded
Comment 74 jrg.otte 2016-04-06 16:39:35 UTC
> If you can help to test patches, please let me know.
Sure, I'm up for it.

Thanks to all debuggers and contributors!
Jörg
Comment 75 Rafael J. Wysocki 2016-04-06 19:05:58 UTC
Created attachment 211991 [details]
intel_pstate: Scale core_busy by %busy if super-lightly loaded (v2)

I'm still not entirely fond of the turbostat results with the previous version of this patch, so please apply this one run "turbostat --debug" with it and attach the result.
Comment 76 Rafael J. Wysocki 2016-04-06 19:13:17 UTC
(In reply to Doug Smythies from comment #71)
> (In reply to Rafael J. Wysocki from comment #70)
> > (In reply to Doug Smythies from comment #69)
> > > (In reply to Rafael J. Wysocki from comment #61)
> > > > Created attachment 211891 [details]
> > > > intel_pstate: Scale core_busy by %busy if super-lightly loaded
> > > > 
> > > > Another patch to try if someone is brave enough.
> > > > 
> > > > It is a bit more aggressive than the last Srinivas' one, but simpler.
> > > 
> > > For my test case, this patch does not work. Why Not? The threshold is too
> > > low. With a threshold type solution the question is what if the problem
> > > scenario is just above the threshold.
> > 
> > Right, but that applies to all threshold values, doesn't it?
> 
> Yes exactly, which is what we struggled with 2 years ago when trying these
> type of solutions.

I'm not looking for a final solution at this point, though.

The current algorithm is clearly unstable as a relatively small timing change has apparently caused it to produce results that are way off.

It is too late in the cycle for us to modify the algorithm significantly, however.  We can either go back to using timers with intel_pstate, which I really wouldn't like to do for a couple of reasons (one of them being that it really wouldn't fix the algorithm, but rather hide the problem with it on this particular machine), or we can do something to stabilize it somewhat which is what I'm trying to do.
Comment 77 Doug Smythies 2016-04-06 21:32:21 UTC
Note: I write this a few hours ago , but it collided with Rafael's edit and I didn't notice.

Based on the turbostat results, I don't think the "super lightly loaded" method threshold was high enough for Jörg's case either. (And I see Rafael has increased the threshold.)

Srinivas: Is there any chance you would be willing to test my version of a load based algorithm on your SpecPower test-bed? I never got the results from Kristen's tests, and suspect it wasn't done with the correct code or "knob" settings anyway.
I have not rebased the patch set since kernel 4.3, and so it has to be applied to a 4.3 start point.

References:

http://www.smythies.com/~doug/linux/intel_pstate/build58/0000-cover.txt

http://www.smythies.com/~doug/linux/intel_pstate/build58/0001-intel_pstate-Use-C0-time-for-busy-calculations-again.txt

http://www.smythies.com/~doug/linux/intel_pstate/build58/0002-intel_pstate-Calculate-target-pstate-directly.txt

http://www.smythies.com/~doug/linux/intel_pstate/build58/0003-intel_pstate-Compensate-for-intermediate-durations-v.txt

http://www.smythies.com/~doug/linux/intel_pstate/build58/0004-intel_pstate-move-utilization-calculation-to-a-separ.txt

http://www.smythies.com/~doug/linux/intel_pstate/build58/0005-intel_pstate-delete-unused-PID-controller.txt

http://www.smythies.com/~doug/linux/intel_pstate/build58/0006-intel_pstate-Change-utilization-calculation-to-a-leg.txt

http://www.smythies.com/~doug/linux/intel_pstate/build58/0001-intel_pstate-Doug-patch-set-clamp-scaled-busy.txt

All of the above links also available in the references section here:
http://www.smythies.com/~doug/linux/intel_pstate/build58/index.html

There are also some notes and graphs comparing the current threshold (setpoint) based load algogrithm to the current performance based algorithm and mine at:
http://www.smythies.com/~doug/linux/intel_pstate/philippe_longepe/index.html
Comment 78 Doug Smythies 2016-04-06 21:37:41 UTC
(In reply to Rafael J. Wysocki from comment #76)
> (In reply to Doug Smythies from comment #71)
>
> I'm not looking for a final solution at this point, though.

Fair enough.
 
> The current algorithm is clearly unstable as a relatively small timing
> change has apparently caused it to produce results that are way off.

Yes, the current algorithm has problems and always has. This is what I have been complaining about for almost two years now.
Comment 79 Srinivas Pandruvada 2016-04-06 21:57:00 UTC
Doug, Rebase to the latest and send me.
Comment 80 Srinivas Pandruvada 2016-04-06 22:31:14 UTC
Doug, FYI, Using scheduler utility I have already tried y = mx + c, type algorithm. I already have this data. Will send offline. The performance/watt number regress by 3 to 4% for specpower.
Comment 81 Doug Smythies 2016-04-06 23:32:23 UTC
(In reply to Srinivas Pandruvada from comment #79)
> Doug, Rebase to the latest and send me.

Srinivas: Due to limited time, I can not right now. I could rebase starting maybe around May 10th.

(In reply to Srinivas Pandruvada from comment #80)
> Doug, FYI, Using scheduler utility I have already tried y = mx + c, type
> algorithm. I already have this data. Will send offline. The performance/watt
> number regress by 3 to 4% for specpower.

O.K. thanks. Note that, at least in my testing, the values for m and c were very important. The default response curve I use is the result of considerable work.
Comment 82 Doug Smythies 2016-04-07 07:07:08 UTC
@Jörg: Did you ever try the Ubuntu mainline PPA kernel?
http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-rc2-wily/

At some point you mentioned your system was somewhat Ubuntu 14.04 based. Would it be possible to try kernel 4.6-rc2, but based on a 16.04 system. Perhaps via live CD or something. While I still believe there is, and always has been, an issue with the control algorithm, I also think there can be odd things in during 14.04 initialization.
Comment 83 jrg.otte 2016-04-07 09:59:30 UTC
@Doug:
> Did you ever try the Ubuntu mainline PPA kernel?
No, but I can give it a try.

> Would it be possible to try kernel 4.6-rc2, but based on a 16.04 system.
> Perhaps via live CD 
Don't know how.

I have bad experiences upgrading too early. I always wait to the first
point release.
Comment 84 Rafael J. Wysocki 2016-04-07 11:24:40 UTC
@Jörg: I hope you haven't missed comment #75. :-)
Comment 85 jrg.otte 2016-04-07 14:09:04 UTC
@Rafael:
Soory, overlooked! Will come soon.
Comment 86 jrg.otte 2016-04-07 14:45:53 UTC
Created attachment 212001 [details]
turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V2
Comment 87 Rafael J. Wysocki 2016-04-07 15:17:39 UTC
(In reply to jrg.otte from comment #86)
> Created attachment 212001 [details]
> turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V2

Thanks!

So this one seems to be keeping the P-states in check while in idle.
Comment 88 Doug Smythies 2016-04-07 15:37:23 UTC
(In reply to Rafael J. Wysocki from comment #87)
> (In reply to jrg.otte from comment #86)
> > Created attachment 212001 [details]
> > turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V2
> 
> Thanks!
> 
> So this one seems to be keeping the P-states in check while in idle.

Agreed, but the first turbostat sample CPU frequencies seem a little high.

It may be a waste of time, but I'd like to work through a trace sample with this kernel.
It would help if acquired using "sudo perf record -a --event=power:pstate_sample sleep 100" as then post processing of the resulting perf.data would take 2 minutes instead of hours. I listed 100 seconds above, but would really like 300 seconds. 

Jörg, maybe your odd kernel configuration is such that you can not use "perf record". I compile perf myself from the tools/perf directory and then put it in my user bin directory. So the command I actually use is:

sudo ~/bin/perf record -a --event=power:pstate_sample sleep 300
Comment 89 Rafael J. Wysocki 2016-04-07 16:48:33 UTC
(In reply to Doug Smythies from comment #88)
> (In reply to Rafael J. Wysocki from comment #87)
> > (In reply to jrg.otte from comment #86)
> > > Created attachment 212001 [details]
> > > turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V2
> > 
> > Thanks!
> > 
> > So this one seems to be keeping the P-states in check while in idle.
> 
> Agreed, but the first turbostat sample CPU frequencies seem a little high.

The first sample has higher frequencies in all of the Jörg's turbostat reports, even in the good case.

[cut]

> Jörg, maybe your odd kernel configuration is such that you can not use "perf
> record". I compile perf myself from the tools/perf directory and then put it
> in my user bin directory. So the command I actually use is:
> 
> sudo ~/bin/perf record -a --event=power:pstate_sample sleep 300

Well, I guess why not if that can help.

In any case, the plan for v4.6 is to go with the patch from comment #75.

Going forward, we'll need a new (and %busy-based) algorithm for Core processors.
Comment 90 Doug Smythies 2016-04-07 18:12:00 UTC
(In reply to Rafael J. Wysocki from comment #89)
> (In reply to Doug Smythies from comment #88)
>> (In reply to Rafael J. Wysocki from comment #87)

>>> So this one seems to be keeping the P-states in check while in idle.
>> 
>> Agreed, but the first turbostat sample CPU frequencies seem a little high.
> 
> The first sample has higher frequencies in all of the Jörg's turbostat
> reports, even in the good case.

Disagree.
attachment 211511 [details] (Good) 3rd sample of 3 has highest CPU frequencies.
attachment 211521 [details] (Fail) 3rd sample of 3 has highest CPU frequencies.
Comment 91 Rafael J. Wysocki 2016-04-07 19:37:34 UTC
(In reply to Doug Smythies from comment #90)
> (In reply to Rafael J. Wysocki from comment #89)
> > (In reply to Doug Smythies from comment #88)
> >> (In reply to Rafael J. Wysocki from comment #87)
> 
> >>> So this one seems to be keeping the P-states in check while in idle.
> >> 
> >> Agreed, but the first turbostat sample CPU frequencies seem a little high.
> > 
> > The first sample has higher frequencies in all of the Jörg's turbostat
> > reports, even in the good case.
> 
> Disagree.
> attachment 211511 [details] (Good) 3rd sample of 3 has highest CPU
> frequencies.
> attachment 211521 [details] (Fail) 3rd sample of 3 has highest CPU
> frequencies.

Higher != the highest :-)

But the point essentially is that frequencies are not uniform in each report.  They vary between intervals, sometimes quite a bit.

I'm not sure if it matters that the first interval is higher and the second one is lower frequency etc, because it depends on when the measurement started too.
Comment 92 Doug Smythies 2016-04-07 20:27:07 UTC
(In reply to Rafael J. Wysocki from comment #91)
> (In reply to Doug Smythies from comment #90)
>>>(In reply to Rafael J. Wysocki from comment #89)
>>> (In reply to Doug Smythies from comment #88)
>>>> (In reply to Rafael J. Wysocki from comment #87)
>> 
>>>>> So this one seems to be keeping the P-states in check while in idle.
>>>> 
>>>> Agreed, but the first turbostat sample CPU frequencies seem a little high.
>>> 
>>> The first sample has higher frequencies in all of the Jörg's turbostat
>>> reports, even in the good case.
>> 
>> Disagree.
>> attachment 211511 [details] (Good) 3rd sample of 3 has highest CPU
>> frequencies.
>> attachment 211521 [details] (Fail) 3rd sample of 3 has highest CPU
>> frequencies.
> 
> Higher != the highest :-)
> 
> But the point essentially is that frequencies are not uniform in each
> report.  They vary between intervals, sometimes quite a bit.

Agreed.

> I'm not sure if it matters that the first interval is higher and the second
> one is lower frequency etc, because it depends on when the measurement
> started too.

Agreed. 
I think this is a reason to acquire 5 minutes worth of trace data to examine.
Comment 93 Doug Smythies 2016-04-08 00:42:47 UTC
Created attachment 212081 [details]
Screen shots 1 of 2: From trace data. Not normalized case.

(In reply to Rafael J. Wysocki from comment #75)
> Created attachment 211991 [details]
> intel_pstate: Scale core_busy by %busy if super-lightly loaded (v2)
> 
> I'm still not entirely fond of the turbostat results with the previous
> version of this patch, so please apply this one run "turbostat --debug" with
> it and attach the result.

Rafael: When you doubled the threshold to 2%, shouldn't you have also normalized the adjustment to be 1 at 2%?

I'm suggesting that this:

if (sample_ratio < int_tofp(2))
	core_busy = mul_fp(core_busy, sample_ratio);

Should be something like this:

if (sample_ratio < int_tofp(2))
	core_busy = mul_fp(core_busy, sample_ratio / 2);
Comment 94 Doug Smythies 2016-04-08 00:43:28 UTC
Created attachment 212091 [details]
Screen shots 2 of 2: From trace data. Normalized case.
Comment 95 Rafael J. Wysocki 2016-04-08 01:53:44 UTC
(In reply to Doug Smythies from comment #93)
> Created attachment 212081 [details]
> Screen shots 1 of 2: From trace data. Not normalized case.
> 
> (In reply to Rafael J. Wysocki from comment #75)
> > Created attachment 211991 [details]
> > intel_pstate: Scale core_busy by %busy if super-lightly loaded (v2)
> > 
> > I'm still not entirely fond of the turbostat results with the previous
> > version of this patch, so please apply this one run "turbostat --debug"
> with
> > it and attach the result.
> 
> Rafael: When you doubled the threshold to 2%, shouldn't you have also
> normalized the adjustment to be 1 at 2%?
> 
> I'm suggesting that this:
> 
> if (sample_ratio < int_tofp(2))
>       core_busy = mul_fp(core_busy, sample_ratio);
> 
> Should be something like this:
> 
> if (sample_ratio < int_tofp(2))
>       core_busy = mul_fp(core_busy, sample_ratio / 2);

I made a mistake, but I didn't mean the above.  I meant to multiply core_busy by the fraction and not by the percentage.
Comment 96 Rafael J. Wysocki 2016-04-08 01:55:27 UTC
(In reply to Doug Smythies from comment #94)
> Created attachment 212091 [details]
> Screen shots 2 of 2: From trace data. Normalized case.

It would be good to explain what's illustrated by it, even if it's obvious to you.
Comment 97 Rafael J. Wysocki 2016-04-08 02:02:27 UTC
Created attachment 212101 [details]
intel_pstate: Scale core_busy by %busy if super-lightly loaded (v3)

Previous versions of this were incorrect, as the scaling should be by the fraction and not by percentage.

@Jörg: Please test this one and attach the output of "turbostat --debug", but please let it run for more time (8 intervals say).
Comment 98 Doug Smythies 2016-04-08 04:26:34 UTC
(In reply to Rafael J. Wysocki from comment #97)
> Created attachment 212101 [details]
> intel_pstate: Scale core_busy by %busy if super-lightly loaded (v3)
> 
> Previous versions of this were incorrect, as the scaling should be by the
> fraction and not by percentage.
> 
> @Jörg: Please test this one and attach the output of "turbostat --debug",
> but please let it run for more time (8 intervals say).

Rafael: O.K. That is actually what I thought you were wanting to do originally.

Because there are only 8 fractional bits, it gets a little coarse, but is fine. (the actual threshold is about 2.34%.) Myself, I would have just done this:

} else {
  sample_ratio = div_fp(cpu->sample.mperf, cpu->sample.tsc);
  if (100 * sample_ratio < int_tofp(2))
    core_busy = 0;
Comment 99 Doug Smythies 2016-04-08 04:47:33 UTC
@Jörg: if possible, I would like to get trace data, as per my comment #88, for the test of Rafael's comment #97.
Comment 100 Srinivas Pandruvada 2016-04-08 06:13:34 UTC
Created attachment 212111 [details]
v3_spec_pwr_hsw_ep0
Comment 101 Srinivas Pandruvada 2016-04-08 06:33:32 UTC
The spec power results are comparable with the baseline 4.6.rc2 with the patch.

Other client tests still going on. Notable one is the IO bench mark regress.

AIO-Stress 0.21
Test: Random Write
  MB/s
    baseline 4.6-rc2  .. 575.88 |===============================================
    +v3patch          .. 521.42 |===========================================

They are in MB/S, so more is better.
Comment 102 jrg.otte 2016-04-08 08:23:00 UTC
Created attachment 212121 [details]
turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V3
Comment 103 jrg.otte 2016-04-08 10:32:55 UTC
perf record -a --event=power:pstate_sample sleep 300
was successful.
So I need to do queries. Which?
Comment 104 Rafael J. Wysocki 2016-04-08 12:27:18 UTC
(In reply to Doug Smythies from comment #98)

> Because there are only 8 fractional bits, it gets a little coarse, but is
> fine. (the actual threshold is about 2.34%.) Myself, I would have just done
> this:
> 
> } else {
>   sample_ratio = div_fp(cpu->sample.mperf, cpu->sample.tsc);
>   if (100 * sample_ratio < int_tofp(2))
>     core_busy = 0;

Yes, that is an option, of course, and actually preferred due to less computations.

I wonder if lowering the threshold would make a difference, though.
Comment 105 Rafael J. Wysocki 2016-04-08 12:30:40 UTC
Created attachment 212151 [details]
intel_pstate: Scale core_busy by %busy if super-lightly loaded (v4)

Simply reset core_busy if %busy is below the threshold and lower the threshold at the same time.

@Jörg: Please try this one and attach the "turbostat --debug" output (like before).

@Srinivas: Please test this one.
Comment 106 Rafael J. Wysocki 2016-04-08 12:34:34 UTC
Created attachment 212161 [details]
intel_pstate: Scale core_busy by %busy if super-lightly loaded (v5)

Cosmetic changes.

@Jörg, Srinivas: Please test as per the previous comment.
Comment 107 Doug Smythies 2016-04-08 13:49:27 UTC
(In reply to jrg.otte from comment #103)
> perf record -a --event=power:pstate_sample sleep 300
> was successful.
> So I need to do queries. Which?

In the same directory that you ran the command from, there will be a file called perf.data. I crudely estimate it will be about 7 megabytes, but it should compress by a factor of roughly 7. Compress that file and attach it, or if there is size limit, send it to me directly. At this point, a V5 trace would be preferred.
Comment 108 jrg.otte 2016-04-08 14:24:08 UTC
Created attachment 212211 [details]
turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V5
Comment 109 Doug Smythies 2016-04-08 18:04:54 UTC
(In reply to Srinivas Pandruvada from comment #101)

> Notable one is the IO bench mark regress.
> 
> AIO-Stress 0.21
> Test: Random Write
>   MB/s
>     baseline 4.6-rc2  .. 575.88
> |===============================================
>     +v3patch          .. 521.42 |===========================================
> 
> They are in MB/S, so more is better.

Srinivas: On my computer I get (sorry, still using rc1. Due to high variability I increased the number of runs from the default of 3 to 10):

4.6-rc1 2667.86 MB/S 6.37% Standard Deviation
4.6-rc1+v3 2750.67 MB/S 0.64% Standard Deviation
Comment 110 Srinivas Pandruvada 2016-04-08 19:15:00 UTC
Doug, Yes. These tests show lots of variability. Looks like you are running on a server (Haswell?). I am rerunning tests.
Comment 111 Rafael J. Wysocki 2016-04-08 19:34:11 UTC
(In reply to jrg.otte from comment #108)
> Created attachment 212211 [details]
> turbostat-debug-Scale-core-busy-by-%busy-if-super-lightly-loaded-V5

OK, that looks reasonable, thanks!

Now all depends on the Srinivas' tests results.
Comment 112 Srinivas Pandruvada 2016-04-08 22:06:36 UTC
Created attachment 212261 [details]
IO disk tests

Tested v5 version of the patch attached here. No regression on IO tests. Trying one  more systems (Haswell ultrabook).
Comment 113 Doug Smythies 2016-04-09 00:20:26 UTC
(In reply to Srinivas Pandruvada from comment #110)
> Looks like you are running on a server (Haswell?).

I have an older Sandy Bridge i7 (Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz) on an Asus P8 Z68-M PRO motherboard.
Comment 114 Srinivas Pandruvada 2016-04-09 00:59:56 UTC
I tested on Haswell ultrabook also, results are comparable with the baseline and with patch version v5.

Doug: You get pretty good IO performance.
Comment 115 Rafael J. Wysocki 2016-04-09 01:16:44 UTC
Cool, thanks!

I'll send the patch from comment #106 to the mailing lists then and queue it up as a 4.6 fix if there are no objections against that.
Comment 116 Doug Smythies 2016-04-09 06:51:33 UTC
(In reply to Rafael J. Wysocki from comment #115)
> I'll send the patch from comment #106 to the mailing lists then and queue it
> up as a 4.6 fix if there are no objections against that.

No Objection.
Comment 117 Doug Smythies 2016-04-11 22:49:22 UTC
(In reply to Doug Smythies from comment #20)

> I still think this is the same as or similar to bug 93521. I have asked on
> that bug for them to try kernel 4.6-rc1.

We have some results from bug 93521. The issue is the same or similar, and the patch V5, (the one Rafael also sent to the mail lists) gives similar good results.
Comment 118 Yaroslav Isakov 2016-04-17 15:56:50 UTC
V5 also helped for Skylake 6600U system, thank you!

Note You need to log in before you can comment on or make changes to this bug.