Bug 93521

Summary: Haswell intel_pstate rarely drops CPU freq below 4.40 GHz on i7-4790K
Product: Power Management Reporter: da_audiophile
Component: intel_pstateAssignee: Chen Yu (yu.c.chen)
Status: CLOSED INVALID    
Severity: normal CC: anonymous135813, anthomas8, artafinde, arup.chowdhury, da_audiophile, dev.rindeal+kernel.org, doug16k, dsmythies, jan.claussen10, johan.reitan, john.ettedgui+kernel, jp.vanriel, lenb, max.bra.gtalk, rui.zhang, szg00000, trolldev
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 3.18.7 - 4.10.13 Subsystem:
Regression: No Bisected commit-id:
Attachments: i7z log under X
i7z log in a TTY
histograms
excerpt from a post processed "perf record"
Performance data sampled from patched ARch kernel 3.18.6-1
Post-processed perf data manuel01
John i7z example
manuel data - example of CPU frequency influence
john 300 Hz kernel example
Doug 250 Hz kernel example
From the John C3 data sample
Program to apply load at specified work / sleep rate
John consume data sample
Suggested kernel config file
Demonstrates failure to raise target pstate under heavy load. 1 of 2.
Demonstrates failure to raise target pstate under heavy load. 2 of 2.
Compare some response curves - fixed load method
Compare some response curves - fixed work packet method
CPU frequencies during game - with patch set - default settings
CPU frequencies during game - with patch set - adjusted settings
Phoronix ffmpeg test comare
An example of the target pstates going to maximum with virtually no load
Why does this system not go to 500MHz?
An example where CPU 0 does not give up it vote into the PLL
CPU Frequencies for Anon Y
CPU loads for Anon Y
turbostat under 4.5.0
turbostat under 4.6rc3
histogram comparing percent time in c7 sleep while idle
mean values based on turbostat
turbustat under 4.5.2
turbostat under 4.6rc4 using Doug's patch

Description da_audiophile 2015-02-19 20:28:58 UTC
Upon querying i7z (git), I noticed my Z97-based system w/ Haswell i7-4790K rarely drops the CPU frequency below the turbo maximum of 4.40 GHz.  This is while the desktop is idle (xfce4) or when I drop to a tty and no X is running.  I have confirmed nothing is really tapping the CPU in htop and i7z itself shows me that it processor is a C7-state most of the time.

In contrast, the CPU freq of an older Ivy (i7-3770K) system idles @ 1.60 GHz when using the intel_pstate driver and scales up to the full freq only under a load.

Is this Haswell high frequency normal when managed by intel_pstate?

% cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor 
powersave
% cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver 
intel_pstate

% sudo i7z
Cpu speed from cpuinfo 3999.00Mhz
cpuinfo might be wrong if cpufreq is enabled. To guess correctly try estimating via tsc
Linux's inbuilt cpu_khz code emulated now
True Frequency (without accounting Turbo) 4000 MHz
  CPU Multiplier 40x || Bus clock frequency (BCLK) 100.00 MHz

Socket [0] - [physical cores=4, logical cores=8, max online cores ever=4]
  TURBO ENABLED on 4 Cores, Hyper Threading ON
  Max Frequency without considering Turbo 4100.00 MHz (100.00 x [41])
  Max TURBO Multiplier (if Enabled) with 1/2/3/4 Cores is  44x/44x/44x/44x
  Real Current Frequency 4399.90 MHz [100.00 x 44.00] (Max of below)
        Core [core-id]  :Actual Freq (Mult.)      C0%   Halt(C1)%  C3 %   C6 %   C7 %  Temp      VCore
        Core 1 [0]:       4399.08 (43.99x)         1    0.621      1       1    96.8    24      1.2190
        Core 2 [1]:       4399.46 (43.99x)         1    0.107      0       0    99.7    25      1.2213
        Core 3 [2]:       4399.90 (44.00x)         1       0       0       1    99.5    25      1.2177
        Core 4 [3]:       4399.86 (44.00x)         1       0       1       0    99.6    24      1.2197



C0 = Processor running without halting
C1 = Processor running with halts (States >C0 are power saver modes with cores idling)
C3 = Cores running with PLL turned off and core cache turned off
C6, C7 = Everything in C3 + core state saved to last level cache, C7 is deeper than C6
  Above values in table are in percentage over the last 1 sec
[core-id] refers to core-id number in /proc/cpuinfo
'Garbage Values' message printed when garbage values are read
  Ctrl+C to exit
Comment 1 da_audiophile 2015-02-19 20:43:22 UTC
Actually, I need to change my original report: the driver appears to be working when I am in a tty without lxdm running at all.  If I however keep lxdm running and open a tty, I do not see the frequency change much from the 4.40 GHz value.  I have attached the i7z logs under X and booting with lxdm.service disabled.

Under X = X-cpu_freq_log
Under TTY = tty-cpu_freq_log
Comment 2 da_audiophile 2015-02-19 20:43:39 UTC
Created attachment 167591 [details]
i7z log under X
Comment 3 da_audiophile 2015-02-19 20:43:57 UTC
Created attachment 167601 [details]
i7z log in a TTY
Comment 4 da_audiophile 2015-02-20 22:52:36 UTC
The attached png image was generated by booting into Xorg (shown in blue) or into a TTY (shown in green) and then using `i7z -w a` and plotting the resulting frequencies.  You can clearly see bug I am reporting.  The top plot is the distribution on the Haswell CPU.  The mean freq at idle under Xorg is 3,401 MHz compared to 1,201 MHz into a text TTY without Xorg on the Haswell.

I also included the same experiment on an Ivybridge CPU.  Here the two mean values are the same.
Comment 5 da_audiophile 2015-02-20 22:52:53 UTC
Created attachment 167801 [details]
histograms
Comment 6 da_audiophile 2015-02-20 23:24:35 UTC
Same bug?
https://bugzilla.kernel.org/show_bug.cgi?id=65591
Comment 7 da_audiophile 2015-02-21 10:42:22 UTC
I just realized that the i7-4790k is technically a "Haswell Refresh" and after collecting the same data on an older "Haswell" CPU, this bug seems to only affect the "Haswell Refresh" CPUs: the i3-4130T shows nearly the same mean value at idle under either Xorg or a text TTY (mean Xorg=1,089 MHz and mean TTY=1,029 MHz).
Comment 8 da_audiophile 2015-02-21 10:50:03 UTC
Here is an independent report from another i7-4790k user confirming this behavior: https://bbs.archlinux.org/viewtopic.php?id=184817
Comment 9 da_audiophile 2015-02-24 02:05:41 UTC
How crazy is this: settings of both 250 Hz or 1000 Hz tickrates both give idle rates that are comparable to either Ubuntu or Fedora. A setting of 300 Hz is to blame for the odd behavior.  Nothing else.

Original Arch config (300 Hz) median value on idle in Xorg = 3,316 MHz

Changing only the tickrate from the original Arch config...
To 250 Hz tick rate median value on idle in Xorg = 889 MHz
To 1000 Hz tick rate median value on idle in Xorg = 1,012 MHz

What's going on with the 300 Hz value and this particular processor?
Comment 10 Doug Smythies 2015-02-26 16:24:47 UTC
Created attachment 168311 [details]
excerpt from a post processed "perf record"

John (da_audiophile@yahoo.com) has discovered a very interesting use case:

His distribution (Arch) uses a 300 Hertz kernel. Due to integer math, the default sample time of 10 milliseconds turns into 13.33333 mSec for this case. For whatever reason, it turns out that this particular sample interval and the video frame rate of 59.95 Hertz interacts with the xorg desktop stuff in such a way as to dramatically increase the manifestation of the pre-existing condition of driving up the target pstate under very light loads. If the sample time is changed the magnitude of the issue is greatly reduced.
Attached is small excerpt from a "perf record" session John did on this otherwise "idle" computer. (note: he had to apply the patch we have been using for many months so that our post processing tools would work). Notice how the smallest perturbation grows such that the target pstate ends up at the turbo maximum, and the various CPUs basically chase each other around, all with the load being less than 1 percent.
There are a few issues with the intel_pstate driver: One is this tendency to drive up the target pstate for no good reason (in most manifestations the target pstate doesn't get driven up too far and also comes back down fairly quickly); Another is the tendency to drive down the target pstate when it should be higher due to excessive deferral of the deferrable timers, thus incorrectly engaging the duration reduction code.
Before Dirk moved on, the little working group was working towards eliminating the duration method and re-introducing some C0 time inclusion to the calculation of the target pstate. Indeed, we had some very nice CPU frequency verses load response curves, with what we considered to be excellent tradeoffs between performance, energy savings, and flexibility. While the group agreed on the big picture, and its urgency, we disagreed on some of the exact implementation details. For my part of it, I'll try to revive and test my patch set, but it'll take probably a couple of weeks.
In the meantime, the proposed workaround is to set the sample rate to 19 Msecs for these 300 Hz kernels, which will result in an actual sample rate of 20 Msecs. (Note that the working group was tending towards a longer default sample rate anyway.)
Comment 11 manuel.bua 2015-02-27 10:48:51 UTC
@Doug, i've the same problem on my ArchLinux/64, i7-4790K, Asus H97-Plus: i tried your suggestion and tweaked my sample rate with this, note my kernel is the stock Arch kernel configured with a tick of 300hZ:

echo 19 | sudo tee /sys/kernel/debug/pstate_snb/sample_rate_ms

By looking at the c-states with i7z-git i can see how the CPU is staying in C7 most of the time but the frequency is modulated in the upper 4-4.4ghz if Turbo Mode is enabled, while it just sits at 3.9/4 without turbo.
I can see some spikes toward 2.9Ghz, *for one single core at a time*, but that's it, not really going down for more than one second.

I also tried different settings for the sample rate, such as 5, 20, 21, 22, 25, 50 and 100, but the result stays the same.
Comment 12 Doug Smythies 2015-02-27 14:58:42 UTC
@manuel: Very interesting. Your computer is otherwise "idle" right? What kernel version are you using? Are you adept at compiling the kernel? It would be great to acquire some "perf record" data, but to do so requires the addition of a yet to be released patch.
Comment 13 manuel.bua 2015-02-27 15:17:23 UTC
I'm using the latest stock kernel, 3.18.6-1-ARCH #1 SMP PREEMPT x86_64 GNU/Linux.
It was completely idle: at some point i also killed Xorg but didn't make any difference; i'll retry it this evening and see if i notice any major changes.
Comment 14 manuel.bua 2015-02-27 15:20:18 UTC
I may be able to recompile it and give it a try, but i've lot of work to finish and i'm not sure i'll have the time, but please let me know what to tweak in case i can do it.
Comment 15 Doug Smythies 2015-02-27 15:58:41 UTC
"at some point i also killed Xorg but didn't make any difference"
Then your use case is different than John's use case.
It would be better to use kernel 3.19 or 4.0RC1 for tests. However, I do have the older version of the patches for kernel 3.18.

With your computer otherwise idle, could you post the output from:
sudo turbostat sleep 300
Comment 16 manuel.bua 2015-02-27 19:20:46 UTC
Ok, i was just idleing in the Gnome desktop, and even though some minor activity is there the c7 state was clearly the winner. Also i'm running an nVidia GTX970 on 346.35, so GFX wattage will be null:

    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
       -       -      38    0.96    3959    4000       0    3.01    0.22    0.08   95.73      33      35    0.00    0.00    0.00    0.00   12.54    1.06    0.00
       0       0      16    0.40    3997    4000       0    4.45    0.38    0.04   94.72      31      35    0.00    0.00    0.00    0.00   12.54    1.06    0.00
       0       4      74    1.91    3859    4000       0    2.95
       1       1      18    0.46    3994    4000       0    3.06    0.14    0.05   96.29      32
       1       5      57    1.42    3997    4000       0    2.10
       2       2      19    0.48    3992    4000       0    2.13    0.16    0.03   97.20      32
       2       6      27    0.68    3993    4000       0    1.93
       3       3      29    0.73    3988    4000       0    4.15    0.20    0.20   94.72      33
       3       7      63    1.57    3989    4000       0    3.31
300.000566 sec
Comment 17 manuel.bua 2015-02-27 19:46:38 UTC
The previous turbostat results were taken with a sample_rate_ms of 10 and while idleing: the next bunch of results are taken with a sample_rate_ms of 19 and light activity, such as browsing (Chromium) and checking mail (Thunderbird):

    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
       -       -     230    5.75    3998    4000       0   18.57    4.50    1.71   69.47      34      37    0.00    0.00    0.00    0.00   18.72    7.02    0.00
       0       0     294    7.37    3997    4000       0   16.52    5.24    2.01   68.86      34      37    0.00    0.00    0.00    0.00   18.72    7.02    0.00
       0       4     208    5.21    3999    4000       0   18.68
       1       1     287    7.19    3997    4000       0   17.58    4.20    1.59   69.44      34
       1       5     200    4.99    3998    4000       0   19.78
       2       2     263    6.57    3997    4000       0   18.83    4.24    1.68   68.68      33
       2       6     190    4.76    3998    4000       0   20.64
       3       3     249    6.22    3997    4000       0   17.00    4.33    1.54   70.91      34
       3       7     148    3.70    3999    4000       0   19.52
300.000576 sec
Comment 18 Doug Smythies 2015-02-27 21:12:53 UTC
@manuel: Your CPU has a lot going on, even when the system is "idle". Could you do a test with xorg turned off and the system otherwise "idle"?
Comment 19 Doug Smythies 2015-02-27 23:30:35 UTC
@manuel: I e-mailed you that patches for both before 3.18RC1 and for 3.19RC1 onwards.

Your "idle" load is much higher than John's, however your CPU frequencies do seem high for the conditions. 

Was your computer ever put into suspend before these tests? If yes, that is a different bug report. Your intel_pstate governor is in powersave mode correct?
Comment 20 manuel.bua 2015-02-28 12:56:34 UTC
Thank you for the patches, yes i set the governor to powersave:

$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver 
intel_pstate

$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor 
powersave
Comment 21 manuel.bua 2015-02-28 14:19:54 UTC
Created attachment 168431 [details]
Performance data sampled from patched ARch kernel 3.18.6-1

Sampled perf data from the ArchLinux/64 stock kernel 3.18.6-1 patched with "Add-tsc" and "Move-tracepoint".
Comment 22 manuel.bua 2015-02-28 14:21:48 UTC
@doug, i recompiled the kernel after applying both "Add-tsc" and "Move-tracepoint" patches and attached the sampled perf data to this bug report, noticed i missed some questions before:

> Was your computer ever put into suspend before these tests?

Nope, never suspended before the tests.

> Your intel_pstate governor is in powersave mode correct?

Yes, always set to powersave.
Comment 23 Doug Smythies 2015-02-28 16:22:48 UTC
@manuel: I am having trouble with the data. I get:

magic/endian check failed
incompatible file format

I will e-mail you the post processing tools, and maybe you could try.

@John: did you have to compile and install new versions of perf and trace when you did this?
Comment 24 manuel.bua 2015-02-28 17:26:05 UTC
Not sure why but even by compiling the perf tools i have the "incompatible file format" too..
Comment 25 manuel.bua 2015-02-28 17:31:42 UTC
Ouch! Got it, should use "tar Jxvf perf.data.xz" to decompress that, not the Gnome archive manager o_O

I'm now attaching the "manuel01.tar.gz" results/ folder.
Comment 26 manuel.bua 2015-02-28 17:33:10 UTC
Created attachment 168461 [details]
Post-processed perf data manuel01
Comment 27 Doug Smythies 2015-02-28 17:54:54 UTC
O.K. great, thanks. Turbo seems to be turned off, correct? There seems to be one bad data point, but it doesn't matter.

It will be hours before I can comment in more details, but on quick glance, your computer has frequent bursts of very heavy CPU loads, and therefore I think it is behaving as expected. I'll comment more later today.
Comment 28 manuel.bua 2015-02-28 18:53:11 UTC
Nice, i'll keep an eye for updates here.

> Turbo seems to be turned off, correct?

Yes, i'm limiting the ratio to 40x.

> your computer has frequent bursts of very heavy CPU loads, and therefore I
> think it is behaving as expected

This is odd, at least, i stopped GDM and probably had mysqld/apache running, but with my old i7-920 (cpupower/ondemand), logged into the gnome desktop, i had the frequency scaling bursting to 2.6Ghz when needed, it just stayed at 800/1600 most of the time.
Comment 29 da_audiophile 2015-02-28 19:25:14 UTC
> did you have to compile and install new versions of perf and trace when you
> did this?

Nope, but I believe you already figured out the compression thing.
Comment 30 Doug Smythies 2015-02-28 22:44:13 UTC
As mentioned earlier, the Manuel data contains jumps to the maximum CPU frequency, for good reason. However the frequency lingers at a high level for much much longer than it should. In addition to the issues we already know about, its as though CPUs that have gone into the C7 state are still casting their target p_state vote into the processor PLL decision stuff, whereas I thought those CPU were supposed to lose their vote.

To gain additional insight, I need to write a program to re-process the normalize.csv file, adding some additional information and calculations.

I also need to read up some more about when the CPUs votes count and when they don't, as I think I am just going by what Dirk (the previous intel_pstate maintainer) always told me.
Comment 31 Doug Smythies 2015-03-01 17:43:16 UTC
@John:
@manuel:

Would both of you do the following experiment, the purpose of which is to determine if issues are C7 state specific.

Note that my i7-2600K doesn't go into C7 anyway, but I did test syntax and such, by limiting it to C3.

Eliminate the systems ability to go into C7: Edit /etc/default/grub and change the GRUB_CMDLINE_LINUX_DEFAULT line:

GRUB_CMDLINE_LINUX_DEFAULT="intel_idle.max_cstate=6"

If there is already other stuff on the line, the new parameter becomes additional. Example:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 intel_pstate=enable intel_idle.max_cstate=3 crashkernel=384M-:128M"

Update: "sudo update-grub"

After a re-boot, observe the limit is hit (3 in my test case):

doug@s15:~/temp$ dmesg | grep -i " max_cstate "
[    1.004557] intel_idle: max_cstate 3 reached

doug@s15:~/temp$ grep -i " max_cstate " /var/log/kern.log
Mar  1 09:17:35 s15 kernel: [    1.004557] intel_idle: max_cstate 3 reached

Observe with turbostat:

doug@s15:~/temp$ sudo ./turbostat sleep 20
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
       -       -      61    3.80    1605    3411       0    3.80   92.39    0.00    0.00      29      29    0.18   69.39    0.00    0.00    6.28    2.63    0.23
       0       0     140    8.69    1605    3411       0    3.81   87.50    0.00    0.00      29      29    0.18   69.39    0.00    0.00    6.28    2.63    0.23
       0       4      61    3.79    1605    3411       0    8.72
       1       1       1    0.03    1606    3411       0    0.03   99.93    0.00    0.00      29
       1       5       0    0.00    1608    3411       0    0.06
       2       2     129    8.05    1605    3411       0    6.97   84.98    0.00    0.00      29
       2       6     111    6.94    1605    3411       0    8.08
       3       3      29    1.80    1605    3411       0    1.04   97.17    0.00    0.00      29
       3       7      18    1.10    1605    3411       0    1.73
20.001256 sec

On an otherwise "idle" system, do the "perf record" as previously:

sudo perf record -a --event=power:pstate_sample sleep 600

and post back here the resulting perf.data file. No need for you to post process yourself, other than for general interest, as I am now using a modified normalize.py script.
Comment 32 da_audiophile 2015-03-01 19:38:55 UTC
OK, dmesg confirms and Arch doesn't log to /var/log/kern.log

% dmesg -t | grep -i " max_cstate "
intel_idle: max_cstate 6 reached

From grub.cfg:
linux /boot/vmlinuz-linux root=/dev/sdb3 rw intel_idle.max_cstate=6

Problem is both turbostat and i7z are showing C7 most of the time.  I emailed you the perf.data anyway.

Note the kernel I am using is the default Arch with 300 Hz tickrate only exception is your patch has been applied.
Comment 33 Doug Smythies 2015-03-02 06:48:39 UTC
> its as though CPUs that have gone into the C7 state are still
> casting their target p_state vote into the processor PLL
> decision stuff, whereas I thought those CPU were supposed
> to lose their vote.

I should have said "gone out of the C0 state" rather than "gone into the C7 state", as that is all I can deduce so far, and even that isn't conclusive yet.

I have looked sample by sample over a lot of data now. Manuel's data more clearly shows this issue, than John's data, but John's does have it.

As a sanity test I got some data from my i7-2600K, and it is O.K., meaning that CPU's that are not in the C0 state do loose their ability to influence what the clock frequency the PLL will target (at least as best as I can determine).

I am unable to process John's data from earlier today, as it appears to be from an unpatched kernel instead of a patched kernel.
Comment 34 da_audiophile 2015-03-02 20:20:00 UTC
Sorry about that, Doug.  I am recompiling now and will run this test ASAP.  I will attempt to limit to c6 and if that fails as I reported, I will try c3.
Comment 35 Doug Smythies 2015-03-05 03:44:56 UTC
Created attachment 169101 [details]
John i7z example

For the John i7z data the attached details an example where the CPU frequency is being held too high by a CPU that is not in the C0 state.

Notice that CPU 5 has a target pstate of 41, but it is not in C0 and when the intel pstate driver finally does run on CPU 5 it has been 3.290 second since the last time the driver was run. Meanwhile CPU has done several passes through the driver.

Starting with line 1430, observe that CPU 1 exits that pass with a target pstate of 8, and no other CPU has an elevated target pstate other than CPU 5, which think is not in the C0 state.

Now, observe the frequency calculation for CPU 1 for its next pass (line 1431), it is 4.1 GHz. Why? it should be around 800 MHz. Due to the known issue where C0 weighting is needed, a target pstate of 44 is calculated for a very low load situation.

Jump ahead to line 1434 where CPU 1 sorts itself out again, exiting with a target pstate of 8. And 16.68 milliseconds (5 jiffies) later, on line 1436 the frequency is calculated as 4.10 GHz, whereas, and again, it should have been 800 MHz because CPU 5 should not be casting a vote in the PLL decision making.
Comment 36 Doug Smythies 2015-03-05 04:20:13 UTC
Created attachment 169111 [details]
manuel data - example of CPU frequency influence

This example is from the manuel data.

start with line 1264, where CPU 7 does a pass through the intel_pstate driver, having not one for 2.63 seconds. It drops its target psate from 40 to 21. Now only CPU 5 has a target pstate of 40, and the next highest target is 21. CPU 5 is likely not in C0, and eventually goes into C0 and does a pass through the driver on line 1267, after 2.79 seconds. Meanwhile CPU 0 has done a couple of passes through the driver, and on line 1268 calculates a frequency of 3.58 GHz. In this case, and mainly because I made a mistake in which screen to take a shot of, we can not deduce anything. It is possible that CPU 5 in C0 at 4 GHz long enough to effect CPU 0.

However, observe CPU 6 in the lines above. It calculates 4 GHZ, while it appears that both CPUs 5 and 7 are not in C0 and should not be influencing the PLL decisions.
Comment 37 Doug Smythies 2015-03-05 04:44:13 UTC
Created attachment 169121 [details]
john 300 Hz kernel example

This example is from John's 300 Hertz Kernel capture of several days ago.

In the area of concern CPU has the highest target pstate of 25. When CPU 3 does finally pass through the driver again, it has been 1.34 seconds since it last trip through. Note that there have been 1.25 million clock cycles in that time, so this is not a great example. Why not? because we don't really know if CPU 3 woke up and went into C0 several times in that 1.25 million clock cycles. All it has to do to avoid a pass through the intel_psate driver, is be not in C0 state when the jiffy boundary occurs. Anyway...

Notice on lines 2128, 2129 and 2131 CPU 0 is at 2.5 GHz, as influenced by CPU 3.
Comment 38 Doug Smythies 2015-03-05 05:19:25 UTC
Created attachment 169131 [details]
Doug 250 Hz kernel example

From Doug's 300 Hz kernel with two tasks running: One at 7 Hertz work / sleep frequency and a load of 17.5% (25 milliseconds per work period 7 times per second); And one at 3 hertz work / sleep frequency and a load of 7.5% (25 milliseconds per work period 3 times per second).

Notice how the frequency of CPU 1 drops when CPU 3 and other CPUs are not in C0, as expected. Even though some of their target pstates is higher than CPU 1.
Comment 39 Doug Smythies 2015-03-05 07:12:48 UTC
Created attachment 169151 [details]
From the John C3 data sample

This one shows CPU 0 frequency being influenced by CPUs 4 and 3, when they appear to not be in C0, and are in an long gap between runs of the driver, 3.3 and 4.0 seconds.

Notice in line 1750 how everything is finally O.K.
Comment 40 da_audiophile 2015-03-05 08:12:58 UTC
Thanks for all the work you are putting into this, Doug and for publishing your observations with data and formulating a hypothesis.

Kristen - What do you think about these recent data?
Comment 41 manuel.bua 2015-03-05 09:06:45 UTC
Sorry for not having time to follow this much better, but thanks for all the hard work you are putting into solving this problem.
Comment 42 Doug Smythies 2015-03-05 15:18:46 UTC
Where are we and how do we move forward?

. I believe the 300 Hz verses 250Hz Verses 1000 Hz to be a red herring. It just so happens that the 300 Hz kernel has a higher probability of interacting with the xorg stuff in such a way as to increase the manifestation of the issues.

. The i7z program itself exacerbates the issues.

. Unduly increased CPU frequency is largely due to the known issue where C0 weight needs to be re-introduced into the driver.

. On the particular processor in question (i7-4790K) the issue is exacerbated because, apparently, CPUs not in the C0 state do not lose their vote contribution to the PLL decision making. While its magnitude would be greatly reduced, unwarranted higher CPU frequencies from this issue would still occur even after adding the C0 weighting inclusion fix to the driver. (Think of CPU N going out of C0 for a period of seconds (never seen more than 4.00 seconds on any processor) while at a target pstate of 44.)

. The only thing within the processor I can think of is the IA-32_ENERGY_PERF_BIAS Register (section 14.3.2 of Vol 3B of the IA-32 SDM. http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf (which doesn't mean there are not other possibilities) I don't know how to investigate this further.

. Suggested experiment: Purpose: To give more definitive, easier to analyze data. Disable all GUI stuff and non essential processes, and do a more controlled load test using the consume program (will be attached later).
 
. Suggested experiment: Purpose: To determine if the PLL vote issue is somehow due to Arch kernel configuration: John and / or Manuel should compile a kernel but using an ubuntu config file. I can attach my kernel 4.0RC2 config file (basically Ubuntu), if John is willing to compile a 4.0RC2 test kernel with it. I do not still have my 3.19 config file, but I could come up with one.
Comment 43 Doug Smythies 2015-03-05 16:25:54 UTC
Created attachment 169251 [details]
Program to apply load at specified work / sleep rate

Please use the attached program to do an experiment.
Don't worry about the calibration, as we are not using fixed work packets mode.

disable any xorg or whatever GUI stuff.
disable anything else you can think of that isn't needed.
The goal is to have a system as "idle" as possible.

run:
./consume 17.5 7.0 1000 &
./consume 7.5 3.0 1000 &

and run perf record as normal (requires the patched kernel):
sudo perf record -a --event=power:pstate_sample sleep 600

Send me the resulting perf.data file.
This data should be easier to analyze and more definitive.
Comment 44 Doug Smythies 2015-03-06 00:06:50 UTC
Created attachment 169381 [details]
John consume data sample

Even though I screwed up the operating conditions for John's test using the consume program, I was able to extract a clear example that shows the influence of a CO not in the C0 state on CPUs that are.

Notice line 941 where CPU 2 exits the driver with a target pstate of 25.
Now to fit everything in one screen shot, I deleted some lines.
Notice the next pass of CPU 2 through the driver is 4 seconds later at line 960.

Now observe CPU 0, highlighted in yellow. In line 951 the frequency is 2.5 GHz, but CPU 2 should not be influencing it. At worst, there might be some influence from CPU 1, but it has finished its work chunk and gone out of C0 also.

Now observe lines 954 and 955, not highlighted, CPU 0 has finished its work chunk and will have gone out of C0 by then. CPU 1 is still influenced by the CPU 0 target pstate. It shouldn't be.

O.K. At this point, I think I have enough proof about this part.
Comment 45 Doug Smythies 2015-03-06 01:57:57 UTC
Created attachment 169441 [details]
Suggested kernel config file

The suggested kernel config file for a 4.0RC2 kernel.
For the experiment to compare Arch kernel config with Ubuntu config with regards to this issue.
Comment 46 Doug Smythies 2015-03-07 00:01:34 UTC
From my comment 42 above:

. The only thing within the processor I can think of is the IA-32_ENERGY_PERF_BIAS Register (section 14.3.2 of Vol 3B of the IA-32 SDM. http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf (which doesn't mean there are not other possibilities) I don't know how to investigate this further.

It turns out that turbostat reads the register and lists its contents in verbose mode. I.E.

sudo turbostat -v sleep 1

gives (edited):

cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)

John and / or manuel: what do you get?
Comment 47 da_audiophile 2015-03-07 00:09:56 UTC
% sudo turbostat -v sleep 1
turbostat v3.7 Feb 6, 2014 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 13 CPUID levels; family:model:stepping 0x6:3c:3 (6:60:3)
CPUID(6): APERF, DTS, PTM
RAPL: 2979 sec. Joule Counter Range, at 88 Watts
cpu0: MSR_NHM_PLATFORM_INFO: 0x80838f3012800
8 * 100 = 800 MHz max efficiency
40 * 100 = 4000 MHz TSC frequency
cpu0: MSR_IA32_POWER_CTL: 0x0000005d (C1E auto-promotion: DISabled)
cpu0: MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x1e008400 (UNdemote-C3, UNdemote-C1, demote-C3, demote-C1, locked: pkg-cstate-limit=0: pc0)
cpu0: MSR_NHM_TURBO_RATIO_LIMIT: 0x2c2c2c2c
44 * 100 = 4400 MHz max turbo 4 active cores
44 * 100 = 4400 MHz max turbo 3 active cores
44 * 100 = 4400 MHz max turbo 2 active cores
44 * 100 = 4400 MHz max turbo 1 active cores
cpu0: MSR_RAPL_POWER_UNIT: 0x000a0e03 (0.125000 Watts, 0.000061 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_INFO: 0x000002c0 (88 W TDP, RAPL 0 - 0 W, 0.000000 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x42ffff001dffff (UNlocked)
cpu0: PKG Limit #1: ENabled (4095.875000 Watts, 16.000000 sec, clamp ENabled)
cpu0: PKG Limit #2: ENabled (4095.875000 Watts, 0.002441* sec, clamp DISabled)
cpu0: MSR_PP0_POLICY: 0
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_PP1_POLICY: 0
cpu0: MSR_PP1_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: GFX Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00641400 (100 C)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x88430800 (33 C)
cpu0: MSR_IA32_THERM_STATUS: 0x884a0800 (26 C +/- 1)
cpu1: MSR_IA32_THERM_STATUS: 0x88490800 (27 C +/- 1)
cpu2: MSR_IA32_THERM_STATUS: 0x88480800 (28 C +/- 1)
cpu3: MSR_IA32_THERM_STATUS: 0x884a0800 (26 C +/- 1)
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
       -       -       6    0.14    4020    4000       0    1.03    0.14    0.08   98.61      27      31    0.00    0.00    0.00    0.00    5.05    0.23    0.03
       0       0      16    0.39    4043    4000       0    1.11    0.00    0.00   98.50      27      31    0.00    0.00    0.00    0.00    5.05    0.23    0.03
       0       4       4    0.10    3862    4000       0    1.40
       1       1      10    0.25    4027    4000       0    1.62    0.00    0.00   98.13      27
       1       5       2    0.06    4354    4000       0    1.81
       2       2       6    0.17    3892    4000       0    0.16    0.43    0.00   99.25      26
       2       6       2    0.04    4345    4000       0    0.29
       3       3       5    0.12    3968    4000       0    0.86    0.12    0.32   98.57      24
       3       7       1    0.02    4078    4000       0    0.96
1.000814 sec
Comment 48 Doug Smythies 2015-03-07 00:38:01 UTC
So, I guess John doesn't have the register. Not all processors have it.
Comment 49 manuel.bua 2015-03-07 12:30:01 UTC
$ sudo turbostat -v sleep 1
[sudo] password for manuel: 
turbostat v3.7 Feb 6, 2014 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 13 CPUID levels; family:model:stepping 0x6:3c:3 (6:60:3)
CPUID(6): APERF, DTS, PTM, EPB
RAPL: 2979 sec. Joule Counter Range, at 88 Watts
cpu0: MSR_NHM_PLATFORM_INFO: 0x80838f3012800
8 * 100 = 800 MHz max efficiency
40 * 100 = 4000 MHz TSC frequency
cpu0: MSR_IA32_POWER_CTL: 0x0104005d (C1E auto-promotion: DISabled)
cpu0: MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x1e000400 (UNdemote-C3, UNdemote-C1, demote-C3, demote-C1, UNlocked: pkg-cstate-limit=0: pc0)
cpu0: MSR_NHM_TURBO_RATIO_LIMIT: 0x2a2b2c2c
42 * 100 = 4200 MHz max turbo 4 active cores
43 * 100 = 4300 MHz max turbo 3 active cores
44 * 100 = 4400 MHz max turbo 2 active cores
44 * 100 = 4400 MHz max turbo 1 active cores
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_RAPL_POWER_UNIT: 0x000a0e03 (0.125000 Watts, 0.000061 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_INFO: 0x000002c0 (88 W TDP, RAPL 0 - 0 W, 0.000000 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x428370001a82c0 (UNlocked)
cpu0: PKG Limit #1: ENabled (88.000000 Watts, 8.000000 sec, clamp DISabled)
cpu0: PKG Limit #2: ENabled (110.000000 Watts, 0.002441* sec, clamp DISabled)
cpu0: MSR_PP0_POLICY: 0
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_PP1_POLICY: 0
cpu0: MSR_PP1_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: GFX Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00641400 (100 C)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x88440800 (32 C)
cpu0: MSR_IA32_THERM_STATUS: 0x88490000 (27 C +/- 1)
cpu1: MSR_IA32_THERM_STATUS: 0x884b0000 (25 C +/- 1)
cpu2: MSR_IA32_THERM_STATUS: 0x88480000 (28 C +/- 1)
cpu3: MSR_IA32_THERM_STATUS: 0x88470000 (29 C +/- 1)
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
       -       -      38    0.94    4095    3991       0    8.49    0.24    0.23   90.10      28      31    0.00    0.00    0.00    0.00    8.24    1.38    0.00
       0       0      34    0.85    3974    3990       0    5.12    0.11    0.28   93.64      26      31    0.00    0.00    0.00    0.00    8.24    1.38    0.00
       0       4       2    0.05    3996    3990       0    5.92
       1       1      98    2.31    4268    3990       0    2.65    0.04    0.07   94.93      27
       1       5       1    0.03    4002    3991       0    4.93
       2       2      54    1.33    4050    3991       0    1.46    0.51    0.08   96.62      28
       2       6       1    0.01    4019    3991       0    2.79
       3       3     116    2.89    4018    3991       0   21.08    0.29    0.51   75.23      28
       3       7       1    0.03    3946    3991       0   23.94
1.000750 sec
Comment 50 Doug Smythies 2015-03-07 14:17:33 UTC
Both John and Manuel have the exact same processor and are using the exact same version of turbostat, yet only one has the readout for the MSR_IA32_ENERGY_PERF_BIAS register. Weird.

The only other thing I notice is this: "pkg-cstate-limit=0: pc0" and the related 0% entries in the package C levels. whereas on my system it is "pkg-cstate-limit=3: pc6" and the related 97.8% entry in the pc6 column. I don't know if this is significant or not.
Comment 51 Doug Smythies 2015-03-08 17:14:49 UTC
Created attachment 169871 [details]
Demonstrates failure to raise target pstate under heavy load. 1 of 2.

While not directly related to this bug report, this attachment and the next one demonstrate what is called "the excessive duration effect". The load on CPU 5 is just under 90%, yet the target pstate is not raised because it just so happens that the CPU is generally not in the C0 state on the jiffy boundary and so the interval (duration) between passes through the intel_pstate driver is very long. The result is that duration code is triggered incorrectly forcing the target pstate downwards.

This demonstration was rigged for dramatic effect. Under real live conditions, and very dependent on the use case case, the excessive duration effect occurs anywhere between a few to a few thousand times per hour.
Comment 52 Doug Smythies 2015-03-08 17:16:00 UTC
Created attachment 169881 [details]
Demonstrates failure to raise target pstate under heavy load. 2 of 2.

Attachment 2 [details] of 2. See previous comment.
Comment 53 Doug Smythies 2015-03-08 17:29:36 UTC
(In reply to Doug Smythies from comment #42)
> Where are we and how do we move forward?
 
> . Suggested experiment: Purpose: To determine if the PLL vote issue is
> somehow due to Arch kernel configuration: John and / or Manuel should
> compile a kernel but using an ubuntu config file. I can attach my kernel
> 4.0RC2 config file (basically Ubuntu), if John is willing to compile a
> 4.0RC2 test kernel with it. I do not still have my 3.19 config file, but I
> could come up with one.

John created a kernel using an Ubuntu config file. And supplied the perf record data. The results were the same. Therefore the conclusion is that this is a not a kernel configuration issue.

I have not attached a screen shot of my spreadsheet, but I could if needed.

I am moving on to reviving the test intel_pstate driver code from June / July. It will take awhile.
Comment 54 Doug Smythies 2015-03-08 23:59:52 UTC
By the way, on attachment 169881 [details] (2 of 2 above) observe lines 8377 through 8380 where the load is 88.6 percent but the target pstate remains locked at 26, even though scaled busy is 99. The target pstate really should increase under the load.
Note that on those lines the CPU is in the C0 state on jiffy boundaries.

This is another known issue with the intel_pstate driver, the incredibly finicky tradeoff between integer math and gain factors and underdamped and overdamped servo system response. The test codes we were using in June /July dealt with this scenario, but some work does remain to be done.

With what we have discovered herein, where some processors never loose their vote into the PLL decision as to what CPU frequency to generate, and this known potential lock up scenario, this might be a contributing factor with bug 90421
Comment 55 Doug Smythies 2015-03-31 15:13:17 UTC
O.K. it is taking me longer than originally expected to revive my code from June / July. However I am re-developing the algorithms and fixing some things that we wanted to change, but were too busy at the time, as I go.

Meanwhile, the very important issue uncovered herein is the issue where it seems some processors never lose their vote into the PLL decision as to what CPU frequency to generate. I do not what to do about that part of the issue. However, it does reveal an issue that existed anyhow, it was just more rarely manifested with processors that behave properly in giving up their vote. This code:

        /*
         * core_busy is the ratio of actual performance to max
         * max_pstate is the max non turbo pstate available
         * current_pstate was the pstate that was requested during
         *      the last sample period.
         *
         * We normalize core_busy, which was our actual percent
         * performance to what we requested during the last sample
         * period. The result will be a percentage of busy at a
         * specified pstate.
         */
        core_busy = cpu->sample.core_pct_busy;
        max_pstate = int_tofp(cpu->pstate.max_pstate);
        current_pstate = int_tofp(cpu->pstate.current_pstate);
        core_busy = mul_fp(core_busy, div_fp(max_pstate, current_pstate));

is incorrect.

Why? Because core_busy is being calculated based on what was asked for and not what was actually done. With processors that properly lose their vote, the code mostly works because mostly one gets what they asked for. With processors that do not lose their vote, there is a dramatic increase in the number of occurrences of the actual operating pstate not being the requested current_pstate for this cpu. Thus ridiculous core_busy numbers are calculated unduly driving up the target pstate. For example see attachment 169101 [details]

I am saying that this:

core_busy = mul_fp(core_busy, div_fp(max_pstate, current_pstate));

should be something like this (not actually real code):

core_busy = mul_fp(core_busy, div_fp(max_pstate * 100MHz , measured_frequency));
Comment 56 Doug Smythies 2015-04-12 17:16:18 UTC
I have submitted my patch set for review.
I have also asked John and Manuel to try the patch set.

I have feedback from one GUI type (Ubuntu, 250 Hz kernel) user with an i5-4690K processor. On an otherwise idle system, the average CPU frequency goes from 2.6 GHz without my patch set to 1.1 GHz with it. There is also a package energy savings of about 0.52 watts.
Comment 57 Doug Smythies 2015-04-15 23:31:33 UTC
Created attachment 174141 [details]
Compare some response curves - fixed load method

As we start to get some feedback from John and Manuel, there seems to be some use cases where the default response curve of my patch set might be holding the target pstate down a little too hard. For example, Manuel seems to have a situation where his game (Dying Light) seems to think his computer does not have enough CPU power, so it drops the FPS (Frames Per second) rate. Meanwhile, the patched intel_pstate driver thinks there isn't enough demand on the CPUs to warrant raising the target pstate. The game seems to switch CPU's in an odd manor, perhaps similar to the Phoronix ffmpeg test.

The attached graph shows a few CPU Verses load (fixed load method) response curves. The patched response has been pushed way way over such that the Phoronix ffmpeg test average time is the same as for the unpatched kernel and the acpi_cpufreq driver.
Comment 58 Doug Smythies 2015-04-15 23:38:22 UTC
Created attachment 174151 [details]
Compare some response curves - fixed work packet method

The attached graph compares CPU frequency Verses load response curves for the fixed amount of work method (more like a real life scenario). The load is roughly normalized to the max non-turbo clock frequency. While it would be annoying, we may have to consider moving the default response curve some, so as to cover some odd use cases.
Comment 59 Doug Smythies 2015-04-16 21:53:36 UTC
Created attachment 174211 [details]
CPU frequencies during game - with patch set - default settings

Graph 1 of 2.
From Manuel: CPU frequencies during game (Dying Light) play with the patch set at default settings.

Note: this game is a real challenge, because it changes CPU's at a high rate.
Comment 60 Doug Smythies 2015-04-16 21:57:57 UTC
Created attachment 174221 [details]
CPU frequencies during game - with patch set - adjusted settings

Graph 2 of 2

From Manuel: CPU frequencies during game play with the default setting adjusted as per the shifted response graphs. They are way higher and Manuel reports the FPS (Frames Per Second) is back to normal.

However, the "idle" frequencies are up some.
Comment 61 Jan Claußen 2015-04-27 21:38:54 UTC
I have the same problem. My c7 runs constantly between 70-95%. I tried adding "intel_idle.max_cstate=6" to my GRUB_CMD_LINE. It runs a little cooler now. Just a little though. Did you come to a solution? I read all the posts, but it is not really clear to me.
Comment 62 Jan Claußen 2015-04-27 21:52:14 UTC
If can help somehow I'd be glad to do so. I need this laptop to study and this heat and noise are driving me nuts. I'm an experienced Linux user. But my coding skills cover just basic C-programming.
Comment 63 Doug Smythies 2015-04-28 03:49:58 UTC
Status:
While "idle" (which isn't really idle on a desktop) frequencies are up a little with the patch set and with the shifted curve, after a number of tests, there has been no measurable increase in processor power consumption at "idle". This is a good thing. In general, for "idle" and for some processors, there is about a 1/2 a watt package power saved verses the un-patched version of the intel-pstate driver. For other processors, there seems to be no difference in package power, at "idle".

Just yesterday, I even managed to obtain some test results from a very low power device: Intel(R) Pentium(R) CPU  N3530  @ 2.16GHz

With such a low IIR filter gain of 5%, the rise time response time is considerably slower than with 10%. While I convinced there must be an operational scenario where this would be detrimental, I have yet to find it.

Of course, Intel (Kristen) has to do a bunch of testing of various workflows on various platforms.

My patch set is now out of sync with Kernel 4.1RC1, and I am just attempting to re-base it now (I am not very git savvy).

Jan: Please provide your exact processor model number and the linux distribution you use. Are you comfortable compiling the kernel? I am looking for test results from anyone willing to try, and I have made kernels for one person (Ubuntu kernel configuration, but kernel.org source + my patch set). John and Manuel are able to compile their own kernels.

Jan: your comments about C7 time and "intel_idle.max_cstate=6" don't really make sense to me. Regardless we have moved past that idea.

Jan: as a temporary work around, you might want to try going back to the acpi=cpufreq driver. Just add "intel_pstate=disable" to your GRUB_CMDLINE_LINUX_DEFAULT line.
Comment 64 Jan Claußen 2015-04-28 16:30:15 UTC
I have a Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz. 

Possible that they don't make sense. I'm not that of a pro yet. But working on it. What I meant is that I used the i7z tool to monitor my CPU. I don't know what the c1-c7 mean. I guessed it were cores. But I also don't think that I've got 7 of them. Anyway when I run i7z the c7 column is around 80% the others are under 10-20%. 

I'm running Antergos Linux, which is based on Arch. This is the first time that I use this system due to the ease of reinstalling. I was experiencing the EXACT SAME PROBLEMS under Arch.

Since the kernel is the same it would be possible to try out your compilation. I don't assume that it could break my laptop, could it? I have never compiled a kernel before. What do I need to do?

I tried disabling intel_pstate, but it didn't change anything, like all the other solution on the web. Plus disabling c7 made my laptop a little cooler, so I'm going to keep it until there is a more sophisticated solution.
Comment 65 Doug Smythies 2015-04-28 18:58:14 UTC
@Jan: Thanks for reporting back. Your problem is something different than this particular bug report is covering.

Please also note that i7z is garbage, and its prints CPU frequencies that are not possible.

Various C states are different levels of CPU idle. C7 is the deepest, lowest power (i.e. zero power) state, and the more time spent there the better.
Comment 66 da_audiophile 2015-06-20 15:22:04 UTC
Doug - Can you update us on the status of this work?

Median idle freq under Arch in Xorg (300 Hz tick rate): 1,237 MHz
Median idle freq under Ubuntu 15.04 live CD: 886 MHz
Median idle freq under Arch in Xorg (1000 Hz tick rate): 956 MHz
Median idle freq under Arch without Xorg running (1000 Hz tick rate): 831 MHz
Comment 67 Doug Smythies 2015-06-22 15:34:28 UTC
(In reply to da_audiophile from comment #66)
> Doug - Can you update us on the status of this work?
> 
> Median idle freq under Arch in Xorg (300 Hz tick rate): 1,237 MHz
> Median idle freq under Ubuntu 15.04 live CD: 886 MHz
> Median idle freq under Arch in Xorg (1000 Hz tick rate): 956 MHz
> Median idle freq under Arch without Xorg running (1000 Hz tick rate): 831 MHz

Hi John,

Everything is on hold and will not make it into kernel 4.2RC1.
The maintainer is working on a method for calculating C0 time.
While the maintainer is in favor of eliminating the pid controller, what she is not sure about yet is whether she likes my new algorithm.

Your median idle frequencies all look O.K. to me.
Comment 68 manuel.bua 2015-07-18 11:02:10 UTC
So.. can the maintainer update us on this?
Comment 69 Doug Smythies 2015-07-18 15:09:52 UTC
Created attachment 183041 [details]
Phoronix ffmpeg test comare

Hi Manuel,

Just this morning, I was going to ask if you would be willing to try the most recent version of my patch set? It is the same as before, with one small, yet significant, change to deal with comment 55 above. With the change I was able to both back off the response curve somewhat and increase the IIR filter gain, and still maintain equal or better performance compared to the acpi-cpufreq driver on the Phoronix ffmpeg test, regardless of the number of CPUs allocated. (I'll attach a graph) Recall that the ffmpeg test represents a particularly annoying challenge to frequency scaling drivers. The patch set is based on the Kernel 4.2 release candidate series.

What I want you to try is your Dying Light game with the patch set with the operating parameters set to: iir_gain 10%; c0_floor 20%; c0_ceiling 65%. Furthermore, I would like you to try +/- N% to both the floor and the ceiling, keeping 45% difference between them, to find the point where you Dying Light game starts to reduce the frame rate. I.E. I want to find how much margin, if any, we would have with the proposed parameter settings.

I know I am asking you for a lot of work, but I do not know of a way to simulate the Dying Light game situation. If you are agreeable, we will go to e-mails to proceed and then report findings back here.
Comment 70 Doug Smythies 2015-07-19 04:58:28 UTC
@ Manuel: never mind. I had some disappointing test results today, and will have to re-think my attempt to deal with comment 55.
Comment 71 Doug Smythies 2015-07-20 14:50:02 UTC
@Manuel: I would still like to determine some things with your Dying Light game: First, it it works O.K. with iir_gain 10%, c0_floor 15%, c0_ceiling 50%; And if that works, then second, at what point does the frames per second begin to drop (modifying c0_ceiling upwards from there).

The Kernel 4.3RC series patch set can be found at:
double u double u double dot smythies dot com /~doug/linux/intel_pstate/build22/

Example for setting the parameters:

doug@s15:~/temp$ sudo su
root@s15:/home/doug/temp# echo "150" > /sys/kernel/debug/pstate_snb/c0_floor
root@s15:/home/doug/temp# echo "500" > /sys/kernel/debug/pstate_snb/c0_ceiling
root@s15:/home/doug/temp# echo "10" > /sys/kernel/debug/pstate_snb/iir_gain_pct
root@s15:/home/doug/temp# cat /sys/kernel/debug/pstate_snb/c0_floor
150
root@s15:/home/doug/temp# cat /sys/kernel/debug/pstate_snb/c0_ceiling
500
root@s15:/home/doug/temp# cat /sys/kernel/debug/pstate_snb/iir_gain_pct
10
root@s15:/home/doug/temp# exit
exit
doug@s15:~/temp$
Comment 72 Doug Smythies 2015-07-28 16:59:00 UTC
(In reply to manuel.bua from comment #68)
> So.. can the maintainer update us on this?

Manuel: The maintainer asked me to re-base my patch set to the kernel 4.2RC series. She is working on an alternate method for calculating C0 time or utilization. I merely brought back the method previously used, but it might not work for all architectures. I am not aware of a timeline for this work.

Meanwhile I was hoping you could do those Dying Light tests, as I hope to be able to adjust the proposed default operating parameters a little.

I did go back and reviewed some previous Dying Light trace data. It rotates through CPUs at various loads very quickly.
Comment 73 Doug Smythies 2015-08-25 19:19:23 UTC
(In reply to Doug Smythies from comment #72)

> Meanwhile I was hoping you could do those Dying Light tests, as I hope to be
> able to adjust the proposed default operating parameters a little.

Just F.Y.I.

In the absence of any additional information from Manuel with his Dying Light game, the next time I have to re-base or update the patch set for the maintainer, in additional to fixing a couple of typos, I will be setting the following recommended default parameters:

c0_floor: 15%
c0_ceiling: 58%
iir_gain: 10%
Comment 74 Doug Smythies 2015-09-14 23:57:02 UTC
Recall our findings where it appeared as though sometimes when the CPUs were in C states above, I think it is, 1, their target pstate vote into the PLL was not being dropped as it should be. There is another case of exactly the same thing. This time the processor is a Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz. There is an added twist in that something else is defining the minimum pstate, meaning that if all 4 CPUs set a target pstate of 5, the actual CPU frequency will be higher. 800 MHz in the particular trace example that I have, but it isn't always the same.
Comment 75 Doug Smythies 2015-09-16 23:40:49 UTC
Created attachment 187781 [details]
An example of the target pstates going to maximum with virtually no load

The extract is from a "perf record" session for this computer where, for whatever reason, the CPU frequency never goes down to the minimum for the processor.

It demonstrates how the current control algorithm has troubles when the CPU frequency is not what it thinks it should be.

The main issues are highlighted in red.

The issue starts on line 1354, where CPU 3 has a short duration pass through the driver, even though there is very little load (0.71%). Now, because the CPU frequency was not 500 MHz, the math goes mental and it ends up asking for a target pstate of 18. More typically, low load situations result in long durations and so the long duration adjustment to scaled_busy kicks in, but not this time.

Thus begins a saga where, with virtually no load, the system drives target pstates up for no reason.

Eventually the system does settle down, due to the long duration thing kicking in over all CPUs.
Comment 76 Doug Smythies 2015-09-16 23:49:48 UTC
Created attachment 187791 [details]
Why does this system not go to 500MHz?

An extract detailing a quiet period where all target pstates are 5, and the system should settle to a CPU frequency in the 500MHz range. However it doesn't, it settles to the 800Mhz range. Why?

Note the blue highlighted cells. In that case the higher CPU frequency is O.K. because the previous pass of CPU 1 was so long ago (745 milliseconds), that the sample has some content from when frequencies were supposed to be higher.

For the cells highlighted in yellow, the CPU frequency should be in the 500 MHz range, but are not.
Comment 77 Doug Smythies 2015-09-17 00:00:04 UTC
Created attachment 187801 [details]
An example where CPU 0 does not give up it vote into the PLL

In this extract from the trace data, it appears as though CPU 0 does not give up its influence on the PLL when it is in a high C state. The duration is 985 milliseconds and there are very few clocks cycles on CPU 0 during that so very long time (134470), so it is hard to imagine that it did go above C2 in that time.

The suggestion is that all the light red highlighted cells should have had a much lower CPU frequency.

The cells that are not highlighted, but within the duration window are indeterminate due to partial sample times outside of the duration window.
Comment 78 Anon Y. 2015-09-20 04:04:23 UTC
@Doug Smythies, 

I believe I am experiencing the same issue describe in this bug report. 

My CPU on idle constantly sits at max frequency around 3.9Ghz. 

"sudo cpupower frequency-info" reports that governor available are powersave and performance. frequency available are 800mhz-3.4ghz. current governor powersave.

"sudo i7z" shows 4 cores are entering C1,C2,C3 state about 1% of the time and C7 around 98/99% of the time. Almost never in C6. 

I have tired "intel_pstate=disable" in /etc/default/grub and "sudo update-grub". Successfully goes back to acpi_cpufreq driver. CPU sits on idle at 800Mhz and VCore around 0.6968v. But acpi_cpufreq driver has no Turbo, and max frequency on 100% load is only 3.4Ghz.

I have tried stress test the cpu with mprime from (Arch AUR). With test type=2. On 100% load, on stock heatsink, with intel_pstate cpu reaches 100 degree C in around 10 min mark. Thermal throttle kicks in and reduces the cpu frequency to as low as 2.2ghz on powersave governor and 2.9 on performance. But keeps the VCore voltage unnecessarily high around 1.1210v. 

intel_pstate almost never reduces VCore even on idle according to i7z. But keeps the idle frequency at highest!

Observed with CoreTemp, win7 with intel chipset driver can manage Haswell much better. Run the same prime95 test type=2, on 100% load with turbo 3.9Ghz. In 10 min reaches 100 deg C. Thermal Throttle pulls back the VCore and Frequency at the same time, lowering the temp more quickly. Also keeps the 4 Core at constant 3.5Ghz at 100 deg C from that point onward.

Manjaro being Arch like has tickrate 300 HZ. Above you already have established that setting 250 Hz or 1000 Hz tickrates will only mask the underlying problem and not solve it. Are there any test or experiment I can run that might help you to farther deduce the cause of this issue?


My H/W, S/W config as follows:

Hardware:
Intel Core i7 4770 (Haswell)
Revision: C0
Base Frequency: 3.4Ghz
Turbo: 3.9Ghz
4 Cores / 8 Thread (aka Hyper-threading)

Kernel: 
Linux laptop1-manjaro 4.1.6-3-MANJARO #1 SMP PREEMPT Sat Sep 5 10:57:06 UTC 2015 x86_64 GNU/Linux

Distribution:
LSB Version:	n/a
Distributor ID:	ManjaroLinux
Description:	Manjaro Linux
Release:	15.09-rc2
Codename:	Bellatrix

Desktop Environment:
default Manjaro xorg, openbox, xfce4 environment.
Comment 79 Doug Smythies 2015-09-20 16:00:41 UTC
(In reply to Anon Y. from comment #78)

> My CPU on idle constantly sits at max frequency around 3.9Ghz. 

What is your definition of "idle"? I ask because "idle" is not really idle on a desktop with a GUI and such. On a server, "idle" can be much more idle.

Anyway, it would be good to understand why your CPU frequency is so high, as from your C state data listed below it seems it should be lower. I can only think of acquiring some trace data to acquire better detail as to what is going on. I think your kernel is recent enough.

> I have tried "intel_pstate=disable" in /etc/default/grub and "sudo
> update-grub". Successfully goes back to acpi_cpufreq driver. CPU sits on
> idle at 800Mhz and VCore around 0.6968v. But acpi_cpufreq driver has no
> Turbo, and max frequency on 100% load is only 3.4Ghz.

? acpi-cpufreq supports turbo. Under 100% load do "grep MHz /proc/cpuinfo" if it listes 3401, then it is in turbo mode. Note that the acpi-cpufreq scaling driver lists what CPU frequency was asked for and not what the CPU frequency actually is.
Comment 80 Doug Smythies 2015-09-21 20:59:57 UTC
@Anon Y. The trace data you sent me is very consistent with what John was getting at the beginning of this bug report. I can not say much more because I had to use older manual methods to post process the data, as I was wrong and your kernel does not have the added trace information needed by our post processing tools. The older manual methods are simply too time consuming, so I have only post processed a couple of CPUs data. I can not say conclusively if your system is exhibiting this failure to give up PLL votes issue or not, but it seems likely.

Keep in mind that CPU frequency alone does not tell the whole story. One has to consider overall C states and such for an overall view of things.
Comment 81 Doug Smythies 2015-09-22 21:13:49 UTC
Created attachment 188101 [details]
CPU Frequencies for Anon Y

Just adding an overview CPU frequencies graph from Anon Y's second trace data set. The actual CPU loads are almost always very very low. In this case, it is less clear from the spreadsheet data that the CPU's should be giving up their votes into the PLL. However, the cores are spending about 95% of their time in the C7 state. Also, there does seem to be some high frequency stuff going on, so maybe there is enough activity that the PLL doesn't have time to drop down in frequency between (just speculating).
Comment 82 Doug Smythies 2015-09-23 06:29:09 UTC
Created attachment 188141 [details]
CPU loads for Anon Y

Just adding the corresponding CPU loads graph for the 2nd trace data from Anon Y.
Comment 83 Doug Smythies 2015-10-08 21:59:52 UTC
Originally I never posted it here, but for the Anon Y case, and based on not a big sample space, the power cost of this stuff was about 1/2 a watt.
Comment 84 Doug Smythies 2016-04-02 15:05:36 UTC
Readers: Please try kernel 4.6-rc1 (or more recent, once available), and report back. (use intel_pstate CPU frequency scaling driver and the "powersave" frequency governor.)
Comment 85 da_audiophile 2016-04-11 19:44:47 UTC
@Doug - Using my i7-4790K for testing, I compared idle frequencies under 4.6-rc3 and found that it is much higher median frequency and idle temps than under 4.5.0.

Script I used to generate the statistics: https://gist.github.com/graysky2/7d0447e04dcf2ac638cb

% stats.sh ~/4.5.0.txt                                                                               median : 1004.181763
mean   : 1324.87
min    : 787.353760
max    : 4403.496582
count  : 327

% stats.sh ~/4.6rc3.txt 
median : 4153.16
mean   : 3969.87
min    : 968.414856
max    : 4682.961914
count  : 452
Comment 86 Doug Smythies 2016-04-11 21:37:13 UTC
@John (da_audiophile@yahoo.com):

Thanks very much (oh, and nice to hear from after so long). Your results are what was expected. For reference see bug 115771.

If you are willing (please be willing), please try the patch in the bug report (version 5 from comment_ 106). Alternatively, you could wait for kernel 4.6-rc4, as I think that patch will be included (not sure yet). However, it would be much much better if you could try now, as in addition to just a test, I am wondering if the current threshold will be good enough for the sufferers of this threads issue.

... Doug
Comment 87 da_audiophile 2016-04-11 21:59:18 UTC
OK.  I patched rc3 per your suggestion.  The results here are in line with those acquired under 4.5.0 however, as I was watching i7z collect the data, I noticed that the chip stayed in C1 state for much longer than under 4.5.0 just sitting idle under X.  Some times, a single core would be in C1 for a high percentage, others times, all 4 would be for some small percentage. 

% stats.sh cpu_freq_log.txt
median : 1081.34
mean   : 1214.47
min    : 775.959229
max    : 4355.350098
count  : 296
Comment 88 Doug Smythies 2016-04-11 22:45:39 UTC
@John: Thanks very much. Your findings are consistent with bug 115771. If it were just up to me, I think I might have the threshold a little higher, but it seems to be good enough.
Comment 89 Doug Smythies 2016-04-12 05:41:24 UTC
@John: if you are willing:

On the kernel you used in comment 87, please run "turbostat --debug".
Let it run for about 10 intervals.
Attach the output back here.

Note 1: Please use a very recent turbostat, so that IRQ information will be acquired. I am using "turbostat version 4.11 27 Feb 2016".

Note 2: Please do not run i7z while acquiring the test data. Otherwise have your system the same as for the comment 87 test.
Comment 90 da_audiophile 2016-04-12 18:53:25 UTC
Created attachment 212541 [details]
turbostat under 4.5.0
Comment 91 da_audiophile 2016-04-12 18:54:14 UTC
Created attachment 212551 [details]
turbostat under 4.6rc3
Comment 92 da_audiophile 2016-04-12 18:54:47 UTC
@Doug -  Attached in #90 and #91.  What do you make of these data?
Comment 93 da_audiophile 2016-04-12 19:09:41 UTC
Created attachment 212561 [details]
histogram comparing percent time in c7 sleep while idle
Comment 94 da_audiophile 2016-04-12 19:10:07 UTC
If I plot the distribution of %C7 state (attachment 212561 [details]) I can see the same effect as I reported from the i7z dataset: the 4.6rc3 kernel spends less idle time in C7 state than the 4.5.0 kernel does.
Comment 95 da_audiophile 2016-04-12 19:16:16 UTC
Created attachment 212571 [details]
mean values based on turbostat
Comment 96 da_audiophile 2016-04-12 19:18:08 UTC
You'll have to give me a direction with regard to understanding the output of these.  Attachment 212571 [details] just provides the mean values for each parameter for each kernel.  Again, 4.6rc3 spends less time in C7 and more in C1 (seemingly less power efficient).  If I understand the PkgWatt column, 4.6rc3 is consuming more power than 4.5.0 so it's all consistent.
Comment 97 Doug Smythies 2016-04-14 03:40:34 UTC
@John: Thanks very much for the turbostat data you provided. It is more or less what was expected. I think, and also mentioned in comment 42, that running i7z effects the system we are attempting to measure. The power difference is a concern.
Comment 98 Doug Smythies 2016-04-20 22:13:29 UTC
The patch that I mentioned in comment 86 would be in kernel 4.6-rc4 isn't. Maybe it will be in kernel 4.6-rc5.

@John: I wonder if we could look at energy again, over a longer sample time. I am thinking something like:

sudo turbostat -J --debug sleep 2000

with the system left alone for the 2000 seconds (33 minutes and 20 seconds).

Recall, in general sufferers of the issues on this thread, also seemed to consume, on average, about an extra 1/2 watt of power. It is not clear, to me at least, what is going on with energy from John's short samples.
Comment 99 da_audiophile 2016-04-21 13:38:47 UTC
Created attachment 213531 [details]
turbustat under 4.5.2
Comment 100 da_audiophile 2016-04-21 13:39:16 UTC
Created attachment 213541 [details]
turbostat under 4.6rc4 using Doug's patch
Comment 101 da_audiophile 2016-04-21 13:39:31 UTC
@Doug - Attached per your request.
Comment 102 Doug Smythies 2016-04-21 21:06:26 UTC
(In reply to da_audiophile from comment #100)
> Created attachment 213541 [details]
> turbostat under 4.6rc4 using Doug's patch

You mean Rafael's patch version 5, right? (from bug 115771)

Anyway, 2% energy increase, or 31 milliwatts, between 4.5.2 and 4.6rc4+patch at idle. O.K.
Comment 103 da_audiophile 2016-04-21 22:16:42 UTC
@Doug - It's the same patch you mentioned from comment #86 that modifies drivers/cpufreq/intel_pstate.c adding 4 lines.  Also, both kernels on my system are using the desktop 1000 HZ tickrate if that matters.
Comment 104 Doug Smythies 2016-04-21 23:03:51 UTC
(In reply to da_audiophile from comment #103)
> @Doug - It's the same patch you mentioned from comment #86 that modifies
> drivers/cpufreq/intel_pstate.c adding 4 lines.  Also, both kernels on my
> system are using the desktop 1000 HZ tickrate if that matters.

Yes, that is the correct patch.
1000Hz is fine. Originally we found that 300Hz on an Arch system seemed to exemplify the issue best, but I wouldn't re-do the work.
Comment 105 The Troll 2016-04-22 16:15:28 UTC
(In reply to Doug Smythies from comment #84)
> Readers: Please try kernel 4.6-rc1 (or more recent, once available), and
> report back. (use intel_pstate CPU frequency scaling driver and the
> "powersave" frequency governor.)

Hi Doug,

for many reasons, i need to stay on kernel Linux tm 4.1.17.
Do you think the related changes from 4.6 could be easily backported there ?

Thanks
Comment 106 Doug Smythies 2016-04-22 21:32:16 UTC
(In reply to The Troll from comment #105)
> (In reply to Doug Smythies from comment #84)
> > Readers: Please try kernel 4.6-rc1 (or more recent, once available), and
> > report back. (use intel_pstate CPU frequency scaling driver and the
> > "powersave" frequency governor.)
> 
> Hi Doug,
> 
> for many reasons, i need to stay on kernel Linux tm 4.1.17.
> Do you think the related changes from 4.6 could be easily backported there ?
> 
> Thanks

That was a trick question. The expectation was that the results would be bad, and was later confirmed by John's tests.

Anyway, Rafael's patch, that at least partially addresses the issue, will (for sure) be in kernel 4.6-rc5. As to how and when that might be backported to a series 4.1 kernel, I don't know. It is a trivial patch, you could add it yourself.

However, note that it was the change from timers to utilization in kernel 4.6 that significantly exacerbated the original issue of this thread, making the root issue much more obvious. You may or may not notice significant changes on a 4.1 series kernel.
Comment 107 The Troll 2016-04-22 21:38:32 UTC
Hi,

thanks for getting back so quickly!
The patch from 

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ffb810563c0c049872a504978e06c8892104fb6c

uses cpu->sample.tsc but tsc is not a member of sample in 4.1...

Is there a way to workaround that ? Or do we need a larger backport ?

thx
Comment 108 Doug Smythies 2016-04-22 23:10:35 UTC
(In reply to The Troll from comment #107)
> Hi,
> 
> thanks for getting back so quickly!
> The patch from 
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/
> ?id=ffb810563c0c049872a504978e06c8892104fb6c
> 
> uses cpu->sample.tsc but tsc is not a member of sample in 4.1...
> 
> Is there a way to workaround that ? Or do we need a larger backport ?
> 
> thx

Oh! I see the tsc sample stuff was only added as of kernel 4.2-rc1 (see commit 4055fad34086dcf5229c43846e0a3cf0fb3692e3). I am not an expert on backporting and such, and don't know what to recommend.
Comment 109 The Troll 2016-04-23 06:45:50 UTC
Hi,

for now, I switched back to acpi_cpufreq. At least it throttles the CPU clock :)
Though:
- it doesnt go over 4Ghz (instead of 4.4)
- it has no real impact on CPU temp
Comment 110 Arup 2016-05-19 13:47:34 UTC
I have Arch install on my Intel Haswell i7 4790 non K edition. Under Arch its always been near turbo and higher heat as compared to Ubuntu flavors LTS and non LTS and this from day Arch introduced intel_pstate in its kernel. Initially after testing done by Colin King of Canonical, they found out higher power consumption and kept it off till Ubuntu kernel 3.19. Thankfully the temps and frequency scaled for Haswell.

Currently an interesting observation regarding this. I have recently installed Ubuntu 16.04 with kernel 4.4 and now the frequencies of my Haswell scale down way less than compared to the 4.2 kernel where it would touch close to 800MHz for some cores at idle in Ubuntu 14.04.4 LTS but temps and voltage are way under control compared to my Arch installation that runs 3-4 degrees higher and steady 1.13v Vcore at idle compared to Ubuntu's .88Volts.
Comment 111 Doug Gale 2016-09-28 04:16:45 UTC
I have the same problem on my "Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz" (which turbos up to 3.5GHz almost constantly). It's a gaming laptop. I fear this is prematurely wearing it out. The thought of Windows doing this better is frightening. I have been using Ubuntu full time for years now.

The following commands seemed to cause my CPU to throttle back a lot more often:

$ sudo -Hi
# echo 95 > /sys/devices/system/cpu/intel_pstate/max_perf_pct
# echo 100 > /sys/devices/system/cpu/intel_pstate/max_perf_pct
# exit
$

It still stays throttled up too much, but I've seen it throttle all the way down to 800MHz, and it throttles down quite a bit more often. I had never seen that before. I know that this shouldn't make any difference. It's hard to tell whether this is my imagination or it really did something. Immediately after the write, it seems to really throttle back properly for a while. Worth a try. Perhaps something touched by the code path that handles those writes provokes it to throttle properly again. With luck this might lead to a fix.

It has been months since any news on this bug. :(
Comment 112 Doug Smythies 2016-09-28 15:48:57 UTC
@Doug Gale: From your description, I am having a bit of trouble understanding your exact scenario. It sounds as though you have significant load all the time on your computer and you have thermal throttling issues, which would be beyond this bug reports scope. I'd be happy to post process and analyze a trace from your computer, if you want (see comment #31 and comment #43 , and do it on an otherwise idle system). There root issue of this bug report would not lead to premature wear out of your computer, because the bottom line is not much extra overall power consumption (the extra active power is offset by extra sleep time).

As for progress on this issue: There was a RFC/RCT (Request For Change / Request For Test) for a patch set that would solve the root issue of this bug report (the clock modulation issue also), as part of the kernel 4.8rc series. I think there are 3 of us working on it, and a great deal of effort has been expended on it. However, there are issues, and any particular timeline remains unclear.

For anyone following this: I never specifically stated, but hope it is now clear to all: 

When I mentioned things like "its as though CPUs that have gone into the C7 state are still casting their target p_state vote into the processor PLL decision stuff, whereas I thought those CPU were supposed to lose their vote"

It turns out what was really happening is that there were several extremely short interrupts on those CPUs, such that their vote was still correctly included in the PLL decisions.
Comment 113 Doug Gale 2016-09-29 14:09:10 UTC
@Doug Smythies. I'm pretty sure you are correct.

I did some further digging, using `sudo perf top -g` and found that an excessive amount of CPU time was being spent reading the HPET coming from XOrg on an otherwise completely idle system sitting at the desktop (I forget the exact kernel symbol, but it was obviously HPET access).

Rebooting resolved it. I haven't seen my CPU sitting at 3.5GHz nonstop at all since yesterday. It hovers around 2.5 to 2.8GHz when idle.

When operating normally, the top function in perf top is about 3% in _raw_spin_lock_irqsave, which seems reasonable. I am using the non-free nvidia driver from ubuntu repository, nvidia-352-updates-dev.

Sorry for posting misleading information.
Comment 114 Zhang Rui 2016-12-22 01:13:42 UTC
Hi, Doug
(In reply to Doug Gale from comment #113)
> @Doug Smythies. I'm pretty sure you are correct.
> 
> I did some further digging, using `sudo perf top -g` and found that an
> excessive amount of CPU time was being spent reading the HPET coming from
> XOrg on an otherwise completely idle system sitting at the desktop (I forget
> the exact kernel symbol, but it was obviously HPET access).
> 
I have not read all the thread. 
so the real problem is that bugus HPET reading from Xorg results in high cpu frequency, right?

As we have a couple of different bug reporters in this thread and I'm not sure if they are exactly the same problem.

da_audiophile@yahoo.com
please confirm if the problem still exists in the latest upstream kernel.

Doug, please make sure you're saying the same problem with da_audiophile@yahoo.com, if no, what you should do is to open a new bug report.
Comment 115 Doug Gale 2016-12-22 07:04:57 UTC
(In reply to Zhang Rui from comment #114)
> Doug, please make sure you're saying the same problem with
> da_audiophile@yahoo.com, if no, what you should do is to open a new bug
> report.

I thought it was the same issue but later figured out that XOrg was at fault and the system was (correctly) keeping the CPU throttled up. Sorry for adding noise.
Comment 116 Doug Smythies 2016-12-29 07:35:07 UTC
(In reply to Zhang Rui from comment #114)

> da_audiophile@yahoo.com
> please confirm if the problem still exists in the latest upstream kernel.
> 

@Zhang: The problem still exists in the latest kernel.
However, and since my computer doesn't really suffer from the problem (I can sort of create it), I do not have good numbers for how bad the problem is. It never really was very expensive in terms of extra power consumption, at least from the data I got from others over the couple of years of this saga.

See also bug 115771, where the issue is partially corrected by the patch from that bug report.
Comment 117 Chen Yu 2017-04-12 05:56:40 UTC
For the original report from da_audiophile@yahoo.com, I found a related commit 

commit ffb810563c0c049872a504978e06c8892104fb6c 

intel_pstate: Avoid getting stuck in high P-states when idle
Which is described here:
https://bugzilla.kernel.org/show_bug.cgi?id=115771
I wonder if this commit has fixed the problem, and are we talking about other issues?
Comment 118 Chen Yu 2017-04-12 06:00:45 UTC
(In reply to Doug Smythies from comment #116)
> (In reply to Zhang Rui from comment #114)
> 
> > da_audiophile@yahoo.com
> > please confirm if the problem still exists in the latest upstream kernel.
> > 
> 
> @Zhang: The problem still exists in the latest kernel.
> However, and since my computer doesn't really suffer from the problem (I can
> sort of create it), I do not have good numbers for how bad the problem is.
> It never really was very expensive in terms of extra power consumption, at
> least from the data I got from others over the couple of years of this saga.
> 
> See also bug 115771, where the issue is partially corrected by the patch
> from that bug report.
Ah, yes, should be this one. I wonder if we have this issue on the same platform as da_audiophile@yahoo.com reported, if not I think we can open a new bug report on that, because as the thread going longer and longer I sometimes get lost in the clue :(.
Comment 119 Chen Yu 2017-04-28 07:26:50 UTC
@da_audiophile@yahoo.com  are you still looking at this thread?
Comment 120 da_audiophile 2017-04-28 19:30:53 UTC
@Chen - Sorry, it fell off my radar.  I will review and post if needed shortly.
Comment 121 da_audiophile 2017-04-28 20:18:28 UTC
Yes, the problem still exists for this hardware under the 4.10.13 kernel.  I am measuring the CPU frequency using this bash script[1] that reads once per second from /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq and writes it out to a file.  It then plots a nice histogram and some stats of the data.

I sampled the CPU frequency under 2 conditions:
1) With X running (lxdm and xfce4 sitting at the desktop otherwise idle).
2) Without X running, just sitting at the prompt idle.

I still see significantly higher mean idle frequency under X:
Logged into X, median frequency is 3,948 MHz
Logged into a TTY, median frequency is 2,500 MHz

Would others please test in a like fashion (script linked below) and post the results here?  Again, my hardware is technically a Haswell Refresh if that matters... as noted in comment #7, doing the same two (under X or from a TTY) experiments on a i3-4130T (Haswell) results in nearly identical median values for CPU frequency (median 813 MHz under X and median 800 under TTY).

Full results for the Haswell Refresh CPU first under X logged in with lxdm at idle:
# NumSamples = 120; Min = 799.80; Max = 4401.30
# Mean = 3246.355000; Variance = 1653274.818975; SD = 1285.797348; Median 3948.050000
# each ∎ represents a count of 1
  799.8000 -  1159.9500 [    17]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (14.17%)
 1159.9500 -  1520.1000 [     5]: ∎∎∎∎∎ (4.17%)
 1520.1000 -  1880.2500 [     1]: ∎ (0.83%)
 1880.2500 -  2240.4000 [     3]: ∎∎∎ (2.50%)
 2240.4000 -  2600.5500 [    10]: ∎∎∎∎∎∎∎∎∎∎ (8.33%)
 2600.5500 -  2960.7000 [    11]: ∎∎∎∎∎∎∎∎∎∎∎ (9.17%)
 2960.7000 -  3320.8500 [     7]: ∎∎∎∎∎∎∎ (5.83%)
 3320.8500 -  3681.0000 [     2]: ∎∎ (1.67%)
 3681.0000 -  4041.1500 [     6]: ∎∎∎∎∎∎ (5.00%)
 4041.1500 -  4401.3000 [    58]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (48.33%)

Results logged in to a TTY at idle (no X running):
# NumSamples = 120; Min = 799.80; Max = 4405.70
# Mean = 2521.885833; Variance = 1611401.717383; SD = 1269.409988; Median 2500.000000
# each ∎ represents a count of 1
  799.8000 -  1160.3900 [    25]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (20.83%)
 1160.3900 -  1520.9800 [    11]: ∎∎∎∎∎∎∎∎∎∎∎ (9.17%)
 1520.9800 -  1881.5700 [    10]: ∎∎∎∎∎∎∎∎∎∎ (8.33%)
 1881.5700 -  2242.1600 [     4]: ∎∎∎∎ (3.33%)
 2242.1600 -  2602.7500 [    22]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (18.33%)
 2602.7500 -  2963.3400 [     8]: ∎∎∎∎∎∎∎∎ (6.67%)
 2963.3400 -  3323.9300 [     5]: ∎∎∎∎∎ (4.17%)
 3323.9300 -  3684.5200 [     7]: ∎∎∎∎∎∎∎ (5.83%)
 3684.5200 -  4045.1100 [     1]: ∎ (0.83%)
 4045.1100 -  4405.7000 [    27]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (22.50%)

1. https://github.com/graysky2/bin/blob/master/cpufreq_histogram.sh
Comment 122 Doug Smythies 2017-04-29 01:54:40 UTC
Hi John,

Your data seems a bit inconsistent with the turbostat stuff you posted (turbostat-4.5.2-ref.log and turbostat-4.6rc4.log, and even the two before those). We can never just look at CPU frequencies in isolation, we also need to look at load.

The use of the load based code path within the intel_pstate driver will be greatly expanded as of kernel 4.12-rc1 (in a few weeks) so this whole saga might become academic.

Anyway, for my main test computer (i7-2600K, min pstate = 16, max pstate = 38, kernel 4.11-rc7) the data using your method is:

$ ./cpufreq_histogram.sh 300
Collecting data for 300 seconds...
# NumSamples = 300; Min = 1500.00; Max = 3900.00
# Mean = 1599.786333; Variance = 0.261980; SD = 0.511840; Median 1599.700000
# each ∎ represents a count of 3
 1500.0000 -  1600.0000 [   296]: ∎∎∎∎...∎∎∎∎ (98.67%)
 1600.0000 -  1700.0000 [     4]: ∎ (1.33%)
 1700.0000 -  1800.0000 [     0]:  (0.00%)
 1800.0000 -  1900.0000 [     0]:  (0.00%)
 1900.0000 -  2000.0000 [     0]:  (0.00%)
 2000.0000 -  2100.0000 [     0]:  (0.00%)
 2100.0000 -  2200.0000 [     0]:  (0.00%)
 2200.0000 -  2300.0000 [     0]:  (0.00%)
 2300.0000 -  2400.0000 [     0]:  (0.00%)
 2400.0000 -  2500.0000 [     0]:  (0.00%)
 2500.0000 -  2600.0000 [     0]:  (0.00%)
 2600.0000 -  2700.0000 [     0]:  (0.00%)
 2700.0000 -  2800.0000 [     0]:  (0.00%)
 2800.0000 -  2900.0000 [     0]:  (0.00%)
 2900.0000 -  3000.0000 [     0]:  (0.00%)
 3000.0000 -  3100.0000 [     0]:  (0.00%)
 3100.0000 -  3200.0000 [     0]:  (0.00%)
 3200.0000 -  3300.0000 [     0]:  (0.00%)
 3300.0000 -  3400.0000 [     0]:  (0.00%)
 3400.0000 -  3500.0000 [     0]:  (0.00%)
 3500.0000 -  3600.0000 [     0]:  (0.00%)
 3600.0000 -  3700.0000 [     0]:  (0.00%)
 3700.0000 -  3800.0000 [     0]:  (0.00%)
 3800.0000 -  3900.0000 [     0]:  (0.00%)
Comment 123 da_audiophile 2017-04-29 10:34:49 UTC
@Doug - Yes, sandybridge (i7-2600k) seems unaffected.
Comment 124 Doug Smythies 2017-04-29 15:51:33 UTC
Hi John,

I think the difference between our two results has more to do with differences in "idle" than the processor itself. When we communicated via e-mail on November 13th, I said a trace was needed to really determine what was going on on your system. That is still true, however the preferred way to acquire and post process trace data has changed. From your previous turbostat postings (the ones where IRQ totals were working), your "idle" system seems to have about twice as many IRQs per second as mine.

Moving forward, I would suggest to wait for, and then try, kernel 4.12-rc1. Then your system will likely, but not for certain, use the load based code path through the intel_pstate driver. (At least I think that is the plan.)
Comment 125 da_audiophile 2017-04-29 22:41:55 UTC
I can't explain this result as it is inconsistent with my testing under the older kernel versions, but with 4.10.13, if I drop the tickrate from 1000 to 300, I get much reduced idle frequencies:

# NumSamples = 180; Min = 799.80; Max = 4401.60
# Mean = 1889.847222; Variance = 1381154.325937; SD = 1175.225223; Median 1326.400000
# each ∎ represents a count of 1
  799.8000 -  1159.9800 [    73]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎...∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (40.56%)
 1159.9800 -  1520.1600 [    26]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (14.44%)
 1520.1600 -  1880.3400 [    16]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (8.89%)
 1880.3400 -  2240.5200 [    10]: ∎∎∎∎∎∎∎∎∎∎ (5.56%)
 2240.5200 -  2600.7000 [    11]: ∎∎∎∎∎∎∎∎∎∎∎ (6.11%)
 2600.7000 -  2960.8800 [     6]: ∎∎∎∎∎∎ (3.33%)
 2960.8800 -  3321.0600 [     7]: ∎∎∎∎∎∎∎ (3.89%)
 3321.0600 -  3681.2400 [     9]: ∎∎∎∎∎∎∎∎∎ (5.00%)
 3681.2400 -  4041.4200 [     4]: ∎∎∎∎ (2.22%)
 4041.4200 -  4401.6000 [    18]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (10.00%)
Comment 126 Chen Yu 2017-09-18 12:06:50 UTC
It looks like the issue was due to too many periodic ticks during 'idle'? And I wonder if it is really idle(if it is idle, there would be no big freq raising anymore because the intel_pstate is controled by the utilization provided by the scheduler now)  And I think intel_pstate driver has also changed its freq predication recently(dropped the PID algorithm), so I suggested to check on latest upstream 4.14-rc1 and also use the latest turbostat -i 10 to see if it still exist?
thanks.
Comment 127 Chen Yu 2017-12-11 08:41:36 UTC
I'm closing this as there's no response for some time. Please feel free to reopen it if you'd like me to further track this issue.