Bug 49231 - Single CPU bound process results in non-optimal turbo boost configuration
Summary: Single CPU bound process results in non-optimal turbo boost configuration
Status: CLOSED INVALID
Alias: None
Product: Power Management
Classification: Unclassified
Component: cpufreq (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: cpufreq
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-10-21 23:56 UTC by Roger Scott
Modified: 2013-01-28 23:57 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.4.4+
Subsystem:
Regression: No
Bisected commit-id:


Attachments
3.4.13 config file (80.07 KB, text/plain)
2012-10-21 23:56 UTC, Roger Scott
Details

Description Roger Scott 2012-10-21 23:56:46 UTC
Created attachment 84251 [details]
3.4.13 config file

Firstly I hope I've classified this issue correctly as it's a bit difficult to nail down exactly what's at fault.  

I have a new i7-3930k CPU on an Intel DX79SR motherboard which doesn't appear to be handling the turbo boost function correctly under Linux.  This is a 6 core CPU which is nominally 3.2GHz but can boost itself to somewhere between 3.5GHz and 3.8GHz.  With 6 processes running it'll run at 3.5GHz which is fine.  The problem is that when only a single process is active it'll still only run at 3.5GHz despite being able to go to 3.8GHz.  The problem appears to be that a single job is pulling all the cores out of C7 state and into at least C1 state and once there, they're all are considered active and hence the fully utilised CPU is unable to go beyond 3.5GHz.

I am unsure if the problem is with the scheduler, ACPI, or something in my configuration.  I can confirm that a 3.4.4 kernel configuration which works correctly on a Xeon X5660 doesn't work correctly with the i7-3930k.  I've tried various kernels between 3.4.4 and 3.6.2 but all result in the same behaviour.  With gritted teeth I installed Windows 7 and it does appear to work correctly there so I don't believe it's anything to do with BIOS settings.

Idle turbostat results:
core CPU   %c0   GHz  TSC   %c1    %c3    %c6    %c7   %pc2   %pc3   %pc6   %pc7
 
           2.73 1.20 3.20   7.06   0.00   0.02  90.19  20.20   0.00  68.20  0.00
   0   0   2.68 1.20 3.20   6.92   0.00   0.00  90.40  20.20   0.00  68.20  0.00
   1   1   2.75 1.20 3.20   6.92   0.00   0.00  90.33  20.20   0.00  68.20  0.00
   2   2   2.78 1.20 3.20   6.99   0.00   0.00  90.23  20.20   0.00  68.20  0.00
   3   3   2.71 1.20 3.20   7.10   0.00   0.00  90.19  20.20   0.00  68.20  0.00
   4   4   2.71 1.20 3.20   7.14   0.00   0.07  90.07  20.20   0.00  68.20  0.00
   5   5   2.76 1.20 3.20   7.27   0.00   0.07  89.90  20.20   0.00  68.20  0.00

Single job turbostat results:
core CPU   %c0   GHz  TSC   %c1    %c3    %c6    %c7   %pc2   %pc3   %pc6   %pc7
 
          18.46 3.50 3.20  81.54   0.00   0.00   0.00   0.00   0.00   0.00  0.00
   0   0   2.02 3.50 3.20  97.98   0.00   0.00   0.00   0.00   0.00   0.00  0.00
   1   1   2.19 3.50 3.20  97.81   0.00   0.00   0.00   0.00   0.00   0.00  0.00
   2   2   2.20 3.50 3.20  97.80   0.00   0.00   0.00   0.00   0.00   0.00  0.00
   3   3   2.17 3.50 3.20  97.83   0.00   0.00   0.00   0.00   0.00   0.00  0.00
   4   4 100.00 3.50 3.20   0.00   0.00   0.00   0.00   0.00   0.00   0.00  0.00
   5   5   2.18 3.50 3.20  97.82   0.00   0.00   0.00   0.00   0.00   0.00  0.00
Comment 1 Thomas Renninger 2012-10-22 13:44:44 UTC
Can you use "cpupower monitor"
(tools/power/cpupower)
instead of turbostat.
Then you will also see what idle states the kernel requested.
Does this single job (or something else) produce interrupts frequently (maybe powertop helps in this respect)?
This would explain why C1 and no deeper sleep states are entered.
If properly configured, "cpupower top" should invoke the powertop tool.
Comment 2 Thomas Renninger 2012-10-22 13:57:09 UTC
Another idea:
There is an Intel specific CPU configuration register (perf-bias) which can be set to values between 0-15.
It tells the CPU to behave more energy or performance efficient (not much more documentation exist afaik).
You can read or set this register via:
cpupower set -b X
cpupower info -b
Maybe this changes C-state entering behavior?
Comment 3 Roger Scott 2012-10-22 23:02:24 UTC
Hi Thomas,

Thanks for the ideas.  I'll do some more testing when I get home this evening (Timezone=UTC+10).  The problem was showing up with jobs which were possibly hammering the interrupts so I used the simplest test I could think of which was a never-ending-do-nothing-for-loop written in C.  Unfortunately (or fortunately depending on your perspective) this resulted in the same behaviour.
Comment 4 Len Brown 2012-10-23 02:09:34 UTC
Even the "idle" case doesn't look right -- it is 2.73% busy.
try using top and find out what is running.

core CPU   %c0   GHz  TSC   %c1    %c3    %c6    %c7   %pc2   %pc3   %pc6
           2.73 1.20 3.20   7.06   0.00   0.02  90.19  20.20   0.00  68.20
Comment 5 Thomas Renninger 2012-10-23 10:20:27 UTC
> so I used the simplest test I could think of which was a never-ending-
> do-nothing-for-loop written in C
JFI, I use:
cat /dev/zero >/dev/null &
to utilized one core with 100% CPU load.
Comment 6 Roger Scott 2012-10-24 04:21:12 UTC
I've managed to do some more testing.  Output from cpupower-monitor (excuse formatting mess):

    |Nehalem                    || SandyBridge        || Mperf              || I
dle_Stats                       
CPU | C3   | C6   | PC3  | PC6  || C7   | PC2  | PC7  || C0   | Cx   | Freq || P
OLL | C1-S | C3-S | C6-S | C7-S 
   0|  0.00|  0.00|  0.00|  0.00||  0.00|  0.00|  0.00||100.00|  0.00|  3433||  
0.00|  0.00|  0.00|  0.00|  0.00
   1|  0.00|  0.00|  0.00|  0.00||  0.00|  0.00|  0.00||  2.22| 97.78|  3427||  
0.00|  0.00|  0.00|  0.16| 97.86
   2|  0.00|  0.00|  0.00|  0.00||  0.00|  0.00|  0.00||  2.17| 97.83|  3425||  
0.00|  0.00|  0.00|  0.31| 97.75
   3|  0.00|  0.00|  0.00|  0.00||  0.00|  0.00|  0.00||  3.71| 96.29|  3433||  
0.00|  0.00|  0.04|  1.06| 95.43
   4|  0.00|  0.00|  0.00|  0.00||  0.00|  0.00|  0.00||  2.04| 97.96|  3427||  
0.00|  0.00|  0.00|  0.00| 98.21
   5|  0.00|  0.00|  0.00|  0.00||  0.00|  0.00|  0.00||  4.56| 95.44|  3434||  
0.00|  0.00|  0.00|  1.19| 94.50

I'm guessing that this is suggesting that the idle processor cores should be mostly in C7 but for some reason aren't.

I fiddled around with the perf-bias register but it didn't seem to make any real difference.  It might have made the cores go from C1 to C7 quicker once the job was stopped but that's just my subjective opinion and I didn't do any timing tests.

When running powertop I was getting 1000 wakeups-from-idle per second (ie the kernel tick rate).  These were all from the swapper threads/processes, one per core.  So I thought I'd try running with the NO_HZ setting and interestingly the idle cores now stay in C7.  Output from cpupower monitor with NO_HZ set:

    |Nehalem                    || SandyBridge        || Mperf              || I
dle_Stats                       
CPU | C3   | C6   | PC3  | PC6  || C7   | PC2  | PC7  || C0   | Cx   | Freq || P
OLL | C1-S | C3-S | C6-S | C7-S 
   0|  0.00|  0.00|  0.00|  0.00||  0.00|  0.00|  0.00|| 99.81|  0.19|  3799||  
0.00|  0.00|  0.00|  0.00|  0.00
   1|  1.17|  0.00|  0.00|  0.00|| 98.58|  0.00|  0.00||  0.11| 99.89|  3719||  
0.00|  0.00|  0.00|  0.00| 99.88
   2|  0.03|  0.00|  0.00|  0.00|| 95.76|  0.00|  0.00||  3.98| 96.02|  3796||  
0.00|  0.03|  0.00|  0.00| 95.93
   3|  1.12|  0.00|  0.00|  0.00|| 98.82|  0.00|  0.00||  0.03| 99.97|  3715||  
0.00|  0.00|  0.00|  0.00| 99.96
   4|  0.01|  0.00|  0.00|  0.00|| 98.76|  0.00|  0.00||  0.05| 99.95|  3755||  
0.00|  0.00|  0.00|  0.00| 99.94
   5|  0.00|  0.00|  0.00|  0.00|| 99.74|  0.00|  0.00||  0.21| 99.79|  3666||  
0.00|  0.01|  0.00|  0.00| 99.75

Naturally the time spent in C0 for the idle cores is less than for a ticked system but my previous 2.7% is less than the Xeon at 4.5% which still manages to turbo boost itself properly.  Just for fun I ran a job which should have simulated 1000 wakes/sec (ie for loop with usleep(1000)).  Interestingly despite more time than originally spent in C0 the cores still spent a reasonable amount of time in C7 and were still boosted beyond 3.5GHz.

    |Nehalem                    || SandyBridge        || Mperf              || Idle_Stats                       
CPU | C3   | C6   | PC3  | PC6  || C7   | PC2  | PC7  || C0   | Cx   | Freq || POLL | C1-S | C3-S | C6-S | C7-S 
   0|  0.00|  0.00|  0.00|  0.00||  0.00|  0.00|  0.00|| 98.43|  1.57|  3654||  0.00|  0.00|  0.00|  0.00|  0.00
   1|  1.36|  0.00|  0.00|  0.00|| 33.03|  0.00|  0.00||  3.78| 96.22|  3606||  0.00| 12.98| 22.92|  0.00| 60.35
   2|  8.65|  0.16|  0.00|  0.00|| 55.79|  0.00|  0.00||  4.23| 95.77|  3586||  0.00| 10.44|  8.28|  0.17| 76.83
   3|  4.42|  0.00|  0.00|  0.00|| 72.01|  0.00|  0.00||  4.47| 95.53|  3563||  0.00|  8.98|  2.78|  0.00| 83.75
   4|  1.47|  0.00|  0.00|  0.00|| 47.46|  0.00|  0.00||  3.21| 96.79|  3603||  0.00| 11.16| 14.78|  0.00| 70.87
   5|  6.18|  0.00|  0.00|  0.00|| 23.11|  0.00|  0.00||  2.15| 97.85|  3619||  0.00| 25.62| 11.99|  0.00| 60.30

I still think there's something a bit funny happening with the ticked system which is inhibiting idle CPUs from entering C7 if one of their siblings are busy but I'd be happy if someone who knows more might be able to explain why.
Comment 7 Thomas Renninger 2012-10-24 11:44:03 UTC
> So I thought I'd try running with the NO_HZ setting and interestingly
> the idle cores now stay in C7
Ok, tickless timer configuration is a must for this processor to enter deepest sleep states and thus enter boosting mode.
How/when processors enter deep sleep states (even if requested by the kernel) is very HW specific. So that the CPUs behave differently (one enters deeper sleep states without NO_HZ, the other does not (probably not very efficiently)) may be interesting, but it looks like it works as designed.
Hm, not sure whether to close this invalid or documented -> going for invalid as nothing seems to be wrong.
Comment 8 Len Brown 2013-01-28 23:57:30 UTC
> nothing seems to be wrong.

Agreed.
Turbo depends on idle,
and idle depends on tickless.

If you tick 1000/sec on each thread, your system will simultaneously
have poor power and poor performance.  There have been several proposals
to remove the tickful option from the kernel, and sightings like this are why.

Note You need to log in before you can comment on or make changes to this bug.