Bug 36182 - Erroneous package power limit notification since kernel 2.6.39
Summary: Erroneous package power limit notification since kernel 2.6.39
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: Thermal (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Fenghua Yu
URL:
Keywords:
Depends on:
Blocks: 32012
  Show dependency tree
 
Reported: 2011-05-29 19:52 UTC by Maciej Rutecki
Modified: 2014-03-07 16:56 UTC (History)
30 users (show)

See Also:
Kernel Version: 3.3.4
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
[1/2] x86 thermal: Disable power limit notification interrupt by default (2.21 KB, patch)
2013-05-21 16:19 UTC, Len Brown
Details | Diff
[PATCH 2/2] x86 thermal: Disable power limit notification interrupt (5.27 KB, patch)
2013-05-21 16:21 UTC, Len Brown
Details | Diff

Description Maciej Rutecki 2011-05-29 19:52:35 UTC
Subject    : Erroneous package power limit notification since kernel 2.6.39
Submitter  : Olaf Freyer <aaron667@gmx.net>
Date       : 2011-05-22 13:01
Message-ID : 4DD9092A.4080507@gmx.net
References : http://marc.info/?l=linux-kernel&m=130606930631131&w=2

This entry is being used for tracking a regression from 2.6.38. Please don't
close it until the problem is fixed in the mainline.
Comment 1 Waldo 2011-06-30 19:36:31 UTC
I am still seeing power limit notifications in kernel 3.0rc5 (ubuntu) on a Thinkpad w520 (bios 1.24).  They look like this:

Jun 30 12:24:30 laptop kernel: [33700.013916] CPU6: Package power limit notification (total events = 5929)

Followed shortly thereafter with:

Jun 30 12:24:30 laptop kernel: [33700.014719] CPU6: Package power limit normal

On earlier kernels the error appeared as a hardware error ("THERMAL").  These errors show up when the CPU, in my case a i7-2820QM, starts to warm up, such as when compiling.  The temps never exceed 75C, which is well within safe specs..

Don't know if it's related, but I also see this:

Jun 30 12:03:42 laptop kernel: [32453.647044] thinkpad_acpi: THERMAL ALERT: unknown thermal alarm received
Jun 30 12:03:42 laptop kernel: [32453.647059] thinkpad_acpi: unhandled HKEY event 0x6040
Jun 30 12:03:42 laptop kernel: [32453.647066] thinkpad_acpi: please report the conditions when this event happened to ibm-acpi-devel@lists.sourceforge.net
Jun 30 12:03:42 laptop kernel: [32453.648022] thinkpad_acpi: EC reports that Thermal Table has changed

I may try to reproduce and send to the ibm list, but did want to confirm this issue persists into 3.0 linux as of now.
Comment 2 Rafael J. Wysocki 2011-07-10 20:19:55 UTC
On Sunday, July 10, 2011, Olaf Freyer wrote:
> This issue is still worth listing, as it isn't resolved yet.
> 
> Jesse Barnes already tracked down the cause to commit
> ccab5c82759e2ace74b2e84f82d1e0eedd932571 -
> but what to do now remains unclear to me.
> 
> With the patch reverted, I get down to 1 warning per workday -
> without reverting it I'm at 70-90k warnings per workday.
> 
> 
> Am 10.07.2011 12:58, schrieb Rafael J. Wysocki:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.38 and 2.6.39.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.38 and 2.6.39.  Please verify if it still should
> > be listed and let the tracking team know (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=36182
> > Subject             : Erroneous package power limit notification since
> kernel 2.6.39
> > Submitter   : Olaf Freyer <aaron667@gmx.net>
> > Date                : 2011-05-22 13:01 (50 days old)
Comment 3 Rafael J. Wysocki 2011-07-10 20:20:28 UTC
First-Bad-Commit : ccab5c82759e2ace74b2e84f82d1e0eedd932571

commit ccab5c82759e2ace74b2e84f82d1e0eedd932571
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Tue Jan 18 15:49:25 2011 -0800

    drm/i915: tune Sandy Bridge DRPS constants
    
    These make us increase our frequency much more readily, and decrease
    them only after significant idle time, resulting in a 20% performance
    increase for nexuiz.
    
    Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 4 Thomas Spura 2011-09-08 09:10:51 UTC
I also see the unhandled thinkpad_acpi events from comment #1, when pluggin in or out the power cord.

Without a battery installed, I also have the "THERMAL EVENT" issues from comment #0, with a battery installed, they are gone.
Comment 5 Zhang Rui 2012-01-18 05:18:15 UTC
It's great that the kernel bugzilla is back.

Can you please verify if the problem still exists in the latest upstream
kernel?
Comment 6 aaron667 2012-01-21 19:27:23 UTC
The issue still persists for me in kernel 3.2.1. The very first message still appears on startup of the xserver, like this example:
[   41.158389] CPU6: Package power limit notification (total events = 1)
[   41.158393] CPU1: Package power limit notification (total events = 1)
[   41.158397] CPU4: Package power limit notification (total events = 1)
[   41.158401] CPU3: Package power limit notification (total events = 1)
[   41.158404] CPU5: Package power limit notification (total events = 1)
[   41.158407] CPU7: Package power limit notification (total events = 1)
[   41.158411] CPU2: Package power limit notification (total events = 1)
[   41.158414] CPU0: Package power limit notification (total events = 1)
[   41.169230] CPU6: Package power limit normal
[   41.169233] CPU4: Package power limit normal
[   41.169236] CPU1: Package power limit normal
[   41.169239] CPU3: Package power limit normal
[   41.169242] CPU7: Package power limit normal
[   41.169245] CPU5: Package power limit normal
[   41.169247] CPU2: Package power limit normal
[   41.169250] CPU0: Package power limit normal

After running nexuiz for about 5-6 minutes this is what I found in my log:
[  350.536528] CPU6: Package power limit notification (total events = 4)
[  350.536532] CPU2: Package power limit notification (total events = 4)
[  350.536536] CPU1: Package power limit notification (total events = 4)
[  350.536539] CPU3: Package power limit notification (total events = 4)
[  350.536543] CPU7: Package power limit notification (total events = 4)
[  350.536546] CPU0: Package power limit notification (total events = 4)
[  350.536553] CPU4: Package power limit notification (total events = 4)
[  350.536555] CPU5: Package power limit notification (total events = 4)
[  350.547453] CPU1: Package power limit normal
[  350.547458] CPU0: Package power limit normal
[  350.547465] CPU2: Package power limit normal
[  350.547470] CPU3: Package power limit normal
[  350.547477] CPU5: Package power limit normal
[  350.547483] CPU4: Package power limit normal
[  350.547490] CPU7: Package power limit normal
[  350.547495] CPU6: Package power limit normal
So the situation somewhat improved compared to the previous situation.

Doing something CPU intensive (like compiling chromium) has a much worse result on the system. Initially it still looks just fine:
[  650.270344] CPU1: Package power limit notification (total events = 1835)
[  650.270349] CPU4: Package power limit notification (total events = 1835)
[  650.270354] CPU7: Package power limit notification (total events = 1835)
[  650.270358] CPU5: Package power limit notification (total events = 1835)
[  650.270362] CPU6: Package power limit notification (total events = 1835)
[  650.270367] CPU2: Package power limit notification (total events = 1835)
[  650.270371] CPU3: Package power limit notification (total events = 1835)
[  650.270375] CPU0: Package power limit notification (total events = 1835)
[  650.281316] CPU1: Package power limit normal
[  650.281320] CPU7: Package power limit normal
[  650.281324] CPU2: Package power limit normal
[  650.281327] CPU6: Package power limit normal
[  650.281331] CPU3: Package power limit normal
[  650.281335] CPU5: Package power limit normal
[  650.281338] CPU4: Package power limit normal
[  650.281341] CPU0: Package power limit normal

But the event counts (and the CPU temperature) start increasing to a scary level:
[  952.896326] CPU5: Package temperature above threshold, cpu clock throttled (total events = 101649)
[  952.896331] CPU4: Package temperature above threshold, cpu clock throttled (total events = 101649)
[  952.896337] CPU0: Package temperature above threshold, cpu clock throttled (total events = 101649)
[  952.896343] CPU3: Package temperature above threshold, cpu clock throttled (total events = 101649)
[  952.896349] CPU1: Package temperature above threshold, cpu clock throttled (total events = 101649)
[  952.896354] CPU2: Package temperature above threshold, cpu clock throttled (total events = 101649)
[  952.896359] CPU6: Package temperature above threshold, cpu clock throttled (total events = 101649)
[  952.896364] CPU7: Package temperature above threshold, cpu clock throttled (total events = 101649)
[  952.897330] CPU1: Package temperature/speed normal
[  952.897335] CPU7: Package temperature/speed normal
[  952.897339] CPU5: Package temperature/speed normal
[  952.897343] CPU6: Package temperature/speed normal
[  952.897346] CPU4: Package temperature/speed normal
[  952.897350] CPU2: Package temperature/speed normal
[  952.897354] CPU3: Package temperature/speed normal
[  952.897357] CPU0: Package temperature/speed normal
[ 1252.565089] CPU1: Package temperature above threshold, cpu clock throttled (total events = 203131)
[ 1252.565095] CPU7: Package temperature above threshold, cpu clock throttled (total events = 203131)
[ 1252.565101] CPU5: Package temperature above threshold, cpu clock throttled (total events = 203131)
[ 1252.565107] CPU6: Package temperature above threshold, cpu clock throttled (total events = 203131)
[ 1252.565113] CPU3: Package temperature above threshold, cpu clock throttled (total events = 203131)
[ 1252.565118] CPU2: Package temperature above threshold, cpu clock throttled (total events = 203131)
[ 1252.565124] CPU4: Package temperature above threshold, cpu clock throttled (total events = 203131)
[ 1252.565129] CPU0: Package temperature above threshold, cpu clock throttled (total events = 203131)
[ 1252.566076] CPU1: Package temperature/speed normal
[ 1252.566080] CPU4: Package temperature/speed normal
[ 1252.566083] CPU5: Package temperature/speed normal
[ 1252.566087] CPU2: Package temperature/speed normal
[ 1252.566091] CPU7: Package temperature/speed normal
[ 1252.566095] CPU3: Package temperature/speed normal
[ 1252.566098] CPU6: Package temperature/speed normal
[ 1252.566101] CPU0: Package temperature/speed normal
[ 1253.163807] CPU1: Core temperature above threshold, cpu clock throttled (total events = 111560)
[ 1253.163811] CPU0: Core temperature above threshold, cpu clock throttled (total events = 111560)
[ 1253.164796] CPU1: Core temperature/speed normal
[ 1253.164799] CPU0: Core temperature/speed normal
[ 1552.235864] CPU1: Package temperature above threshold, cpu clock throttled (total events = 311608)
[ 1552.235870] CPU2: Package temperature above threshold, cpu clock throttled (total events = 311608)
[ 1552.235877] CPU5: Package temperature above threshold, cpu clock throttled (total events = 311608)
[ 1552.235883] CPU3: Package temperature above threshold, cpu clock throttled (total events = 311608)
[ 1552.235889] CPU4: Package temperature above threshold, cpu clock throttled (total events = 311608)
[ 1552.235895] CPU7: Package temperature above threshold, cpu clock throttled (total events = 311608)
[ 1552.235900] CPU6: Package temperature above threshold, cpu clock throttled (total events = 311608)
[ 1552.235906] CPU0: Package temperature above threshold, cpu clock throttled (total events = 311608)
[ 1552.236852] CPU1: Package temperature/speed normal
[ 1552.236857] CPU6: Package temperature/speed normal
[ 1552.236861] CPU7: Package temperature/speed normal
[ 1552.236865] CPU3: Package temperature/speed normal
[ 1552.236869] CPU2: Package temperature/speed normal
[ 1552.236873] CPU5: Package temperature/speed normal
[ 1552.236877] CPU4: Package temperature/speed normal
[ 1552.236880] CPU0: Package temperature/speed normal
[ 1851.905623] CPU0: Package temperature above threshold, cpu clock throttled (total events = 414672)
[ 1851.905628] CPU1: Package temperature above threshold, cpu clock throttled (total events = 414672)
[ 1851.905633] CPU4: Package temperature above threshold, cpu clock throttled (total events = 414672)
[ 1851.905640] CPU3: Package temperature above threshold, cpu clock throttled (total events = 414672)
[ 1851.905646] CPU5: Package temperature above threshold, cpu clock throttled (total events = 414672)
[ 1851.905652] CPU6: Package temperature above threshold, cpu clock throttled (total events = 414672)
[ 1851.905658] CPU7: Package temperature above threshold, cpu clock throttled (total events = 414672)
[ 1851.905664] CPU2: Package temperature above threshold, cpu clock throttled (total events = 414672)
[ 1852.745619] CPU5: Core temperature/speed normal
[ 1852.745623] CPU4: Core temperature/speed normal
[ 1854.324459] CPU7: Core temperature above threshold, cpu clock throttled (total events = 42368)
[ 1854.324463] CPU6: Core temperature above threshold, cpu clock throttled (total events = 42368)
[ 1854.325446] CPU7: Core temperature/speed normal
[ 1854.325449] CPU6: Core temperature/speed normal
Comment 7 aaron667 2012-01-21 20:18:41 UTC
An interesing sidenote:
The above scenario was compiling chromium in the Konsole Terminal Emulator on KDE 4.7.4. The CPU temperature spiked to almost 100°C and (under such conditions) the warnings on throttling was to be expected and to be considered a good and healthy decision.

Out of curiosity I now started another run of compiling chromium on the system console - and all of a sudden CPU temperature doesn't exceed 95°C anymore. Additionally there aren't any warnings about Package temperature or Core temperature - and only about 20 Package power limit notifications.

So maybe extensive logging on the Konsole Terminal Emulator is the culprit?
Comment 8 Waldo 2012-01-21 21:36:09 UTC
Well FWIW, I'll confirm-- I still see similar messages on my Thinkpad w520.  Almost identical to the above.

ubuntu's 3.2.0 RC7

I get those warnings mostly when compiling and not even exceeding temps over 90 (never hit 100), but sometimes I'll see them when it's at a cool 60 or so and not doing much at all.

It doesn't seem to happen with every time I build-- I'm compiling a kernel now and dont see them.   So it's weird.
Comment 9 Pierre Ossman 2012-03-07 06:52:53 UTC
Got these on my Thinkpad 420s:

[11036.192232] CPU1: Package power limit notification (total events = 304)
[11036.192237] CPU3: Package power limit notification (total events = 304)
[11036.192240] CPU2: Package power limit notification (total events = 304)
[11036.192243] CPU0: Package power limit notification (total events = 304)
[11036.192368] CPU2: Package power limit normal
[11036.192370] CPU3: Package power limit normal
[11036.192372] CPU1: Package power limit normal
[11036.192374] CPU0: Package power limit normal
[11117.102093] [Hardware Error]: Machine check events logged
[11344.643133] CPU1: Package power limit notification (total events = 354)
[11344.643138] CPU3: Package power limit notification (total events = 354)
[11344.643141] CPU2: Package power limit notification (total events = 354)
[11344.643144] CPU0: Package power limit notification (total events = 354)
[11344.643537] CPU2: Package power limit normal
[11344.643539] CPU3: Package power limit normal
[11344.643542] CPU1: Package power limit normal
[11344.643544] CPU0: Package power limit normal
[11416.428190] [Hardware Error]: Machine check events logged
Comment 10 James Ettle 2012-05-05 13:24:18 UTC
(Is this bug still being attended to?) Just seen this on a new notebook with an i7-2760QM processor, while doing make -j8 on an ffmpeg build, kernel 3.3.4-3.fc17.x86_64.


[ 5363.846647] CPU2: Package power limit notification (total events = 1)
[ 5363.846651] CPU6: Package power limit notification (total events = 1)
[ 5363.846655] CPU4: Package power limit notification (total events = 1)
[ 5363.846660] CPU1: Package power limit notification (total events = 1)
[ 5363.846664] CPU7: Package power limit notification (total events = 1)
[ 5363.846668] CPU0: Package power limit notification (total events = 1)
[ 5363.846672] CPU5: Package power limit notification (total events = 1)
[ 5363.846675] CPU3: Package power limit notification (total events = 1)
[ 5363.847031] CPU3: Package power limit normal
[ 5363.847034] CPU5: Package power limit normal
[ 5363.847037] CPU7: Package power limit normal
[ 5363.847040] CPU0: Package power limit normal
[ 5363.847043] CPU1: Package power limit normal
[ 5363.847046] CPU6: Package power limit normal
[ 5363.847049] CPU2: Package power limit normal
[ 5363.847052] CPU4: Package power limit normal


Machine doesn't seem any worse off for it, and no MCEs got logged. Perhaps some tuning of the performance governors needs tweaking? Would attaching my machines acpidump help here?
Comment 11 Alan 2012-06-14 17:57:40 UTC
Several of the very thin laptops seem to report these, should we be logging these so loudly in the non machine check case ?
Comment 12 Fenghua Yu 2012-06-14 18:11:23 UTC
This issue is fixed in upstream in
commit 29e9bf1841e4f9df13b4992a716fece7087dd237
Date:   Fri Nov 4 13:31:23 2011 -0700
    x86, mce, therm_throt: Don't report power limit and package level thermal throttle events in
    
    Thermal throttle and power limit events are not defined as MCE errors in x86
    architecture and should not generate MCE errors in mcelog.
    
    Current kernel generates fake software defined MCE errors for these events.
    This may confuse users because they may think the machine has real MCE errors
    while actually only thermal throttle or power limit events happen.
    
    To make it worse, buggy firmware on some platforms may falsely generate
    the events. Therefore, kernel reports MCE errors which users think as real
    hardware errors. Although the firmware bugs should be fixed, on the other hand,
    kernel should not report MCE errors either.
    
    So mcelog is not a good mechanism to report these events. To report the events, we count the
    package_power_limit_count, core_throttle_count, and package_throttle_count) in
    /sys/devices/system/cpu/cpu#/thermal_throttle/. Users can check the counters
    for each event on each CPU. Please note that all CPU's on one package report
    duplicate counters. It's user application's responsibity to retrieve a package
    level counter for one package.
Comment 13 Steffen Weber 2012-09-17 06:02:39 UTC
I'm getting tons of these messages on a Xeon E5-2430 (cpu family 6, model 45, stepping 7). Last Friday the "total events" counter got close to 2^32 and the whole system came to a halt. Upgrading from Linux 3.4.11 to 3.5.4 did not help.

Someone mentioned buggy firmware above. Might a microcode update help?
http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=21385
Comment 14 Len Brown 2012-10-30 02:04:17 UTC
Fenghua,
the commit in comment #12 shipped in Linux-3.3,
but the complaint in comment #13 includes Linux-3.5.4.

This is still an open issue?
Comment 15 Steffen Weber 2012-10-30 08:31:40 UTC
Yes, I'm still seeing this issue on Linux 3.6.2.
Comment 16 Markus Rathgeb 2012-11-23 13:11:13 UTC
Are there any news or progress?

Nov 23 11:00:04 pc88 kernel: CPU3: Package power limit notification (total events = 4332)
Nov 23 11:00:04 pc88 kernel: CPU1: Package power limit notification (total events = 4333)
Nov 23 11:00:04 pc88 kernel: CPU0: Package power limit notification (total events = 4334)
Nov 23 11:00:04 pc88 kernel: CPU2: Package power limit notification (total events = 4334)
Nov 23 11:00:04 pc88 kernel: CPU2: Package power limit normal
Nov 23 11:00:04 pc88 kernel: CPU1: Package power limit normal
Nov 23 11:00:04 pc88 kernel: CPU0: Package power limit normal
Nov 23 11:00:04 pc88 kernel: CPU3: Package power limit normal
Nov 23 11:13:50 pc88 kernel: CPU3: Package power limit notification (total events = 4349)
Nov 23 11:13:50 pc88 kernel: CPU0: Package power limit notification (total events = 4351)
Nov 23 11:13:50 pc88 kernel: CPU1: Package power limit notification (total events = 4350)
Nov 23 11:13:50 pc88 kernel: CPU2: Package power limit notification (total events = 4351)
Nov 23 11:13:50 pc88 kernel: CPU3: Package power limit normal
Nov 23 11:13:50 pc88 kernel: CPU2: Package power limit normal
Nov 23 11:13:50 pc88 kernel: CPU1: Package power limit normal
Nov 23 11:13:50 pc88 kernel: CPU0: Package power limit normal
Nov 23 11:20:48 pc88 kernel: CPU0: Package power limit notification (total events = 4355)
Nov 23 11:20:48 pc88 kernel: CPU1: Package power limit notification (total events = 4354)
Nov 23 11:20:48 pc88 kernel: CPU2: Package power limit notification (total events = 4355)
Nov 23 11:20:48 pc88 kernel: CPU3: Package power limit notification (total events = 4353)
Nov 23 11:20:48 pc88 kernel: CPU2: Package power limit normal
Nov 23 11:20:48 pc88 kernel: CPU0: Package power limit normal
Nov 23 11:20:48 pc88 kernel: CPU3: Package power limit normal
Nov 23 11:20:48 pc88 kernel: CPU1: Package power limit normal
Nov 23 11:52:15 pc88 kernel: CPU2: Package power limit notification (total events = 4397)
Nov 23 11:52:15 pc88 kernel: CPU1: Package power limit notification (total events = 4396)
Nov 23 11:52:15 pc88 kernel: CPU3: Package power limit notification (total events = 4395)
Nov 23 11:52:15 pc88 kernel: CPU0: Package power limit notification (total events = 4397)
Nov 23 11:52:15 pc88 kernel: CPU1: Package power limit normal
Nov 23 11:52:15 pc88 kernel: CPU0: Package power limit normal
Nov 23 11:52:15 pc88 kernel: CPU3: Package power limit normal
Nov 23 11:52:15 pc88 kernel: CPU2: Package power limit normal
Nov 23 11:57:39 pc88 kernel: CPU0: Package power limit notification (total events = 4429)
Nov 23 11:57:39 pc88 kernel: CPU2: Package power limit notification (total events = 4429)
Nov 23 11:57:39 pc88 kernel: CPU3: Package power limit notification (total events = 4427)
Nov 23 11:57:39 pc88 kernel: CPU1: Package power limit notification (total events = 4428)
Nov 23 11:57:39 pc88 kernel: CPU0: Package power limit normal
Nov 23 11:57:39 pc88 kernel: CPU2: Package power limit normal
Nov 23 11:57:39 pc88 kernel: CPU3: Package power limit normal
Nov 23 11:57:39 pc88 kernel: CPU1: Package power limit normal
Nov 23 12:07:44 pc88 kernel: CPU1: Package power limit notification (total events = 4566)
Nov 23 12:07:44 pc88 kernel: CPU0: Package power limit notification (total events = 4567)
Nov 23 12:07:44 pc88 kernel: CPU3: Package power limit notification (total events = 4565)
Nov 23 12:07:44 pc88 kernel: CPU2: Package power limit notification (total events = 4567)
Nov 23 12:07:44 pc88 kernel: CPU2: Package power limit normal
Nov 23 12:07:44 pc88 kernel: CPU3: Package power limit normal
Nov 23 12:07:44 pc88 kernel: CPU0: Package power limit normal
Nov 23 12:07:44 pc88 kernel: CPU1: Package power limit normal
Nov 23 13:21:40 pc88 kernel: CPU0: Package power limit notification (total events = 4569)
Nov 23 13:21:40 pc88 kernel: CPU2: Package power limit notification (total events = 4569)
Nov 23 13:21:40 pc88 kernel: CPU1: Package power limit notification (total events = 4568)
Nov 23 13:21:40 pc88 kernel: CPU3: Package power limit notification (total events = 4567)
Nov 23 13:21:40 pc88 kernel: CPU2: Package power limit normal
Nov 23 13:21:40 pc88 kernel: CPU0: Package power limit normal
Nov 23 13:21:40 pc88 kernel: CPU1: Package power limit normal
Nov 23 13:21:40 pc88 kernel: CPU3: Package power limit normal
Nov 23 13:26:41 pc88 kernel: CPU3: Package power limit notification (total events = 4584)
Nov 23 13:26:41 pc88 kernel: CPU2: Package power limit notification (total events = 4586)
Nov 23 13:26:41 pc88 kernel: CPU1: Package power limit notification (total events = 4585)
Nov 23 13:26:41 pc88 kernel: CPU0: Package power limit notification (total events = 4586)
Nov 23 13:26:41 pc88 kernel: CPU1: Package power limit normal
Nov 23 13:26:41 pc88 kernel: CPU2: Package power limit normal
Nov 23 13:26:41 pc88 kernel: CPU0: Package power limit normal
Nov 23 13:26:41 pc88 kernel: CPU3: Package power limit normal
Nov 23 13:31:48 pc88 kernel: CPU0: Package power limit notification (total events = 4587)
Nov 23 13:31:48 pc88 kernel: CPU1: Package power limit notification (total events = 4586)
Nov 23 13:31:48 pc88 kernel: CPU2: Package power limit notification (total events = 4587)
Nov 23 13:31:48 pc88 kernel: CPU3: Package power limit notification (total events = 4585)
Nov 23 13:31:48 pc88 kernel: CPU0: Package power limit normal
Nov 23 13:31:48 pc88 kernel: CPU1: Package power limit normal
Nov 23 13:31:48 pc88 kernel: CPU3: Package power limit normal
Nov 23 13:31:48 pc88 kernel: CPU2: Package power limit normal
Nov 23 13:36:48 pc88 kernel: CPU3: Package power limit notification (total events = 4731)
Nov 23 13:36:48 pc88 kernel: CPU1: Package power limit notification (total events = 4732)
Nov 23 13:36:48 pc88 kernel: CPU0: Package power limit notification (total events = 4733)
Nov 23 13:36:48 pc88 kernel: CPU2: Package power limit notification (total events = 4733)
Nov 23 13:36:48 pc88 kernel: CPU3: Package power limit normal
Nov 23 13:36:48 pc88 kernel: CPU0: Package power limit normal
Nov 23 13:36:48 pc88 kernel: CPU1: Package power limit normal
Nov 23 13:36:48 pc88 kernel: CPU2: Package power limit normal
Nov 23 13:42:02 pc88 kernel: CPU2: Package power limit notification (total events = 4764)
Nov 23 13:42:02 pc88 kernel: CPU3: Package power limit notification (total events = 4762)
Nov 23 13:42:02 pc88 kernel: CPU0: Package power limit notification (total events = 4764)
Nov 23 13:42:02 pc88 kernel: CPU1: Package power limit notification (total events = 4763)
Nov 23 13:42:02 pc88 kernel: CPU0: Package power limit normal
Nov 23 13:42:02 pc88 kernel: CPU2: Package power limit normal
Nov 23 13:42:02 pc88 kernel: CPU3: Package power limit normal
Nov 23 13:42:02 pc88 kernel: CPU1: Package power limit normal
Nov 23 13:47:55 pc88 kernel: CPU3: Package power limit notification (total events = 4875)
Nov 23 13:47:55 pc88 kernel: CPU0: Package power limit notification (total events = 4877)
Nov 23 13:47:55 pc88 kernel: CPU2: Package power limit notification (total events = 4877)
Nov 23 13:47:55 pc88 kernel: CPU1: Package power limit notification (total events = 4876)
Nov 23 13:47:55 pc88 kernel: CPU3: Package power limit normal
Nov 23 13:47:55 pc88 kernel: CPU2: Package power limit normal
Nov 23 13:47:55 pc88 kernel: CPU1: Package power limit normal
Nov 23 13:47:55 pc88 kernel: CPU0: Package power limit normal
Nov 23 13:53:49 pc88 kernel: CPU1: Package power limit notification (total events = 4888)
Nov 23 13:53:49 pc88 kernel: CPU3: Package power limit notification (total events = 4887)
Nov 23 13:53:49 pc88 kernel: CPU2: Package power limit notification (total events = 4889)
Nov 23 13:53:49 pc88 kernel: CPU0: Package power limit notification (total events = 4889)
Nov 23 13:53:49 pc88 kernel: CPU1: Package power limit normal
Nov 23 13:53:49 pc88 kernel: CPU0: Package power limit normal
Nov 23 13:53:49 pc88 kernel: CPU3: Package power limit normal
Nov 23 13:53:49 pc88 kernel: CPU2: Package power limit normal
Nov 23 14:05:44 pc88 kernel: CPU2: Package power limit notification (total events = 4995)
Nov 23 14:05:44 pc88 kernel: CPU0: Package power limit notification (total events = 4996)
Nov 23 14:05:44 pc88 kernel: CPU3: Package power limit notification (total events = 4993)
Nov 23 14:05:44 pc88 kernel: CPU1: Package power limit notification (total events = 4995)
Nov 23 14:05:44 pc88 kernel: CPU0: Package power limit normal
Nov 23 14:05:44 pc88 kernel: CPU3: Package power limit normal
Nov 23 14:05:44 pc88 kernel: CPU1: Package power limit normal
Nov 23 14:05:44 pc88 kernel: CPU2: Package power limit normal

# uname -a
Linux pc88 3.6.6-gentoo #13 SMP Wed Nov 21 16:41:09 CET 2012 x86_64 Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz GenuineIntel GNU/Linux
Comment 17 Lance Grover 2012-12-05 14:59:26 UTC
I was seeing this same kernel logging events on a dell powerEdge R320 with an Intel(R) Xeon(R) CPU E5-2470 0 @ 2.30GHz.  It was causing me issues running some java applications.  

I went into the Bios and discovered that the CPU power stepping was turned on, I changed that to Performance so the system would not do the CPU dynamic power stepping and the errors went away, and the java applications no longer had issues.

I hope that information helps, could someone else try that same setting to see if the issue goes away for them as well?
Comment 18 Steffen Weber 2012-12-05 17:40:57 UTC
Lance, now your power consumption is probably much higher. Maybe you can check this in the DRAC web interface?
Comment 19 RJ 2012-12-13 00:42:44 UTC
Hi everybody.  I think I might be having the same problem with a brand new Dell R520 which has Xeon E5 2440 2.4GHz processors.  Is there anything I can do to help contribute to fixing this?  It's brand new hardware, can't imagine there being anything wrong with it already.
Comment 20 Zhang Rui 2013-04-15 07:20:23 UTC
(In reply to comment #14)
> Fenghua,
> the commit in comment #12 shipped in Linux-3.3,
> but the complaint in comment #13 includes Linux-3.5.4.
> 
> This is still an open issue?

Hi, Fenghua,

any update on this?
Comment 21 Steffen Weber 2013-04-15 07:23:46 UTC
Issue still exists for me in Linux 3.8.5.
Comment 22 Bob 2013-05-03 19:36:16 UTC
Comment #17 fits our situation exactly, with the exception that we are running a Dell R420 with 2x E5-2430 CPUs.
Comment 23 Jan 2013-05-07 16:47:50 UTC
Same here on a brand new R620, ubuntu 12.10 (Linux 3.5.0-27)

Is there a way to at least filter these from the logs so they don't crowd out other messages?
Comment 24 Bobby 2013-05-08 17:17:00 UTC
I am having the same problem with a brand new R420 with 2x E5-2430's running RHEL 6.2

I don't see any issues using the iDRAC card that would cause this. Is there no way to turn these alerts off?
Comment 25 Jan 2013-05-08 18:43:10 UTC
I found a solution in another thread to edit /etc/modprobe.d/blacklist.conf and add "blacklist sb_edac" and "blacklist i7core_edac" and that made the errors go away. I'm not sure if that has any adverse side effects, but so far it seems to work.
Comment 26 Fenghua Yu 2013-05-08 18:57:51 UTC
RHEL6.4, 3.6.6 haven't picked up the upstream fix: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=29e9bf1841e4f9df13b4992a716fece7087dd237

Please check if your kernel have the fix. If possible, you can apply the fix in your kernel.

I'm working with RH and stable kernel maintainers to put the fix in their kernels.

In the meantime, you can workaround the issue with kernel parameter "clearcpuid=229" which disables Power Limit Notificaion (PLN) feature in cpu flags.

Disabling sb_edac or i7core_edac are too intrusive and may cause other unexpected behaviors.
Comment 27 Bob 2013-05-08 20:58:52 UTC
Hi Fenghua (and all):

For the record, my personal recommendation would be to only disregard or blacklist these log entries only if you're certain you're not experiencing any performance degradation, or if you are, it is acceptable to you. 

In the case of running java applications, more than one person has encountered severe performance problems, which were only actually solved by disabling a variety of EIST functionality in BIOS. 

To be clear: without these notice messages showing up in the logs, there would have been virtually no way to diagnose this issue. 

My feeling is, while these messages are innocuous (ie. there is no error condition, they are merely logging CPU stepping activity), seeing these log entries is very necessary so that you can discover the existence of CPU stepping. That is a crucial troubleshooting component to diagnosing the root cause of java application performance degradation when running this family of hardware.

Rather than working with RH to silence the notifications, it would be much better to find and fix the root cause of the performance problems (in the case of java), and then add some sort of normal logging parameter that would allow for a graceful silencing of these messages when performance problems are not involved.

That said, java may simply fall into Dell's "High Performance Computing" category, where even they suggest disabling CPU stepping (see pages 13 & 14):
http://www.dellhpcsolutions.com/dellhpcsolutions/assets/Optimal_BIOS_HPC_Dell_12G.v1.0.pdf
Comment 28 Fenghua Yu 2013-05-08 21:25:35 UTC
> To be clear: without these notice messages showing up in the logs, there
> would
> have been virtually no way to diagnose this issue. 

The fix patch in
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=29e9bf1841e4f9df13b4992a716fece7087dd237
does record the events in /sys/devices/system/cpu/cpu#/thermal_throttle/core_power_limit_count or package_power_limit_count

Tools can monitor the counters to know the power limit notification events.

What the fix patch doesn't show is the message in log_buf and mce log which are annoying according to user's feedback.
Comment 29 Bob 2013-05-08 21:54:33 UTC
Good to know. Thank you for that information. 

From a troubleshooting perspective, those messages are often the only initial clues to investigating the CPU/BIOS configuration as a potential cause of performance problems. This is why I argue that they should remain visible in logs unless the user makes a conscious decision to mute them. 

It's highly valuable that the events are still getting recorded somewhere else, but without these log entries, it's not likely an administrator would necessarily know to go find and review those events in connection with seemingly unconnected problems. 

After all, performance problems like these are so often the cause of something else up in higher layers. Without log entries to indicate otherwise, prematurely silencing them to solve an annoyance problem may further hamper troubleshooting and lead to misdiagnoses. 

Perhaps this might be helped by decreasing verbosity of these events by default? (rather than an all-on or all-off approach). This would still trigger the user to investigate the messages, without all the day-to-day sifting of clutter.

Of course, having a method for increasing or decreasing the verbosity of power limit event log entries as needed would also be valuable. 

I totally understand the annoyance factor, but since I just got done relying on them to lead me to a proper diagnosis, I also see their value.
Comment 30 Zhang Rui 2013-05-09 02:03:16 UTC
Hi, Fenghua,
can you please help propose a fix patch for the warning messages?
Comment 31 Jan 2013-05-09 08:16:26 UTC
@Fenghua: Thanks! adding clearcpuid=299 as a boot option has indeed solved the issue

@Bob: I don't see how syslog filling up with garbage on an idle server with a clean Linux install is in any way conducive to troubleshooting. It may actually make you miss important messages.
Comment 32 Michal Petrucha 2013-05-09 08:36:56 UTC
Jan: I think everyone agrees that the rate at which the messages appear is just too high, that's why Bob suggested reducing the verbosity (which I understood as reducing the number of messages appearing in the syslog). The main concern raised by Bob is valid, IMO -- completely silencing the messages by default for everyone might not be the best idea.
Comment 33 Jan 2013-05-09 09:36:36 UTC
Michal: OK, but I'm not sure there is a half-way house here. If I understand power stepping correctly then this will naturally occur each time the load on the system ramps up or down, so even if it is less verbose you would see lots of messages. 

By the way, I'm also not sure I understand how these messages would help in diagnosing a problem: power stepping, when operating correctly, should not have adverse affects. But apparently Bob's experience with java is different (although, as the Dell white paper shows, switching off power stepping will make power consumption jump from 150W to 300W so switching it off is not really a solution for most people anyway).
Comment 34 Bob 2013-05-09 17:26:37 UTC
It's certainly not an optimal solution for us either, but it's the only one we have in lieu of some kind of added latency tolerance from java, I'd guess.

I completely agree, that on a stock-installed OS with no performance problems, those log entries are pure clutter that should be suppressed. But that should be an active decision because of how necessary they end up being during troubleshooting.

I know my idea of increasing/decreasing verbosity on stepping actions is likely not possible because of the nature of the activity. Perhaps something like a single line: "CPU power was adjusted X times in the past hour." Then at least you're only dealing with a max of 24 such lines per day. Still clutter, but far fewer than what we have now, without completely silencing them. 

Just thinking out loud, so I'm certain there's probably a better way.
Comment 35 Len Brown 2013-05-21 16:19:08 UTC
Created attachment 102161 [details]
[1/2] x86 thermal: Disable power limit notification interrupt by default


Power Limit Notification (X86_FEATURE_PLN) was added in Sandy Bridge
to give the OS the option of knowing when the package has reached
a configured power threshold.

Linux-2.6.36 enabled this feature:
0199114c31798af5b83841b21759b64171060d9b
(x86, hwmon: Package Level Thermal/Power: power limit)

It enabld the interrupt, and the interrupt hander
added to the MCE log and printed to the console:

printk(KERN_CRIT "CPU%d: %s power limit notification (total events = %lu)
printk(KERN_INFO "CPU%d: %s power limit normal\n"

However, these events are quite routine on some systems under some conditions,
alarming customers and provoking un-necessary customer support calls.

So the MCE log entry was deleted in Linux-3.3:

29e9bf1841e4f9df13b4992a716fece7087dd237
(x86, mce, therm_throt: Don't report power limit and package level thermal throttle events in mcelog)

Here we delete the corresponding kernel console messages,
and then we disable the interrupt by default -- allowing it
to be enabled by cmdline for diagnosis purposes.

https://bugzilla.kernel.org/show_bug.cgi?id=36182

This pair of patches applies cleanly back to Linux-3.3.

Documentation/kernel-parameters.txt      |  2 ++
arch/x86/kernel/cpu/mcheck/therm_throt.c | 43 ++++++++++++++++++++++++++-----------------
 2 files changed, 28 insertions(+), 17 deletions(-)

[1/2] x86 thermal: Disable power limit notification interrupt by default
[2/2] x86 thermal: Delete power-limit-notification console messages
Comment 36 Len Brown 2013-05-21 16:21:55 UTC
Created attachment 102171 [details]
[PATCH 2/2] x86 thermal: Disable power limit notification interrupt
Comment 37 Len Brown 2013-05-21 17:00:12 UTC
a clarification...
comment #35: [PATCH 1/2] x86 thermal: Delete power-limit-notification console messages

plus

comment #36: [PATCH 2/2] x86 thermal: Disable power limit notification interrupt
 by default

are what we are proposing for upstream (and -stable).

The attachments above are correct, but the text describing the attachments in comment #35 erroneously exchanged their 1-line summaries.
Comment 38 Bob 2013-05-21 19:13:43 UTC
So just to clarify, the decision was made to disable these event notifications by default, and allow admins to manually enable them for diagnostic purposes. While I understand that addresses the largest contingent of people (those who see the messages but are not experiencing any corresponding performance problems). But I stand by my warning that suppressing these messages by default will lead others to misdiagnose a potentially wide variety of problems that should otherwise be resolved by discovering these these messages, and leading them to properly diagnose CPU power misconfiguration.

These event notices are one of the only highly-visible indicators, which makes them very important to those trying to track down this specific problem. I feel the suppression of these notices is merely a band-aid approach to appease those who complain about log clutter. I do sympathize, as I hate log clutter too, but not at the expense of obfuscating problem indicators.
Comment 39 Martin Mokrejs 2013-08-26 21:23:17 UTC
I landed at this thread because I got this message on 3.10.9. However, I realized during reading this thread it is probably about too many such lines in syslogs, which is NOT my case. However, as it seems the code was almost dropped from the kernel, I would like to add my comments.

Seems several reports here are about laptops, actually SandyBridge-based laptops. I have yet another, i7-2640M. I realized that if I disable my hyper-threaded cores:

echo 0 > /sys/devices/system/cpu/cpu2/online
echo 0 > /sys/devices/system/cpu/cpu3/online

that my singlethreaded applications run faster (I run two instances just to fill physical cores). Per i7z tools I was reaching temperatures 95-98 oC while with all 4 cores enabled it never reached to those temperatures and processing speed/throughput was lower. disabling the HT-cores had also one other effect. That the physical cores could have ran at higher boosted speeds, which of course heated up more the CPU. I would say, that was the right way to test my cooling.

I was glad kernel reported:

[ 1092.103952] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 1092.103954] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1)
[ 1092.103957] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 1092.104931] CPU1: Core temperature/speed normal
[ 1092.104933] CPU0: Package temperature/speed normal
[ 1092.104936] CPU1: Package temperature/speed normal
[ 1201.614297] mce: [Hardware Error]: Machine check events logged
[ 1395.598163] CPU1: Core temperature above threshold, cpu clock throttled (total events = 21680)
[ 1395.598169] CPU1: Package temperature above threshold, cpu clock throttled (total events = 22191)
[ 1395.598190] CPU0: Package temperature above threshold, cpu clock throttled (total events = 22191)
[ 1395.599169] CPU1: Core temperature/speed normal
[ 1395.599171] CPU1: Package temperature/speed normal
[ 1395.599176] CPU0: Package temperature/speed normal
[ 1502.016525] mce: [Hardware Error]: Machine check events logged
[ 1698.841500] CPU1: Core temperature above threshold, cpu clock throttled (total events = 46139)
[ 1698.841503] CPU0: Package temperature above threshold, cpu clock throttled (total events = 47504)
[ 1698.841506] CPU1: Package temperature above threshold, cpu clock throttled (total events = 47504)
[ 1698.842526] CPU0: Package temperature/speed normal
[ 1698.842528] CPU1: Core temperature/speed normal
[ 1698.842529] CPU1: Package temperature/speed normal
[ 1952.545072] mce: [Hardware Error]: Machine check events logged
[ 1999.213823] CPU0: Package temperature/speed normal
[ 1999.213826] CPU1: Core temperature/speed normal
[ 1999.213829] CPU1: Package temperature/speed normal
[ 2102.731048] mce: [Hardware Error]: Machine check events logged

[15125.078769] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1015693)
[15125.078771] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1048803)
[15125.078776] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1048803)
[15125.079794] CPU1: Core temperature/speed normal
[15125.079796] CPU0: Package temperature/speed normal
[15125.079798] CPU1: Package temperature/speed normal
[15425.600586] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1041101)
[15425.600588] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1075009)
[15425.600593] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1075009)
[15425.601591] CPU1: Core temperature/speed normal
[15425.601593] CPU0: Package temperature/speed normal
[15425.601596] CPU1: Package temperature/speed normal
[15725.995979] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1064631)
[15725.995983] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1099299)
[15725.995986] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1099299)
[15725.996987] CPU1: Core temperature/speed normal
[15725.996989] CPU0: Package temperature/speed normal
[15725.996991] CPU1: Package temperature/speed normal
[16301.492089] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1066448)
[16301.492091] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1101154)
[16301.492096] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1101154)
[16301.493096] CPU1: Core temperature/speed normal
[16301.493098] CPU0: Package temperature/speed normal
[16301.493098] CPU1: Package temperature/speed normal
[16607.731994] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1069217)
[16607.731999] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1104055)
[16607.732006] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1104055)
[16607.732958] CPU1: Core temperature/speed normal
[16607.732959] CPU1: Package temperature/speed normal
[16607.732982] CPU0: Package temperature/speed normal
[21761.864712] r8169 0000:05:00.0 enp5s0: link down
[21763.550884] r8169 0000:05:00.0 enp5s0: link up
[25099.840780] conftest[4957]: segfault at 0 ip 0000000000400570 sp 00007fff7a800450 error 4 in conftest[400000+1000]
[25100.156282] conftest[4981]: segfault at 0 ip 00007f674705bef6 sp 00007fffbec472d8 error 4 in libc-2.17.so[7f6746f39000+1a2000]
[25187.711268] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1071206)
[25187.711270] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1106064)
[25187.711275] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1106064)
[25187.712288] CPU1: Core temperature/speed normal
[25187.712290] CPU0: Package temperature/speed normal
[25187.712293] CPU1: Package temperature/speed normal
[25208.899908] conftest[15966]: segfault at 0 ip 0000000000400570 sp 00007fff169ea510 error 4 in conftest[400000+1000]
[25209.230326] conftest[15990]: segfault at 0 ip 00007f22fc360ef6 sp 00007fff8918c5d8 error 4 in libc-2.17.so[7f22fc23e000+1a2000]
[25288.933545] conftest[27177]: segfault at 0 ip 0000000000400570 sp 00007fffcb660da0 error 4 in conftest[400000+1000]
[25289.180398] conftest[27209]: segfault at 0 ip 00007f213482aef6 sp 00007fffd657a1e8 error 4 in libc-2.17.so[7f2134708000+1a2000]
[25488.084595] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1087585)
[25488.084598] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1123071)
[25488.084602] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1123071)
[25488.097613] CPU1: Core temperature/speed normal
[25488.097615] CPU0: Package temperature/speed normal
[25488.097621] CPU1: Package temperature/speed normal
[25788.465891] CPU1: Core temperature/speed normal
[25788.465893] CPU0: Package temperature/speed normal
[25788.465898] CPU1: Package temperature/speed normal
[26088.838199] CPU1: Core temperature/speed normal


mcelog said:
Hardware event. This is not a software error.
MCE 5
CPU 1 THERMAL EVENT TSC 1b06d3491f5a 
TIME 1375546009 Sat Aug  3 18:06:49 2013
Processor 1 below trip temperature. Throttling disabled
STATUS 8801028a MCGSTATUS 0
MCGCAP c07 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
Hardware event. This is not a software error.
MCE 6
CPU 1 THERMAL EVENT TSC 1c36a76eacb8 
TIME 1375546476 Sat Aug  3 18:14:36 2013
Processor 1 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
STATUS 880003cb MCGSTATUS 0
MCGCAP c07 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42



My external LCD connected via DVI cabled to the i7 processor with builtin graphics chip was blinking sometimes 4-5x during a 1minute window. I believe that was caused by CPU being throttled or stepped back into higher speeds. Thanks to these messages Dell replaced my CPU cooler and motherboard. The technician also placed more thermal glue onto the CPU.


Now, with only the 2 physical cores enabled I reach temperatures 64-76 oC and have less segmentation faults and no external LCD blinking at all. Messages about CPU throttling are gone but finally, I am getting to subject of this thread, I get :

[98101.254002] CPU1: Package power limit notification (total events = 2)
[98101.254004] CPU0: Package power limit notification (total events = 2)
[98101.255084] CPU1: Package power limit normal
[98101.255085] CPU0: Package power limit normal
[111450.779762] binaryurpReader[30277]: segfault at a0 ip 00007fc9d6aa20c7 sp 00007fc9efffe280 error 4 in libfwllo.so[7fc9d6a6b000+7e000]
[133993.045662] soffice.bin[7553]: segfault at 18 ip 00007f935f72641a sp 00007fff8d249ae0 error 4 in libvclplug_gtklo.so[7f935f6bb000+c1000]
[208645.410300] CPU1: Package power limit notification (total events = 7)
[208645.410302] CPU0: Package power limit notification (total events = 7)
[208645.421172] CPU0: Package power limit normal
[208645.421175] CPU1: Package power limit normal


You see, although my CPU is not overheating too much (new cooler and more thermal glue) I still have some issues. Or maybe it just tells me that I am really squezing maximum CPU power? So, a good sign in my case? Or would you say that my CPU has a bad silicon and is still heating too much over the spec? Aren't we all with SandyBridge laptops and CPUs at high frequency having a cooling issue?

Either way, I am not disturbed but these messages but wonder why I never saw previously the "Package power limit notification" with definitely faulty cooling. The messages are helpful to study what is going on under the hood.
Comment 40 Stefano Cherchi 2013-09-03 13:47:31 UTC
I think the title of the page is misleading. We're experiencing the issue with Red Hat 6.4, kernel version 2.6.32, on several Dell Poweredge r620 with Sandybridge family processors (various models).

No notification with Red Hat 5, kernel version 2.6.18, on the same hardware.

So, as far as I can see, the bug has been introduced somewhere between the two versions above. 

Some additional info:

The servers come from Dell with the power management policy set to "Performance Per Watt Optimized (DAPC)" by default. 

We managed to "fix" the kernel notifications issue by setting the BIOS to "Performance" instead. 

If you don't mind power consumption it is a quick way to get rid of the problem.
Comment 41 Martin Mokrejs 2014-03-07 16:56:48 UTC
(In reply to Martin Mokrejs from comment #39)

Just to update my experience.

Few months later even my CPU was replaced, because it baked one of my two RAM modules. I did not realize that in august when just the cooller was exchanged and more glue applied. Now, 6 months later I could conclude I just had a bad luck with a bad mobile CPU. It was using the CPU up to its maximum limits and as it was heating a lot, sometimes it tripped over the power limits. Before cooler replacements it tripped over the temperature limits instead. However, I should worried a lot more about the segfaults. Luckily, the CPU was doing worse and worse and the external HDMI output just stopped giving output signal. With the new CPU, they are just gone.

Under 3.10.12 which I use currently I got it just once during a long-term CPU intensive job (lasting days). Only these 4 lines in total were logged when CPU was taking a lot of power while still under thermal threshold. A good sign my cooler is doing well.

[66841.449507] CPU1: Package power limit notification (total events = 1)
[66841.449509] CPU0: Package power limit notification (total events = 1)
[66841.450084] CPU0: Package power limit normal
[66841.450086] CPU1: Package power limit normal



# cat /etc/local.d/baselayout1.start

echo 0 > /sys/devices/system/cpu/cpu2/online
echo 0 > /sys/devices/system/cpu/cpu3/online

for CPUFREQ in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do [ -f $CPUFREQ ] || continue; echo -n performance > $CPUFREQ; done

sysctl kernel.sched_child_runs_first=1

cpupower --cpu 0 set -b 0
cpupower --cpu 1 set -b 0

#

Note You need to log in before you can comment on or make changes to this bug.