Bug 96611 - "temperature above threshold" errors last 1ms
Summary: "temperature above threshold" errors last 1ms
Status: CLOSED UNREPRODUCIBLE
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Thermal (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Srinivas Pandruvada
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-04-14 00:59 UTC by Konstantin Svist
Modified: 2016-12-21 01:54 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.19.3-200.fc21.x86_64
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Konstantin Svist 2015-04-14 00:59:19 UTC
My laptop keeps reporting "temperature above threshold" errors that always last 1ms and cannot be reported with ABRT.

I'm pretty sure this is not a real temperature error condition, since it always gets "fixed" in 1ms. Things having to do with temperature always take a lot longer than that.

Here's dmesg snippet:


[93120.116714] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
[93300.544896] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
[93828.253704] CPU1: Core temperature above threshold, cpu clock throttled (total events = 65)
[93828.253705] CPU5: Core temperature above threshold, cpu clock throttled (total events = 65)
[93828.253717] CPU5: Package temperature above threshold, cpu clock throttled (total events = 462)
[93828.253720] CPU2: Package temperature above threshold, cpu clock throttled (total events = 462)
[93828.253721] CPU4: Package temperature above threshold, cpu clock throttled (total events = 462)
[93828.253722] CPU0: Package temperature above threshold, cpu clock throttled (total events = 462)
[93828.253723] CPU7: Package temperature above threshold, cpu clock throttled (total events = 462)
[93828.253724] CPU6: Package temperature above threshold, cpu clock throttled (total events = 462)
[93828.253725] CPU3: Package temperature above threshold, cpu clock throttled (total events = 462)
[93828.253728] CPU1: Package temperature above threshold, cpu clock throttled (total events = 462)
[93828.254736] CPU1: Core temperature/speed normal
[93828.254738] CPU5: Core temperature/speed normal
[93828.254744] CPU2: Package temperature/speed normal
[93828.254747] CPU6: Package temperature/speed normal
[93828.254749] CPU5: Package temperature/speed normal
[93828.254755] CPU7: Package temperature/speed normal
[93828.254756] CPU3: Package temperature/speed normal
[93828.254761] CPU0: Package temperature/speed normal
[93828.254764] CPU4: Package temperature/speed normal
[93828.254776] CPU1: Package temperature/speed normal
[93858.439817] mce: [Hardware Error]: Machine check events logged
[94151.554335] CPU5: Core temperature above threshold, cpu clock throttled (total events = 1214)
[94151.554346] CPU2: Package temperature above threshold, cpu clock throttled (total events = 3285)
[94151.554348] CPU4: Package temperature above threshold, cpu clock throttled (total events = 3285)
[94151.554348] CPU0: Package temperature above threshold, cpu clock throttled (total events = 3285)
[94151.554349] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1214)
[94151.554350] CPU6: Package temperature above threshold, cpu clock throttled (total events = 3285)
[94151.554352] CPU7: Package temperature above threshold, cpu clock throttled (total events = 3285)
[94151.554353] CPU3: Package temperature above threshold, cpu clock throttled (total events = 3285)
[94151.554355] CPU1: Package temperature above threshold, cpu clock throttled (total events = 3285)
[94151.554361] CPU5: Package temperature above threshold, cpu clock throttled (total events = 3285)
[94151.555359] CPU5: Core temperature/speed normal
[94151.555363] CPU2: Package temperature/speed normal
[94151.555366] CPU1: Core temperature/speed normal
[94151.555369] CPU0: Package temperature/speed normal
[94151.555372] CPU4: Package temperature/speed normal
[94151.555375] CPU6: Package temperature/speed normal
[94151.555377] CPU1: Package temperature/speed normal
[94151.555389] CPU7: Package temperature/speed normal
[94151.555392] CPU3: Package temperature/speed normal
[94151.555404] CPU5: Package temperature/speed normal
[94308.961272] mce: [Hardware Error]: Machine check events logged


Version-Release number of selected component (if applicable):
[    0.000000] Linux version 3.19.0-1.fc22.x86_64 (mockbuild@bkernel02.phx2.fedoraproject.org) (gcc version 5.0.0 20150206 (Red Hat 5.0.0-0.8) (GCC) ) #1 SMP Mon Feb 9 08:42:43 UTC 2015

Running this kernel on Fedora 21 -- but even when running properly matching kernel (from fedora-updates), I saw same errors.


How reproducible:
Every day, ~once a day


Steps to Reproduce:
1. Keep laptop on


Additional info:

* does not seem to coincide with heavy CPU activity
* original bug reported on 3.19.0; now same behavior on 3.19.3
Comment 1 Zhang Rui 2015-04-15 02:35:25 UTC
what is the model of your platform? please attach the output of "cat /proc/cpuinfo"
Comment 2 Konstantin Svist 2015-04-15 08:14:07 UTC
Here's the 1st of 8 entries (I believe rest are nearly identical):

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 60
model name	: Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz
stepping	: 3
microcode	: 0x1c
cpu MHz		: 3679.156
cache size	: 6144 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt
bugs		:
bogomips	: 5587.01
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:
Comment 3 Srinivas Pandruvada 2015-05-12 18:52:15 UTC
These messages are generated by thermal threshold interrupts generated by the BIOS. So it does think that you have reached a temperature where this needs immediate control. BIOS use aggressive method, so temperature will become below threshold immediately.
So I suggest these steps to check:
1- go to 
cd /sys/class/hwmon/hwmonX (where X is from 0-n, where attribute name = "coretemp")
You will something like this, if you do grep . *
name:coretemp
temp1_crit:105000
temp1_crit_alarm:0
temp1_input:41000
temp1_label:Physical id 0
temp1_max:105000
temp2_crit:105000
temp2_crit_alarm:0
temp2_input:39000
temp2_label:Core 0
temp2_max:105000
temp3_crit:105000
temp3_crit_alarm:0
temp3_input:38000
temp3_label:Core 1
temp3_max:105000

Usually the BIOS will control and send interrupt to OS, when
tempX_input > tempX_max

So you can do something like this in this folder

 while true; do cat temp1_input; sleep 1; done

I think you do reach close to tempX_max temperature.
Comment 4 Konstantin Svist 2015-05-13 18:00:19 UTC
Soon (<1min) after a temperature event:

# cd /sys/class/hwmon/hwmon1/

# grep . * --exclude-dir='*'
name:coretemp
temp1_crit:100000
temp1_crit_alarm:0
temp1_input:65000
temp1_label:Physical id 0
temp1_max:84000
temp2_crit:100000
temp2_crit_alarm:0
temp2_input:62000
temp2_label:Core 0
temp2_max:84000
temp3_crit:100000
temp3_crit_alarm:0
temp3_input:62000
temp3_label:Core 1
temp3_max:84000
temp4_crit:100000
temp4_crit_alarm:0
temp4_input:61000
temp4_label:Core 2
temp4_max:84000
temp5_crit:100000
temp5_crit_alarm:0
temp5_input:63000
temp5_label:Core 3
temp5_max:84000


# dmesg |tail -n###
[37845.134740] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
[39467.310323] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
[39480.882088] CPU5: Core temperature above threshold, cpu clock throttled (total events = 1)
[39480.882100] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
[39480.882101] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1)
[39480.882102] CPU4: Package temperature above threshold, cpu clock throttled (total events = 1)
[39480.882105] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
[39480.882106] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
[39480.882107] CPU7: Package temperature above threshold, cpu clock throttled (total events = 1)
[39480.882108] CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)
[39480.882109] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
[39480.882113] CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)
[39480.883111] CPU5: Core temperature/speed normal
[39480.883116] CPU3: Package temperature/speed normal
[39480.883120] CPU4: Package temperature/speed normal
[39480.883124] CPU0: Package temperature/speed normal
[39480.883128] CPU2: Package temperature/speed normal
[39480.883131] CPU7: Package temperature/speed normal
[39480.883134] CPU1: Core temperature/speed normal
[39480.883137] CPU6: Package temperature/speed normal
[39480.883142] CPU1: Package temperature/speed normal
[39480.883162] CPU5: Package temperature/speed normal
[39526.852167] mce: [Hardware Error]: Machine check events logged



Is _max a limit setting for the alarm or is it historical max reading?


The only other neighbor symlink is hwmon0, and it has:
# grep . * --exclude-dir='*'
name:acpitz
temp1_crit:120000
temp1_input:52000
Comment 5 Srinivas Pandruvada 2015-05-13 18:49:38 UTC
max limit is a setting where BIOS/firmware should activate cooling. This is a predefined limit set by the manufacturer.  Your temperature readings (tempX_input) is quite close the max, so it is likely that temperature reached max and came down as a result of action BIOS. So this displayed this message.

If you are not running anything and your cpu temperature is 65C, then you may have some cooling issue on your platform (Is FAN running?). Better to run turbostat or system monitor and see utilization of your CPUs. You may have some daemon going crazy.

For your hwmon0 question, this follows closely with coretemp.

So suggest in your hwmon1 folder

$ while true; do cat temp1_input; sleep 1; done

You will see that your temperature sometime reaching close to max.
Comment 6 Zhang Rui 2015-06-26 02:30:57 UTC
ping...
Comment 7 Konstantin Svist 2015-06-26 17:51:31 UTC
dmesg:


[  238.815752] nf_conntrack: automatic helper assignment is deprecated and it will be removed soon. Use the iptables CT target to attach helpers instead.
[ 1613.070169] CPU5: Core temperature above threshold, cpu clock throttled (total events = 120)
[ 1613.070170] CPU1: Core temperature above threshold, cpu clock throttled (total events = 120)
[ 1613.070183] CPU1: Package temperature above threshold, cpu clock throttled (total events = 120)
[ 1613.070186] CPU5: Package temperature above threshold, cpu clock throttled (total events = 120)
[ 1613.070188] CPU2: Package temperature above threshold, cpu clock throttled (total events = 120)
[ 1613.070189] CPU4: Package temperature above threshold, cpu clock throttled (total events = 120)
[ 1613.070190] CPU6: Package temperature above threshold, cpu clock throttled (total events = 120)
[ 1613.070191] CPU0: Package temperature above threshold, cpu clock throttled (total events = 120)
[ 1613.070192] CPU7: Package temperature above threshold, cpu clock throttled (total events = 120)
[ 1613.070193] CPU3: Package temperature above threshold, cpu clock throttled (total events = 120)
[ 1613.071190] CPU1: Core temperature/speed normal
[ 1613.071204] CPU5: Core temperature/speed normal
[ 1613.071206] CPU1: Package temperature/speed normal
[ 1613.071216] CPU4: Package temperature/speed normal
[ 1613.071221] CPU7: Package temperature/speed normal
[ 1613.071224] CPU3: Package temperature/speed normal
[ 1613.071227] CPU0: Package temperature/speed normal
[ 1613.071231] CPU2: Package temperature/speed normal
[ 1613.071234] CPU6: Package temperature/speed normal
[ 1613.071236] CPU5: Package temperature/speed normal


$ while true; do echo "$(date) $(dmesg |tail -n1|cut -d] -f1)] $(cat temp1_input) $(cat temp2_input) $(cat temp3_input) $(cat temp4_input) $(cat temp5_input)"; sleep 1; done
Fri Jun 26 10:37:23 PDT 2015 [  238.815752] 78000 73000 77000 72000 72000
Fri Jun 26 10:37:24 PDT 2015 [  238.815752] 92000 75000 92000 80000 75000
Fri Jun 26 10:37:25 PDT 2015 [  238.815752] 77000 73000 76000 72000 72000
Fri Jun 26 10:37:26 PDT 2015 [  238.815752] 89000 75000 89000 78000 71000
Fri Jun 26 10:37:27 PDT 2015 [  238.815752] 90000 74000 91000 79000 73000
Fri Jun 26 10:37:28 PDT 2015 [  238.815752] 77000 71000 76000 71000 71000
Fri Jun 26 10:37:29 PDT 2015 [  238.815752] 74000 71000 73000 72000 71000
Fri Jun 26 10:37:30 PDT 2015 [  238.815752] 81000 73000 80000 74000 71000
Fri Jun 26 10:37:31 PDT 2015 [  238.815752] 81000 74000 81000 76000 71000
Fri Jun 26 10:37:32 PDT 2015 [  238.815752] 78000 72000 77000 74000 71000
Fri Jun 26 10:37:33 PDT 2015 [  238.815752] 79000 75000 81000 72000 72000
Fri Jun 26 10:37:34 PDT 2015 [  238.815752] 80000 74000 79000 73000 72000
Fri Jun 26 10:37:35 PDT 2015 [  238.815752] 83000 74000 81000 75000 72000
Fri Jun 26 10:37:36 PDT 2015 [  238.815752] 91000 75000 92000 79000 73000
Fri Jun 26 10:37:37 PDT 2015 [  238.815752] 94000 78000 93000 79000 72000
Fri Jun 26 10:37:38 PDT 2015 [  238.815752] 94000 79000 94000 79000 72000
Fri Jun 26 10:37:39 PDT 2015 [  238.815752] 83000 74000 81000 75000 71000
Fri Jun 26 10:37:40 PDT 2015 [  238.815752] 82000 78000 82000 76000 72000
Fri Jun 26 10:37:41 PDT 2015 [  238.815752] 76000 74000 76000 74000 71000
Fri Jun 26 10:37:42 PDT 2015 [  238.815752] 77000 74000 75000 74000 70000
Fri Jun 26 10:37:43 PDT 2015 [ 1613.071236] 87000 77000 87000 80000 72000
Fri Jun 26 10:37:44 PDT 2015 [ 1613.071236] 77000 76000 75000 75000 70000
Fri Jun 26 10:37:45 PDT 2015 [ 1613.071236] 81000 77000 80000 77000 72000
Fri Jun 26 10:37:46 PDT 2015 [ 1613.071236] 78000 74000 79000 73000 72000
Fri Jun 26 10:37:47 PDT 2015 [ 1613.071236] 74000 73000 74000 72000 73000
Fri Jun 26 10:37:48 PDT 2015 [ 1613.071236] 91000 77000 91000 77000 74000
Fri Jun 26 10:37:49 PDT 2015 [ 1613.071236] 75000 71000 74000 72000 70000
Fri Jun 26 10:37:50 PDT 2015 [ 1613.071236] 91000 78000 92000 79000 71000
Fri Jun 26 10:37:51 PDT 2015 [ 1613.071236] 75000 72000 74000 73000 71000
Fri Jun 26 10:37:52 PDT 2015 [ 1613.071236] 73000 72000 72000 71000 72000
Fri Jun 26 10:37:53 PDT 2015 [ 1613.071236] 75000 73000 74000 73000 73000
Fri Jun 26 10:37:54 PDT 2015 [ 1613.071236] 71000 70000 70000 70000 70000
Fri Jun 26 10:37:55 PDT 2015 [ 1613.071236] 75000 73000 74000 72000 74000
Fri Jun 26 10:37:56 PDT 2015 [ 1613.071236] 90000 75000 88000 81000 73000
Fri Jun 26 10:37:57 PDT 2015 [ 1613.071236] 91000 75000 90000 79000 75000
Fri Jun 26 10:37:58 PDT 2015 [ 1613.071236] 91000 75000 91000 78000 72000
Fri Jun 26 10:37:59 PDT 2015 [ 1613.071236] 91000 75000 92000 78000 74000
Fri Jun 26 10:38:00 PDT 2015 [ 1613.071236] 91000 75000 91000 78000 71000
Fri Jun 26 10:38:01 PDT 2015 [ 1613.071236] 83000 80000 78000 82000 83000
Fri Jun 26 10:38:02 PDT 2015 [ 1613.071236] 75000 74000 73000 72000 70000
Fri Jun 26 10:38:03 PDT 2015 [ 1613.071236] 75000 75000 74000 70000 72000
Fri Jun 26 10:38:04 PDT 2015 [ 1613.071236] 79000 73000 80000 75000 70000
Fri Jun 26 10:38:05 PDT 2015 [ 1613.071236] 86000 77000 85000 74000 73000
Fri Jun 26 10:38:06 PDT 2015 [ 1613.071236] 79000 72000 79000 73000 73000
Fri Jun 26 10:38:07 PDT 2015 [ 1613.071236] 71000 70000 72000 70000 69000
Fri Jun 26 10:38:08 PDT 2015 [ 1613.071236] 69000 68000 70000 69000 68000
Fri Jun 26 10:38:09 PDT 2015 [ 1613.071236] 72000 70000 71000 70000 68000
Fri Jun 26 10:38:10 PDT 2015 [ 1613.071236] 72000 69000 70000 69000 69000
Fri Jun 26 10:38:11 PDT 2015 [ 1613.071236] 72000 70000 72000 70000 67000
Fri Jun 26 10:38:12 PDT 2015 [ 1613.071236] 76000 71000 76000 71000 69000
Comment 8 ByteEnable 2015-07-13 04:18:53 UTC
I started seeing the same after a kernel update in Fedora 20.  It stayed even with a Fedora 21 upgrade.  Went away for a few weeks but is back with the latest kernel update.

Linux version 4.0.7-200.fc21.x86_64 (mockbuild@bkernel02.phx2.fedoraproject.org) (gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC) ) #1 SMP Mon Jun 29 22:11:52 UTC 2015

[87365.887038] CPU5: Core temperature above threshold, cpu clock throttled (total events = 1)
[87365.887039] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1)
[87365.887041] CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)
[87365.887042] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
[87365.887044] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
[87365.887046] CPU4: Package temperature above threshold, cpu clock throttled (total events = 1)
[87365.887047] CPU7: Package temperature above threshold, cpu clock throttled (total events = 1)
[87365.887048] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
[87365.887049] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
[87365.887061] CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)
[87365.889032] CPU5: Core temperature/speed normal
[87365.889033] CPU1: Core temperature/speed normal
[87365.889035] CPU6: Package temperature/speed normal
[87365.889036] CPU4: Package temperature/speed normal
[87365.889037] CPU3: Package temperature/speed normal
[87365.889038] CPU0: Package temperature/speed normal
[87365.889039] CPU2: Package temperature/speed normal
[87365.889039] CPU7: Package temperature/speed normal
[87365.889040] CPU1: Package temperature/speed normal
[87365.889048] CPU5: Package temperature/speed normal
[87374.345270] mce: [Hardware Error]: Machine check events logged
[87666.172496] CPU5: Core temperature above threshold, cpu clock throttled (total events = 1356)
[87666.172497] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1356)
[87666.172499] CPU4: Package temperature above threshold, cpu clock throttled (total events = 1393)
[87666.172500] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1393)
[87666.172500] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1393)
[87666.172519] CPU5: Package temperature above threshold, cpu clock throttled (total events = 1393)
[87666.172529] CPU6: Package temperature above threshold, cpu clock throttled (total events = 1393)
[87666.172530] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1393)
[87666.172532] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1393)
[87666.172533] CPU7: Package temperature above threshold, cpu clock throttled (total events = 1393)
[87666.174531] CPU5: Core temperature/speed normal
Comment 9 Srinivas Pandruvada 2015-07-17 17:16:46 UTC
Your temperature log shows temperature around 90C, which is high enough for some devices to be throttled by BIOS. This is a real temperature condition.
Comment 10 Srinivas Pandruvada 2015-07-17 17:22:26 UTC
This is not a thermal bug as system is acting to compensate for high temperature. If kernel upgrade caused this then run turbostat and top to compare numbers and cause of the driver/process which is causing this.
Comment 11 ByteEnable 2015-07-17 19:32:43 UTC
I haven't seen any more events since upgrading to Linux version 4.0.7-300.fc22.x86_64.  I used turbostat to monitor and I heard the fan rpm kick-up once the temp reached around 80.  The fan ran high for about a minute then cut the rpms again as temp dropped.  Temp seems to hover between 65 - 69.

I would get the machine check events in the past out of the blue. No high rpm fan cut in, no warning, just a machine check.

Will debug if it happens again.
Comment 12 Zhang Rui 2015-09-01 05:39:31 UTC
any updates? does the machine check event happen again?
Comment 13 Konstantin Svist 2015-09-01 09:19:03 UTC
Actually I haven't seen those since almost a month ago, ~August 8th (according to btrfs backups), with kernel 4.0.8
According to yum log, kernel 4.1.3 was installed around that time (7th) -- that probably fixed the problem
Comment 14 Srinivas Pandruvada 2015-09-01 10:37:18 UTC
Since this issue doesn't occur, Can we close this issue?
Comment 15 Dan Yasny 2016-12-21 01:54:39 UTC
Just stumbled across this BZ in my own search. I've been seeing the same errors on two different laptops across several kernel versions, in Fedora 22-23-24-25 and just for fun - CentOS7. 

Please let me know what information I need to provide to help move the fix along

Should I open a new BZ or can we reopen this one?

Note You need to log in before you can comment on or make changes to this bug.