Bug 75071

Summary: Coretemp reports too high temperature
Product: Drivers Reporter: Rafal Kupiec (belliash)
Component: Hardware MonitoringAssignee: Jean Delvare (jdelvare)
Status: RESOLVED CODE_FIX    
Severity: high CC: linux
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.14.2 Subsystem:
Regression: No Bisected commit-id:

Description Rafal Kupiec 2014-04-29 16:05:18 UTC
I have recently upgraded kernel on my server from 3.13.10 to 3.14.2 and lm_sensors, using coretemp, has begun to report over twice higher CPU temperature than before. On old kernel CPU temperature is about 30-35* C when IDLE, while 3.14 reports 60-70*.

Affected CPU is Intel(R) Xeon(R) CPU L5630 @ 2.13GHz with SuperMicro X8STi-F mainboard.

This is a high-priority bug for me, as breaks monitoring. Actually I am receiving a lot of notifications telling me the processor is overheating, when it actually does not. I have reverted changes and switched back to 3.13 line, looking forward to get this fixed.
Comment 1 Jean Delvare 2014-04-30 06:46:14 UTC
On what basis do you claim that the driver is wrong and your system is not actually overheating?

The coretemp driver did not see many changes in kernel 3.14, this is generally a simple and stable piece of code. So my first suspect would be broken power management, rather than broken monitoring.

Please provide the full output of "sensors" on both kernels.
Comment 2 Rafal Kupiec 2014-04-30 07:16:40 UTC
I'm pretty sure the CPU is not overheating, because there is nearly no load on this server. Also IPMI is reporting LOW temperature. Finally I have physical access to this server and I can check if it's warm/hot or not.

The below output comes from kernel 3.13:
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +30.0°C  (high = +61.0°C, crit = +71.0°C)
Core 1:       +28.0°C  (high = +61.0°C, crit = +71.0°C)
Core 9:       +28.0°C  (high = +61.0°C, crit = +71.0°C)
Core 10:      +26.0°C  (high = +61.0°C, crit = +71.0°C)

Also don't forget is L-series CPU with TDP 40W. According to Intel specification, Case Temperature is the maximum temperature allowed at the processor Integrated Heat Spreader (IHS) and for L5630 max allowed is 63.1*C.


After switching to 3.14 the reported temperatures were at least twice that. High was reported as about 80*C and critical 105*C if i remember. I cannot reboot it to check now. It is production server, so Im staying with 3.13 until this gets fixed.
Comment 3 Jean Delvare 2014-04-30 16:33:49 UTC
If the high and critical temperatures changed as well, that means that the driver now gets tjmax wrong. There were a couple of changes related to that recently. That would indeed be a monitoring issue and your hardware is doing right. That being said, that's nothing to really be worried about. The hardware reports a thermal margin rather than an absolute temperature, so getting tjmax wrong causes temperatures to be reported wrong as well, but your hardware is still safe.

Please attach the output of "cat /proc/cpuinfo".

Please also attach the output of "/sbin/lspci -nn".

I have a standalone coretemp driver at:
http://jdelvare.nerim.net/devel/lm-sensors/drivers/coretemp/

You could give it a try temporarily on your 3.13 kernel, to confirm that the problem comes from the most recent coretemp driver changes.
Comment 4 Rafal Kupiec 2014-04-30 16:53:36 UTC
# cat /proc/cpuinfo #limited to just one - 8 showed:
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 44
model name      : Intel(R) Xeon(R) CPU           L5630  @ 2.13GHz
stepping        : 2
microcode       : 0xc
cpu MHz         : 1600.000
cache size      : 12288 KB
physical id     : 0
siblings        : 8
core id         : 10
cpu cores       : 4
apicid          : 21
initial apicid  : 21
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips        : 4255.81
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:


# lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation 5520/5500/X58 I/O Hub to ESI Port [8086:3405] (rev 13)
00:01.0 PCI bridge [0604]: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 [8086:3408] (rev 13)
00:02.0 PCI bridge [0604]: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 2 [8086:3409] (rev 13)
00:03.0 PCI bridge [0604]: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 [8086:340a] (rev 13)
00:07.0 PCI bridge [0604]: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 [8086:340e] (rev 13)
00:09.0 PCI bridge [0604]: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 9 [8086:3410] (rev 13)
00:14.0 PIC [0800]: Intel Corporation 7500/5520/5500/X58 I/O Hub System Management Registers [8086:342e] (rev 13)
00:14.1 PIC [0800]: Intel Corporation 7500/5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers [8086:3422] (rev 13)
00:14.2 PIC [0800]: Intel Corporation 7500/5520/5500/X58 I/O Hub Control Status and RAS Registers [8086:3423] (rev 13)
00:14.3 PIC [0800]: Intel Corporation 7500/5520/5500/X58 I/O Hub Throttle Registers [8086:3438] (rev 13)
00:16.0 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3430] (rev 13)
00:16.1 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3431] (rev 13)
00:16.2 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3432] (rev 13)
00:16.3 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3433] (rev 13)
00:16.4 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3429] (rev 13)
00:16.5 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342a] (rev 13)
00:16.6 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342b] (rev 13)
00:16.7 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342c] (rev 13)
00:1a.0 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4 [8086:3a37]
00:1a.1 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5 [8086:3a38]
00:1a.2 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6 [8086:3a39]
00:1a.7 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2 [8086:3a3c]
00:1d.0 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1 [8086:3a34]
00:1d.1 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2 [8086:3a35]
00:1d.2 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3 [8086:3a36]
00:1d.7 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1 [8086:3a3a]
00:1e.0 PCI bridge [0604]: Intel Corporation 82801 PCI Bridge [8086:244e] (rev 90)
00:1f.0 ISA bridge [0601]: Intel Corporation 82801JIR (ICH10R) LPC Interface Controller [8086:3a16]
00:1f.2 SATA controller [0106]: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller [8086:3a22]
00:1f.3 SMBus [0c05]: Intel Corporation 82801JI (ICH10 Family) SMBus Controller [8086:3a30]
01:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3]
02:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3]
06:04.0 VGA compatible controller [0300]: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 [102b:0532] (rev 0a)
ff:00.0 Host bridge [0600]: Intel Corporation Xeon 5600 Series QuickPath Architecture Generic Non-core Registers [8086:2c70] (rev 02)
ff:00.1 Host bridge [0600]: Intel Corporation Xeon 5600 Series QuickPath Architecture System Address Decoder [8086:2d81] (rev 02)
ff:02.0 Host bridge [0600]: Intel Corporation Xeon 5600 Series QPI Link 0 [8086:2d90] (rev 02)
ff:02.1 Host bridge [0600]: Intel Corporation Xeon 5600 Series QPI Physical 0 [8086:2d91] (rev 02)
ff:02.2 Host bridge [0600]: Intel Corporation Xeon 5600 Series Mirror Port Link 0 [8086:2d92] (rev 02)
ff:02.3 Host bridge [0600]: Intel Corporation Xeon 5600 Series Mirror Port Link 1 [8086:2d93] (rev 02)
ff:02.4 Host bridge [0600]: Intel Corporation Xeon 5600 Series QPI Link 1 [8086:2d94] (rev 02)
ff:02.5 Host bridge [0600]: Intel Corporation Xeon 5600 Series QPI Physical 1 [8086:2d95] (rev 02)
ff:03.0 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Registers [8086:2d98] (rev 02)
ff:03.1 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Target Address Decoder [8086:2d99] (rev 02)
ff:03.2 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller RAS Registers [8086:2d9a] (rev 02)
ff:03.4 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Test Registers [8086:2d9c] (rev 02)
ff:04.0 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Control [8086:2da0] (rev 02)
ff:04.1 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Address [8086:2da1] (rev 02)
ff:04.2 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Rank [8086:2da2] (rev 02)
ff:04.3 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Thermal Control [8086:2da3] (rev 02)
ff:05.0 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Control [8086:2da8] (rev 02)
ff:05.1 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Address [8086:2da9] (rev 02)
ff:05.2 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Rank [8086:2daa] (rev 02)
ff:05.3 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Thermal Control [8086:2dab] (rev 02)
ff:06.0 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Control [8086:2db0] (rev 02)
ff:06.1 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Address [8086:2db1] (rev 02)
ff:06.2 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Rank [8086:2db2] (rev 02)
ff:06.3 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Thermal Control [8086:2db3] (rev 02)


I know that hardware is safe and not overheating. But this does not solve the problem at all. Driver should report real temperatures. Also I use Zabbix to monitor this server - thats how i got to know about this bug. If i change warnign & critical level there is a risk i wont notice real problem if kernel gets fixed and CPU gets really hot, so I would prefer to not modify anything in Zabbix. On the other hand I receive a lot of notifications about overheating CPU with new kernel. So the only real solution for me right now is to wait until this gets fixed and upgrade to newer, problem free kernel. Also it is hard for me to reboot it few times everyday in order to provide you more information as this is a production server. Anyway I still believe you can find a solution.

I could ofc try the provided coretemp module, but unfortunately i have built it into the kernel, so I would need to reboot again - at least twice, whats hard on production machine. I doubt it can be unloaded?
Comment 5 Guenter Roeck 2014-04-30 18:04:08 UTC
Not necessarily helpful, but this works fine with L5238.

Tcase is 63 degrees C, but that is different to Tjmax and typically lower.
Unfortunately I seem to be unable to find the documented value for Tjmax
on the affected CPU. Tjmax on the L5238 is reported as 100 degrees C,
with Tcase specified as 71 degrees C.

Anyway, I suspect the culprit to be
    9fb6c9c hwmon: (coretemp) Refine TjMax detection

That will be a problem if Tjmax is really below 85 degrees C.
Maybe we should reduce the accepted low limit to, say, 50 degrees C
or take the additional temperature check out entirely (and maybe
I should not have trusted the turbostat program as much as I did ;-).
Jean, any comments/thoughts ?

Guenter
Comment 6 Jean Delvare 2014-04-30 18:58:21 UTC
Guenter, I agree with your analysis, actually I came up to exactly the same conclusion while walking back home. The arbitrary limit check is wrong, let's remove it.

In fact I would revert commit 9fb6c9c entirely. I have no idea what turbostat is up to, but the Intel IA32 System Programming document doesn't mention any of the arbitrary checks you added. It says that temperature target is in bits 23:16 of MSR 0x1a2 (MSR_TEMPERATURE_TARGET) and that's 8 bits, not 7. There is no mention of which values should be considered valid, so I'd say all of them are until anyone reports an evidence some aren't.
Comment 7 Guenter Roeck 2014-04-30 21:15:39 UTC
Ok, let's do that. I submitted the revert a minute ago.
Comment 8 Guenter Roeck 2014-05-02 15:30:34 UTC
Fixed with upstream commit c0940e9 (Revert "hwmon: (coretemp) Refine TjMax detection").
Comment 9 Rafal Kupiec 2014-05-07 09:34:52 UTC
Is this fixed with 3.14.3 or 3.15?
Comment 10 Jean Delvare 2014-05-07 11:19:45 UTC
It is fixed in 3.15-rc4+. The fix missed 3.14.3 but it should be in 3.14.4.