Bug 75071
Summary: | Coretemp reports too high temperature | ||
---|---|---|---|
Product: | Drivers | Reporter: | Rafal Kupiec (belliash) |
Component: | Hardware Monitoring | Assignee: | Jean Delvare (jdelvare) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | linux |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 3.14.2 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Rafal Kupiec
2014-04-29 16:05:18 UTC
On what basis do you claim that the driver is wrong and your system is not actually overheating? The coretemp driver did not see many changes in kernel 3.14, this is generally a simple and stable piece of code. So my first suspect would be broken power management, rather than broken monitoring. Please provide the full output of "sensors" on both kernels. I'm pretty sure the CPU is not overheating, because there is nearly no load on this server. Also IPMI is reporting LOW temperature. Finally I have physical access to this server and I can check if it's warm/hot or not. The below output comes from kernel 3.13: # sensors coretemp-isa-0000 Adapter: ISA adapter Core 0: +30.0°C (high = +61.0°C, crit = +71.0°C) Core 1: +28.0°C (high = +61.0°C, crit = +71.0°C) Core 9: +28.0°C (high = +61.0°C, crit = +71.0°C) Core 10: +26.0°C (high = +61.0°C, crit = +71.0°C) Also don't forget is L-series CPU with TDP 40W. According to Intel specification, Case Temperature is the maximum temperature allowed at the processor Integrated Heat Spreader (IHS) and for L5630 max allowed is 63.1*C. After switching to 3.14 the reported temperatures were at least twice that. High was reported as about 80*C and critical 105*C if i remember. I cannot reboot it to check now. It is production server, so Im staying with 3.13 until this gets fixed. If the high and critical temperatures changed as well, that means that the driver now gets tjmax wrong. There were a couple of changes related to that recently. That would indeed be a monitoring issue and your hardware is doing right. That being said, that's nothing to really be worried about. The hardware reports a thermal margin rather than an absolute temperature, so getting tjmax wrong causes temperatures to be reported wrong as well, but your hardware is still safe. Please attach the output of "cat /proc/cpuinfo". Please also attach the output of "/sbin/lspci -nn". I have a standalone coretemp driver at: http://jdelvare.nerim.net/devel/lm-sensors/drivers/coretemp/ You could give it a try temporarily on your 3.13 kernel, to confirm that the problem comes from the most recent coretemp driver changes. # cat /proc/cpuinfo #limited to just one - 8 showed: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU L5630 @ 2.13GHz stepping : 2 microcode : 0xc cpu MHz : 1600.000 cache size : 12288 KB physical id : 0 siblings : 8 core id : 10 cpu cores : 4 apicid : 21 initial apicid : 21 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4255.81 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: # lspci -nn 00:00.0 Host bridge [0600]: Intel Corporation 5520/5500/X58 I/O Hub to ESI Port [8086:3405] (rev 13) 00:01.0 PCI bridge [0604]: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 [8086:3408] (rev 13) 00:02.0 PCI bridge [0604]: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 2 [8086:3409] (rev 13) 00:03.0 PCI bridge [0604]: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 [8086:340a] (rev 13) 00:07.0 PCI bridge [0604]: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 [8086:340e] (rev 13) 00:09.0 PCI bridge [0604]: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 9 [8086:3410] (rev 13) 00:14.0 PIC [0800]: Intel Corporation 7500/5520/5500/X58 I/O Hub System Management Registers [8086:342e] (rev 13) 00:14.1 PIC [0800]: Intel Corporation 7500/5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers [8086:3422] (rev 13) 00:14.2 PIC [0800]: Intel Corporation 7500/5520/5500/X58 I/O Hub Control Status and RAS Registers [8086:3423] (rev 13) 00:14.3 PIC [0800]: Intel Corporation 7500/5520/5500/X58 I/O Hub Throttle Registers [8086:3438] (rev 13) 00:16.0 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3430] (rev 13) 00:16.1 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3431] (rev 13) 00:16.2 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3432] (rev 13) 00:16.3 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3433] (rev 13) 00:16.4 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3429] (rev 13) 00:16.5 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342a] (rev 13) 00:16.6 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342b] (rev 13) 00:16.7 System peripheral [0880]: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342c] (rev 13) 00:1a.0 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4 [8086:3a37] 00:1a.1 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5 [8086:3a38] 00:1a.2 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6 [8086:3a39] 00:1a.7 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2 [8086:3a3c] 00:1d.0 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1 [8086:3a34] 00:1d.1 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2 [8086:3a35] 00:1d.2 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3 [8086:3a36] 00:1d.7 USB controller [0c03]: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1 [8086:3a3a] 00:1e.0 PCI bridge [0604]: Intel Corporation 82801 PCI Bridge [8086:244e] (rev 90) 00:1f.0 ISA bridge [0601]: Intel Corporation 82801JIR (ICH10R) LPC Interface Controller [8086:3a16] 00:1f.2 SATA controller [0106]: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller [8086:3a22] 00:1f.3 SMBus [0c05]: Intel Corporation 82801JI (ICH10 Family) SMBus Controller [8086:3a30] 01:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3] 02:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3] 06:04.0 VGA compatible controller [0300]: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 [102b:0532] (rev 0a) ff:00.0 Host bridge [0600]: Intel Corporation Xeon 5600 Series QuickPath Architecture Generic Non-core Registers [8086:2c70] (rev 02) ff:00.1 Host bridge [0600]: Intel Corporation Xeon 5600 Series QuickPath Architecture System Address Decoder [8086:2d81] (rev 02) ff:02.0 Host bridge [0600]: Intel Corporation Xeon 5600 Series QPI Link 0 [8086:2d90] (rev 02) ff:02.1 Host bridge [0600]: Intel Corporation Xeon 5600 Series QPI Physical 0 [8086:2d91] (rev 02) ff:02.2 Host bridge [0600]: Intel Corporation Xeon 5600 Series Mirror Port Link 0 [8086:2d92] (rev 02) ff:02.3 Host bridge [0600]: Intel Corporation Xeon 5600 Series Mirror Port Link 1 [8086:2d93] (rev 02) ff:02.4 Host bridge [0600]: Intel Corporation Xeon 5600 Series QPI Link 1 [8086:2d94] (rev 02) ff:02.5 Host bridge [0600]: Intel Corporation Xeon 5600 Series QPI Physical 1 [8086:2d95] (rev 02) ff:03.0 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Registers [8086:2d98] (rev 02) ff:03.1 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Target Address Decoder [8086:2d99] (rev 02) ff:03.2 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller RAS Registers [8086:2d9a] (rev 02) ff:03.4 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Test Registers [8086:2d9c] (rev 02) ff:04.0 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Control [8086:2da0] (rev 02) ff:04.1 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Address [8086:2da1] (rev 02) ff:04.2 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Rank [8086:2da2] (rev 02) ff:04.3 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Thermal Control [8086:2da3] (rev 02) ff:05.0 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Control [8086:2da8] (rev 02) ff:05.1 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Address [8086:2da9] (rev 02) ff:05.2 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Rank [8086:2daa] (rev 02) ff:05.3 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Thermal Control [8086:2dab] (rev 02) ff:06.0 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Control [8086:2db0] (rev 02) ff:06.1 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Address [8086:2db1] (rev 02) ff:06.2 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Rank [8086:2db2] (rev 02) ff:06.3 Host bridge [0600]: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Thermal Control [8086:2db3] (rev 02) I know that hardware is safe and not overheating. But this does not solve the problem at all. Driver should report real temperatures. Also I use Zabbix to monitor this server - thats how i got to know about this bug. If i change warnign & critical level there is a risk i wont notice real problem if kernel gets fixed and CPU gets really hot, so I would prefer to not modify anything in Zabbix. On the other hand I receive a lot of notifications about overheating CPU with new kernel. So the only real solution for me right now is to wait until this gets fixed and upgrade to newer, problem free kernel. Also it is hard for me to reboot it few times everyday in order to provide you more information as this is a production server. Anyway I still believe you can find a solution. I could ofc try the provided coretemp module, but unfortunately i have built it into the kernel, so I would need to reboot again - at least twice, whats hard on production machine. I doubt it can be unloaded? Not necessarily helpful, but this works fine with L5238. Tcase is 63 degrees C, but that is different to Tjmax and typically lower. Unfortunately I seem to be unable to find the documented value for Tjmax on the affected CPU. Tjmax on the L5238 is reported as 100 degrees C, with Tcase specified as 71 degrees C. Anyway, I suspect the culprit to be 9fb6c9c hwmon: (coretemp) Refine TjMax detection That will be a problem if Tjmax is really below 85 degrees C. Maybe we should reduce the accepted low limit to, say, 50 degrees C or take the additional temperature check out entirely (and maybe I should not have trusted the turbostat program as much as I did ;-). Jean, any comments/thoughts ? Guenter Guenter, I agree with your analysis, actually I came up to exactly the same conclusion while walking back home. The arbitrary limit check is wrong, let's remove it. In fact I would revert commit 9fb6c9c entirely. I have no idea what turbostat is up to, but the Intel IA32 System Programming document doesn't mention any of the arbitrary checks you added. It says that temperature target is in bits 23:16 of MSR 0x1a2 (MSR_TEMPERATURE_TARGET) and that's 8 bits, not 7. There is no mention of which values should be considered valid, so I'd say all of them are until anyone reports an evidence some aren't. Ok, let's do that. I submitted the revert a minute ago. Fixed with upstream commit c0940e9 (Revert "hwmon: (coretemp) Refine TjMax detection"). Is this fixed with 3.14.3 or 3.15? It is fixed in 3.15-rc4+. The fix missed 3.14.3 but it should be in 3.14.4. |