Bug 179981 - coretemp stops reporting new temperatures - Avaton - Intel(R) Atom(TM) CPU C2758
Summary: coretemp stops reporting new temperatures - Avaton - Intel(R) Atom(TM) CPU C...
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Power Management
Classification: Unclassified
Component: Thermal (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Srinivas Pandruvada
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-10-22 17:26 UTC by Alex Forencich
Modified: 2017-10-10 06:57 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.8.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Alex Forencich 2016-10-22 17:26:43 UTC
I have a Supermicro motherboard with an Intel Atom C2758 processor where coretemp does not seem to be working correctly.  The core temperature sensor appears to be detected correctly with sensors-detect and shows up as expected in the output of running 'sensors' as coretemp-isa-0000.  However, the temperatures reported only update for a short period of time after the system boots.  After that, the reported temperatures get 'stuck' and do not change.  I can get the overall CPU temperature from the IPMI interface with ipmitool, and this value does change over time even after the values reported by coretemp stop updating.  I don't see any interesting messages in dmesg.  The entries in /sys/devices/platform/coretemp.0 also do not change after the output of sensors gets stuck, so the problem is definitely either a hardware issue or a driver issue, not an issue with lm-sensors.  Reloading the coretemp module also does not help.  The values reported by sensors and sysfs are the same before and after reloading the driver.

cpuinfo (1 core out of 8):

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 77
model name      : Intel(R) Atom(TM) CPU  C2758  @ 2.40GHz
stepping        : 8
microcode       : 0x127
cpu MHz         : 2400.000
cache size      : 1024 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm arat
bugs            :
bogomips        : 4802.11
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

output of sensors:

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +47.0°C  (high = +98.0°C, crit = +98.0°C)
Core 1:       +47.0°C  (high = +98.0°C, crit = +98.0°C)
Core 2:       +47.0°C  (high = +98.0°C, crit = +98.0°C)
Core 3:       +47.0°C  (high = +98.0°C, crit = +98.0°C)
Core 4:       +46.0°C  (high = +98.0°C, crit = +98.0°C)
Core 5:       +46.0°C  (high = +98.0°C, crit = +98.0°C)
Core 6:       +47.0°C  (high = +98.0°C, crit = +98.0°C)
Core 7:       +46.0°C  (high = +98.0°C, crit = +98.0°C)

sysfs entries:

$ cat /sys/devices/platform/coretemp.0/hwmon/hwmon0/temp*_input
47000
47000
47000
47000
46000
46000
47000
46000
Comment 1 Alex Forencich 2017-03-11 01:22:37 UTC
Just updated to kernel version 4.10.1; no change.
Comment 2 Zhang Rui 2017-06-17 07:23:04 UTC
the temperature reported by coretemp driver are directly read from MSR.
Thus this sounds like a hardware issue to me.

please attach the turbostat output as well, when the problem is reproduced.
Comment 3 Alex Forencich 2017-06-20 23:58:05 UTC
sensors output:

$ sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +39.0°C  (high = +98.0°C, crit = +98.0°C)
Core 1:       +39.0°C  (high = +98.0°C, crit = +98.0°C)
Core 2:       +38.0°C  (high = +98.0°C, crit = +98.0°C)
Core 3:       +38.0°C  (high = +98.0°C, crit = +98.0°C)
Core 4:       +38.0°C  (high = +98.0°C, crit = +98.0°C)
Core 5:       +38.0°C  (high = +98.0°C, crit = +98.0°C)
Core 6:       +36.0°C  (high = +98.0°C, crit = +98.0°C)
Core 7:       +36.0°C  (high = +98.0°C, crit = +98.0°C)

turbostat output, with turbostat.c edited to force no_MSR_MISC_PWR_MGMT to 1 to avoid an I/O error while reading msr 0x1aa:

$ sudo ./turbostat --debug
turbostat version 17.04.12 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:4d:8 (6:77:8)
CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM
CPUID(6): APERF, No-TURBO, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu5: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST No-MWAIT PREFETCH TURBO)
CPUID(7): No-SGX
SLM BCLK: 100.0 Mhz
RAPL: 2185 sec. Joule Counter Range, at 30 Watts
cpu5: MSR_PLATFORM_INFO: 0xc0080001800
12 * 100.0 = 1200.0 MHz max efficiency frequency
24 * 100.0 = 2400.0 MHz base frequency
cpu5: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled)
cpu5: MSR_TURBO_RATIO_LIMIT: 0x00000000
cpu5: MSR_PKG_CST_CONFIG_CONTROL: 0x0000840e (locked: pkg-cstate-limit=14: pc6)
cpu5: POLL: CPUIDLE CORE POLL IDLE
cpu5: C1: MWAIT 0x00
cpu5: C6: MWAIT 0x51
cpu5: cpufreq driver: acpi-cpufreq
cpu5: cpufreq governor: schedutil
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000004 (custom)
cpu0: MSR_RAPL_POWER_UNIT: 0x000a1003 (0.125000 Watts, 0.000015 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x468bb8005b89c4 (UNlocked)
cpu0: PKG Limit #1: ENabled (312.500000 Watts, 10.000000 sec, clamp ENabled)
cpu0: PKG Limit #2: ENabled (375.000000 Watts, 0.009766* sec, clamp DISabled)
cpu0: MSR_PP0_POWER_LIMIT: 0x00020000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.001953 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00620000 (98 C)
cpu0: MSR_IA32_THERM_STATUS: 0x883b0000 (39 C +/- 1)
cpu0: MSR_IA32_THERM_INTERRUPT: 0x000a0507 (88 C, 93 C)
cpu1: MSR_IA32_THERM_STATUS: 0x883b0000 (39 C +/- 1)
cpu1: MSR_IA32_THERM_INTERRUPT: 0x000a0507 (88 C, 93 C)
cpu2: MSR_IA32_THERM_STATUS: 0x883c0000 (38 C +/- 1)
cpu2: MSR_IA32_THERM_INTERRUPT: 0x000a0507 (88 C, 93 C)
cpu3: MSR_IA32_THERM_STATUS: 0x883c0000 (38 C +/- 1)
cpu3: MSR_IA32_THERM_INTERRUPT: 0x000a0507 (88 C, 93 C)
cpu4: MSR_IA32_THERM_STATUS: 0x883c0000 (38 C +/- 1)
cpu4: MSR_IA32_THERM_INTERRUPT: 0x000a0507 (88 C, 93 C)
cpu5: MSR_IA32_THERM_STATUS: 0x883c0000 (38 C +/- 1)
cpu5: MSR_IA32_THERM_INTERRUPT: 0x000a0507 (88 C, 93 C)
cpu6: MSR_IA32_THERM_STATUS: 0x883e0000 (36 C +/- 1)
cpu6: MSR_IA32_THERM_INTERRUPT: 0x000a0507 (88 C, 93 C)
cpu7: MSR_IA32_THERM_STATUS: 0x883e0000 (36 C +/- 1)
cpu7: MSR_IA32_THERM_INTERRUPT: 0x000a0507 (88 C, 93 C)
Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     C1      C6     C1%      C6%     CPU%c1  CPU%c6  CoreTmp Pkg%pc3 Pkg%pc6 PkgWatt CorWatt
-       -       33      1.74    1900    2400    7395    0       223     7435   0.10     98.20   0.38    97.88   39      0.00    0.00    0.00    0.00
0       0       24      1.25    1900    2400    719     0       23      710    0.21     98.57   0.44    98.31   39      0.00    0.00    0.00    0.00
1       1       29      1.54    1900    2400    1025    0       18      811    0.05     98.44   0.29    98.16   39
2       2       33      1.76    1900    2400    1010    0       15      1191   0.02     98.27   0.34    97.90   38
3       3       22      1.14    1900    2400    979     0       21      957    0.06     98.84   0.32    98.54   38
4       4       27      1.40    1900    2400    629     0       36      838    0.02     98.62   0.28    98.33   38
5       5       28      1.46    1900    2400    749     0       36      1010   0.02     98.57   0.32    98.22   38
6       6       72      3.78    1900    2400    1233    0       30      925    0.26     96.00   0.57    95.65   36
7       7       30      1.58    1900    2400    1051    0       44      993    0.16     98.31   0.49    97.93   36
Comment 4 Zhang Rui 2017-06-21 00:42:44 UTC
(In reply to Alex Forencich from comment #0)
> 
> cpuinfo (1 core out of 8):
> 
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 77
> model name      : Intel(R) Atom(TM) CPU  C2758  @ 2.40GHz

#define INTEL_FAM6_ATOM_SILVERMONT2     0x4D /* Avaton/Rangely */

this is an Avaton platform.
Comment 5 Zhang Rui 2017-08-28 05:03:05 UTC
the turnostat output are consistent with the core_temp driver output.

It seems that the real problem is that MSR stops updating...
Comment 6 Zhang Rui 2017-08-29 02:16:13 UTC
please
1. run "turbostat --debug --out turbostat.log"
2. stress cpu to make sure the temperature raises
3. quit turbostat and attach the turbostat.log here

we can check if the other MSRs are updated properly.
Comment 7 Zhang Rui 2017-10-10 06:57:17 UTC
Bug closed because there is not response from the bug reporter.
Please feel free to reopen it if you can provide the information required in comment #6.

Note You need to log in before you can comment on or make changes to this bug.