Bug 203335 - Lenovo T580 always complains it's overheating
Summary: Lenovo T580 always complains it's overheating
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: ACPICA-Core (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: acpi_acpica-core@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-16 12:04 UTC by Laurent Bigonville
Modified: 2020-06-29 07:59 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci output (19.09 KB, text/plain)
2019-04-18 08:09 UTC, Laurent Bigonville
Details
dmesg 5.3-rc5 (77.26 KB, text/plain)
2019-09-07 09:50 UTC, Laurent Bigonville
Details
thermal.txt (12.19 KB, text/plain)
2019-09-09 11:11 UTC, Laurent Bigonville
Details

Description Laurent Bigonville 2019-04-16 12:04:09 UTC
Hello,

On my Lenovo T580 I really often get a message in dmesg about the CPU temperature being too high:

[mar avr 16 14:02:19 2019] mce: CPU4: Core temperature above threshold, cpu clock throttled (total events = 52)
[mar avr 16 14:02:19 2019] mce: CPU0: Core temperature above threshold, cpu clock throttled (total events = 52)
[mar avr 16 14:02:19 2019] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 77)
[mar avr 16 14:02:19 2019] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 77)
[mar avr 16 14:02:19 2019] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 77)
[mar avr 16 14:02:19 2019] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 77)
[mar avr 16 14:02:19 2019] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 77)
[mar avr 16 14:02:19 2019] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 77)
[mar avr 16 14:02:19 2019] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 77)
[mar avr 16 14:02:19 2019] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 77)
[mar avr 16 14:02:19 2019] mce: CPU4: Core temperature/speed normal
[mar avr 16 14:02:19 2019] mce: CPU0: Core temperature/speed normal
[mar avr 16 14:02:19 2019] mce: CPU1: Package temperature/speed normal
[mar avr 16 14:02:19 2019] mce: CPU3: Package temperature/speed normal
[mar avr 16 14:02:19 2019] mce: CPU6: Package temperature/speed normal
[mar avr 16 14:02:19 2019] mce: CPU2: Package temperature/speed normal
[mar avr 16 14:02:19 2019] mce: CPU0: Package temperature/speed normal
[mar avr 16 14:02:19 2019] mce: CPU7: Package temperature/speed normal
[mar avr 16 14:02:19 2019] mce: CPU5: Package temperature/speed normal
[mar avr 16 14:02:19 2019] mce: CPU4: Package temperature/speed normal

Looking at the temperature with lm-sensors, it seems perfectly in range:

iwlwifi-virtual-0
Adapter: Virtual device
temp1:        +41.0°C  

thinkpad-isa-0000
Adapter: ISA adapter
fan1:           0 RPM

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +43.0°C  (crit = +98.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +46.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +44.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +46.0°C  (high = +100.0°C, crit = +100.0°C)
Core 2:        +44.0°C  (high = +100.0°C, crit = +100.0°C)
Core 3:        +44.0°C  (high = +100.0°C, crit = +100.0°C)

pch_skylake-virtual-0
Adapter: Virtual device
temp1:        +41.5°C

An idea what might be the problem?

This might be related to a general issue with ACPI on this model (see bug #203199)
Comment 1 Zhang Rui 2019-04-18 07:16:12 UTC
please attach the output of lspci -vx
Comment 2 Laurent Bigonville 2019-04-18 08:09:12 UTC
Created attachment 282379 [details]
lspci output
Comment 3 Zhang Rui 2019-09-03 07:21:11 UTC
Please check if the problem can be reproduced in the latest upstream kernel.
BTW, what distribution you're using? please make sure thermald is running.
Comment 4 Laurent Bigonville 2019-09-03 10:01:43 UTC
I'm still see that today with using kernel 5.2 from debian unstable

thermald was not running, I installed it know. Let's see if this is working
Comment 5 Laurent Bigonville 2019-09-06 08:58:14 UTC
It's still complaining even with thermald
Comment 6 Zhang Rui 2019-09-07 06:07:21 UTC
Oh, please confirm CONFIG_PROC_THERMAL_MMIO_RAPL is set with your test.
Comment 7 Laurent Bigonville 2019-09-07 09:02:47 UTC
Looks like CONFIG_PROC_THERMAL_MMIO_RAPL doesn't exist in 5.2 (was added in 5.3)
Comment 8 Laurent Bigonville 2019-09-07 09:50:17 UTC
Created attachment 284875 [details]
dmesg 5.3-rc5

I updated to 5.3-rc5 (from debian) and I see that CONFIG_PROC_THERMAL_MMIO_RAPL is enabled, I'm still seeing the same messages:

[  510.202054] mce: CPU2: Core temperature above threshold, cpu clock throttled (total events = 92)
[  510.202055] mce: CPU6: Core temperature above threshold, cpu clock throttled (total events = 92)
[  510.202056] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 216)
[  510.202057] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 216)
[  510.202058] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 216)
[  510.202059] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 216)
[  510.202060] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 216)
[  510.202061] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 216)
[  510.202090] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 216)
[  510.202091] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 216)
[  510.202941] mce: CPU2: Core temperature/speed normal
[  510.202942] mce: CPU6: Core temperature/speed normal
[  510.202943] mce: CPU6: Package temperature/speed normal
[  510.202943] mce: CPU2: Package temperature/speed normal
[  510.202998] mce: CPU0: Package temperature/speed normal
[  510.202999] mce: CPU3: Package temperature/speed normal
[  510.203000] mce: CPU5: Package temperature/speed normal
[  510.203000] mce: CPU7: Package temperature/speed normal
[  510.203001] mce: CPU1: Package temperature/speed normal
[  510.203002] mce: CPU4: Package temperature/speed normal

Looking in dmesg I see: 

[   62.127412] thermal thermal_zone8: failed to read out thermal zone (-61)

Is that expected?
Comment 9 Zhang Rui 2019-09-09 04:40:15 UTC
(In reply to Laurent Bigonville from comment #8)
> Created attachment 284875 [details]
> dmesg 5.3-rc5
> 
> I updated to 5.3-rc5 (from debian) and I see that
> CONFIG_PROC_THERMAL_MMIO_RAPL is enabled, I'm still seeing the same messages:
> 
> [  510.202054] mce: CPU2: Core temperature above threshold, cpu clock
> throttled (total events = 92)
> [  510.202055] mce: CPU6: Core temperature above threshold, cpu clock
> throttled (total events = 92)
> [  510.202056] mce: CPU5: Package temperature above threshold, cpu clock
> throttled (total events = 216)
> [  510.202057] mce: CPU7: Package temperature above threshold, cpu clock
> throttled (total events = 216)
> [  510.202058] mce: CPU3: Package temperature above threshold, cpu clock
> throttled (total events = 216)
> [  510.202059] mce: CPU1: Package temperature above threshold, cpu clock
> throttled (total events = 216)
> [  510.202060] mce: CPU6: Package temperature above threshold, cpu clock
> throttled (total events = 216)
> [  510.202061] mce: CPU2: Package temperature above threshold, cpu clock
> throttled (total events = 216)
> [  510.202090] mce: CPU0: Package temperature above threshold, cpu clock
> throttled (total events = 216)
> [  510.202091] mce: CPU4: Package temperature above threshold, cpu clock
> throttled (total events = 216)
> [  510.202941] mce: CPU2: Core temperature/speed normal
> [  510.202942] mce: CPU6: Core temperature/speed normal
> [  510.202943] mce: CPU6: Package temperature/speed normal
> [  510.202943] mce: CPU2: Package temperature/speed normal
> [  510.202998] mce: CPU0: Package temperature/speed normal
> [  510.202999] mce: CPU3: Package temperature/speed normal
> [  510.203000] mce: CPU5: Package temperature/speed normal
> [  510.203000] mce: CPU7: Package temperature/speed normal
> [  510.203001] mce: CPU1: Package temperature/speed normal
> [  510.203002] mce: CPU4: Package temperature/speed normal
> 
I'm curious if the system is really overheating when these messages are generated.
how often do you get these errors?

> Looking in dmesg I see: 
> 
> [   62.127412] thermal thermal_zone8: failed to read out thermal zone (-61)
> 
> Is that expected?

that should be okay.

please attach the output of "grep . /sys/class/thermal/thermal*/*"
Comment 10 Laurent Bigonville 2019-09-09 11:11:27 UTC
Created attachment 284897 [details]
thermal.txt

I don't know, it also happened during the night when the laptop was not used and left unattended

Temperatures looks OK:

$ sensors
iwlwifi-virtual-0
Adapter: Virtual device
temp1:        +49.0°C  

BAT0-acpi-0
Adapter: ACPI interface
in0:         +16.49 V  

pch_skylake-virtual-0
Adapter: Virtual device
temp1:        +43.0°C  

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +45.0°C  (crit = +98.0°C)

BAT1-acpi-0
Adapter: ACPI interface
in0:         +12.68 V  

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +45.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +45.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +44.0°C  (high = +100.0°C, crit = +100.0°C)
Core 2:        +45.0°C  (high = +100.0°C, crit = +100.0°C)
Core 3:        +44.0°C  (high = +100.0°C, crit = +100.0°C)

thinkpad-isa-0000
Adapter: ISA adapter
fan1:           0 RPM
temp1:        +45.0°C  
temp2:            N/A  
temp3:         +0.0°C  
temp4:         +0.0°C  
temp5:         +0.0°C  
temp6:         +0.0°C  
temp7:         +0.0°C  
temp8:         +0.0°C  
temp9:         +0.0°C  
temp10:        +0.0°C  
temp11:       +66.0°C  
temp12:        +0.0°C  
temp13:        +0.0°C  
temp14:        +0.0°C  
temp15:        +0.0°C  
temp16:        +0.0°C
Comment 11 Laurent Bigonville 2019-09-09 11:12:34 UTC
But to answer your question, it's happening multiple times a day
Comment 12 Zhang Rui 2019-12-31 04:21:39 UTC
please run a kernel later than 5.4-4c2, check the location of file tcc_offset_degree_celsius by "find /sys/ | grep tcc_offset_degree_celsius" and then get the content of this file.
Comment 13 Laurent Bigonville 2020-01-06 10:32:24 UTC
bigon@edoras:~$ sudo find /sys/ | grep tcc_offset_degree_celsius
/sys/devices/pci0000:00/0000:00:04.0/tcc_offset_degree_celsius
bigon@edoras:~$ cat '/sys/devices/pci0000:00/0000:00:04.0/tcc_offset_degree_celsius'
24


bigon@edoras:~$ uname -a
Linux edoras 5.4.0-1-amd64 #1 SMP Debian 5.4.6-1 (2019-12-27) x86_64 GNU/Linux
Comment 14 Zhang Rui 2020-01-07 02:52:23 UTC
I think you can set it to a smaller value to get rid of the overheating messages.
say, run "echo 14 > /sys/devices/pci0000:00/0000:00:04.0/tcc_offset_degree_celsius"
Comment 15 Laurent Bigonville 2020-01-07 13:33:10 UTC
FTR, I found a long thread on lenovo forums that looks related: https://forums.lenovo.com/t5/Other-Linux-Discussions/X1C6-T480s-low-cTDP-and-trip-temperature-in-Linux/td-p/4028489
Comment 16 Zhang Rui 2020-06-29 07:59:27 UTC
that is really a long thread.
I think that can be fixed with thermald running, right?
We have made kernel changes for thermald to improve this.
Bug closed. Please feel free to reopen it if you still have any questions.

Note You need to log in before you can comment on or make changes to this bug.