Hello, On my Lenovo T580 I really often get a message in dmesg about the CPU temperature being too high: [mar avr 16 14:02:19 2019] mce: CPU4: Core temperature above threshold, cpu clock throttled (total events = 52) [mar avr 16 14:02:19 2019] mce: CPU0: Core temperature above threshold, cpu clock throttled (total events = 52) [mar avr 16 14:02:19 2019] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 77) [mar avr 16 14:02:19 2019] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 77) [mar avr 16 14:02:19 2019] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 77) [mar avr 16 14:02:19 2019] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 77) [mar avr 16 14:02:19 2019] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 77) [mar avr 16 14:02:19 2019] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 77) [mar avr 16 14:02:19 2019] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 77) [mar avr 16 14:02:19 2019] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 77) [mar avr 16 14:02:19 2019] mce: CPU4: Core temperature/speed normal [mar avr 16 14:02:19 2019] mce: CPU0: Core temperature/speed normal [mar avr 16 14:02:19 2019] mce: CPU1: Package temperature/speed normal [mar avr 16 14:02:19 2019] mce: CPU3: Package temperature/speed normal [mar avr 16 14:02:19 2019] mce: CPU6: Package temperature/speed normal [mar avr 16 14:02:19 2019] mce: CPU2: Package temperature/speed normal [mar avr 16 14:02:19 2019] mce: CPU0: Package temperature/speed normal [mar avr 16 14:02:19 2019] mce: CPU7: Package temperature/speed normal [mar avr 16 14:02:19 2019] mce: CPU5: Package temperature/speed normal [mar avr 16 14:02:19 2019] mce: CPU4: Package temperature/speed normal Looking at the temperature with lm-sensors, it seems perfectly in range: iwlwifi-virtual-0 Adapter: Virtual device temp1: +41.0°C thinkpad-isa-0000 Adapter: ISA adapter fan1: 0 RPM acpitz-acpi-0 Adapter: ACPI interface temp1: +43.0°C (crit = +98.0°C) coretemp-isa-0000 Adapter: ISA adapter Package id 0: +46.0°C (high = +100.0°C, crit = +100.0°C) Core 0: +44.0°C (high = +100.0°C, crit = +100.0°C) Core 1: +46.0°C (high = +100.0°C, crit = +100.0°C) Core 2: +44.0°C (high = +100.0°C, crit = +100.0°C) Core 3: +44.0°C (high = +100.0°C, crit = +100.0°C) pch_skylake-virtual-0 Adapter: Virtual device temp1: +41.5°C An idea what might be the problem? This might be related to a general issue with ACPI on this model (see bug #203199)
please attach the output of lspci -vx
Created attachment 282379 [details] lspci output
Please check if the problem can be reproduced in the latest upstream kernel. BTW, what distribution you're using? please make sure thermald is running.
I'm still see that today with using kernel 5.2 from debian unstable thermald was not running, I installed it know. Let's see if this is working
It's still complaining even with thermald
Oh, please confirm CONFIG_PROC_THERMAL_MMIO_RAPL is set with your test.
Looks like CONFIG_PROC_THERMAL_MMIO_RAPL doesn't exist in 5.2 (was added in 5.3)
Created attachment 284875 [details] dmesg 5.3-rc5 I updated to 5.3-rc5 (from debian) and I see that CONFIG_PROC_THERMAL_MMIO_RAPL is enabled, I'm still seeing the same messages: [ 510.202054] mce: CPU2: Core temperature above threshold, cpu clock throttled (total events = 92) [ 510.202055] mce: CPU6: Core temperature above threshold, cpu clock throttled (total events = 92) [ 510.202056] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 216) [ 510.202057] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 216) [ 510.202058] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 216) [ 510.202059] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 216) [ 510.202060] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 216) [ 510.202061] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 216) [ 510.202090] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 216) [ 510.202091] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 216) [ 510.202941] mce: CPU2: Core temperature/speed normal [ 510.202942] mce: CPU6: Core temperature/speed normal [ 510.202943] mce: CPU6: Package temperature/speed normal [ 510.202943] mce: CPU2: Package temperature/speed normal [ 510.202998] mce: CPU0: Package temperature/speed normal [ 510.202999] mce: CPU3: Package temperature/speed normal [ 510.203000] mce: CPU5: Package temperature/speed normal [ 510.203000] mce: CPU7: Package temperature/speed normal [ 510.203001] mce: CPU1: Package temperature/speed normal [ 510.203002] mce: CPU4: Package temperature/speed normal Looking in dmesg I see: [ 62.127412] thermal thermal_zone8: failed to read out thermal zone (-61) Is that expected?
(In reply to Laurent Bigonville from comment #8) > Created attachment 284875 [details] > dmesg 5.3-rc5 > > I updated to 5.3-rc5 (from debian) and I see that > CONFIG_PROC_THERMAL_MMIO_RAPL is enabled, I'm still seeing the same messages: > > [ 510.202054] mce: CPU2: Core temperature above threshold, cpu clock > throttled (total events = 92) > [ 510.202055] mce: CPU6: Core temperature above threshold, cpu clock > throttled (total events = 92) > [ 510.202056] mce: CPU5: Package temperature above threshold, cpu clock > throttled (total events = 216) > [ 510.202057] mce: CPU7: Package temperature above threshold, cpu clock > throttled (total events = 216) > [ 510.202058] mce: CPU3: Package temperature above threshold, cpu clock > throttled (total events = 216) > [ 510.202059] mce: CPU1: Package temperature above threshold, cpu clock > throttled (total events = 216) > [ 510.202060] mce: CPU6: Package temperature above threshold, cpu clock > throttled (total events = 216) > [ 510.202061] mce: CPU2: Package temperature above threshold, cpu clock > throttled (total events = 216) > [ 510.202090] mce: CPU0: Package temperature above threshold, cpu clock > throttled (total events = 216) > [ 510.202091] mce: CPU4: Package temperature above threshold, cpu clock > throttled (total events = 216) > [ 510.202941] mce: CPU2: Core temperature/speed normal > [ 510.202942] mce: CPU6: Core temperature/speed normal > [ 510.202943] mce: CPU6: Package temperature/speed normal > [ 510.202943] mce: CPU2: Package temperature/speed normal > [ 510.202998] mce: CPU0: Package temperature/speed normal > [ 510.202999] mce: CPU3: Package temperature/speed normal > [ 510.203000] mce: CPU5: Package temperature/speed normal > [ 510.203000] mce: CPU7: Package temperature/speed normal > [ 510.203001] mce: CPU1: Package temperature/speed normal > [ 510.203002] mce: CPU4: Package temperature/speed normal > I'm curious if the system is really overheating when these messages are generated. how often do you get these errors? > Looking in dmesg I see: > > [ 62.127412] thermal thermal_zone8: failed to read out thermal zone (-61) > > Is that expected? that should be okay. please attach the output of "grep . /sys/class/thermal/thermal*/*"
Created attachment 284897 [details] thermal.txt I don't know, it also happened during the night when the laptop was not used and left unattended Temperatures looks OK: $ sensors iwlwifi-virtual-0 Adapter: Virtual device temp1: +49.0°C BAT0-acpi-0 Adapter: ACPI interface in0: +16.49 V pch_skylake-virtual-0 Adapter: Virtual device temp1: +43.0°C acpitz-acpi-0 Adapter: ACPI interface temp1: +45.0°C (crit = +98.0°C) BAT1-acpi-0 Adapter: ACPI interface in0: +12.68 V coretemp-isa-0000 Adapter: ISA adapter Package id 0: +45.0°C (high = +100.0°C, crit = +100.0°C) Core 0: +45.0°C (high = +100.0°C, crit = +100.0°C) Core 1: +44.0°C (high = +100.0°C, crit = +100.0°C) Core 2: +45.0°C (high = +100.0°C, crit = +100.0°C) Core 3: +44.0°C (high = +100.0°C, crit = +100.0°C) thinkpad-isa-0000 Adapter: ISA adapter fan1: 0 RPM temp1: +45.0°C temp2: N/A temp3: +0.0°C temp4: +0.0°C temp5: +0.0°C temp6: +0.0°C temp7: +0.0°C temp8: +0.0°C temp9: +0.0°C temp10: +0.0°C temp11: +66.0°C temp12: +0.0°C temp13: +0.0°C temp14: +0.0°C temp15: +0.0°C temp16: +0.0°C
But to answer your question, it's happening multiple times a day
please run a kernel later than 5.4-4c2, check the location of file tcc_offset_degree_celsius by "find /sys/ | grep tcc_offset_degree_celsius" and then get the content of this file.
bigon@edoras:~$ sudo find /sys/ | grep tcc_offset_degree_celsius /sys/devices/pci0000:00/0000:00:04.0/tcc_offset_degree_celsius bigon@edoras:~$ cat '/sys/devices/pci0000:00/0000:00:04.0/tcc_offset_degree_celsius' 24 bigon@edoras:~$ uname -a Linux edoras 5.4.0-1-amd64 #1 SMP Debian 5.4.6-1 (2019-12-27) x86_64 GNU/Linux
I think you can set it to a smaller value to get rid of the overheating messages. say, run "echo 14 > /sys/devices/pci0000:00/0000:00:04.0/tcc_offset_degree_celsius"
FTR, I found a long thread on lenovo forums that looks related: https://forums.lenovo.com/t5/Other-Linux-Discussions/X1C6-T480s-low-cTDP-and-trip-temperature-in-Linux/td-p/4028489
that is really a long thread. I think that can be fixed with thermald running, right? We have made kernel changes for thermald to improve this. Bug closed. Please feel free to reopen it if you still have any questions.