Dell precision mobile 7740 notebook, manually configured kernel. Thermal zone 0 is the acpi thermal zone. Reading it (e.g. tmon) constantly shows 25.000 C, independent of load and real temp, even if the notebook heats up to the mce throttling temperature. All other thermal zones work fine and deliver changing, realistic values (except for the iwlwifi thermal zone, which is a known bug). From dmesg: thermal LNXTHERM:00: registered as thermal_zone0 ACPI: Thermal Zone [THM] (25 C) Nothing else in dmesg I could relate to thermal zone 0 or acpi thermal. Is this a bios problem, a linux kernel problem, or did I configure something wrong?
please attach the output of 1. grep . /sys/class/thermal/thermal*/* 2. cat /sys/class/thermal/thermal_zone0/device/path 3. acpidump
Created attachment 286493 [details] /sys/class/thermal/thermal_zone0/device/path
Created attachment 286495 [details] acpidump
Created attachment 286497 [details] grep . /sys/class/thermal/thermal*/*
You have enabled the INT340X thermal drivers on this platform, and the ACPI thermal zone is actually disabled when INT340X thermal drivers are working. IMO, the real problem is that we should remove/disable acpi thermal zone when INT340X thermal driver enabled. could you please confirm if the problem still exists when you rebuild your kernel with CONFIG_INT340X_THERMAL cleared?
The acpi thermal zone constantly reports 25 C, even if I clear CONFIG_INT340X_THERMAL, even if I disable all other thermals (X86_PKG_TEMP_THERMAL and INTEL_PCH_THERMAL). P.S.: Could you please tell your coworkers at Intel IWLWIFI that their thermal zone should be controllable by a CONFIG (it is always present if IWLWIFI is enabled, there is no way to disable it), especially as long as their thermal driver is broken (gives an error message at every boot and constantly reports 0 C).
(In reply to Klaus Kusche from comment #6) > The acpi thermal zone constantly reports 25 C, > even if I clear CONFIG_INT340X_THERMAL, > even if I disable all other thermals > (X86_PKG_TEMP_THERMAL and INTEL_PCH_THERMAL). is there any BIOS option related with DPTF? If yes, could you please disable it and re-check? > > P.S.: Could you please tell your coworkers at Intel IWLWIFI > that their thermal zone should be controllable by a CONFIG > (it is always present if IWLWIFI is enabled, > there is no way to disable it), > especially as long as their thermal driver is broken > (gives an error message at every boot and constantly reports 0 C). hmm, okay, I will try to work out a fix for this.
I found nothing in the BIOS related to thermal sensors or management. Temperature values shown in the BIOS self test look reasonable. Would it help to attach my kernel config or dmesg? Maybe related: According to dmesg, my system runs into thermal throttling several times per day (per hour under heavy load), on all cores, just for some milliseconds.
Some more points: * Changing CONFIG_DPTF_POWER does not make any difference. Should it be on or off? * Because you mentioned CONFIG_INT340X_THERMAL above: With that on, there is a thermal zone INT340 (and some others). While the other zones controlled by that CONFIG give correct and changing values, INT340 seems to be broken, too: It constantly shows 20 C. * About the acpi thermal zone: Of course it would be nice to have it working. But I could also live without it, because using x86_pkg_temp instead would be fine for me. However, most userland tools (e.g. panel plugins or graphical system monitors) default either to thermal zone 0 or to "acpitz" and hence give no useful values. And disabling "acpitz" would not help, because then the iwlwifi thermal zone would be zone 0, and that can't be disabled and is also totally useless! If both could be disabled, x86_pkg_temp would be zone 0, which would be ok.
(In reply to Klaus Kusche from comment #9) > Some more points: > > * Changing CONFIG_DPTF_POWER does not make any difference. > Should it be on or off? please leave it on. > > * Because you mentioned CONFIG_INT340X_THERMAL above: > With that on, there is a thermal zone INT340 (and some others). > While the other zones controlled by that CONFIG give correct and changing > values, > INT340 seems to be broken, too: It constantly shows 20 C. that's okay. int3400 is a faked thermal zone, but the others should work properly, or else that is a bug. > > * About the acpi thermal zone: > Of course it would be nice to have it working. > But I could also live without it, > because using x86_pkg_temp instead would be fine for me. Yes, because there is no cooling device associated with this ACPI thermal zone. no, because we lose the critical shutdown protection when the system really overheats a lot. > > However, most userland tools (e.g. panel plugins or graphical system > monitors) > default either to thermal zone 0 or to "acpitz" and hence give no useful > values. > And disabling "acpitz" would not help, because then the iwlwifi thermal zone > would be zone 0, and that can't be disabled and is also totally useless! > If both could be disabled, x86_pkg_temp would be zone 0, which would be ok. Right.
Let's get back to the originally problem, aka, why acpitz returns 25C. Method (_TMP, 0, NotSerialized) // _TMP: Temperature { Local0 = GENS (0x16, Zero, Zero) If ((Local0 < 0x0BA6)) { Local0 = 0x0BA6 } Return (Local0) } _TMP returns 0x0BA6, which is 2982 in decimal, which is (2982-2732)/10=25C, when Local0 is lower than 0xBA6. And Local0 is got as the return value of GENS (0x16, Zero, Zero). So now, in our case, the root cause of the problem is why GENS (0x16, Zero, Zero) return value is lower than 0xBA6. Now let's check GENS, Method (GENS, 3, NotSerialized) { Acquire (SMIX, 0xFFFF) Local0 = Arg1 If ((ObjectType (Arg1) == One)) { Local0 = SMBI (Arg0, Arg1) } If ((ObjectType (Arg1) == 0x03)) { Local0 = SMBF (Arg0, Arg1, Arg2) } Release (SMIX) Return (Local0) } ObjectType of Arg1, which is "Zero" in our case, is "1", which stands for Integer. And the system traps into SMM mode, in SMBI, to get the temperature. But we can not debug further as SMM is transparent to OS. For this issue, I can not explain why ACPI Thermal zone returns a static value when DPTF is not actually enabled. I will get back to this issue if I have something new.
Klaus, please check if there is any BIOS option related with DPTF, and confirm if changing those options helps or not. Hi, Mario, can you please help check if this is a BIOS issue?
(In reply to Zhang Rui from comment #12) > Klaus, > please check if there is any BIOS option related with DPTF, and confirm if > changing those options helps or not. As far as I can tell, there's nothing related in the BIOS. The only BIOS setting I think could be related to temp & power & speed is "Speed Step Shift", which is "on". Perhaps related: I still observe MCE thermal throttling now and then, on all cores simultaneously, for just a few milliseconds. As far as I can tell, this mainly happens for short single-core or dual-core peak loads on an otherwise idle system, less frequently also for continuous all-core full load.
please run a kernel later than 5.4-4c2, check the location of file tcc_offset_degree_celsius by "find /sys/ | grep tcc_offset_degree_celsius" and then get the content of this file.
5.4.7-gentoo cat /sys/devices/pci0000:00/0000:00:04.0/tcc_offset_degree_celsius 0
try "echo 5 > /sys/devices/pci0000:00/0000:00:04.0/tcc_offset_degree_celsius" and see if the MCE errors goes away.
I have not seen any MCE thermal throttling messages in dmesg for several kernel versions. But this does not imply that there is no MCE thermal throttling. I think a patch which suppresses all short-time thermal throttling messages was included in the linux kernel shortly after I opened this bug report, and all my throttling has been very short-time (milliseconds). So I can't test if the command you sent makes any difference. However, throttling is not the problem here (especially if it doesn't annoy me with messages), the temperature being reported constantly as 25 degrees is the problem.
Klaus, as I mentioned earlier, it seems that the system traps into SMM mode, and then, based on that, 25C is returned. There is noway for software to know what happens in SMM mode as it is totally transparent to OS. I highly suspect this is BIOS related but I don't have solid evidence for this. I'd prefer to close it as I run out of my ideas, but if you prefer to keep it open, we can leave it open and see if anyone may jump into this and help us.
If Mario.Limonciello@dell.com also has no idea, you may close that part of the bug. But the iwlwifi thermal zone problem (see comment 6) still persists: It permanently reports 0, it gives an error message when booting, and it can't be turned off with a config. You promised to have a look at it in comment 7?
there are some upstream work on this. But unfortunately, we failed to push it upstream without ACK from wifi driver experts. https://patchwork.kernel.org/project/linux-pm/list/?series=279873&state=* Let me think about this.
For the wifi issue, it is tracked in another thread. https://bugzilla.kernel.org/show_bug.cgi?id=201761 So let's close this one and track there.