Bug 218586 - No ACPI Thermal Zones after Kernel 6.8
Summary: No ACPI Thermal Zones after Kernel 6.8
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Thermal (show other bugs)
Hardware: AMD Linux
: P3 low
Assignee: acpi_power-thermal
URL:
Keywords:
: 218652 (view as bug list)
Depends on:
Blocks:
 
Reported: 2024-03-11 12:12 UTC by Stephen Horvath
Modified: 2024-04-05 03:47 UTC (History)
5 users (show)

See Also:
Kernel Version: 6.8.0
Subsystem:
Regression: Yes
Bisected commit-id: 9c8647224e9fabb765019193aa43c054a638f808


Attachments
Output of 'dmesg' on Kernel 6.8 (99.29 KB, text/plain)
2024-03-11 12:12 UTC, Stephen Horvath
Details
Raise the max temperature of ACPI trip temps to 488K (215°C) (1.19 KB, patch)
2024-03-18 07:05 UTC, Stephen Horvath
Details | Diff
Raise the max temperature of ACPI trip temps to 488K (215°C) (2.08 KB, patch)
2024-03-18 07:19 UTC, Stephen Horvath
Details | Diff
Test patch for 6.8.x (432 bytes, patch)
2024-03-31 01:28 UTC, Quentin Smith
Details | Diff

Description Stephen Horvath 2024-03-11 12:12:15 UTC
Created attachment 305976 [details]
Output of 'dmesg' on Kernel 6.8

Hi, I have a Framework 13 AMD, which had 4 detected thermal zones though ACPI, but starting with kernel 6.8 they no longer appear and the following gets printed in dmesg:
```
[    0.630727] ACPI: thermal: [Firmware Bug]: Invalid critical threshold (-274000)
[    0.630738] ACPI: thermal: [Firmware Bug]: No valid trip points!
[    0.630819] ACPI: thermal: [Firmware Bug]: Invalid critical threshold (-274000)
[    0.630828] ACPI: thermal: [Firmware Bug]: No valid trip points!
[    0.630904] ACPI: thermal: [Firmware Bug]: Invalid critical threshold (-274000)
[    0.630913] ACPI: thermal: [Firmware Bug]: No valid trip points!
[    0.630991] ACPI: thermal: [Firmware Bug]: Invalid critical threshold (-274000)
[    0.631000] ACPI: thermal: [Firmware Bug]: No valid trip points!
```

for comparison this is 6.7.9:
```
[    0.632366] ACPI: thermal: Thermal Zone [TZ00] (42 C)
[    0.632593] ACPI: thermal: Thermal Zone [TZ01] (41 C)
[    0.632773] ACPI: thermal: Thermal Zone [TZ02] (39 C)
[    0.632867] ACPI: thermal: Thermal Zone [TZ03] (57 C)
```

Thanks,
Steve
Comment 1 Artem S. Tashkinov 2024-03-12 14:57:10 UTC
Would be great if you tried to bisect: https://docs.kernel.org/admin-guide/bug-bisect.html
Comment 2 Stephen Horvath 2024-03-18 07:03:46 UTC
Hi, sorry I had a busy week.

Here's the output after bisecting:
```
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[9c8647224e9fabb765019193aa43c054a638f808] ACPI: thermal: Use library functions to obtain trip point temperature values
```

And after some debugging, it seems my device seems to report trip temps of 483.2K (210°C), but the kernel only checks the range 218K (-55°C) to 448K (175°C), which makes it think it's invalid.

I'll attach a diff to raise the max to 488K (215°C); although I was wondering if that's enough or if maybe a value like 3276K (3003°C) would be better, since it's just below the signed 16bit limit which seems like it could be used as an invalid value on some devices.

Thanks,
Steve
Comment 3 Stephen Horvath 2024-03-18 07:05:38 UTC
Created attachment 306004 [details]
Raise the max temperature of ACPI trip temps to 488K (215°C)
Comment 4 Stephen Horvath 2024-03-18 07:19:30 UTC
Created attachment 306005 [details]
Raise the max temperature of ACPI trip temps to 488K (215°C)

Sorry I thought I might make it a proper patch rather than a diff.
Comment 5 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-03-29 08:51:48 UTC
Let me at this to the regression tracker to ensure it does not fall through the cracks:

#regzbot introduced: 9c8647224e9fabb765019193a
#regzbot title: ACPI: thermal_lib: no ACPI Thermal Zones anymore
#regzbot fix: ACPI: thermal_lib: Continue registering thermal zones even if trip points fail validation
#regzbot monitor: https://lore.kernel.org/all/SY4P282MB3063EE2CC37BD0EF2318B746C5362@SY4P282MB3063.AUSP282.PROD.OUTLOOK.COM/
Comment 6 Mario Limonciello (AMD) 2024-03-30 14:16:08 UTC
*** Bug 218652 has been marked as a duplicate of this bug. ***
Comment 7 Quentin Smith 2024-03-31 01:28:16 UTC
Created attachment 306063 [details]
Test patch for 6.8.x

I tried Mario's patch from bug 218652; unfortunately it doesn't cleanly apply to 6.8.x (it looks like this code changed on master only a week ago).

I made a corresponding patch for 6.8.x, and with this patch the sensors are back (though obviously the temperature thresholds are still bogus).

Probably Mario's patch will work correctly on master, but I didn't want to run a true bleeding edge kernel in case other things are broken differently.
Comment 8 Mario Limonciello (AMD) 2024-04-01 12:56:58 UTC
Quentin, can you try Stephen's suggestion above in comment 4 instead?

I think that's more desirable if that works instead.
Comment 10 Quentin Smith 2024-04-04 19:34:13 UTC
(In reply to Mario Limonciello (AMD) from comment #8)
> Quentin, can you try Stephen's suggestion above in comment 4 instead?
> 
> I think that's more desirable if that works instead.

To raise the limit a bit? That might fix the Framework 16, but the same problem would exist on any other laptop that has invalid limits. Why is that more desirable?

In any event, it looks like Rafael's patch is queued up for 6.9.

Note You need to log in before you can comment on or make changes to this bug.