Bug 201761

Summary: "failed to read out thermal zone" for wifi thermal zone
Product: Power Management Reporter: 00oo00 (chenalias)
Component: ThermalAssignee: Zhang Rui (rui.zhang)
Status: NEEDINFO ---    
Severity: normal CC: bugzilla.kernel.org, daniel.lezcano, fkrueger, florinlipan, irherder, jay+bko, johannespfrang+kernel, kernelorg, klaus.kusche, marcel, Maurice.Smulders, mluppov, navarro.ime, nikuito, nrndda, oneuptingera, oscar.priegov, oskar.grindemyr, rm+bko, rui.zhang, serg, shalev.tomer, smirandac, spiderx, stanley.king, ulf.norberg, whenov
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.18.0-10-generic Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
grep output
dmesg from 5.0-rc1 built from https://cgit.freedesktop.org/~agd5f/linux/tree/?h=drm-next-5.1-wip
grep . /sys/class/thermal/*/*
dmesg output showing -61 error

Description 00oo00 2018-11-22 06:26:02 UTC
Created attachment 279597 [details]
dmesg

thermal thermal_zone0: failed to read out thermal zone (-61)
Comment 1 Jean Delvare 2018-11-28 09:06:39 UTC
Please provide a description of your system and a clear statement of your problem from a functional perspective. Also let us know if it used to work before, and if so, which is the last known working version and the first known non-working version.

Also note that thermal is a different subsystem from hwmon, so this bug was filled in the wrong component.
Comment 2 Zhang Rui 2018-11-29 02:46:09 UTC
Hi, Jean, thanks for forwarding. Reassign to thermal component.

I suspect thermal_zone0 is from wifi driver.
please attach the output of "grep . /sys/class/thermal/*/*"
Comment 3 Zhang Rui 2018-12-27 15:55:00 UTC
ping...
Comment 4 Andriy Perevortkin 2019-02-10 13:39:24 UTC
I'm not the original reporter but I'm affected too.

Motherboard os an MSI B450 GAMING PLUS AC with
1c:00.0 Network controller: Intel Corporation Dual Band Wireless-AC 3168NGW [Stone Peak] (rev 10)

which is the device to which thermal zone supposedly belongs.

lm-sensors (git) reports this as:
iwlwifi-virtual-0
Adapter: Virtual device
temp1:            N/A

Attached output of grep . /sys/class/thermal/*/* as  requested and my dmesg.
Comment 5 Andriy Perevortkin 2019-02-10 13:41:48 UTC
Created attachment 281089 [details]
grep output
Comment 6 Andriy Perevortkin 2019-02-10 13:43:13 UTC
Created attachment 281091 [details]
dmesg from 5.0-rc1 built from https://cgit.freedesktop.org/~agd5f/linux/tree/?h=drm-next-5.1-wip
Comment 7 Zhang Rui 2019-03-26 08:52:05 UTC
/sys/class/thermal/thermal_zone0/type:iwlwifi

Yes, it is the wifi driver.
The problem is that we read the temperature when a thermal zone is registered, but for this particular device, we're not able to get the temp because wifi firmware is not loaded at that time.

AFAICS, this is on my TODO list, and I will propose the capability to register a disabled thermal zone so that thermal framework will not try to read the temperature. Will paste the patch here when it is done.
Comment 8 Stan King 2019-07-28 15:24:17 UTC
I too see this on all my systems that use the iwlwifi driver.

By the way, the message is posted about 0.2 to 0.3 seconds after the message saying that the firmware is loaded.  This is dmesg output from a system running Fedora 30.  Other systems have similar timing.

[    7.207904] iwlwifi 0000:01:00.0: loaded firmware version 29.1044073957.0 op_mode iwlmvm
[    7.263526] input: HDA Intel PCH Headphone as /devices/pci0000:00/0000:00:1f.3/sound/card0/input9
[    7.263631] input: HDA Intel PCH HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input10
[    7.263731] input: HDA Intel PCH HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input11
[    7.263807] input: HDA Intel PCH HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input12
[    7.263882] input: HDA Intel PCH HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input13
[    7.263955] input: HDA Intel PCH HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input14
[    7.368003] intel_rapl: Found RAPL domain package
[    7.370868] intel_rapl: Found RAPL domain core
[    7.373444] intel_rapl: Found RAPL domain uncore
[    7.376064] intel_rapl: Found RAPL domain dram
[    7.412647] Bluetooth: hci0: unexpected event for opcode 0xfc2f
[    7.415702] iwlwifi 0000:01:00.0: Detected Intel(R) Dual Band Wireless AC 7265, REV=0x210
[    7.432655] Bluetooth: hci0: Intel firmware patch completed and activated
[    7.441377] iwlwifi 0000:01:00.0: base HW address: 10:02:b5:2a:43:1f
[    7.525907] ieee80211 phy0: Selected rate control algorithm 'iwl-mvm-rs'
[    7.526792] thermal thermal_zone9: failed to read out thermal zone (-61)
Comment 9 Klaus Kusche 2019-10-31 11:16:02 UTC
Same problem here.

thermal thermal_zone1: failed to read out thermal zone (-61)

comes five lines after

iwlwifi 0000:6f:00.0: loaded firmware version 48.4fa0041f.0 op_mode iwlmvm
Comment 10 Andriy Perevortkin 2020-02-03 23:31:52 UTC
It has been over a year now and we got 7 kernel releases after the initial bug report.
Can we hope this is going to be fixed?
Comment 11 Sergio Miranda 2020-02-24 23:09:21 UTC
Hello, I'm using Ubuntu Mate 18.04, and I'm getting this message or bug on /var/log/syslog.
And I guess it's the reason my laptop freezes.

Here's laptop and Ubuntu data:

lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.4 LTS
Release:	18.04
Codename:	bionic

uname -a
Linux sergio-AsusLaptop 5.3.0-40-generic #32~18.04.1-Ubuntu SMP Mon Feb 3 14:05:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
root@sergio-AsusLaptop:~# 



Log:

Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: The canary thread is apparently starving. Taking action.
Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: Demoting known real-time threads.
Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: Successfully demoted thread 14491 of process 26854 (n/a).
Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: Successfully demoted thread 12546 of process 12446 (n/a).
Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: Successfully demoted thread 3272 of process 3262 (n/a).
Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: Successfully demoted thread 3271 of process 3262 (n/a).
Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: Successfully demoted thread 3262 of process 3262 (n/a).
Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: Demoted 5 threads.
Feb 24 15:29:19 sergio-AsusLaptop kernel: [11779.689160] thermal thermal_zone1: failed to read out thermal zone (-61)
Feb 24 15:29:19 sergio-AsusLaptop kernel: [11779.712771] PM: suspend exit
Feb 24 15:29:19 sergio-AsusLaptop kernel: [11779.712905] PM: suspend entry (s2idle)
Comment 12 shalev.tomer 2020-03-13 05:31:30 UTC
Same problem here

$ uname -a
Linux 8VT19Y2 4.15.0-1073-oem #83-Ubuntu SMP Mon Feb 17 11:21:18 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.4 LTS
Release:	18.04
Codename:	bionic

Log:

Mar 13 06:49:02 8VT19Y2 kernel: [ 1218.632872] usb 1-1.3.2.2: new full-speed USB device number 18 using xhci_hcd
...
Mar 13 06:49:02 8VT19Y2 kernel: [ 1218.713229] thermal thermal_zone8: failed to read out thermal zone (-61)
Mar 13 06:49:02 8VT19Y2 kernel: [ 1218.713241] usb 1-1.3.2.2: device descriptor read/64, error -32
Comment 13 shalev.tomer 2020-03-13 05:32:56 UTC
Created attachment 287905 [details]
grep . /sys/class/thermal/*/*

Output of:
$ grep . /sys/class/thermal/*/* 2> /dev/null > sys_class_thermal.txt
Comment 14 Klaus Kusche 2020-04-22 14:32:14 UTC
Why is this still in status "NEEDINFO"?
What info is still missing?
Comment 15 Zhang Rui 2020-04-27 02:52:47 UTC
Hi, guys,

sorry for the late response.

I can reproduce the issue locally, and debug message shows that iwl_mvm_firmware_running() fails when reading the temperature during registration. I'm not sure if we run into the same issue or not, but it's better to have the firmware load issue fixed first and see it solves all the problems or not.

Recently, there are two patches posted which makes it possible to register wifi driver as "disabled" thermal zone and we can enable it later after firmware loaded, so I will propose a patch to fix this issue soon.
Comment 16 Stan King 2020-04-30 20:10:04 UTC
I still don't understand it fails to read out the thermal zone even after the firmware is reported to be loaded.  Does the firmware require more time to initialize before the thermal zone can be successfully read, so there's just some handshaking missing?  If so, does that explain why lm_sensors can successfully read the temperature later?

Here's what it looks like on my Fedora 30 system, kernel 5.6.7-100.fc30.x86_64:

$ dmesg|egrep 'thermal_zone4|iwl'
[    5.503305] iwlwifi 0000:03:00.0: enabling device (0000 -> 0002)
[    5.513497] iwlwifi 0000:03:00.0: Found debug destination: EXTERNAL_DRAM
[    5.517088] iwlwifi 0000:03:00.0: Found debug configuration: 0
[    5.517222] iwlwifi 0000:03:00.0: loaded firmware version 29.1654887522.0 7265D-29.ucode op_mode iwlmvm
[    5.683009] iwlwifi 0000:03:00.0: Detected Intel(R) Dual Band Wireless AC 3165, REV=0x210
[    5.701454] iwlwifi 0000:03:00.0: Applying debug destination EXTERNAL_DRAM
[    5.702520] iwlwifi 0000:03:00.0: Allocated 0x00400000 bytes for firmware monitor.
[    5.708150] iwlwifi 0000:03:00.0: base HW address: 48:a4:72:8a:a4:f3
[    5.768788] ieee80211 phy0: Selected rate control algorithm 'iwl-mvm-rs'
[    5.769174] thermal thermal_zone4: failed to read out thermal zone (-61)
[    5.772690] iwlwifi 0000:03:00.0 wlp3s0: renamed from wlan0
[    9.369676] iwlwifi 0000:03:00.0: Applying debug destination EXTERNAL_DRAM
[    9.445652] iwlwifi 0000:03:00.0: Applying debug destination EXTERNAL_DRAM
[    9.446745] iwlwifi 0000:03:00.0: FW already configured (0) - re-configuring
[    9.475696] iwlwifi 0000:03:00.0: Applying debug destination EXTERNAL_DRAM
[    9.551639] iwlwifi 0000:03:00.0: Applying debug destination EXTERNAL_DRAM
[    9.552723] iwlwifi 0000:03:00.0: FW already configured (0) - re-configuring
Comment 17 Zhang Rui 2020-05-01 02:09:06 UTC
I'm not sure neither.

For any of you who can try my currently solution,
please pull the thermal/linux-next branch at
git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git

and then apply the following patches
https://patchwork.kernel.org/patch/11506053/
https://patchwork.kernel.org/patch/11506065/
https://patchwork.kernel.org/patch/11519215/
https://patchwork.kernel.org/patch/11519219/
https://patchwork.kernel.org/patch/11519223/
https://patchwork.kernel.org/patch/11519227/
https://patchwork.kernel.org/patch/11519231/
https://patchwork.kernel.org/patch/11519235/

and see if the problem goes away.

One thing to note is that the fix patch in this series (PATCH 6/6) is just a prototype one as I'm not familiar with the iwlwifi driver, I expect some wireless expert will help improve the patch or generate a better solution, which can be targeted for upstream.
Comment 18 Zhang Rui 2020-05-01 02:10:58 UTC
@Stan, you can try the solution above to see if the problem goes away or not. maybe it is the same issue.
Comment 19 Stan King 2020-05-03 18:37:06 UTC
Zhang,

I have to admit that I'm on the edge of being hopelessly lost trying to apply your patches.  I haven't compiled a kernel for at least two years, and I've never applied git patches before.

I've done a "git clone" with the git URL that you supplied.  That seems to have populated a complete source tree.

I downloaded the .diff file from your first patch URL, the one that ends in 11506053/.  Is there a way to use the URL directly, without a file download?

I tried to apply that patch with "git apply".  A few of the diffs inside were matched successfully, but it failed on what would be Hunk #8 of drivers/thermal/imx_thermal.c.  I think the failure was that it was looking for a block of text starting with "static int __maybe_unused imx_thermal_suspend(struct device *dev)", yet the string "__maybe_unused" does not appear in imx_thermal.c in the definition of imx_thermal_suspend.

Do the patches need to be applied in a particular order or sequence?

Assuming that I am able to successfully apply the patches, any help for how to proceed after that would be appreciated too, as the various help pages on the Internet all target slightly different situations than this.  Even knowing how much file space will be consumed will be helpful.  I'm using Fedora 31. Thanks.
Comment 20 Zhang Rui 2020-05-04 01:06:15 UTC
here is the command that we can use
1. git clone git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git
2. git checkout -b test thermal/linux-next
3. download the first two patches as mbox
4. git am patch_dir/*.patch
5. remove the two patches and download the next six patches as mbox
6. git am patch_dir/*.patch

that should work without any conflict.
I think you probably missed step 2.
Comment 21 Stan King 2020-05-04 13:37:51 UTC
Thank you, Zhang.  I think I'm getting closer.

Step 2 initially failed because it wasn't a git repository, but presumably I should have cd'ed to directory linux, so I did that.  In that directory, step 2 gets the following error:

  fatal: 'thermal/linux-next' is not a commit and a branch 'test' cannot be created from it

I don't know how to recover from this, so I'd appreciate any advice you could offer.
Comment 22 Zhang Rui 2020-05-05 04:28:44 UTC
for step 2,
try
git checkout -b test origin/thermal/linux-next
instead.
Comment 23 Stan King 2020-05-06 21:50:49 UTC
Zhang,

That did the trick.  I have some results to report.  First the good news:

In two environments,

Intel(R) Core(TM) i7-8550U CPU + Intel Wireless 3165 (rev 81), and
AMD PRO A8-8600B R6 CPU + Intel Wireless 7265 (rev 61)

it no longer complains about the thermal zone upon start-up.  The iwlwifi zone is still visible among /sys/class/thermal/*/type, and lm_sensors still claims to read out the temperature successfully.

In an environment with an Intel Core i5-2400 CPU and no Wireless, it seems to run without difficulty.

Is there any additional information you'd like me to gather from these cases that seem to work?

Now the bad news:

In a laptop with an Intel Core i5-540M CPU and Intel Centrino Advanced-N 6205 [Taylor Peak] for wireless, it halts quite early on in the boot process, I think long before it has a chance to set up thermal.  I was able to capture a video by using a boot_delay=10 kernel parameter.  The first sign of a problem is "divide error: 0000 [#1] SMP PTI".  The next line starts with "CPU: 0 PID" and ends with the kernel version and #1.  Next is a line with "RIP: 0010:arch_scale_freq_tick+0x67/0x7f".  This is followed by what looks like register dumps and call traces.  The first two lines of the trace are <IRQ> and schedule_tick+0x34/0x120.  There's a different RIP right after that, but you get the idea.  Is it possible your kernel had an experimental feature enabled that doesn't work on this old hardware?
Comment 24 Zhang Rui 2020-05-07 02:36:32 UTC
(In reply to Stan King from comment #23)
> Zhang,
> 
> That did the trick.  I have some results to report.  First the good news:
> 
> In two environments,
> 
> Intel(R) Core(TM) i7-8550U CPU + Intel Wireless 3165 (rev 81), and
> AMD PRO A8-8600B R6 CPU + Intel Wireless 7265 (rev 61)
> 
> it no longer complains about the thermal zone upon start-up.  The iwlwifi
> zone is still visible among /sys/class/thermal/*/type, and lm_sensors still
> claims to read out the temperature successfully.
> 
> In an environment with an Intel Core i5-2400 CPU and no Wireless, it seems
> to run without difficulty.
> 
> Is there any additional information you'd like me to gather from these cases
> that seem to work?

Good to know that it works. But the patch itself is a little tricky, and I'm still waiting for the wireless experts to give comments about the patch.

> 
> Now the bad news:
> 
> In a laptop with an Intel Core i5-540M CPU and Intel Centrino Advanced-N
> 6205 [Taylor Peak] for wireless, it halts quite early on in the boot
> process, I think long before it has a chance to set up thermal.  I was able
> to capture a video by using a boot_delay=10 kernel parameter.  The first
> sign of a problem is "divide error: 0000 [#1] SMP PTI".  The next line
> starts with "CPU: 0 PID" and ends with the kernel version and #1.  Next is a
> line with "RIP: 0010:arch_scale_freq_tick+0x67/0x7f".  This is followed by
> what looks like register dumps and call traces.  The first two lines of the
> trace are <IRQ> and schedule_tick+0x34/0x120.  There's a different RIP right
> after that, but you get the idea.  Is it possible your kernel had an
> experimental feature enabled that doesn't work on this old hardware?

I don't think it is caused by the patches.
A simple way to verify this is to rebuild the kernel with step 1 and 2 only, without applies the patches and see if you can see the same problem.
Comment 25 Stan King 2020-05-07 18:05:20 UTC
Zhang, indeed it failed on the i5-540M processor even without your patches.
Comment 26 Zhang Rui 2020-05-08 03:07:40 UTC
I see.
So let's wait for the reply from the wifi experts and see how we can have an upstream solution.
Comment 27 Thiago Navarro 2020-08-13 12:00:34 UTC
Hi everyone,

I have faced the same problem.
 
Some information about my system.

$ uname -a
Linux chivunk 5.4.52-1-MANJARO #1 SMP PREEMPT Thu Jul 16 16:07:11 UTC 2020 x86_64 GNU/Linux

$ sudo dmesg|egrep 'thermal_zone4|iwl' 
[    2.818658] iwlwifi 0000:00:14.3: enabling device (0000 -> 0002)
[    2.826456] iwlwifi 0000:00:14.3: TLV_FW_FSEQ_VERSION: FSEQ Version: 58.3.35.22
[    2.826461] iwlwifi 0000:00:14.3: Found debug destination: EXTERNAL_DRAM
[    2.826463] iwlwifi 0000:00:14.3: Found debug configuration: 0
[    2.826812] iwlwifi 0000:00:14.3: loaded firmware version 50.3e391d3e.0 op_mode iwlmvm
[    3.077614] iwlwifi 0000:00:14.3: Detected Intel(R) Wi-Fi 6 AX201 160MHz, REV=0x354
[    3.083978] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
[    3.084350] iwlwifi 0000:00:14.3: Allocated 0x00400000 bytes for firmware monitor.
[    3.249901] iwlwifi 0000:00:14.3: base HW address: 40:74:e0:62:8d:60
[    3.261793] thermal thermal_zone4: failed to read out thermal zone (-61)
[    3.490892] iwlwifi 0000:00:14.3 wlp0s20f3: renamed from wlan0
[    4.855150] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
[    5.021074] iwlwifi 0000:00:14.3: FW already configured (0) - re-configuring
[    5.052032] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
[    5.217533] iwlwifi 0000:00:14.3: FW already configured (0) - re-configuring
[  106.829111] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
[  106.994883] iwlwifi 0000:00:14.3: FW already configured (0) - re-configuring
Comment 28 oneuptingera 2020-09-25 09:11:03 UTC
Same problem.

$ uname -a
Linux hulk 5.8.0-2-amd64 #1 SMP Debian 5.8.10-1 (2020-09-19) x86_64 GNU/Linux


$ sudo dmesg|egrep 'thermal_zone4|iwl'
[    2.890162] iwlwifi 0000:03:00.0: enabling device (0000 -> 0002)
[    2.897199] iwlwifi 0000:03:00.0: firmware: direct-loading firmware iwlwifi-cc-a0-56.ucode
[    2.897205] iwlwifi 0000:03:00.0: api flags index 2 larger than supported by driver
[    2.897214] iwlwifi 0000:03:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.22
[    2.897216] iwlwifi 0000:03:00.0: Found debug destination: EXTERNAL_DRAM
[    2.897217] iwlwifi 0000:03:00.0: Found debug configuration: 0
[    2.897463] iwlwifi 0000:03:00.0: loaded firmware version 55.d9698065.0 cc-a0-56.ucode op_mode iwlmvm
[    2.897474] iwlwifi 0000:03:00.0: firmware: failed to load iwl-debug-yoyo.bin (-2)
[    3.058913] iwlwifi 0000:03:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
[    3.235913] iwlwifi 0000:03:00.0: base HW address: xx:xx:xx:xx:xx:xx
[    3.249489] iwlwifi 0000:03:00.0 wlp3s0: renamed from wlan0
Comment 29 oneuptingera 2020-09-25 09:13:17 UTC
(In reply to oneuptingera from comment #28)
> Same problem.
> 
> $ uname -a
> Linux hulk 5.8.0-2-amd64 #1 SMP Debian 5.8.10-1 (2020-09-19) x86_64 GNU/Linux
> 
> 
> $ sudo dmesg|egrep 'thermal_zone4|iwl'
> [    2.890162] iwlwifi 0000:03:00.0: enabling device (0000 -> 0002)
> [    2.897199] iwlwifi 0000:03:00.0: firmware: direct-loading firmware
> iwlwifi-cc-a0-56.ucode
> [    2.897205] iwlwifi 0000:03:00.0: api flags index 2 larger than supported
> by driver
> [    2.897214] iwlwifi 0000:03:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version:
> 89.3.35.22
> [    2.897216] iwlwifi 0000:03:00.0: Found debug destination: EXTERNAL_DRAM
> [    2.897217] iwlwifi 0000:03:00.0: Found debug configuration: 0
> [    2.897463] iwlwifi 0000:03:00.0: loaded firmware version 55.d9698065.0
> cc-a0-56.ucode op_mode iwlmvm
> [    2.897474] iwlwifi 0000:03:00.0: firmware: failed to load
> iwl-debug-yoyo.bin (-2)
> [    3.058913] iwlwifi 0000:03:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz,
> REV=0x340
> [    3.235913] iwlwifi 0000:03:00.0: base HW address: xx:xx:xx:xx:xx:xx
> [    3.249489] iwlwifi 0000:03:00.0 wlp3s0: renamed from wlan0

$ sudo dmesg|egrep 'thermal_zone0|iwl'
[    2.890162] iwlwifi 0000:03:00.0: enabling device (0000 -> 0002)
[    2.897199] iwlwifi 0000:03:00.0: firmware: direct-loading firmware iwlwifi-cc-a0-56.ucode
[    2.897205] iwlwifi 0000:03:00.0: api flags index 2 larger than supported by driver
[    2.897214] iwlwifi 0000:03:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.22
[    2.897216] iwlwifi 0000:03:00.0: Found debug destination: EXTERNAL_DRAM
[    2.897217] iwlwifi 0000:03:00.0: Found debug configuration: 0
[    2.897463] iwlwifi 0000:03:00.0: loaded firmware version 55.d9698065.0 cc-a0-56.ucode op_mode iwlmvm
[    2.897474] iwlwifi 0000:03:00.0: firmware: failed to load iwl-debug-yoyo.bin (-2)
[    3.058913] iwlwifi 0000:03:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
[    3.235913] iwlwifi 0000:03:00.0: base HW address: xx:xx:xx:xx:xx:xx
[    3.247759] thermal thermal_zone0: failed to read out thermal zone (-61)
[    3.249489] iwlwifi 0000:03:00.0 wlp3s0: renamed from wlan0
Comment 30 Frank Kruger 2020-09-25 19:44:09 UTC
Same issue and error message with kernel 5.8.11, which, however, is not surprising since the bug is open for almost 2 years.
Comment 31 Max 2021-04-27 18:45:43 UTC
Same problem here.

Linux 5.11.16-gentoo 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz

Message clearly comes AFTER firmware is loaded and device initialized.

[    4.754547] iwlwifi 0000:6f:00.0: enabling device (0000 -> 0002)
[    4.772998] iwlwifi 0000:6f:00.0: api flags index 2 larger than supported by driver
[    4.773011] iwlwifi 0000:6f:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 93.8.63.28
[    4.773212] iwlwifi 0000:6f:00.0: loaded firmware version 59.601f3a66.0 ty-a0-gf-a0-59.ucode op_mode iwlmvm
[    4.810088] iwlwifi 0000:6f:00.0: Detected Intel(R) Wi-Fi 6 AX210 160MHz, REV=0x420
[    5.020872] iwlwifi 0000:6f:00.0: base HW address: e8:f4:08:dc:bf:4a
[    5.035333] thermal thermal_zone2: failed to read out thermal zone (-61)


Looks like it's been more than two years since it was originally reported. Oh my...
Comment 32 Frank Kruger 2021-04-27 20:50:17 UTC
(In reply to Zhang Rui from comment #26)
> I see.
> So let's wait for the reply from the wifi experts and see how we can have an
> upstream solution.

After almost one year of waiting for an upstream solution, do see any chance that this issue will be fixed? Thx.
Comment 33 Daniel Lezcano 2021-04-27 22:28:28 UTC
Hi,

I'm trying to figure out what is the issue.

The firmware is missing, so when reading the thermal zone temperature, that returns an error. In what that causes an issue except seeing an annoying message ?

What would be the expected behavior ?

 1. No thermal zone until the firmware is with the status running ?

 2. A disabled thermal zone until the firmware is with the status running ?
    (more code to differentiate manual disabling and initial mode)
 
 3. No trace because it is scary / annoying ?

I'll try to help fixing any of this but the only wifi card I have is in my laptop and can't test on it. Does anyone know a wifi USB dongle compatible with the iwlwifi driver ?
Comment 34 Andriy Perevortkin 2021-04-27 22:43:17 UTC
Even with firmware available the same error is logged and thermal zone is missing.
I tried bundling intel firmware into the kernel via CONFIG_EXTRA_FIRMWARE=  same errors and same missing functionality.

And it has been this way for years. I upgraded the motherboard (for a different reason), same shit.

dmesg | grep -E 'iwl|thermal'
[    0.262277] thermal_sys: Registered thermal governor 'fair_share'
[    0.262277] thermal_sys: Registered thermal governor 'bang_bang'
[    0.262277] thermal_sys: Registered thermal governor 'step_wise'
[    0.262277] thermal_sys: Registered thermal governor 'user_space'
[    2.654368] iwlwifi 0000:04:00.0: enabling device (0000 -> 0002)
[    2.665876] iwlwifi 0000:04:00.0: Direct firmware loaded: iwlwifi-9260-th-b0-jf-b0-46.ucode
[    2.665904] iwlwifi 0000:04:00.0: WRT: Overriding region id 0
[    2.665906] iwlwifi 0000:04:00.0: WRT: Overriding region id 1
[    2.665908] iwlwifi 0000:04:00.0: WRT: Overriding region id 2
[    2.665910] iwlwifi 0000:04:00.0: WRT: Overriding region id 3
[    2.665911] iwlwifi 0000:04:00.0: WRT: Overriding region id 4
[    2.665913] iwlwifi 0000:04:00.0: WRT: Overriding region id 6
[    2.665914] iwlwifi 0000:04:00.0: WRT: Overriding region id 8
[    2.665915] iwlwifi 0000:04:00.0: WRT: Overriding region id 9
[    2.665917] iwlwifi 0000:04:00.0: WRT: Overriding region id 10
[    2.665918] iwlwifi 0000:04:00.0: WRT: Overriding region id 11
[    2.665919] iwlwifi 0000:04:00.0: WRT: Overriding region id 15
[    2.665920] iwlwifi 0000:04:00.0: WRT: Overriding region id 16
[    2.665922] iwlwifi 0000:04:00.0: WRT: Overriding region id 18
[    2.665923] iwlwifi 0000:04:00.0: WRT: Overriding region id 19
[    2.665924] iwlwifi 0000:04:00.0: WRT: Overriding region id 20
[    2.665926] iwlwifi 0000:04:00.0: WRT: Overriding region id 21
[    2.665927] iwlwifi 0000:04:00.0: WRT: Overriding region id 28
[    2.666223] iwlwifi 0000:04:00.0: loaded firmware version 46.6f9f215c.0 9260-th-b0-jf-b0-46.ucode op_mode iwlmvm
[    2.687959] iwlwifi 0000:04:00.0: Detected Intel(R) Wireless-AC 9260 160MHz, REV=0x324
[    2.694318] thermal thermal_zone0: failed to read out thermal zone (-61)
[    2.735469] iwlwifi 0000:04:00.0: base HW address: 84:fd:d1:5c:06:48
[    2.802970] ieee80211 phy0: Selected rate control algorithm 'iwl-mvm-rs'
[    2.805706] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
Comment 35 Daniel Lezcano 2021-04-27 22:46:58 UTC
On 28/04/2021 00:43, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=201761
> 
> --- Comment #34 from Andriy Perevortkin (irherder@gmail.com) ---
> Even with firmware available the same error is logged and thermal zone is
> missing.
> I tried bundling intel firmware into the kernel via CONFIG_EXTRA_FIRMWARE= 
> same errors and same missing functionality.
> 
> And it has been this way for years. I upgraded the motherboard (for a
> different
> reason), same shit.
> 
> dmesg | grep -E 'iwl|thermal'
> [    0.262277] thermal_sys: Registered thermal governor 'fair_share'
> [    0.262277] thermal_sys: Registered thermal governor 'bang_bang'
> [    0.262277] thermal_sys: Registered thermal governor 'step_wise'
> [    0.262277] thermal_sys: Registered thermal governor 'user_space'
> [    2.654368] iwlwifi 0000:04:00.0: enabling device (0000 -> 0002)
> [    2.665876] iwlwifi 0000:04:00.0: Direct firmware loaded:
> iwlwifi-9260-th-b0-jf-b0-46.ucode
> [    2.665904] iwlwifi 0000:04:00.0: WRT: Overriding region id 0
> [    2.665906] iwlwifi 0000:04:00.0: WRT: Overriding region id 1
> [    2.665908] iwlwifi 0000:04:00.0: WRT: Overriding region id 2
> [    2.665910] iwlwifi 0000:04:00.0: WRT: Overriding region id 3
> [    2.665911] iwlwifi 0000:04:00.0: WRT: Overriding region id 4
> [    2.665913] iwlwifi 0000:04:00.0: WRT: Overriding region id 6
> [    2.665914] iwlwifi 0000:04:00.0: WRT: Overriding region id 8
> [    2.665915] iwlwifi 0000:04:00.0: WRT: Overriding region id 9
> [    2.665917] iwlwifi 0000:04:00.0: WRT: Overriding region id 10
> [    2.665918] iwlwifi 0000:04:00.0: WRT: Overriding region id 11
> [    2.665919] iwlwifi 0000:04:00.0: WRT: Overriding region id 15
> [    2.665920] iwlwifi 0000:04:00.0: WRT: Overriding region id 16
> [    2.665922] iwlwifi 0000:04:00.0: WRT: Overriding region id 18
> [    2.665923] iwlwifi 0000:04:00.0: WRT: Overriding region id 19
> [    2.665924] iwlwifi 0000:04:00.0: WRT: Overriding region id 20
> [    2.665926] iwlwifi 0000:04:00.0: WRT: Overriding region id 21
> [    2.665927] iwlwifi 0000:04:00.0: WRT: Overriding region id 28
> [    2.666223] iwlwifi 0000:04:00.0: loaded firmware version 46.6f9f215c.0
> 9260-th-b0-jf-b0-46.ucode op_mode iwlmvm
> [    2.687959] iwlwifi 0000:04:00.0: Detected Intel(R) Wireless-AC 9260
> 160MHz,
> REV=0x324
> [    2.694318] thermal thermal_zone0: failed to read out thermal zone (-61)

That appears at boot time, but if you read the content of
/sys/class/thermal/thermal_zone0/temp on the command line, does it give
the same error or the temperature?


> [    2.735469] iwlwifi 0000:04:00.0: base HW address: 84:fd:d1:5c:06:48
> [    2.802970] ieee80211 phy0: Selected rate control algorithm 'iwl-mvm-rs'
> [    2.805706] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
>
Comment 36 Andriy Perevortkin 2021-04-27 22:58:02 UTC
cat /sys/class/thermal/thermal_zone0/temp
cat: /sys/class/thermal/thermal_zone0/temp: No data available

sensors | grep iwlw -A 2
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:            N/A
Comment 37 Daniel Lezcano 2021-04-27 23:33:26 UTC
On 28/04/2021 00:58, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=201761
> 
> --- Comment #36 from Andriy Perevortkin (irherder@gmail.com) ---
> cat /sys/class/thermal/thermal_zone0/temp
> cat: /sys/class/thermal/thermal_zone0/temp: No data available
> 
> sensors | grep iwlw -A 2
> iwlwifi_1-virtual-0
> Adapter: Virtual device
> temp1:            N/A
> 

Thanks for the extra information.

Definitively need some hardware to investigate the problem.
Comment 38 Stan King 2021-04-27 23:56:02 UTC
No change with this behavior on my hardware.  In fact, I recently got an Intel kit with an AX200 160 MHz WiFi card, and it exhibits the same problem with recent kernels: error during start-up, but reads the temperature OK later with lm_sensors.

As you can see from comments 24 and 26, Zhang Rui had worked out a possible solution, but needed to check it with some other people.
Comment 39 Max 2021-04-28 05:08:43 UTC
$ cat /sys/class/thermal/thermal_zone0/temp
27800

$ sensors
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:            N/A
Comment 40 Daniel Lezcano 2021-04-28 09:00:06 UTC

(In reply to Max from comment #39)
> $ cat /sys/class/thermal/thermal_zone0/temp
> 27800
> 
> $ sensors
> iwlwifi_1-virtual-0
> Adapter: Virtual device
> temp1:            N/A

Are you sure both are referring to the same sensor ?
Comment 41 Daniel Lezcano 2021-04-28 09:11:00 UTC
(In reply to Stan King from comment #38)
> No change with this behavior on my hardware.  In fact, I recently got an
> Intel kit with an AX200 160 MHz WiFi card, and it exhibits the same problem
> with recent kernels: error during start-up, but reads the temperature OK
> later with lm_sensors.
> 
> As you can see from comments 24 and 26, Zhang Rui had worked out a possible
> solution, but needed to check it with some other people.

Yes, I saw the thread. But actually, we may create the thermal zone after the firmware did successfully finish its setup. So there will be the guarantee the sensor is operational for the thermal framework without a parallel initialization leading to this message.

On the other side, there is a REGULAR_UCODE property for the driver which depends on the device_family. If it is not set, then the get_temp will fail also.

For the first case, the message appears because of an asynchronous initialization o the firmware.

For the second case, the device does not belong to the device family with the REGULAR_UCODE family.

The code snippet:

static int iwl_mvm_tzone_get_temp(...)
{

   [ ... ]

        if (!iwl_mvm_firmware_running(mvm) ||
            mvm->fwrt.cur_fw_img != IWL_UCODE_REGULAR) {
                ret = -ENODATA;
                goto out;
        }

  [ ... ]
}
Comment 42 Max 2021-04-28 09:33:57 UTC
(In reply to Daniel Lezcano from comment #40)
> 
> (In reply to Max from comment #39)
> > $ cat /sys/class/thermal/thermal_zone0/temp
> > 27800
> > 
> > $ sensors
> > iwlwifi_1-virtual-0
> > Adapter: Virtual device
> > temp1:            N/A
> 
> Are you sure both are referring to the same sensor ?

You know what? You're right. My bad. This is the correct zone.

[    3.216572] thermal thermal_zone2: failed to read out thermal zone (-61)

$ cat /sys/class/thermal/thermal_zone2/temp
cat: /sys/class/thermal/thermal_zone2/temp: No data available
Comment 43 Zhang Rui 2021-04-28 13:13:18 UTC
(In reply to Daniel Lezcano from comment #41)
> (In reply to Stan King from comment #38)
> > No change with this behavior on my hardware.  In fact, I recently got an
> > Intel kit with an AX200 160 MHz WiFi card, and it exhibits the same problem
> > with recent kernels: error during start-up, but reads the temperature OK
> > later with lm_sensors.
> > 
> > As you can see from comments 24 and 26, Zhang Rui had worked out a possible
> > solution, but needed to check it with some other people.
> 
> Yes, I saw the thread. But actually, we may create the thermal zone after
> the firmware did successfully finish its setup. So there will be the
> guarantee the sensor is operational for the thermal framework without a
> parallel initialization leading to this message.
> 
> On the other side, there is a REGULAR_UCODE property for the driver which
> depends on the device_family. If it is not set, then the get_temp will fail
> also.
> 
> For the first case, the message appears because of an asynchronous
> initialization o the firmware.
> 
> For the second case, the device does not belong to the device family with
> the REGULAR_UCODE family.
> 
> The code snippet:
> 
> static int iwl_mvm_tzone_get_temp(...)
> {
> 
>    [ ... ]
> 
>         if (!iwl_mvm_firmware_running(mvm) ||
>             mvm->fwrt.cur_fw_img != IWL_UCODE_REGULAR) {
>                 ret = -ENODATA;
>                 goto out;
>         }
> 
>   [ ... ]
> }

My previous impression is that even at runtime, it still fails because iwl_mvm_firmware_running(mvm) returns false because the firmware can be unloaded, but I was not aware of the IWL_UCODE_REGULAR condition.
I have a machine that can reproduce the error.
Let me check what the problem is and if I can refresh my patch based on the findings.
Comment 44 Daniel Lezcano 2021-04-28 14:26:04 UTC
(In reply to Zhang Rui from comment #43)
> (In reply to Daniel Lezcano from comment #41)
> > (In reply to Stan King from comment #38)
> > > No change with this behavior on my hardware.  In fact, I recently got an
> > > Intel kit with an AX200 160 MHz WiFi card, and it exhibits the same
> problem
> > > with recent kernels: error during start-up, but reads the temperature OK
> > > later with lm_sensors.
> > > 
> > > As you can see from comments 24 and 26, Zhang Rui had worked out a
> possible
> > > solution, but needed to check it with some other people.
> > 
> > Yes, I saw the thread. But actually, we may create the thermal zone after
> > the firmware did successfully finish its setup. So there will be the
> > guarantee the sensor is operational for the thermal framework without a
> > parallel initialization leading to this message.
> > 
> > On the other side, there is a REGULAR_UCODE property for the driver which
> > depends on the device_family. If it is not set, then the get_temp will fail
> > also.
> > 
> > For the first case, the message appears because of an asynchronous
> > initialization o the firmware.
> > 
> > For the second case, the device does not belong to the device family with
> > the REGULAR_UCODE family.
> > 
> > The code snippet:
> > 
> > static int iwl_mvm_tzone_get_temp(...)
> > {
> > 
> >    [ ... ]
> > 
> >         if (!iwl_mvm_firmware_running(mvm) ||
> >             mvm->fwrt.cur_fw_img != IWL_UCODE_REGULAR) {
> >                 ret = -ENODATA;
> >                 goto out;
> >         }
> > 
> >   [ ... ]
> > }
> 
> My previous impression is that even at runtime, it still fails because
> iwl_mvm_firmware_running(mvm) returns false because the firmware can be
> unloaded, but I was not aware of the IWL_UCODE_REGULAR condition.
> I have a machine that can reproduce the error.
> Let me check what the problem is and if I can refresh my patch based on the
> findings.

Would'nt make sense to call iwl_mvm_thermal_initialize() in the firmware callback after the setup is complete (perhaps at the enf of the iwl_mvm_load_ucode_wait_alive() function?) instead of creating uncondionnally the thermal zone ?
Comment 45 Daniel Lezcano 2021-04-28 19:53:25 UTC
(In reply to Max from comment #42)
> (In reply to Daniel Lezcano from comment #40)
> > 
> > (In reply to Max from comment #39)
> > > $ cat /sys/class/thermal/thermal_zone0/temp
> > > 27800
> > > 
> > > $ sensors
> > > iwlwifi_1-virtual-0
> > > Adapter: Virtual device
> > > temp1:            N/A
> > 
> > Are you sure both are referring to the same sensor ?
> 
> You know what? You're right. My bad. This is the correct zone.
> 
> [    3.216572] thermal thermal_zone2: failed to read out thermal zone (-61)
> 
> $ cat /sys/class/thermal/thermal_zone2/temp
> cat: /sys/class/thermal/thermal_zone2/temp: No data available

Thanks for confirming
Comment 46 Stan King 2021-04-28 22:53:27 UTC
(In reply to Daniel Lezcano from comment #45)
> (In reply to Max from comment #42)
> > (In reply to Daniel Lezcano from comment #40)
> > > 
> > > (In reply to Max from comment #39)
> > > > $ cat /sys/class/thermal/thermal_zone0/temp
> > > > 27800
> > > > 
> > > > $ sensors
> > > > iwlwifi_1-virtual-0
> > > > Adapter: Virtual device
> > > > temp1:            N/A
> > > 
> > > Are you sure both are referring to the same sensor ?
> > 
> > You know what? You're right. My bad. This is the correct zone.
> > 
> > [    3.216572] thermal thermal_zone2: failed to read out thermal zone (-61)
> > 
> > $ cat /sys/class/thermal/thermal_zone2/temp
> > cat: /sys/class/thermal/thermal_zone2/temp: No data available
> 
> Thanks for confirming

On the other hand, in my situation:

[    7.298689] thermal thermal_zone4: failed to read out thermal zone (-61)

$ cat /sys/class/thermal/thermal_zone4/temp
55000

So I am getting a successful read after start-up.  55C sounds about right, as it's a fanless system.
Comment 47 Zhang Rui 2021-04-29 03:29:21 UTC
(In reply to Daniel Lezcano from comment #44)
> (In reply to Zhang Rui from comment #43)
> > (In reply to Daniel Lezcano from comment #41)
> > > (In reply to Stan King from comment #38)
> > My previous impression is that even at runtime, it still fails because
> > iwl_mvm_firmware_running(mvm) returns false because the firmware can be
> > unloaded, but I was not aware of the IWL_UCODE_REGULAR condition.
> > I have a machine that can reproduce the error.
> > Let me check what the problem is and if I can refresh my patch based on the
> > findings.
> 
> Would'nt make sense to call iwl_mvm_thermal_initialize() in the firmware
> callback after the setup is complete (perhaps at the enf of the
> iwl_mvm_load_ucode_wait_alive() function?) instead of creating
> uncondionnally the thermal zone ?

I think that is because the firmware may get loaded/unloaded at runtime, at least one code path I can confirm is
__iwl_mvm_suspend -> iwl_mvm_netdetect_config -> iwl_mvm_switch_to_d3 ->iwl_mvm_stop_device -> clear_bit(IWL_MVM_STATUS_FIRMWARE_RUNNING, &mvm->status)

iwl_mvm_firmware_running() just check the IWL_MVM_STATUS_FIRMWARE_RUNNING bit, and this results in the iwl_mvm_tzone_get_temp() failure, which is the thermal .get_temp callback.

I don't know how often the firmware loading/unloading happens, but registering/unregistering the thermal zone upon firmware load/unload is relatively expensive, and its thermal zone device node becomes inconsistent.
That is why I preferred to disable/enable the thermal zone instead of register/unregister in the original proposal.
Comment 48 Zhang Rui 2021-04-30 04:24:54 UTC
Hi, Daniel,

is there any specific reason we can not register a "disabled" thermal zone?
Comment 49 Daniel Lezcano 2021-04-30 07:49:50 UTC
Hi Rui,

IMO, we can register a disabled thermal zone if its initialization is complete and we can set it enabled.

Here the initialization failed (and AFAICT there is nothing preventing us to set it 'enabled').

The function thermal_zone_device_register() registers a sensor but in the iwlwifi case there is no guarantee such sensor exists because of the firmware.

That is the reason why I think the driver is not doing the right thing and should take care of registering/unregistering the thermal zone when the sensor (aka firmware code) can operate, otherwise we create an empty sensor device which is wrong. This is clearly spotted by the error happening at the end of the thermal_zone_device_register() function which calls thermal_zone_device_update() at the end: the thermal zone is registered before the sensor is initialized.

In addition, userspace programs may not be aware of the thermal zone mode and continue reading the temp file with the same ENODATA error. Especially when they read the temp from /sys/class/hwmon where the disabled state is not available.
Comment 50 Zhang Rui 2021-05-06 07:20:54 UTC
(In reply to Daniel Lezcano from comment #49)
> This is clearly spotted by the error happening at the
> end of the thermal_zone_device_register() function which calls
> thermal_zone_device_update() at the end: the thermal zone is registered
> before the sensor is initialized.

thermal_zone_device_update() can handle a disabled thermal zone now. So if we flag the thermal zone as disabled during registration, thermal_zone_device_update() is a no-op.

This just means it is doable technically.
But let's understand why the current thermal APIs (register/unregister) can not fit the current problem first.

> That is the reason why I think the driver is not doing the right thing and
> should take care of registering/unregistering the thermal zone when the
> sensor (aka firmware code) can operate, otherwise we create an empty sensor
> device which is wrong.

Okay, I found the previous conversation with Luciano Coelho, the iwlwifi maintainer.

"This issue has been know by us for a while now and we also had users complain about it, but at the time there was nothing we could do. The reason for registering before we can actually provide the temperature is because the wifi interface may go up and down many times and we didn't want the userspace to keep having to set values again."

Luca,
can you give more details about what the userspace does for the iwlwifi thermal zone, and how often the iwlwifi interface becomes avaialble/unavailable?

We want to fully understand the drawbacks of doing thermal register/unregister.

> 
> In addition, userspace programs may not be aware of the thermal zone mode
> and continue reading the temp file with the same ENODATA error. Especially
> when they read the temp from /sys/class/hwmon where the disabled state is
> not available.

I agree. We can only prevent access of a disabled thermal zone from kernel, but accessing via sysfs can still trigger this error.

But the kernel failure is what this bug report mainly complains.
And if users read the temp when the wifi interface is down, it will get an error, I don't think there is a problem.
Plus, we can add the tz->mode check in sysfs attribute callbacks, and give an extra warning of "accessing-to-a-disabled-thermal-zone".
Comment 51 Daniel Lezcano 2021-05-31 08:28:57 UTC
(In reply to Zhang Rui from comment #50)

[ ... ]

> > That is the reason why I think the driver is not doing the right thing and
> > should take care of registering/unregistering the thermal zone when the
> > sensor (aka firmware code) can operate, otherwise we create an empty sensor
> > device which is wrong.
> 
> Okay, I found the previous conversation with Luciano Coelho, the iwlwifi
> maintainer.
> 
> "This issue has been know by us for a while now and we also had users
> complain about it, but at the time there was nothing we could do. The reason
> for registering before we can actually provide the temperature is because
> the wifi interface may go up and down many times and we didn't want the
> userspace to keep having to set values again."

That is not a kernel problem IMHO. The ifup / ifdown scripts can cleanly handle the configuration, right?

On the other side, the get_temp may never operate as mentioned in a previous comment and the thermal zone is there.

From my POV, the bug falls under the wireless umbrella (Drivers/Intel wireless network drivers). The driver must register when the sensor exists (firmware loaded or whatever).

This bug is open since a long time and people are complaining we don't fix it.

Does it make sense to move it under the 'Drivers/Intel wireless network' component and change the status to 'CONFIRMED'
Comment 52 korg 2021-07-25 02:27:18 UTC
For what it is worth, this is affecting Lenovo laptops based on TigerLake platform.

Attempting to suspend a machine that is running Debian Bullseye (LTS kernel 5.10.46) results in a forced shutdown due to incorrect temperature detection:

Before an attempted suspend:

```
$ sudo dmesg | grep 'failed to read out thermal'
kern  :warn  : [  +0.000066] thermal thermal_zone5: failed to read out thermal zone (-5)
kern  :warn  : [  +0.012993] thermal thermal_zone8: failed to read out thermal zone (-61)
kern  :warn  : [  +0.000010] thermal thermal_zone5: failed to read out thermal zone (-5)
$ sudo cat /sys/class/thermal/thermal_zone5/temp 
cat: /sys/class/thermal/thermal_zone5/temp: Input/output error
```

After suspend, I am seeing the following, albeit briefly:

```
[ xxxx.xxxxxx] thermal thermal_zone4: critical temperature reached (128 C), shutting down

message from syslog@laptop at Jul 24 00:00:00
kernel:[ xxxx.xxxxxx] thermal thermal_zone4: critical temperature reached (128 C), shutting down
```

This is when highest temperature registered by lm-sensors is about 30c. I highly doubt that 128c reading above is accurate.
Comment 53 Daniel Lezcano 2021-08-13 20:21:21 UTC
(In reply to korg from comment #52)
> For what it is worth, this is affecting Lenovo laptops based on TigerLake
> platform.
> 
> Attempting to suspend a machine that is running Debian Bullseye (LTS kernel
> 5.10.46) results in a forced shutdown due to incorrect temperature detection:
> 
> Before an attempted suspend:
> 
> ```
> $ sudo dmesg | grep 'failed to read out thermal'
> kern  :warn  : [  +0.000066] thermal thermal_zone5: failed to read out
> thermal zone (-5)
> kern  :warn  : [  +0.012993] thermal thermal_zone8: failed to read out
> thermal zone (-61)
> kern  :warn  : [  +0.000010] thermal thermal_zone5: failed to read out
> thermal zone (-5)
> $ sudo cat /sys/class/thermal/thermal_zone5/temp 
> cat: /sys/class/thermal/thermal_zone5/temp: Input/output error
> ```

What are the different thermal zones (4,5 and 8) ?


> After suspend, I am seeing the following, albeit briefly:
> 
> ```
> [ xxxx.xxxxxx] thermal thermal_zone4: critical temperature reached (128 C),
> shutting down
> 
> message from syslog@laptop at Jul 24 00:00:00
> kernel:[ xxxx.xxxxxx] thermal thermal_zone4: critical temperature reached
> (128 C), shutting down
> ```
> 
> This is when highest temperature registered by lm-sensors is about 30c. I
> highly doubt that 128c reading above is accurate.
Comment 54 Marcel Ziswiler 2021-11-17 01:34:46 UTC
I still get this on a first-generation Lenovo ThinkPad T14 AMD (Ryzen 7 PRO 4750U based) running 5.14.17-301.fc35.x86_64:

Nov 16 15:16:57 fedora kernel: thermal thermal_zone0: failed to read out thermal zone (-61)
Comment 55 Tom 2022-03-24 12:42:07 UTC
I'm having the same issue. Can I provide any logs to help debug further?

Lenovo Thinkpad X1 Carbon 7th Gen (Type 20QDCT01WW)


$ uname -a
Linux pop-os 5.16.11-76051611-generic #202202230823~1646248261~21.10~2b22243 SMP PREEMPT Wed Mar 2 20: x86_64 x86_64 x86_64 GNU/Linux


$ sensors
...
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:            N/A  
...


/var/log/kern.log
...
Mar 23 21:49:04 pop-os kernel: [135133.290276] thermal thermal_zone5: failed to read out thermal zone (-61)
Mar 23 21:49:04 pop-os kernel: [135133.325120] PM: suspend exit
...
Comment 56 Tom 2022-03-24 16:25:41 UTC
I should've mentioned - I experience this issue regularly, often multiple times a day. All userspace applications are ended on wake from sleep.
Comment 57 Daniel Lezcano 2022-03-24 21:26:19 UTC
(In reply to Tom from comment #55)

Thanks for reporting.

I'm really willing to fix this issue but I don't have the hardware to reproduce it.

What is this *virtual* wifi driver iwlwifi_1-virtual-0 ?

> I'm having the same issue. Can I provide any logs to help debug further?
> 
> Lenovo Thinkpad X1 Carbon 7th Gen (Type 20QDCT01WW)
> 
> 
> $ uname -a
> Linux pop-os 5.16.11-76051611-generic #202202230823~1646248261~21.10~2b22243
> SMP PREEMPT Wed Mar 2 20: x86_64 x86_64 x86_64 GNU/Linux
> 
> 
> $ sensors
> ...
> iwlwifi_1-virtual-0
> Adapter: Virtual device
> temp1:            N/A  
> ...
> 
> 
> /var/log/kern.log
> ...
> Mar 23 21:49:04 pop-os kernel: [135133.290276] thermal thermal_zone5: failed
> to read out thermal zone (-61)
> Mar 23 21:49:04 pop-os kernel: [135133.325120] PM: suspend exit
> ...
Comment 58 Tom 2022-03-25 07:58:28 UTC
I'm not sure - I had a look in /etc/sensors3.conf and /etc/sensors.d to try and learn more but couldn't find anything. If you can suggest how I can get more info about it, I'm happy to report back.
Comment 59 Zhang Rui 2022-06-21 02:00:08 UTC
(In reply to Marcel Ziswiler from comment #54)
> I still get this on a first-generation Lenovo ThinkPad T14 AMD (Ryzen 7 PRO
> 4750U based) running 5.14.17-301.fc35.x86_64:
> 
> Nov 16 15:16:57 fedora kernel: thermal thermal_zone0: failed to read out
> thermal zone (-61)

what is the output of cat /sys/class/thermal/thermal_zone0/type?
can you please attach the full dmesg output after boot?
Comment 60 Stan King 2022-06-23 00:05:51 UTC
Zhang, although I'm not Marcel Ziswiler, I'm getting this on three different systems, and in each case, the contents of the relevant thermal_zoneN/type is iwlwifi_1.

I'll try to attach the dmesg output from the boot of the system with a Intel 3165 WiFi circuit.  It looks like that will end up as a separate comment.  I apologize for my unfamiliarity with this forum environment.

The other systems have Intel 7265 and AX200 WiFi, so let me know if anything from those systems would be useful to your debug efforts.
Comment 61 Stan King 2022-06-23 00:09:07 UTC
Created attachment 301261 [details]
dmesg output showing -61 error

This file was from a Fedora 35 system, kernel 5.18.5-100.fc35.x86_64.

The WiFi circuit was Intel 3165.

The error was "thermal thermal_zone4: failed to read out thermal zone (-61)".

The output of "cat /sys/class/thermal/thermal_zone4/type" was iwlwifi_1
Comment 62 Erik Quaeghebeur 2022-06-23 07:45:36 UTC
Lenovo T14 Gen 1 AMD (Ryzen 4750U) here with kernel 5.15.41.

From log:

kernel: thermal thermal_zone1: failed to read out thermal zone (-61)

The asked output:

$ cat /sys/class/thermal/thermal_zone1/type
iwlwifi_1

Wifi card:

$ lspci | grep Wi-Fi
03:00.0 Network controller: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
Comment 63 Zhang Rui 2022-06-24 01:10:10 UTC
I want to confirm that if it is caused by the same device/reason on AMD platforms.

This is a known issue that kernel is reading the wifi thermal zone temperature while the wifi device firmware is not ready.
It is harmless but annoying, and I have proposed some solution but failed to push it upstream.

Let me bring this for discussion again in the community.
Comment 64 Daniel Lezcano 2022-06-24 13:37:46 UTC
(In reply to Zhang Rui from comment #63)
> I want to confirm that if it is caused by the same device/reason on AMD
> platforms.
> 
> This is a known issue that kernel is reading the wifi thermal zone
> temperature while the wifi device firmware is not ready.
> It is harmless but annoying, and I have proposed some solution but failed to
> push it upstream.
> 
> Let me bring this for discussion again in the community.

From my POV the problem is in the wifi driver, not in the thermal framework. The thermal zone should be registered when the firmware is loaded.
Comment 65 Oscar Priego 2022-08-19 08:41:57 UTC
I agree with Daniel, the problem is lying down in the wifi driver, which could be caused by a race condition of having not the thermal zone registered while firmware is being loaded, this, and that who was in charge of writing the driver didn't consider the full table of memory addresses regarding the thermal zones in the actual hardware.
Comment 66 Nicolás Rotunno 2022-08-24 01:04:40 UTC
Thinkpad E15, Gen2 with Intel i5
Redhat 9.0 Plow

$ dmesg 
[    9.614728] thermal thermal_zone6: failed to read out thermal zone (-61)

$ cat /sys/class/thermal/thermal_zone6/type
iwlwifi_1

$ lspci | grep Wi
00:14.3 Network controller: Intel Corporation Wi-Fi 6 AX201 (rev 20)


$ uname -rv
5.14.0-70.22.1.el9_0.x86_64 #1 SMP PREEMPT Tue Aug 2 10:02:12 EDT 2022

$ sensors
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:        +32.0°C  

$ cat /sys/class/thermal/thermal_zone6/temp
32000
Comment 67 Roman Mamedov 2022-11-03 21:48:33 UTC
For the record, I get the same message on boot:

  thermal thermal_zone2: failed to read out thermal zone (-61)

but the sensors output for iwlwifi is only N/A until the interface is brought up. 
After activating the WiFi interface, the temperature can be read successfully.
Comment 68 Florin Lipan 2023-03-07 18:03:09 UTC
A temporary fix that worked for me was to disable the iwlwifi module before entering sleep mode and turning it back on when waking up. You can do this easily with systemd hooks.

First confirm that the problem originates with the iwlwifi module:

```
# This should return iwlwifi
cat /sys/class/thermal/thermal_zone[NUMBER]/type
```

Then create a file under `/usr/lib/systemd/system-sleep/iwlwifi-thermal-issue` with the following contents:

```
#!/bin/sh

# Disables the wifi module before going into sleep mode to
# prevent the error `failed to read out thermal zone`.
# See https://bugzilla.kernel.org/show_bug.cgi?id=201761

# $1 is 'pre' (going to sleep) or 'post' (waking up)
# $2 is 'suspend', 'hibernate' or 'hybrid-sleep'
case "$1/$2" in
  pre/*)
    if lsmod | grep -q iwlmvm; then
      rmmod iwlmvm 
    fi
    if lsmod | grep -q iwlwifi; then
      rmmod iwlwifi
    fi
    ;;
  post/*)
    modprobe iwlmvm iwlwifi
    ;;
esac
```

Make the file executable:

```
chmod +x /usr/lib/systemd/system-sleep/iwlwifi-thermal-issue
```
Comment 69 Florin Lipan 2023-03-08 09:25:30 UTC
Actually I take that back, the only thing that works consistently is removing the wifi module explicitly before entering sleep mode:

```
sudo rmmod iwlmvm iwlwifi
```

The systemd hook only works sometimes and I couldn't figure out yet why.
Comment 70 Oskar Grindemyr 2023-04-19 20:57:14 UTC
Tuxedo InfinityBook S15 - Gen6
Mint 21.1

dmesg
[    8.166878] iwlwifi 0000:38:00.0: enabling device (0000 -> 0002)
[    8.176798] iwlwifi 0000:38:00.0: api flags index 2 larger than supported by driver
[    8.176811] iwlwifi 0000:38:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
[    8.177173] iwlwifi 0000:38:00.0: loaded firmware version 72.daa05125.0 cc-a0-72.ucode op_mode iwlmvm
[    8.320337] iwlwifi 0000:38:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
[    8.320431] thermal thermal_zone4: failed to read out thermal zone (-61)

cat /sys/class/thermal/thermal_zone4/type 
iwlwifi_1

cat /sys/class/thermal/thermal_zone4/temp 
49000

uname
5.19.0-40-generic #41~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC

sensors
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:        +49.0°C
Comment 71 Frank Kruger 2023-04-19 21:04:09 UTC
(In reply to Zhang Rui from comment #63)
> I want to confirm that if it is caused by the same device/reason on AMD
> platforms.
> 
> This is a known issue that kernel is reading the wifi thermal zone
> temperature while the wifi device firmware is not ready.
> It is harmless but annoying, and I have proposed some solution but failed to
> push it upstream.
> 
> Let me bring this for discussion again in the community.

Is there any news from your side?