Bug 201761
Summary: | "failed to read out thermal zone" for wifi thermal zone | ||
---|---|---|---|
Product: | Power Management | Reporter: | 00oo00 (chenalias) |
Component: | Thermal | Assignee: | Rafael J. Wysocki (rjw) |
Status: | RESOLVED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | andreas.thalhammer, bugzilla.kernel.org, daniel.lezcano, fkrueger, irherder, jay+bko, johannespfrang+kernel, kernelorg, klaus.kusche, marcel, Maurice.Smulders, mluppov, navarro.ime, nikuito, nrndda, oneuptingera, oscar.priegov, oskar.grindemyr, rafael, rjw, rm+bko, rui.zhang, rwarsow, serg, shalev.tomer, smirandac, spiderx, stanley.king, ulf.norberg, whenov |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.18.0-10-generic | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg
grep output dmesg from 5.0-rc1 built from https://cgit.freedesktop.org/~agd5f/linux/tree/?h=drm-next-5.1-wip grep . /sys/class/thermal/*/* dmesg output showing -61 error |
Please provide a description of your system and a clear statement of your problem from a functional perspective. Also let us know if it used to work before, and if so, which is the last known working version and the first known non-working version. Also note that thermal is a different subsystem from hwmon, so this bug was filled in the wrong component. Hi, Jean, thanks for forwarding. Reassign to thermal component. I suspect thermal_zone0 is from wifi driver. please attach the output of "grep . /sys/class/thermal/*/*" ping... I'm not the original reporter but I'm affected too. Motherboard os an MSI B450 GAMING PLUS AC with 1c:00.0 Network controller: Intel Corporation Dual Band Wireless-AC 3168NGW [Stone Peak] (rev 10) which is the device to which thermal zone supposedly belongs. lm-sensors (git) reports this as: iwlwifi-virtual-0 Adapter: Virtual device temp1: N/A Attached output of grep . /sys/class/thermal/*/* as requested and my dmesg. Created attachment 281089 [details]
grep output
Created attachment 281091 [details] dmesg from 5.0-rc1 built from https://cgit.freedesktop.org/~agd5f/linux/tree/?h=drm-next-5.1-wip /sys/class/thermal/thermal_zone0/type:iwlwifi Yes, it is the wifi driver. The problem is that we read the temperature when a thermal zone is registered, but for this particular device, we're not able to get the temp because wifi firmware is not loaded at that time. AFAICS, this is on my TODO list, and I will propose the capability to register a disabled thermal zone so that thermal framework will not try to read the temperature. Will paste the patch here when it is done. I too see this on all my systems that use the iwlwifi driver. By the way, the message is posted about 0.2 to 0.3 seconds after the message saying that the firmware is loaded. This is dmesg output from a system running Fedora 30. Other systems have similar timing. [ 7.207904] iwlwifi 0000:01:00.0: loaded firmware version 29.1044073957.0 op_mode iwlmvm [ 7.263526] input: HDA Intel PCH Headphone as /devices/pci0000:00/0000:00:1f.3/sound/card0/input9 [ 7.263631] input: HDA Intel PCH HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input10 [ 7.263731] input: HDA Intel PCH HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input11 [ 7.263807] input: HDA Intel PCH HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input12 [ 7.263882] input: HDA Intel PCH HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input13 [ 7.263955] input: HDA Intel PCH HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input14 [ 7.368003] intel_rapl: Found RAPL domain package [ 7.370868] intel_rapl: Found RAPL domain core [ 7.373444] intel_rapl: Found RAPL domain uncore [ 7.376064] intel_rapl: Found RAPL domain dram [ 7.412647] Bluetooth: hci0: unexpected event for opcode 0xfc2f [ 7.415702] iwlwifi 0000:01:00.0: Detected Intel(R) Dual Band Wireless AC 7265, REV=0x210 [ 7.432655] Bluetooth: hci0: Intel firmware patch completed and activated [ 7.441377] iwlwifi 0000:01:00.0: base HW address: 10:02:b5:2a:43:1f [ 7.525907] ieee80211 phy0: Selected rate control algorithm 'iwl-mvm-rs' [ 7.526792] thermal thermal_zone9: failed to read out thermal zone (-61) Same problem here. thermal thermal_zone1: failed to read out thermal zone (-61) comes five lines after iwlwifi 0000:6f:00.0: loaded firmware version 48.4fa0041f.0 op_mode iwlmvm It has been over a year now and we got 7 kernel releases after the initial bug report. Can we hope this is going to be fixed? Hello, I'm using Ubuntu Mate 18.04, and I'm getting this message or bug on /var/log/syslog. And I guess it's the reason my laptop freezes. Here's laptop and Ubuntu data: lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 18.04.4 LTS Release: 18.04 Codename: bionic uname -a Linux sergio-AsusLaptop 5.3.0-40-generic #32~18.04.1-Ubuntu SMP Mon Feb 3 14:05:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux root@sergio-AsusLaptop:~# Log: Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: The canary thread is apparently starving. Taking action. Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: Demoting known real-time threads. Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: Successfully demoted thread 14491 of process 26854 (n/a). Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: Successfully demoted thread 12546 of process 12446 (n/a). Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: Successfully demoted thread 3272 of process 3262 (n/a). Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: Successfully demoted thread 3271 of process 3262 (n/a). Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: Successfully demoted thread 3262 of process 3262 (n/a). Feb 24 15:29:19 sergio-AsusLaptop rtkit-daemon[3263]: Demoted 5 threads. Feb 24 15:29:19 sergio-AsusLaptop kernel: [11779.689160] thermal thermal_zone1: failed to read out thermal zone (-61) Feb 24 15:29:19 sergio-AsusLaptop kernel: [11779.712771] PM: suspend exit Feb 24 15:29:19 sergio-AsusLaptop kernel: [11779.712905] PM: suspend entry (s2idle) Same problem here $ uname -a Linux 8VT19Y2 4.15.0-1073-oem #83-Ubuntu SMP Mon Feb 17 11:21:18 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 18.04.4 LTS Release: 18.04 Codename: bionic Log: Mar 13 06:49:02 8VT19Y2 kernel: [ 1218.632872] usb 1-1.3.2.2: new full-speed USB device number 18 using xhci_hcd ... Mar 13 06:49:02 8VT19Y2 kernel: [ 1218.713229] thermal thermal_zone8: failed to read out thermal zone (-61) Mar 13 06:49:02 8VT19Y2 kernel: [ 1218.713241] usb 1-1.3.2.2: device descriptor read/64, error -32 Created attachment 287905 [details]
grep . /sys/class/thermal/*/*
Output of:
$ grep . /sys/class/thermal/*/* 2> /dev/null > sys_class_thermal.txt
Why is this still in status "NEEDINFO"? What info is still missing? Hi, guys, sorry for the late response. I can reproduce the issue locally, and debug message shows that iwl_mvm_firmware_running() fails when reading the temperature during registration. I'm not sure if we run into the same issue or not, but it's better to have the firmware load issue fixed first and see it solves all the problems or not. Recently, there are two patches posted which makes it possible to register wifi driver as "disabled" thermal zone and we can enable it later after firmware loaded, so I will propose a patch to fix this issue soon. I still don't understand it fails to read out the thermal zone even after the firmware is reported to be loaded. Does the firmware require more time to initialize before the thermal zone can be successfully read, so there's just some handshaking missing? If so, does that explain why lm_sensors can successfully read the temperature later? Here's what it looks like on my Fedora 30 system, kernel 5.6.7-100.fc30.x86_64: $ dmesg|egrep 'thermal_zone4|iwl' [ 5.503305] iwlwifi 0000:03:00.0: enabling device (0000 -> 0002) [ 5.513497] iwlwifi 0000:03:00.0: Found debug destination: EXTERNAL_DRAM [ 5.517088] iwlwifi 0000:03:00.0: Found debug configuration: 0 [ 5.517222] iwlwifi 0000:03:00.0: loaded firmware version 29.1654887522.0 7265D-29.ucode op_mode iwlmvm [ 5.683009] iwlwifi 0000:03:00.0: Detected Intel(R) Dual Band Wireless AC 3165, REV=0x210 [ 5.701454] iwlwifi 0000:03:00.0: Applying debug destination EXTERNAL_DRAM [ 5.702520] iwlwifi 0000:03:00.0: Allocated 0x00400000 bytes for firmware monitor. [ 5.708150] iwlwifi 0000:03:00.0: base HW address: 48:a4:72:8a:a4:f3 [ 5.768788] ieee80211 phy0: Selected rate control algorithm 'iwl-mvm-rs' [ 5.769174] thermal thermal_zone4: failed to read out thermal zone (-61) [ 5.772690] iwlwifi 0000:03:00.0 wlp3s0: renamed from wlan0 [ 9.369676] iwlwifi 0000:03:00.0: Applying debug destination EXTERNAL_DRAM [ 9.445652] iwlwifi 0000:03:00.0: Applying debug destination EXTERNAL_DRAM [ 9.446745] iwlwifi 0000:03:00.0: FW already configured (0) - re-configuring [ 9.475696] iwlwifi 0000:03:00.0: Applying debug destination EXTERNAL_DRAM [ 9.551639] iwlwifi 0000:03:00.0: Applying debug destination EXTERNAL_DRAM [ 9.552723] iwlwifi 0000:03:00.0: FW already configured (0) - re-configuring I'm not sure neither. For any of you who can try my currently solution, please pull the thermal/linux-next branch at git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git and then apply the following patches https://patchwork.kernel.org/patch/11506053/ https://patchwork.kernel.org/patch/11506065/ https://patchwork.kernel.org/patch/11519215/ https://patchwork.kernel.org/patch/11519219/ https://patchwork.kernel.org/patch/11519223/ https://patchwork.kernel.org/patch/11519227/ https://patchwork.kernel.org/patch/11519231/ https://patchwork.kernel.org/patch/11519235/ and see if the problem goes away. One thing to note is that the fix patch in this series (PATCH 6/6) is just a prototype one as I'm not familiar with the iwlwifi driver, I expect some wireless expert will help improve the patch or generate a better solution, which can be targeted for upstream. @Stan, you can try the solution above to see if the problem goes away or not. maybe it is the same issue. Zhang, I have to admit that I'm on the edge of being hopelessly lost trying to apply your patches. I haven't compiled a kernel for at least two years, and I've never applied git patches before. I've done a "git clone" with the git URL that you supplied. That seems to have populated a complete source tree. I downloaded the .diff file from your first patch URL, the one that ends in 11506053/. Is there a way to use the URL directly, without a file download? I tried to apply that patch with "git apply". A few of the diffs inside were matched successfully, but it failed on what would be Hunk #8 of drivers/thermal/imx_thermal.c. I think the failure was that it was looking for a block of text starting with "static int __maybe_unused imx_thermal_suspend(struct device *dev)", yet the string "__maybe_unused" does not appear in imx_thermal.c in the definition of imx_thermal_suspend. Do the patches need to be applied in a particular order or sequence? Assuming that I am able to successfully apply the patches, any help for how to proceed after that would be appreciated too, as the various help pages on the Internet all target slightly different situations than this. Even knowing how much file space will be consumed will be helpful. I'm using Fedora 31. Thanks. here is the command that we can use 1. git clone git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git 2. git checkout -b test thermal/linux-next 3. download the first two patches as mbox 4. git am patch_dir/*.patch 5. remove the two patches and download the next six patches as mbox 6. git am patch_dir/*.patch that should work without any conflict. I think you probably missed step 2. Thank you, Zhang. I think I'm getting closer. Step 2 initially failed because it wasn't a git repository, but presumably I should have cd'ed to directory linux, so I did that. In that directory, step 2 gets the following error: fatal: 'thermal/linux-next' is not a commit and a branch 'test' cannot be created from it I don't know how to recover from this, so I'd appreciate any advice you could offer. for step 2, try git checkout -b test origin/thermal/linux-next instead. Zhang, That did the trick. I have some results to report. First the good news: In two environments, Intel(R) Core(TM) i7-8550U CPU + Intel Wireless 3165 (rev 81), and AMD PRO A8-8600B R6 CPU + Intel Wireless 7265 (rev 61) it no longer complains about the thermal zone upon start-up. The iwlwifi zone is still visible among /sys/class/thermal/*/type, and lm_sensors still claims to read out the temperature successfully. In an environment with an Intel Core i5-2400 CPU and no Wireless, it seems to run without difficulty. Is there any additional information you'd like me to gather from these cases that seem to work? Now the bad news: In a laptop with an Intel Core i5-540M CPU and Intel Centrino Advanced-N 6205 [Taylor Peak] for wireless, it halts quite early on in the boot process, I think long before it has a chance to set up thermal. I was able to capture a video by using a boot_delay=10 kernel parameter. The first sign of a problem is "divide error: 0000 [#1] SMP PTI". The next line starts with "CPU: 0 PID" and ends with the kernel version and #1. Next is a line with "RIP: 0010:arch_scale_freq_tick+0x67/0x7f". This is followed by what looks like register dumps and call traces. The first two lines of the trace are <IRQ> and schedule_tick+0x34/0x120. There's a different RIP right after that, but you get the idea. Is it possible your kernel had an experimental feature enabled that doesn't work on this old hardware? (In reply to Stan King from comment #23) > Zhang, > > That did the trick. I have some results to report. First the good news: > > In two environments, > > Intel(R) Core(TM) i7-8550U CPU + Intel Wireless 3165 (rev 81), and > AMD PRO A8-8600B R6 CPU + Intel Wireless 7265 (rev 61) > > it no longer complains about the thermal zone upon start-up. The iwlwifi > zone is still visible among /sys/class/thermal/*/type, and lm_sensors still > claims to read out the temperature successfully. > > In an environment with an Intel Core i5-2400 CPU and no Wireless, it seems > to run without difficulty. > > Is there any additional information you'd like me to gather from these cases > that seem to work? Good to know that it works. But the patch itself is a little tricky, and I'm still waiting for the wireless experts to give comments about the patch. > > Now the bad news: > > In a laptop with an Intel Core i5-540M CPU and Intel Centrino Advanced-N > 6205 [Taylor Peak] for wireless, it halts quite early on in the boot > process, I think long before it has a chance to set up thermal. I was able > to capture a video by using a boot_delay=10 kernel parameter. The first > sign of a problem is "divide error: 0000 [#1] SMP PTI". The next line > starts with "CPU: 0 PID" and ends with the kernel version and #1. Next is a > line with "RIP: 0010:arch_scale_freq_tick+0x67/0x7f". This is followed by > what looks like register dumps and call traces. The first two lines of the > trace are <IRQ> and schedule_tick+0x34/0x120. There's a different RIP right > after that, but you get the idea. Is it possible your kernel had an > experimental feature enabled that doesn't work on this old hardware? I don't think it is caused by the patches. A simple way to verify this is to rebuild the kernel with step 1 and 2 only, without applies the patches and see if you can see the same problem. Zhang, indeed it failed on the i5-540M processor even without your patches. I see. So let's wait for the reply from the wifi experts and see how we can have an upstream solution. Hi everyone, I have faced the same problem. Some information about my system. $ uname -a Linux chivunk 5.4.52-1-MANJARO #1 SMP PREEMPT Thu Jul 16 16:07:11 UTC 2020 x86_64 GNU/Linux $ sudo dmesg|egrep 'thermal_zone4|iwl' [ 2.818658] iwlwifi 0000:00:14.3: enabling device (0000 -> 0002) [ 2.826456] iwlwifi 0000:00:14.3: TLV_FW_FSEQ_VERSION: FSEQ Version: 58.3.35.22 [ 2.826461] iwlwifi 0000:00:14.3: Found debug destination: EXTERNAL_DRAM [ 2.826463] iwlwifi 0000:00:14.3: Found debug configuration: 0 [ 2.826812] iwlwifi 0000:00:14.3: loaded firmware version 50.3e391d3e.0 op_mode iwlmvm [ 3.077614] iwlwifi 0000:00:14.3: Detected Intel(R) Wi-Fi 6 AX201 160MHz, REV=0x354 [ 3.083978] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM [ 3.084350] iwlwifi 0000:00:14.3: Allocated 0x00400000 bytes for firmware monitor. [ 3.249901] iwlwifi 0000:00:14.3: base HW address: 40:74:e0:62:8d:60 [ 3.261793] thermal thermal_zone4: failed to read out thermal zone (-61) [ 3.490892] iwlwifi 0000:00:14.3 wlp0s20f3: renamed from wlan0 [ 4.855150] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM [ 5.021074] iwlwifi 0000:00:14.3: FW already configured (0) - re-configuring [ 5.052032] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM [ 5.217533] iwlwifi 0000:00:14.3: FW already configured (0) - re-configuring [ 106.829111] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM [ 106.994883] iwlwifi 0000:00:14.3: FW already configured (0) - re-configuring Same problem. $ uname -a Linux hulk 5.8.0-2-amd64 #1 SMP Debian 5.8.10-1 (2020-09-19) x86_64 GNU/Linux $ sudo dmesg|egrep 'thermal_zone4|iwl' [ 2.890162] iwlwifi 0000:03:00.0: enabling device (0000 -> 0002) [ 2.897199] iwlwifi 0000:03:00.0: firmware: direct-loading firmware iwlwifi-cc-a0-56.ucode [ 2.897205] iwlwifi 0000:03:00.0: api flags index 2 larger than supported by driver [ 2.897214] iwlwifi 0000:03:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.22 [ 2.897216] iwlwifi 0000:03:00.0: Found debug destination: EXTERNAL_DRAM [ 2.897217] iwlwifi 0000:03:00.0: Found debug configuration: 0 [ 2.897463] iwlwifi 0000:03:00.0: loaded firmware version 55.d9698065.0 cc-a0-56.ucode op_mode iwlmvm [ 2.897474] iwlwifi 0000:03:00.0: firmware: failed to load iwl-debug-yoyo.bin (-2) [ 3.058913] iwlwifi 0000:03:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340 [ 3.235913] iwlwifi 0000:03:00.0: base HW address: xx:xx:xx:xx:xx:xx [ 3.249489] iwlwifi 0000:03:00.0 wlp3s0: renamed from wlan0 (In reply to oneuptingera from comment #28) > Same problem. > > $ uname -a > Linux hulk 5.8.0-2-amd64 #1 SMP Debian 5.8.10-1 (2020-09-19) x86_64 GNU/Linux > > > $ sudo dmesg|egrep 'thermal_zone4|iwl' > [ 2.890162] iwlwifi 0000:03:00.0: enabling device (0000 -> 0002) > [ 2.897199] iwlwifi 0000:03:00.0: firmware: direct-loading firmware > iwlwifi-cc-a0-56.ucode > [ 2.897205] iwlwifi 0000:03:00.0: api flags index 2 larger than supported > by driver > [ 2.897214] iwlwifi 0000:03:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: > 89.3.35.22 > [ 2.897216] iwlwifi 0000:03:00.0: Found debug destination: EXTERNAL_DRAM > [ 2.897217] iwlwifi 0000:03:00.0: Found debug configuration: 0 > [ 2.897463] iwlwifi 0000:03:00.0: loaded firmware version 55.d9698065.0 > cc-a0-56.ucode op_mode iwlmvm > [ 2.897474] iwlwifi 0000:03:00.0: firmware: failed to load > iwl-debug-yoyo.bin (-2) > [ 3.058913] iwlwifi 0000:03:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, > REV=0x340 > [ 3.235913] iwlwifi 0000:03:00.0: base HW address: xx:xx:xx:xx:xx:xx > [ 3.249489] iwlwifi 0000:03:00.0 wlp3s0: renamed from wlan0 $ sudo dmesg|egrep 'thermal_zone0|iwl' [ 2.890162] iwlwifi 0000:03:00.0: enabling device (0000 -> 0002) [ 2.897199] iwlwifi 0000:03:00.0: firmware: direct-loading firmware iwlwifi-cc-a0-56.ucode [ 2.897205] iwlwifi 0000:03:00.0: api flags index 2 larger than supported by driver [ 2.897214] iwlwifi 0000:03:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.22 [ 2.897216] iwlwifi 0000:03:00.0: Found debug destination: EXTERNAL_DRAM [ 2.897217] iwlwifi 0000:03:00.0: Found debug configuration: 0 [ 2.897463] iwlwifi 0000:03:00.0: loaded firmware version 55.d9698065.0 cc-a0-56.ucode op_mode iwlmvm [ 2.897474] iwlwifi 0000:03:00.0: firmware: failed to load iwl-debug-yoyo.bin (-2) [ 3.058913] iwlwifi 0000:03:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340 [ 3.235913] iwlwifi 0000:03:00.0: base HW address: xx:xx:xx:xx:xx:xx [ 3.247759] thermal thermal_zone0: failed to read out thermal zone (-61) [ 3.249489] iwlwifi 0000:03:00.0 wlp3s0: renamed from wlan0 Same issue and error message with kernel 5.8.11, which, however, is not surprising since the bug is open for almost 2 years. Same problem here. Linux 5.11.16-gentoo 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz Message clearly comes AFTER firmware is loaded and device initialized. [ 4.754547] iwlwifi 0000:6f:00.0: enabling device (0000 -> 0002) [ 4.772998] iwlwifi 0000:6f:00.0: api flags index 2 larger than supported by driver [ 4.773011] iwlwifi 0000:6f:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 93.8.63.28 [ 4.773212] iwlwifi 0000:6f:00.0: loaded firmware version 59.601f3a66.0 ty-a0-gf-a0-59.ucode op_mode iwlmvm [ 4.810088] iwlwifi 0000:6f:00.0: Detected Intel(R) Wi-Fi 6 AX210 160MHz, REV=0x420 [ 5.020872] iwlwifi 0000:6f:00.0: base HW address: e8:f4:08:dc:bf:4a [ 5.035333] thermal thermal_zone2: failed to read out thermal zone (-61) Looks like it's been more than two years since it was originally reported. Oh my... (In reply to Zhang Rui from comment #26) > I see. > So let's wait for the reply from the wifi experts and see how we can have an > upstream solution. After almost one year of waiting for an upstream solution, do see any chance that this issue will be fixed? Thx. Hi, I'm trying to figure out what is the issue. The firmware is missing, so when reading the thermal zone temperature, that returns an error. In what that causes an issue except seeing an annoying message ? What would be the expected behavior ? 1. No thermal zone until the firmware is with the status running ? 2. A disabled thermal zone until the firmware is with the status running ? (more code to differentiate manual disabling and initial mode) 3. No trace because it is scary / annoying ? I'll try to help fixing any of this but the only wifi card I have is in my laptop and can't test on it. Does anyone know a wifi USB dongle compatible with the iwlwifi driver ? Even with firmware available the same error is logged and thermal zone is missing. I tried bundling intel firmware into the kernel via CONFIG_EXTRA_FIRMWARE= same errors and same missing functionality. And it has been this way for years. I upgraded the motherboard (for a different reason), same shit. dmesg | grep -E 'iwl|thermal' [ 0.262277] thermal_sys: Registered thermal governor 'fair_share' [ 0.262277] thermal_sys: Registered thermal governor 'bang_bang' [ 0.262277] thermal_sys: Registered thermal governor 'step_wise' [ 0.262277] thermal_sys: Registered thermal governor 'user_space' [ 2.654368] iwlwifi 0000:04:00.0: enabling device (0000 -> 0002) [ 2.665876] iwlwifi 0000:04:00.0: Direct firmware loaded: iwlwifi-9260-th-b0-jf-b0-46.ucode [ 2.665904] iwlwifi 0000:04:00.0: WRT: Overriding region id 0 [ 2.665906] iwlwifi 0000:04:00.0: WRT: Overriding region id 1 [ 2.665908] iwlwifi 0000:04:00.0: WRT: Overriding region id 2 [ 2.665910] iwlwifi 0000:04:00.0: WRT: Overriding region id 3 [ 2.665911] iwlwifi 0000:04:00.0: WRT: Overriding region id 4 [ 2.665913] iwlwifi 0000:04:00.0: WRT: Overriding region id 6 [ 2.665914] iwlwifi 0000:04:00.0: WRT: Overriding region id 8 [ 2.665915] iwlwifi 0000:04:00.0: WRT: Overriding region id 9 [ 2.665917] iwlwifi 0000:04:00.0: WRT: Overriding region id 10 [ 2.665918] iwlwifi 0000:04:00.0: WRT: Overriding region id 11 [ 2.665919] iwlwifi 0000:04:00.0: WRT: Overriding region id 15 [ 2.665920] iwlwifi 0000:04:00.0: WRT: Overriding region id 16 [ 2.665922] iwlwifi 0000:04:00.0: WRT: Overriding region id 18 [ 2.665923] iwlwifi 0000:04:00.0: WRT: Overriding region id 19 [ 2.665924] iwlwifi 0000:04:00.0: WRT: Overriding region id 20 [ 2.665926] iwlwifi 0000:04:00.0: WRT: Overriding region id 21 [ 2.665927] iwlwifi 0000:04:00.0: WRT: Overriding region id 28 [ 2.666223] iwlwifi 0000:04:00.0: loaded firmware version 46.6f9f215c.0 9260-th-b0-jf-b0-46.ucode op_mode iwlmvm [ 2.687959] iwlwifi 0000:04:00.0: Detected Intel(R) Wireless-AC 9260 160MHz, REV=0x324 [ 2.694318] thermal thermal_zone0: failed to read out thermal zone (-61) [ 2.735469] iwlwifi 0000:04:00.0: base HW address: 84:fd:d1:5c:06:48 [ 2.802970] ieee80211 phy0: Selected rate control algorithm 'iwl-mvm-rs' [ 2.805706] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0 On 28/04/2021 00:43, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=201761 > > --- Comment #34 from Andriy Perevortkin (irherder@gmail.com) --- > Even with firmware available the same error is logged and thermal zone is > missing. > I tried bundling intel firmware into the kernel via CONFIG_EXTRA_FIRMWARE= > same errors and same missing functionality. > > And it has been this way for years. I upgraded the motherboard (for a > different > reason), same shit. > > dmesg | grep -E 'iwl|thermal' > [ 0.262277] thermal_sys: Registered thermal governor 'fair_share' > [ 0.262277] thermal_sys: Registered thermal governor 'bang_bang' > [ 0.262277] thermal_sys: Registered thermal governor 'step_wise' > [ 0.262277] thermal_sys: Registered thermal governor 'user_space' > [ 2.654368] iwlwifi 0000:04:00.0: enabling device (0000 -> 0002) > [ 2.665876] iwlwifi 0000:04:00.0: Direct firmware loaded: > iwlwifi-9260-th-b0-jf-b0-46.ucode > [ 2.665904] iwlwifi 0000:04:00.0: WRT: Overriding region id 0 > [ 2.665906] iwlwifi 0000:04:00.0: WRT: Overriding region id 1 > [ 2.665908] iwlwifi 0000:04:00.0: WRT: Overriding region id 2 > [ 2.665910] iwlwifi 0000:04:00.0: WRT: Overriding region id 3 > [ 2.665911] iwlwifi 0000:04:00.0: WRT: Overriding region id 4 > [ 2.665913] iwlwifi 0000:04:00.0: WRT: Overriding region id 6 > [ 2.665914] iwlwifi 0000:04:00.0: WRT: Overriding region id 8 > [ 2.665915] iwlwifi 0000:04:00.0: WRT: Overriding region id 9 > [ 2.665917] iwlwifi 0000:04:00.0: WRT: Overriding region id 10 > [ 2.665918] iwlwifi 0000:04:00.0: WRT: Overriding region id 11 > [ 2.665919] iwlwifi 0000:04:00.0: WRT: Overriding region id 15 > [ 2.665920] iwlwifi 0000:04:00.0: WRT: Overriding region id 16 > [ 2.665922] iwlwifi 0000:04:00.0: WRT: Overriding region id 18 > [ 2.665923] iwlwifi 0000:04:00.0: WRT: Overriding region id 19 > [ 2.665924] iwlwifi 0000:04:00.0: WRT: Overriding region id 20 > [ 2.665926] iwlwifi 0000:04:00.0: WRT: Overriding region id 21 > [ 2.665927] iwlwifi 0000:04:00.0: WRT: Overriding region id 28 > [ 2.666223] iwlwifi 0000:04:00.0: loaded firmware version 46.6f9f215c.0 > 9260-th-b0-jf-b0-46.ucode op_mode iwlmvm > [ 2.687959] iwlwifi 0000:04:00.0: Detected Intel(R) Wireless-AC 9260 > 160MHz, > REV=0x324 > [ 2.694318] thermal thermal_zone0: failed to read out thermal zone (-61) That appears at boot time, but if you read the content of /sys/class/thermal/thermal_zone0/temp on the command line, does it give the same error or the temperature? > [ 2.735469] iwlwifi 0000:04:00.0: base HW address: 84:fd:d1:5c:06:48 > [ 2.802970] ieee80211 phy0: Selected rate control algorithm 'iwl-mvm-rs' > [ 2.805706] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0 > cat /sys/class/thermal/thermal_zone0/temp cat: /sys/class/thermal/thermal_zone0/temp: No data available sensors | grep iwlw -A 2 iwlwifi_1-virtual-0 Adapter: Virtual device temp1: N/A On 28/04/2021 00:58, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=201761 > > --- Comment #36 from Andriy Perevortkin (irherder@gmail.com) --- > cat /sys/class/thermal/thermal_zone0/temp > cat: /sys/class/thermal/thermal_zone0/temp: No data available > > sensors | grep iwlw -A 2 > iwlwifi_1-virtual-0 > Adapter: Virtual device > temp1: N/A > Thanks for the extra information. Definitively need some hardware to investigate the problem. No change with this behavior on my hardware. In fact, I recently got an Intel kit with an AX200 160 MHz WiFi card, and it exhibits the same problem with recent kernels: error during start-up, but reads the temperature OK later with lm_sensors. As you can see from comments 24 and 26, Zhang Rui had worked out a possible solution, but needed to check it with some other people. $ cat /sys/class/thermal/thermal_zone0/temp 27800 $ sensors iwlwifi_1-virtual-0 Adapter: Virtual device temp1: N/A (In reply to Max from comment #39) > $ cat /sys/class/thermal/thermal_zone0/temp > 27800 > > $ sensors > iwlwifi_1-virtual-0 > Adapter: Virtual device > temp1: N/A Are you sure both are referring to the same sensor ? (In reply to Stan King from comment #38) > No change with this behavior on my hardware. In fact, I recently got an > Intel kit with an AX200 160 MHz WiFi card, and it exhibits the same problem > with recent kernels: error during start-up, but reads the temperature OK > later with lm_sensors. > > As you can see from comments 24 and 26, Zhang Rui had worked out a possible > solution, but needed to check it with some other people. Yes, I saw the thread. But actually, we may create the thermal zone after the firmware did successfully finish its setup. So there will be the guarantee the sensor is operational for the thermal framework without a parallel initialization leading to this message. On the other side, there is a REGULAR_UCODE property for the driver which depends on the device_family. If it is not set, then the get_temp will fail also. For the first case, the message appears because of an asynchronous initialization o the firmware. For the second case, the device does not belong to the device family with the REGULAR_UCODE family. The code snippet: static int iwl_mvm_tzone_get_temp(...) { [ ... ] if (!iwl_mvm_firmware_running(mvm) || mvm->fwrt.cur_fw_img != IWL_UCODE_REGULAR) { ret = -ENODATA; goto out; } [ ... ] } (In reply to Daniel Lezcano from comment #40) > > (In reply to Max from comment #39) > > $ cat /sys/class/thermal/thermal_zone0/temp > > 27800 > > > > $ sensors > > iwlwifi_1-virtual-0 > > Adapter: Virtual device > > temp1: N/A > > Are you sure both are referring to the same sensor ? You know what? You're right. My bad. This is the correct zone. [ 3.216572] thermal thermal_zone2: failed to read out thermal zone (-61) $ cat /sys/class/thermal/thermal_zone2/temp cat: /sys/class/thermal/thermal_zone2/temp: No data available (In reply to Daniel Lezcano from comment #41) > (In reply to Stan King from comment #38) > > No change with this behavior on my hardware. In fact, I recently got an > > Intel kit with an AX200 160 MHz WiFi card, and it exhibits the same problem > > with recent kernels: error during start-up, but reads the temperature OK > > later with lm_sensors. > > > > As you can see from comments 24 and 26, Zhang Rui had worked out a possible > > solution, but needed to check it with some other people. > > Yes, I saw the thread. But actually, we may create the thermal zone after > the firmware did successfully finish its setup. So there will be the > guarantee the sensor is operational for the thermal framework without a > parallel initialization leading to this message. > > On the other side, there is a REGULAR_UCODE property for the driver which > depends on the device_family. If it is not set, then the get_temp will fail > also. > > For the first case, the message appears because of an asynchronous > initialization o the firmware. > > For the second case, the device does not belong to the device family with > the REGULAR_UCODE family. > > The code snippet: > > static int iwl_mvm_tzone_get_temp(...) > { > > [ ... ] > > if (!iwl_mvm_firmware_running(mvm) || > mvm->fwrt.cur_fw_img != IWL_UCODE_REGULAR) { > ret = -ENODATA; > goto out; > } > > [ ... ] > } My previous impression is that even at runtime, it still fails because iwl_mvm_firmware_running(mvm) returns false because the firmware can be unloaded, but I was not aware of the IWL_UCODE_REGULAR condition. I have a machine that can reproduce the error. Let me check what the problem is and if I can refresh my patch based on the findings. (In reply to Zhang Rui from comment #43) > (In reply to Daniel Lezcano from comment #41) > > (In reply to Stan King from comment #38) > > > No change with this behavior on my hardware. In fact, I recently got an > > > Intel kit with an AX200 160 MHz WiFi card, and it exhibits the same > problem > > > with recent kernels: error during start-up, but reads the temperature OK > > > later with lm_sensors. > > > > > > As you can see from comments 24 and 26, Zhang Rui had worked out a > possible > > > solution, but needed to check it with some other people. > > > > Yes, I saw the thread. But actually, we may create the thermal zone after > > the firmware did successfully finish its setup. So there will be the > > guarantee the sensor is operational for the thermal framework without a > > parallel initialization leading to this message. > > > > On the other side, there is a REGULAR_UCODE property for the driver which > > depends on the device_family. If it is not set, then the get_temp will fail > > also. > > > > For the first case, the message appears because of an asynchronous > > initialization o the firmware. > > > > For the second case, the device does not belong to the device family with > > the REGULAR_UCODE family. > > > > The code snippet: > > > > static int iwl_mvm_tzone_get_temp(...) > > { > > > > [ ... ] > > > > if (!iwl_mvm_firmware_running(mvm) || > > mvm->fwrt.cur_fw_img != IWL_UCODE_REGULAR) { > > ret = -ENODATA; > > goto out; > > } > > > > [ ... ] > > } > > My previous impression is that even at runtime, it still fails because > iwl_mvm_firmware_running(mvm) returns false because the firmware can be > unloaded, but I was not aware of the IWL_UCODE_REGULAR condition. > I have a machine that can reproduce the error. > Let me check what the problem is and if I can refresh my patch based on the > findings. Would'nt make sense to call iwl_mvm_thermal_initialize() in the firmware callback after the setup is complete (perhaps at the enf of the iwl_mvm_load_ucode_wait_alive() function?) instead of creating uncondionnally the thermal zone ? (In reply to Max from comment #42) > (In reply to Daniel Lezcano from comment #40) > > > > (In reply to Max from comment #39) > > > $ cat /sys/class/thermal/thermal_zone0/temp > > > 27800 > > > > > > $ sensors > > > iwlwifi_1-virtual-0 > > > Adapter: Virtual device > > > temp1: N/A > > > > Are you sure both are referring to the same sensor ? > > You know what? You're right. My bad. This is the correct zone. > > [ 3.216572] thermal thermal_zone2: failed to read out thermal zone (-61) > > $ cat /sys/class/thermal/thermal_zone2/temp > cat: /sys/class/thermal/thermal_zone2/temp: No data available Thanks for confirming (In reply to Daniel Lezcano from comment #45) > (In reply to Max from comment #42) > > (In reply to Daniel Lezcano from comment #40) > > > > > > (In reply to Max from comment #39) > > > > $ cat /sys/class/thermal/thermal_zone0/temp > > > > 27800 > > > > > > > > $ sensors > > > > iwlwifi_1-virtual-0 > > > > Adapter: Virtual device > > > > temp1: N/A > > > > > > Are you sure both are referring to the same sensor ? > > > > You know what? You're right. My bad. This is the correct zone. > > > > [ 3.216572] thermal thermal_zone2: failed to read out thermal zone (-61) > > > > $ cat /sys/class/thermal/thermal_zone2/temp > > cat: /sys/class/thermal/thermal_zone2/temp: No data available > > Thanks for confirming On the other hand, in my situation: [ 7.298689] thermal thermal_zone4: failed to read out thermal zone (-61) $ cat /sys/class/thermal/thermal_zone4/temp 55000 So I am getting a successful read after start-up. 55C sounds about right, as it's a fanless system. (In reply to Daniel Lezcano from comment #44) > (In reply to Zhang Rui from comment #43) > > (In reply to Daniel Lezcano from comment #41) > > > (In reply to Stan King from comment #38) > > My previous impression is that even at runtime, it still fails because > > iwl_mvm_firmware_running(mvm) returns false because the firmware can be > > unloaded, but I was not aware of the IWL_UCODE_REGULAR condition. > > I have a machine that can reproduce the error. > > Let me check what the problem is and if I can refresh my patch based on the > > findings. > > Would'nt make sense to call iwl_mvm_thermal_initialize() in the firmware > callback after the setup is complete (perhaps at the enf of the > iwl_mvm_load_ucode_wait_alive() function?) instead of creating > uncondionnally the thermal zone ? I think that is because the firmware may get loaded/unloaded at runtime, at least one code path I can confirm is __iwl_mvm_suspend -> iwl_mvm_netdetect_config -> iwl_mvm_switch_to_d3 ->iwl_mvm_stop_device -> clear_bit(IWL_MVM_STATUS_FIRMWARE_RUNNING, &mvm->status) iwl_mvm_firmware_running() just check the IWL_MVM_STATUS_FIRMWARE_RUNNING bit, and this results in the iwl_mvm_tzone_get_temp() failure, which is the thermal .get_temp callback. I don't know how often the firmware loading/unloading happens, but registering/unregistering the thermal zone upon firmware load/unload is relatively expensive, and its thermal zone device node becomes inconsistent. That is why I preferred to disable/enable the thermal zone instead of register/unregister in the original proposal. Hi, Daniel, is there any specific reason we can not register a "disabled" thermal zone? Hi Rui, IMO, we can register a disabled thermal zone if its initialization is complete and we can set it enabled. Here the initialization failed (and AFAICT there is nothing preventing us to set it 'enabled'). The function thermal_zone_device_register() registers a sensor but in the iwlwifi case there is no guarantee such sensor exists because of the firmware. That is the reason why I think the driver is not doing the right thing and should take care of registering/unregistering the thermal zone when the sensor (aka firmware code) can operate, otherwise we create an empty sensor device which is wrong. This is clearly spotted by the error happening at the end of the thermal_zone_device_register() function which calls thermal_zone_device_update() at the end: the thermal zone is registered before the sensor is initialized. In addition, userspace programs may not be aware of the thermal zone mode and continue reading the temp file with the same ENODATA error. Especially when they read the temp from /sys/class/hwmon where the disabled state is not available. (In reply to Daniel Lezcano from comment #49) > This is clearly spotted by the error happening at the > end of the thermal_zone_device_register() function which calls > thermal_zone_device_update() at the end: the thermal zone is registered > before the sensor is initialized. thermal_zone_device_update() can handle a disabled thermal zone now. So if we flag the thermal zone as disabled during registration, thermal_zone_device_update() is a no-op. This just means it is doable technically. But let's understand why the current thermal APIs (register/unregister) can not fit the current problem first. > That is the reason why I think the driver is not doing the right thing and > should take care of registering/unregistering the thermal zone when the > sensor (aka firmware code) can operate, otherwise we create an empty sensor > device which is wrong. Okay, I found the previous conversation with Luciano Coelho, the iwlwifi maintainer. "This issue has been know by us for a while now and we also had users complain about it, but at the time there was nothing we could do. The reason for registering before we can actually provide the temperature is because the wifi interface may go up and down many times and we didn't want the userspace to keep having to set values again." Luca, can you give more details about what the userspace does for the iwlwifi thermal zone, and how often the iwlwifi interface becomes avaialble/unavailable? We want to fully understand the drawbacks of doing thermal register/unregister. > > In addition, userspace programs may not be aware of the thermal zone mode > and continue reading the temp file with the same ENODATA error. Especially > when they read the temp from /sys/class/hwmon where the disabled state is > not available. I agree. We can only prevent access of a disabled thermal zone from kernel, but accessing via sysfs can still trigger this error. But the kernel failure is what this bug report mainly complains. And if users read the temp when the wifi interface is down, it will get an error, I don't think there is a problem. Plus, we can add the tz->mode check in sysfs attribute callbacks, and give an extra warning of "accessing-to-a-disabled-thermal-zone". (In reply to Zhang Rui from comment #50) [ ... ] > > That is the reason why I think the driver is not doing the right thing and > > should take care of registering/unregistering the thermal zone when the > > sensor (aka firmware code) can operate, otherwise we create an empty sensor > > device which is wrong. > > Okay, I found the previous conversation with Luciano Coelho, the iwlwifi > maintainer. > > "This issue has been know by us for a while now and we also had users > complain about it, but at the time there was nothing we could do. The reason > for registering before we can actually provide the temperature is because > the wifi interface may go up and down many times and we didn't want the > userspace to keep having to set values again." That is not a kernel problem IMHO. The ifup / ifdown scripts can cleanly handle the configuration, right? On the other side, the get_temp may never operate as mentioned in a previous comment and the thermal zone is there. From my POV, the bug falls under the wireless umbrella (Drivers/Intel wireless network drivers). The driver must register when the sensor exists (firmware loaded or whatever). This bug is open since a long time and people are complaining we don't fix it. Does it make sense to move it under the 'Drivers/Intel wireless network' component and change the status to 'CONFIRMED' For what it is worth, this is affecting Lenovo laptops based on TigerLake platform. Attempting to suspend a machine that is running Debian Bullseye (LTS kernel 5.10.46) results in a forced shutdown due to incorrect temperature detection: Before an attempted suspend: ``` $ sudo dmesg | grep 'failed to read out thermal' kern :warn : [ +0.000066] thermal thermal_zone5: failed to read out thermal zone (-5) kern :warn : [ +0.012993] thermal thermal_zone8: failed to read out thermal zone (-61) kern :warn : [ +0.000010] thermal thermal_zone5: failed to read out thermal zone (-5) $ sudo cat /sys/class/thermal/thermal_zone5/temp cat: /sys/class/thermal/thermal_zone5/temp: Input/output error ``` After suspend, I am seeing the following, albeit briefly: ``` [ xxxx.xxxxxx] thermal thermal_zone4: critical temperature reached (128 C), shutting down message from syslog@laptop at Jul 24 00:00:00 kernel:[ xxxx.xxxxxx] thermal thermal_zone4: critical temperature reached (128 C), shutting down ``` This is when highest temperature registered by lm-sensors is about 30c. I highly doubt that 128c reading above is accurate. (In reply to korg from comment #52) > For what it is worth, this is affecting Lenovo laptops based on TigerLake > platform. > > Attempting to suspend a machine that is running Debian Bullseye (LTS kernel > 5.10.46) results in a forced shutdown due to incorrect temperature detection: > > Before an attempted suspend: > > ``` > $ sudo dmesg | grep 'failed to read out thermal' > kern :warn : [ +0.000066] thermal thermal_zone5: failed to read out > thermal zone (-5) > kern :warn : [ +0.012993] thermal thermal_zone8: failed to read out > thermal zone (-61) > kern :warn : [ +0.000010] thermal thermal_zone5: failed to read out > thermal zone (-5) > $ sudo cat /sys/class/thermal/thermal_zone5/temp > cat: /sys/class/thermal/thermal_zone5/temp: Input/output error > ``` What are the different thermal zones (4,5 and 8) ? > After suspend, I am seeing the following, albeit briefly: > > ``` > [ xxxx.xxxxxx] thermal thermal_zone4: critical temperature reached (128 C), > shutting down > > message from syslog@laptop at Jul 24 00:00:00 > kernel:[ xxxx.xxxxxx] thermal thermal_zone4: critical temperature reached > (128 C), shutting down > ``` > > This is when highest temperature registered by lm-sensors is about 30c. I > highly doubt that 128c reading above is accurate. I still get this on a first-generation Lenovo ThinkPad T14 AMD (Ryzen 7 PRO 4750U based) running 5.14.17-301.fc35.x86_64: Nov 16 15:16:57 fedora kernel: thermal thermal_zone0: failed to read out thermal zone (-61) I'm having the same issue. Can I provide any logs to help debug further? Lenovo Thinkpad X1 Carbon 7th Gen (Type 20QDCT01WW) $ uname -a Linux pop-os 5.16.11-76051611-generic #202202230823~1646248261~21.10~2b22243 SMP PREEMPT Wed Mar 2 20: x86_64 x86_64 x86_64 GNU/Linux $ sensors ... iwlwifi_1-virtual-0 Adapter: Virtual device temp1: N/A ... /var/log/kern.log ... Mar 23 21:49:04 pop-os kernel: [135133.290276] thermal thermal_zone5: failed to read out thermal zone (-61) Mar 23 21:49:04 pop-os kernel: [135133.325120] PM: suspend exit ... I should've mentioned - I experience this issue regularly, often multiple times a day. All userspace applications are ended on wake from sleep. (In reply to Tom from comment #55) Thanks for reporting. I'm really willing to fix this issue but I don't have the hardware to reproduce it. What is this *virtual* wifi driver iwlwifi_1-virtual-0 ? > I'm having the same issue. Can I provide any logs to help debug further? > > Lenovo Thinkpad X1 Carbon 7th Gen (Type 20QDCT01WW) > > > $ uname -a > Linux pop-os 5.16.11-76051611-generic #202202230823~1646248261~21.10~2b22243 > SMP PREEMPT Wed Mar 2 20: x86_64 x86_64 x86_64 GNU/Linux > > > $ sensors > ... > iwlwifi_1-virtual-0 > Adapter: Virtual device > temp1: N/A > ... > > > /var/log/kern.log > ... > Mar 23 21:49:04 pop-os kernel: [135133.290276] thermal thermal_zone5: failed > to read out thermal zone (-61) > Mar 23 21:49:04 pop-os kernel: [135133.325120] PM: suspend exit > ... I'm not sure - I had a look in /etc/sensors3.conf and /etc/sensors.d to try and learn more but couldn't find anything. If you can suggest how I can get more info about it, I'm happy to report back. (In reply to Marcel Ziswiler from comment #54) > I still get this on a first-generation Lenovo ThinkPad T14 AMD (Ryzen 7 PRO > 4750U based) running 5.14.17-301.fc35.x86_64: > > Nov 16 15:16:57 fedora kernel: thermal thermal_zone0: failed to read out > thermal zone (-61) what is the output of cat /sys/class/thermal/thermal_zone0/type? can you please attach the full dmesg output after boot? Zhang, although I'm not Marcel Ziswiler, I'm getting this on three different systems, and in each case, the contents of the relevant thermal_zoneN/type is iwlwifi_1. I'll try to attach the dmesg output from the boot of the system with a Intel 3165 WiFi circuit. It looks like that will end up as a separate comment. I apologize for my unfamiliarity with this forum environment. The other systems have Intel 7265 and AX200 WiFi, so let me know if anything from those systems would be useful to your debug efforts. Created attachment 301261 [details]
dmesg output showing -61 error
This file was from a Fedora 35 system, kernel 5.18.5-100.fc35.x86_64.
The WiFi circuit was Intel 3165.
The error was "thermal thermal_zone4: failed to read out thermal zone (-61)".
The output of "cat /sys/class/thermal/thermal_zone4/type" was iwlwifi_1
Lenovo T14 Gen 1 AMD (Ryzen 4750U) here with kernel 5.15.41. From log: kernel: thermal thermal_zone1: failed to read out thermal zone (-61) The asked output: $ cat /sys/class/thermal/thermal_zone1/type iwlwifi_1 Wifi card: $ lspci | grep Wi-Fi 03:00.0 Network controller: Intel Corporation Wi-Fi 6 AX200 (rev 1a) I want to confirm that if it is caused by the same device/reason on AMD platforms. This is a known issue that kernel is reading the wifi thermal zone temperature while the wifi device firmware is not ready. It is harmless but annoying, and I have proposed some solution but failed to push it upstream. Let me bring this for discussion again in the community. (In reply to Zhang Rui from comment #63) > I want to confirm that if it is caused by the same device/reason on AMD > platforms. > > This is a known issue that kernel is reading the wifi thermal zone > temperature while the wifi device firmware is not ready. > It is harmless but annoying, and I have proposed some solution but failed to > push it upstream. > > Let me bring this for discussion again in the community. From my POV the problem is in the wifi driver, not in the thermal framework. The thermal zone should be registered when the firmware is loaded. I agree with Daniel, the problem is lying down in the wifi driver, which could be caused by a race condition of having not the thermal zone registered while firmware is being loaded, this, and that who was in charge of writing the driver didn't consider the full table of memory addresses regarding the thermal zones in the actual hardware. Thinkpad E15, Gen2 with Intel i5 Redhat 9.0 Plow $ dmesg [ 9.614728] thermal thermal_zone6: failed to read out thermal zone (-61) $ cat /sys/class/thermal/thermal_zone6/type iwlwifi_1 $ lspci | grep Wi 00:14.3 Network controller: Intel Corporation Wi-Fi 6 AX201 (rev 20) $ uname -rv 5.14.0-70.22.1.el9_0.x86_64 #1 SMP PREEMPT Tue Aug 2 10:02:12 EDT 2022 $ sensors iwlwifi_1-virtual-0 Adapter: Virtual device temp1: +32.0°C $ cat /sys/class/thermal/thermal_zone6/temp 32000 For the record, I get the same message on boot: thermal thermal_zone2: failed to read out thermal zone (-61) but the sensors output for iwlwifi is only N/A until the interface is brought up. After activating the WiFi interface, the temperature can be read successfully. A temporary fix that worked for me was to disable the iwlwifi module before entering sleep mode and turning it back on when waking up. You can do this easily with systemd hooks. First confirm that the problem originates with the iwlwifi module: ``` # This should return iwlwifi cat /sys/class/thermal/thermal_zone[NUMBER]/type ``` Then create a file under `/usr/lib/systemd/system-sleep/iwlwifi-thermal-issue` with the following contents: ``` #!/bin/sh # Disables the wifi module before going into sleep mode to # prevent the error `failed to read out thermal zone`. # See https://bugzilla.kernel.org/show_bug.cgi?id=201761 # $1 is 'pre' (going to sleep) or 'post' (waking up) # $2 is 'suspend', 'hibernate' or 'hybrid-sleep' case "$1/$2" in pre/*) if lsmod | grep -q iwlmvm; then rmmod iwlmvm fi if lsmod | grep -q iwlwifi; then rmmod iwlwifi fi ;; post/*) modprobe iwlmvm iwlwifi ;; esac ``` Make the file executable: ``` chmod +x /usr/lib/systemd/system-sleep/iwlwifi-thermal-issue ``` Actually I take that back, the only thing that works consistently is removing the wifi module explicitly before entering sleep mode: ``` sudo rmmod iwlmvm iwlwifi ``` The systemd hook only works sometimes and I couldn't figure out yet why. Tuxedo InfinityBook S15 - Gen6 Mint 21.1 dmesg [ 8.166878] iwlwifi 0000:38:00.0: enabling device (0000 -> 0002) [ 8.176798] iwlwifi 0000:38:00.0: api flags index 2 larger than supported by driver [ 8.176811] iwlwifi 0000:38:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37 [ 8.177173] iwlwifi 0000:38:00.0: loaded firmware version 72.daa05125.0 cc-a0-72.ucode op_mode iwlmvm [ 8.320337] iwlwifi 0000:38:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340 [ 8.320431] thermal thermal_zone4: failed to read out thermal zone (-61) cat /sys/class/thermal/thermal_zone4/type iwlwifi_1 cat /sys/class/thermal/thermal_zone4/temp 49000 uname 5.19.0-40-generic #41~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC sensors iwlwifi_1-virtual-0 Adapter: Virtual device temp1: +49.0°C (In reply to Zhang Rui from comment #63) > I want to confirm that if it is caused by the same device/reason on AMD > platforms. > > This is a known issue that kernel is reading the wifi thermal zone > temperature while the wifi device firmware is not ready. > It is harmless but annoying, and I have proposed some solution but failed to > push it upstream. > > Let me bring this for discussion again in the community. Is there any news from your side? I see this aswell, but it has getting much worser with Kernel 6.10. before Kernel 6.10 I saw only one entry "thermal thermal_zone2: failed to read out thermal zone (-61)" now with kernel 6.10 it continously writes this message, until I temporary activate the Wifi. Same here. For me, this is a blocker for kernel 6.10 and needs an immediate hotfix: This message messes up the syslog/journal at a rate of 4 messages per second. And on my notebook, wifi is controlled by security regulations and is forced to "off" when booting and by default, so there is no way to stop this message. I can confirm the aforementioned observation for kernel 6.10. Since there have been two commits related to 'thermal_zone', I am cc'ing Rafael J. Wysocki, maybe he has an idea. Please see https://lore.kernel.org/linux-pm/CAJZ5v0gZ5611KXqfjSZOdjRi7v8num3P-vO82c7LGuS1Ak1=FQ@mail.gmail.com/ The patch attached to this message is reported to fix the problem. (In reply to Rafael J. Wysocki from comment #75) > Please see > https://lore.kernel.org/linux-pm/CAJZ5v0gZ5611KXqfjSZOdjRi7v8num3P- > vO82c7LGuS1Ak1=FQ@mail.gmail.com/ > > The patch attached to this message is reported to fix the problem. Thanks for the patch! Laptop: Lenovo Legion 5 Pro 16ACH6H, Type 82JQ, Firmware GKCN65WW I just wanted to report the same issue, on the same WiFi as Oskar Grindemyr from Comment #70, only I have a different iwlwifi firmware (microcode) version: # dmesg | grep -i -e iwlwifi -e thermal (selective copy&paste:) [ 6.679123] iwlwifi 0000:04:00.0: enabling device (0000 -> 0002) [ 6.683549] iwlwifi 0000:04:00.0: Detected crf-id 0x3617, cnv-id 0x100530 wfpm id 0x80000000 [ 6.683583] iwlwifi 0000:04:00.0: PCI dev 2723/0080, rev=0x340, rfid=0x10a100 [ 6.684695] Loading firmware: iwlwifi-cc-a0-77.ucode [ 6.684722] iwlwifi 0000:04:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37 [ 6.684983] iwlwifi 0000:04:00.0: loaded firmware version 77.c360c4b1.0 cc-a0-77.ucode op_mode iwlmvm [ 6.685014] iwlwifi 0000:04:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340 [ 6.685083] thermal thermal_zone0: failed to read out thermal zone (-61) [ 6.809814] iwlwifi 0000:04:00.0: Detected RF HR B3, rfid=0x10a100 [ 6.877056] iwlwifi 0000:04:00.0: base HW address: xx:xx:xx:xx:xx:xx [ 6.943965] thermal thermal_zone0: failed to read out thermal zone (-61) I wonder why a 5.19.0 system loads cc-a0-72.ucode, when mine with 6.10 and previous 6.x kernels loads cc-a0-77.ucode... The message repeats quite a lot, most unfortunately it does this on my LUKS passwort prompt (encrypted root partition). And, the message continues to repeat also after the WiFi is renamed. [ 22.238221] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0 I did not see a message flood like this on my previous kernel 6.9.9: [ 27.744247] thermal thermal_zone0: failed to read out thermal zone (-61) # cat /sys/class/thermal/thermal_zone0/type iwlwifi_1 # sensors | grep iwlw -A 2 iwlwifi_1-virtual-0 Adapter: Virtual device temp1: +47.0°C # cat /sys/class/thermal/thermal_zone0/temp 47000 # uname -rv 6.10.0-gentoo-L5P #1 SMP Sat Jul 20 11:48:07 CEST 2024 I'll try the patch as suggested in Comment #75 next and report back... Follow-up to Comment #77: The patch: # wget https://lore.kernel.org/linux-pm/CAJZ5v0gZ5611KXqfjSZOdjRi7v8num3P-vO82c7LGuS1Ak1=FQ@mail.gmail.com/2-thermal-core-recheck-temperature-v2.patch The patch works! All "thermal thermal_zone0: failed to read out thermal zone (-61)" messages are gone. # uname -rv 6.10.0-gentoo-L5P #1 SMP Mon Jul 22 08:50:22 CEST 2024 JFYI: Kernel 6.10.1 contains the fix. |
Created attachment 279597 [details] dmesg thermal thermal_zone0: failed to read out thermal zone (-61)