I have never noticed iwlwifi temperature monitoring until now, but when WiFi is disabled using rfkill, where iwlwifi temperatures are supposed to be, sensors output shows this: iwlwifi-virtual-0 Adapter: Virtual device ERROR: Can't get value of subfeature temp1_input: I/O error temp1: N/A This seems to make userspace apps that use libsensors believe that sensors aren't detected and makes them misbehave. amdgpu seems to detect this situation properly. 03:00.0 Network controller: Intel Corporation Wireless 3165 (rev 81) Subsystem: Intel Corporation Dual Band Wireless AC 3165 Flags: bus master, fast devsel, latency 0, IRQ 126 Memory at e2200000 (64-bit, non-prefetchable) [size=8K] Capabilities: [c8] Power Management version 3 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [40] Express Endpoint, MSI 00 Kernel driver in use: iwlwifi [ 3.729100] iwlwifi 0000:03:00.0: enabling device (0000 -> 0002) [ 3.729720] iwlwifi 0000:03:00.0: Direct firmware load for iwlwifi-7265D-29.ucode failed with error -2 [ 3.729726] iwlwifi 0000:03:00.0: Direct firmware load for iwlwifi-7265D-28.ucode failed with error -2 [ 3.729953] iwlwifi 0000:03:00.0: loaded firmware version 27.455470.0 op_mode iwlmvm [ 3.730224] iwlwifi 0000:03:00.0: Detected Intel(R) Dual Band Wireless AC 3165, REV=0x210
The iwlwifi driver cannot provide the temperature reading when the interface is down (which is the case with rfkill). The temperature is read by the firmware, which is only loaded and running when the interface is up. We can't support this, so the userspace needs to be able to take care of it.
As I said, other hardware drivers support this scenario quite perfect: Don't try to read anything if the hardware is powered down. This is breaking userspace, I don't think Linus would be happy with it.
It can never have worked with the userspace as is, because this has never changed in the iwlwifi driver. This issue has been discussed before and this was the conclusion we came to. What do you want the iwlwifi driver to do? Lie about the temperature?
This is what amdgpu does: amdgpu-pci-0100 Adapter: PCI adapter fan1: N/A temp1: N/A (crit = +0.0°C, hyst = +0.0°C) As in, returns nothing. I/O error is problematic in the OP, as it tries to access a disabled device's registers/whatever. It already displays N/A.
So, amdgpu returns 0 as the temperature when the device is not on? 0 is a valid temperature (though unlikely) and there is nothing in the thermal framework indicating that the temperature should be set to 0 when it cannot be read. The iwlwifi driver registers a thermal zone and the only way to read the temperature from it is by calling the get_temperature op. This op can return a temperature and/or an error. There's no such thing as "returns nothing". Unfortunately the thermal framework doesn't seem to specify these cases.
I've finally found the amdgpu kernel code for reading temperature. In case of hardware being disabled, they return -EINVAL, which lm_sensors seems to parse correctly. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c#n821
I think that is incorrect. -EINVAL is returned when the user space sends an invalid argument in a request, which is not the case here. In any case, if the userspace understands -EINVAL as N/A, it should also treat -EIO as such. Maybe -EIO is a bit harsh in iwlwifi, but at most I would change it to -ENODATA. -EINVAL would be wrong.
-ENODATA sounds good and maybe even makes more sense.
And do you think the userspace you treat this error more gracefully than the current -EINVAL? The userspace code is libsensors, right?
Okay, I checked the libsensors code and it seems to treat *only* -EIO differently[1]: int sensors_read_sysfs_attr(const sensors_chip_name *name, const sensors_subfeature *subfeature, double *value) { [...] if (res == EOF) { if (errno == EIO) return -SENSORS_ERR_IO; else return -SENSORS_ERR_ACCESS_R; } [...] } I have no idea why it does this, though... I don't see anything related to that in the documentation[2]. [1] https://github.com/groeck/lm-sensors/blob/dfa4518962c81cf1a21471cef958aa9e67a63b68/lib/sysfs.c#L874 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/thermal/sysfs-api.txt
Maybe they treat -EIO as "device (driver) broken"?
Okay, I'm reopening and I'll provide this fix.
Okay, I made the change in our internal tree and it will eventually reach the mainline, following our normal upstreaming process.
The fix has been submitted upstream and will soon reach the mainline and stable releases.