Bug 196941 - [iwlwifi] Temperature sensor giving I/O error when WiFi is disabled with rfkill
Summary: [iwlwifi] Temperature sensor giving I/O error when WiFi is disabled with rfkill
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: DO NOT USE - assign "network-wireless-intel" component instead
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-09-14 09:46 UTC by Armin K.
Modified: 2017-09-30 08:14 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.13.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Armin K. 2017-09-14 09:46:06 UTC
I have never noticed iwlwifi temperature monitoring until now, but when WiFi is disabled using rfkill, where iwlwifi temperatures are supposed to be, sensors output shows this:

iwlwifi-virtual-0
Adapter: Virtual device
ERROR: Can't get value of subfeature temp1_input: I/O error
temp1:            N/A  

This seems to make userspace apps that use libsensors believe that sensors aren't detected and makes them misbehave.

amdgpu seems to detect this situation properly.

03:00.0 Network controller: Intel Corporation Wireless 3165 (rev 81)
	Subsystem: Intel Corporation Dual Band Wireless AC 3165
	Flags: bus master, fast devsel, latency 0, IRQ 126
	Memory at e2200000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: [c8] Power Management version 3
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [40] Express Endpoint, MSI 00
	Kernel driver in use: iwlwifi

[    3.729100] iwlwifi 0000:03:00.0: enabling device (0000 -> 0002)
[    3.729720] iwlwifi 0000:03:00.0: Direct firmware load for iwlwifi-7265D-29.ucode failed with error -2
[    3.729726] iwlwifi 0000:03:00.0: Direct firmware load for iwlwifi-7265D-28.ucode failed with error -2
[    3.729953] iwlwifi 0000:03:00.0: loaded firmware version 27.455470.0 op_mode iwlmvm
[    3.730224] iwlwifi 0000:03:00.0: Detected Intel(R) Dual Band Wireless AC 3165, REV=0x210
Comment 1 Luca Coelho 2017-09-16 16:34:36 UTC
The iwlwifi driver cannot provide the temperature reading when the interface is down (which is the case with rfkill).  The temperature is read by the firmware, which is only loaded and running when the interface is up.

We can't support this, so the userspace needs to be able to take care of it.
Comment 2 Armin K. 2017-09-16 16:57:04 UTC
As I said, other hardware drivers support this scenario quite perfect: Don't try to read anything if the hardware is powered down. This is breaking userspace, I don't think Linus would be happy with it.
Comment 3 Luca Coelho 2017-09-16 17:25:42 UTC
It can never have worked with the userspace as is, because this has never changed in the iwlwifi driver.  This issue has been discussed before and this was the conclusion we came to.

What do you want the iwlwifi driver to do? Lie about the temperature?
Comment 4 Armin K. 2017-09-16 17:29:09 UTC
This is what amdgpu does:

amdgpu-pci-0100
Adapter: PCI adapter
fan1:             N/A
temp1:            N/A  (crit =  +0.0°C, hyst =  +0.0°C)

As in, returns nothing. I/O error is problematic in the OP, as it tries to access a disabled device's registers/whatever. It already displays N/A.
Comment 5 Luca Coelho 2017-09-16 18:53:47 UTC
So, amdgpu returns 0 as the temperature when the device is not on? 0 is a valid temperature (though unlikely) and there is nothing in the thermal framework indicating that the temperature should be set to 0 when it cannot be read.

The iwlwifi driver registers a thermal zone and the only way to read the temperature from it is by calling the get_temperature op.  This op can return a temperature and/or an error.  There's no such thing as "returns nothing".

Unfortunately the thermal framework doesn't seem to specify these cases.
Comment 6 Armin K. 2017-09-16 19:46:54 UTC
I've finally found the amdgpu kernel code for reading temperature. In case of hardware being disabled, they return -EINVAL, which lm_sensors seems to parse correctly.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c#n821
Comment 7 Luca Coelho 2017-09-16 20:01:05 UTC
I think that is incorrect.  -EINVAL is returned when the user space sends an invalid argument in a request, which is not the case here.

In any case, if the userspace understands -EINVAL as N/A, it should also treat -EIO as such.

Maybe -EIO is a bit harsh in iwlwifi, but at most I would change it to -ENODATA.  -EINVAL would be wrong.
Comment 8 Armin K. 2017-09-18 10:54:55 UTC
-ENODATA sounds good and maybe even makes more sense.
Comment 9 Luca Coelho 2017-09-18 11:08:08 UTC
And do you think the userspace you treat this error more gracefully than the current -EINVAL?

The userspace code is libsensors, right?
Comment 10 Luca Coelho 2017-09-18 11:33:31 UTC
Okay, I checked the libsensors code and it seems to treat *only* -EIO differently[1]:

int sensors_read_sysfs_attr(const sensors_chip_name *name,
			    const sensors_subfeature *subfeature,
			    double *value)
{
[...]
		if (res == EOF) {
			if (errno == EIO)
				return -SENSORS_ERR_IO;
			else 
				return -SENSORS_ERR_ACCESS_R;
		}
[...]
}

I have no idea why it does this, though... I don't see anything related to that in the documentation[2].

[1] https://github.com/groeck/lm-sensors/blob/dfa4518962c81cf1a21471cef958aa9e67a63b68/lib/sysfs.c#L874
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/thermal/sysfs-api.txt
Comment 11 Armin K. 2017-09-18 11:35:11 UTC
Maybe they treat -EIO as "device (driver) broken"?
Comment 12 Luca Coelho 2017-09-18 11:48:35 UTC
Okay, I'm reopening and I'll provide this fix.
Comment 13 Luca Coelho 2017-09-19 08:52:34 UTC
Okay, I made the change in our internal tree and it will eventually reach the mainline, following our normal upstreaming process.
Comment 14 Luca Coelho 2017-09-30 08:14:57 UTC
The fix has been submitted upstream and will soon reach the mainline and stable releases.

Note You need to log in before you can comment on or make changes to this bug.