Bug 213059 - iwlwifi: AX201 doesn't support NL80211_EXT_FEATURE_CAN_REPLACE_PTK0, causes disconnections on rekeying
Summary: iwlwifi: AX201 doesn't support NL80211_EXT_FEATURE_CAN_REPLACE_PTK0, causes d...
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless-intel (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Default virtual assignee for network-wireless-intel
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-05-14 11:03 UTC by wavexx
Modified: 2022-03-21 13:45 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.10.28
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description wavexx 2021-05-14 11:03:14 UTC
Using a Wifi AX201 card on Kernel 5.10.28 while connecting to a WPA2-EAP network I noticed frequent and random disconnections, which turned out to be due to rekeying.

Using wpasupplicant with the default configuration (wpa_deny_ptk0_rekey=0) this causes the connection to timeout:

kernel: Rekeying PTK for STA 4c:a6:4d:54:32:6b but driver can't safely do that.
kernel: wlan0: disassociated from 4c:a6:4d:54:32:6b (Reason: 15=4WAY_HANDSHAKE_TIMEOUT)

With wpa_deny_ptk0_rekey=1, wpa_supplicant notices there's no support and just reconnects instead (which actually brings the connection up faster). The same is true with when using iwd.

Is this a limitation of the iwlwifi driver, of the AX201 or the AX2?? fw?

[    7.798293] iwlwifi 0000:00:14.3: firmware: direct-loading firmware iwlwifi-QuZ-a0-hr-b0-59.ucode
[    7.798303] iwlwifi 0000:00:14.3: api flags index 2 larger than supported by driver
[    7.798316] iwlwifi 0000:00:14.3: TLV_FW_FSEQ_VERSION: FSEQ Version: 65.3.35.22
[    7.798695] iwlwifi 0000:00:14.3: loaded firmware version 59.601f3a66.0 QuZ-a0-hr-b0-59.ucode op_mode iwlmvm
[    7.798710] iwlwifi 0000:00:14.3: firmware: failed to load iwl-debug-yoyo.bin (-2)
[    7.852819] iwlwifi 0000:00:14.3: Detected Intel(R) Wi-Fi 6 AX201 160MHz, REV=0x354

This is the latest kernel and firmware as available on debian unstable.

Lack of support for NL80211_EXT_FEATURE_CAN_REPLACE_PTK0 is a huge issue for networks where rekeying is enabled.

It is possible to force SW crypto in the iwlwifi module? It seems that SW crypto always has support for NL80211_EXT_FEATURE_CAN_REPLACE_PTK0, so that would be a superior alternative to constant disconnections.
Comment 1 Johannes Berg 2021-06-09 10:42:26 UTC
It seems that NL80211_EXT_FEATURE_CAN_REPLACE_PTK0 might not even matter. The driver should in fact support it, but the only thing that happens if it doesn't support it is that we have an extra queue flush.

You can try to add NL80211_EXT_FEATURE_CAN_REPLACE_PTK0 in the driver? I suspect it won't actually make any difference.

And if it doesn't make any difference, I guess something is misbehaving between the AP and wpa_supplicant, or so. Or we're encrypted with the wrong key, or something. Are you able to capture air sniffer of this scenario with another machine?
Comment 2 Johannes Berg 2021-06-09 10:45:41 UTC
Actually, thinking about it *again*, NL80211_EXT_FEATURE_CAN_REPLACE_PTK0 is probably not really true for the driver, at least according to the documented requirements. But the requirements might be wrong - if we switch an older packet *entirely* to a newer key (including PN generation), it should be fine.

In any case, it's probably unrelated and we need to look at the actual packet exchange over the air to see what's going on. Btw, to make sense of that you'll need to provide also the keys and the initial 4-way-HS of the first connection, so make sure you do this on a test network.

Also see

https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#air_sniffing

and

https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#privacy_aspects
Comment 3 wavexx 2021-06-10 14:00:50 UTC
> Actually, thinking about it *again*, NL80211_EXT_FEATURE_CAN_REPLACE_PTK0 is
> probably not really true for the driver, at least according to the documented
> requirements.

What do you mean by this? Does the driver support this or not?
If it does, it clearly needs to be advertised as otherwise with the default behavior of wpasupplicant (wpa_deny_ptk0_rekey=0, "always rekey when  onfigured/instructed"), you get this on a current kernel:

[ 9430.044942] Rekeying PTK for STA 4c:a6:4d:54:09:43 but driver can't safely do that.
[ 9435.038366] wlan0: deauthenticated from 4c:a6:4d:54:09:43 (Reason: 7=CLASS3_FRAME_FROM_NONASSOC_STA)
[ 9435.177887] wlan0: authenticate with 4c:a6:4d:54:09:43
...
Comment 4 Johannes Berg 2021-06-11 07:11:30 UTC
(In reply to wavexx from comment #3)

> What do you mean by this? Does the driver support this or not?

Well, there are some very specific requirements for it that are documented in mac80211.h, and those are not met to the letter. I'm thinking that the driver could advertise it (and the documented requirements could be slightly relaxed.)

However, unless you set "wpa_deny_ptk0_rekey=1", the only differences between the driver advertising this or not are:
 1) there's the extra (admittedly somewhat scary) message printed
 2) a queue flush is done before changing the key

> If it does, it clearly needs to be advertised as otherwise with the default
> behavior of wpasupplicant (wpa_deny_ptk0_rekey=0, "always rekey when 
> onfigured/instructed"), you get this on a current kernel:
> 
> [ 9430.044942] Rekeying PTK for STA 4c:a6:4d:54:09:43 but driver can't
> safely do that.
> [ 9435.038366] wlan0: deauthenticated from 4c:a6:4d:54:09:43 (Reason:
> 7=CLASS3_FRAME_FROM_NONASSOC_STA)
> [ 9435.177887] wlan0: authenticate with 4c:a6:4d:54:09:43
> ...

I understand, but I don't think the reason for this is necessarily the lack of NL80211_EXT_FEATURE_CAN_REPLACE_PTK0 feature advertisement. After all, like I say above, there are only two very small differences with and without the feature advertisement.

Since you're able to reproduce the issue and we're not (we actually have a few tests with PTK0 rekeying, I guess nobody noticed the message there), I think we're only going to make progress on this if you can provide more information:

a) try with NL80211_EXT_FEATURE_CAN_REPLACE_PTK0 set - if that makes a difference then perhaps the flush there took 5 seconds for some reason

b) maybe try to collect tracing data while this happens, per https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#tracing

c) maybe better yet try to collect air sniffer (but this is harder) per my previous comment
Comment 5 wavexx 2021-06-11 11:41:35 UTC
Afaik this problem is happening since the org I'm connecting to switched to some Cisco C9115AXI-E APs.

The disconnections always happen after the rekeying, apparently, but the time between those events can go anywhere between once an hour and 2-3 times per minute.

This is a problem for debugging, as I don't have control over the network, don't connect to it often enough to just let it sit and don't have some test account where I could perform some tests to just dump traffic freely.

I have some traces I collected while debugging this issue one level up using iwmon along with Denis Kenzior (from the iwd daemon), but I'd rather not share those publicly.
Comment 6 Johannes Berg 2021-06-14 08:42:30 UTC
The time between which events? Between the rekeying, or between the disconnects?
Comment 7 wavexx 2021-06-14 09:26:38 UTC
For both, since each disconnect is preceded by that message.
Comment 8 James Ettle 2021-07-06 19:32:19 UTC
Same behaviour observed also with Intel Wireless-AC 9260 (8086:2526) and a TP-Link EAP 225 WAP (fw 5.0.3) using EAP-TLS. An occational "Rekeying PTK for STA 4c:a6:4d:54:09:43 but driver can't safely do that.", followed after some hours by a total disconnect.
Comment 9 alexander 2021-09-24 22:14:51 UTC
It looks like you are claiming that you only get "Rekeying PTK for STA..." messages when you get disassociated?
That should only be possible when every rekey fails but then the interval of the disconnects should be constant: Simply whatever the PTK rekey interval is set to.

Also interesting is, that the two log fragments list two different disassoc messages after the rekey:
1) disassociated from 4c:a6:4d:54:32:6b (Reason: 15=4WAY_HANDSHAKE_TIMEOUT)
2) deauthenticated from 4c:a6:4d:54:09:43 (Reason: 7=CLASS3_FRAME_FROM_NONASSOC_STA)

Are there more? And how frequent are each of these?

Especially the initial log allows some conclusions in what area the error must be:
1) The PTK0 rekey warning is printed. So the STA must have completed the eapol handshake and is on the new key
2) wpa_supplicant receives a disassoc frame from the AP whether the AP tells the STA that the eapol handshake has run into a timeout and therefore still should be on the old key.
Which seems to explain why they can't talk to each other any longer and are unable to recover without a disconnect. But is not sheading any light how they could get into such a state...
Having some time stamps in the log could be useful too. Even better would be a full debug log from wpa_supplicant with time stamps...

You should be able to enable the highest debug logging for wpa_supplicant with:
sudo dbus-send --system --print-reply --dest=fi.w1.wpa_supplicant1 /fi/w1/wpa_supplicant1 org.freedesktop.DBus.Properties.Set string:fi.w1.wpa_supplicant1 string:DebugLevel variant:string:"msgdump"

Logs from the AP - the more verbose the better - would also be nice but it looks like this is happening:
1) STA sends out EAPOL#4 and installs the new key (triggering the next to be sure harmless rekey warning)
2) AP is not reviving or able to decrypt eapol#4 
3) some seconds (maybe some tenth of seconds?) later the AP gives up retransmitting eapol#3 to get an readable eapol#4 from the STA. There is no way to recover from that but sending a disconnect.

So somehow we got into a state where the AP never did see eapol#4 or has chosen to ignore it for some reason.
Ideally you would be able to provide the air sniffing capture Johannes linked some instructions for above. When you can do that and also provide wpa_supplicant debug logs including sensitive information we'll see exactly what's going on here. 
(For WPA-EAP the key information is comparable harmless, it will only allow to decode the frames intended for the sta we got the debug log from till the next rekey. Nevertheless you should either be sure that noting sensitive is in the encrypted frames or only hand over the debug logs to someone you trust with the data in the capture.)

Much simpler but still interesting would be disabling HW crypto for iwlwifi by setting swcrypto and check if that fixes the issue, e.g.:
 rmmod iwlmvm iwlwifi
 modprobe iwlwifi swcrypto=1
 modprobe iwlmvm
After that directly connect to the AP and start testing.

If it fixes the issue, it's strongly pointing to a bug in iwlwifi. If not it looks like the AP is the "better" candidate for causing the issue somehow.

Note You need to log in before you can comment on or make changes to this bug.