Bug 198357 - iwlwifi: failed to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-22)
Summary: iwlwifi: failed to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-22)
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: DO NOT USE - assign "network-wireless-intel" component instead
Depends on:
Reported: 2018-01-05 03:06 UTC by zzpyynzjbszkbtipim
Modified: 2018-04-29 11:03 UTC (History)
7 users (show)

See Also:
Kernel Version: 4.15.0-rc6-00048-ge1915c8195b3
Tree: Mainline
Regression: Yes

Hardware/software info, logs from wpa_supplicant and the kernel (3.54 KB, text/plain)
2018-01-05 03:06 UTC, zzpyynzjbszkbtipim
Fix canditate (2.00 KB, patch)
2018-01-15 13:07 UTC, Emmanuel Grumbach
Details | Diff
Tweak of proposed candidate fix (1.25 KB, patch)
2018-01-16 00:50 UTC, zzpyynzjbszkbtipim
Details | Diff
Fix canditate (1.80 KB, patch)
2018-01-16 18:35 UTC, Emmanuel Grumbach
Details | Diff
dmesg 4.16rc1 (135.42 KB, text/plain)
2018-02-15 03:00 UTC, Chris Murphy
lspci -vvnn (30.08 KB, text/plain)
2018-02-15 03:00 UTC, Chris Murphy

Description zzpyynzjbszkbtipim 2018-01-05 03:06:56 UTC
Created attachment 273409 [details]
Hardware/software info, logs from wpa_supplicant and the kernel

The high level issue is that on disconnection from my Wi-Fi AP, the Group Temporal Key received by wpa_supplicant is not being successfully removed from the hardware. This causes the following error-level message to be printed by the kernel:

  [  +0.014262] wlp1s0: failed to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-22)

This bug is always reproducible (for example, on suspend). I labeled this as a regression because when git-bisecting the kernel, this symptom first happens in cd4d23c1ea9b. But I don't think that commit is necessarily to blame...

The attachment, debug_logs.txt, contains hardware/software info, and some filtered debug logs from wpa_supplicant and the kernel. The kernel log contains some IWLWIFI_DEBUG statements, and two dump_stack()s on key-removal I added to my kernel.

I do not observe any functionality issues, but the error has (reasonably) caused confusion/concern to users. For example:
  - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1720930
  - https://bugzilla.redhat.com/show_bug.cgi?id=1503337
  - https://bbs.archlinux.org/viewtopic.php?id=230867

--> My understanding of why the kernel fails to remove the key:

On attempted iwl_mvm_remove_sta_key() of the GTK in question, the kernel fails to find the associated sta, and thus returns -EINVAL.
It fails to find the associated sta because in ieee80211_mgd_deauth() the kernel first calls ieee80211_set_disassoc(), then ieee80211_report_disconnect().
But ieee80211_set_disassoc() destroys the AP station via ieee80211_bss_info_change_notify(), and only afterwards does ieee80211_report_disconnect() attempt to destroys the GTK in question.

--> Two small clarifications:

One might notice, ieee80211_set_disassoc() destroys sta->ptk and sta->gtk first, before calling ieee80211_bss_info_change_notify().
So what's the issue? The issue is, the GTK in question is not held in sta->gtk, it is only held in sdata->keys.
This is because wpa_supplicant does not (via nl80211) give the kernel a sta mac-addr that the GTK is for -- see wpa_driver_nl80211_set_key().
Thus, the kernel does not install it as an sta-specific GTK, but rather an interface key -- see ieee80211_add_key()

I also wondered, how is the GTK in question installed in the hardware as an sta key in the first place?
This happens because in iwl_mvm_set_sta_key(), the station is looked up using iwl_mvm_get_key_sta().
Which, as the comment there explains, the station is found using the interface's associated station.
(Again, unfortunately, this association is removed on destruction of the AP station).

--> Thoughts on a fix:

First, based on git-bisect of the kernel, the symptom of the error printing first started in cd4d23c1ea9b.
In that commit (though it is unclear why from the message) the failure to find a station turned from a return-0 case to a return-EINVAL case.
(If you look at the rest of the code in that function in that commit, no-station-found used to be a return-0 case).

So the first option would be to restore this return-0 behavior. I think this would be the best move if, when the hardware destroys the station, it also destroys the GTK in question somehow. But, it's not clear to me how to find out if this is happening or not.

If the key is not being removed from the hardware, then I am not sure of the best fix. Based on the dump_stack()s from the debug_log.txt attachment, I think it would look something like, at least conceptually, deleting the sta-specific sdata->keys in ieee80211_set_disassoc() instead of __cfg80211_disconnected(). Or, moving the ieee80211_bss_info_change_notify() from ieee80211_set_disassoc() to after ieee80211_report_disconnect() in ieee80211_mgd_deauth().
Comment 1 Emmanuel Grumbach 2018-01-15 13:07:32 UTC
Created attachment 273623 [details]
Fix canditate

please try the patch attached.
Comment 2 zzpyynzjbszkbtipim 2018-01-16 00:50:41 UTC
Created attachment 273641 [details]
Tweak of proposed candidate fix

Hi Emmanuel, thanks for the patch. With it, I tested booting a couple times, and suspending a couple times after that. The error message did not appear, so that is great.

However, I did note that on one of the boots, I got a failed ASSERT_RTNL() inside the new ieee80211_iter_keys() call while associating. So the kernel dump_stack()'d, then half a second later printed:

[    9.588473] iwlwifi 0000:01:00.0: No association and the time event is over already...
[    9.588562] wlp1s0: Connection to AP <redacted> lost

This happened a couple more times in succession. Only on the 4th time, it succeeded.

I looked at the patch, and it seems a likely culprit is the new ieee80211_iter_keys() call being done after the mvm->mutex is unlocked. So I moved it above the unlock -- into the pre-existing old_state/new_state checks. See attached. I have tested this and no failed ASSERT_RTNL so far (and still no failed-to-remove-key error). Thoughts?
Comment 3 Emmanuel Grumbach 2018-01-16 18:35:36 UTC
Created attachment 273651 [details]
Fix canditate

Fix that was merged in our internal tree.
Comment 4 Chris Murphy 2018-02-15 02:58:24 UTC
This is still showing up in 4.15.3 and 4.16rc1.
Comment 5 Chris Murphy 2018-02-15 03:00:30 UTC
Created attachment 274173 [details]
dmesg 4.16rc1
Comment 6 Chris Murphy 2018-02-15 03:00:48 UTC
Created attachment 274175 [details]
lspci -vvnn
Comment 7 Emmanuel Grumbach 2018-02-15 05:19:35 UTC
Code hasn't been sent yet.

We have a backlog of patches that need to be sent upstream.
Comment 8 zzpyynzjbszkbtipim 2018-03-25 16:02:40 UTC
For anyone following along, the commit is now in Linus's tree:     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e4f13ad07823b . It looks like it was merged on 2018/03/22 by https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c4f4d2f91772 .
Comment 9 Bjoern Franke 2018-04-01 22:44:20 UTC
I've patched the commit against 4.15.15 and still got the issue:

[   83.667043] wlp1s0: deauthenticating from e8:df:70:41:f2:43 by local choice (Reason: 3=DEAUTH_LEAVING)
[   83.679337] wlp1s0: failed to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-22)
Comment 10 Bjoern Franke 2018-04-01 23:10:18 UTC
Sorry, my issue was caused initially by another issue with the sata-controller which brought the system to the hanging state.
Comment 11 dd 2018-04-29 01:15:39 UTC
The message still appear in 4.15.0
Comment 12 Esokrarkose 2018-04-29 11:03:48 UTC
Still happening in 4.16.0, is the fix still not applied?

Note You need to log in before you can comment on or make changes to this bug.