Created attachment 273409 [details]
Hardware/software info, logs from wpa_supplicant and the kernel
The high level issue is that on disconnection from my Wi-Fi AP, the Group Temporal Key received by wpa_supplicant is not being successfully removed from the hardware. This causes the following error-level message to be printed by the kernel:
[ +0.014262] wlp1s0: failed to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-22)
This bug is always reproducible (for example, on suspend). I labeled this as a regression because when git-bisecting the kernel, this symptom first happens in cd4d23c1ea9b. But I don't think that commit is necessarily to blame...
The attachment, debug_logs.txt, contains hardware/software info, and some filtered debug logs from wpa_supplicant and the kernel. The kernel log contains some IWLWIFI_DEBUG statements, and two dump_stack()s on key-removal I added to my kernel.
I do not observe any functionality issues, but the error has (reasonably) caused confusion/concern to users. For example:
--> My understanding of why the kernel fails to remove the key:
On attempted iwl_mvm_remove_sta_key() of the GTK in question, the kernel fails to find the associated sta, and thus returns -EINVAL.
It fails to find the associated sta because in ieee80211_mgd_deauth() the kernel first calls ieee80211_set_disassoc(), then ieee80211_report_disconnect().
But ieee80211_set_disassoc() destroys the AP station via ieee80211_bss_info_change_notify(), and only afterwards does ieee80211_report_disconnect() attempt to destroys the GTK in question.
--> Two small clarifications:
One might notice, ieee80211_set_disassoc() destroys sta->ptk and sta->gtk first, before calling ieee80211_bss_info_change_notify().
So what's the issue? The issue is, the GTK in question is not held in sta->gtk, it is only held in sdata->keys.
This is because wpa_supplicant does not (via nl80211) give the kernel a sta mac-addr that the GTK is for -- see wpa_driver_nl80211_set_key().
Thus, the kernel does not install it as an sta-specific GTK, but rather an interface key -- see ieee80211_add_key()
I also wondered, how is the GTK in question installed in the hardware as an sta key in the first place?
This happens because in iwl_mvm_set_sta_key(), the station is looked up using iwl_mvm_get_key_sta().
Which, as the comment there explains, the station is found using the interface's associated station.
(Again, unfortunately, this association is removed on destruction of the AP station).
--> Thoughts on a fix:
First, based on git-bisect of the kernel, the symptom of the error printing first started in cd4d23c1ea9b.
In that commit (though it is unclear why from the message) the failure to find a station turned from a return-0 case to a return-EINVAL case.
(If you look at the rest of the code in that function in that commit, no-station-found used to be a return-0 case).
So the first option would be to restore this return-0 behavior. I think this would be the best move if, when the hardware destroys the station, it also destroys the GTK in question somehow. But, it's not clear to me how to find out if this is happening or not.
If the key is not being removed from the hardware, then I am not sure of the best fix. Based on the dump_stack()s from the debug_log.txt attachment, I think it would look something like, at least conceptually, deleting the sta-specific sdata->keys in ieee80211_set_disassoc() instead of __cfg80211_disconnected(). Or, moving the ieee80211_bss_info_change_notify() from ieee80211_set_disassoc() to after ieee80211_report_disconnect() in ieee80211_mgd_deauth().
Created attachment 273623 [details]
please try the patch attached.
Created attachment 273641 [details]
Tweak of proposed candidate fix
Hi Emmanuel, thanks for the patch. With it, I tested booting a couple times, and suspending a couple times after that. The error message did not appear, so that is great.
However, I did note that on one of the boots, I got a failed ASSERT_RTNL() inside the new ieee80211_iter_keys() call while associating. So the kernel dump_stack()'d, then half a second later printed:
[ 9.588473] iwlwifi 0000:01:00.0: No association and the time event is over already...
[ 9.588562] wlp1s0: Connection to AP <redacted> lost
This happened a couple more times in succession. Only on the 4th time, it succeeded.
I looked at the patch, and it seems a likely culprit is the new ieee80211_iter_keys() call being done after the mvm->mutex is unlocked. So I moved it above the unlock -- into the pre-existing old_state/new_state checks. See attached. I have tested this and no failed ASSERT_RTNL so far (and still no failed-to-remove-key error). Thoughts?
Created attachment 273651 [details]
Fix that was merged in our internal tree.
This is still showing up in 4.15.3 and 4.16rc1.
Created attachment 274173 [details]
Created attachment 274175 [details]
Code hasn't been sent yet.
We have a backlog of patches that need to be sent upstream.
For anyone following along, the commit is now in Linus's tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e4f13ad07823b . It looks like it was merged on 2018/03/22 by https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c4f4d2f91772 .
I've patched the commit against 4.15.15 and still got the issue:
[ 83.667043] wlp1s0: deauthenticating from e8:df:70:41:f2:43 by local choice (Reason: 3=DEAUTH_LEAVING)
[ 83.679337] wlp1s0: failed to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-22)
Sorry, my issue was caused initially by another issue with the sata-controller which brought the system to the hanging state.
The message still appear in 4.15.0
Still happening in 4.16.0, is the fix still not applied?