|Summary:||iwlwifi Microcode SW error during rescan while under network load|
|Product:||Drivers||Reporter:||Andrew M (andrew)|
|Component:||network-wireless-intel||Assignee:||Golan Ben Ami (golan.ben.ami)|
|Severity:||normal||CC:||andersonkw2, ayala.beker, fweimer, golan.ben.ami, leonard, me, tom|
System journal logs
Fix - formatted patch.
iwlwifi patch to set scan timeout to 30 seconds
Description Andrew M 2021-10-11 13:36:03 UTC
Created attachment 299177 [details] System journal logs WiFi client device: AX210 WiFi AP: Unifi AP AC Pro Reproduction steps: 1. Start iperf to generate network load. 2. Perform `nmcli device wifi rescan`. 3. Expect: no device restart. Actual: device restarts (see logs) and network speeds become abysmally slow or non-responsive until disconnecting and reconnecting to WiFi network. I can reliably reproduce. This occurs regardless of whether the connected network is 5 GHz only, 2.4 GHz only, or 2.4+5 GHz. The network is 802.11ac. However, when network is 2.4 GHz only, network speed reduction seems to be less severe. Since NetworkManager performs rescans periodically, this bug occurs quite readily if performing continuous high network load.
Comment 1 Andrew M 2021-11-18 13:19:45 UTC
Setting modprobe config `options iwlwifi disable_11ax=Y` appears to workaround the issue (note that the original problem occurs when connected to non-ax networks), but of course comes with the caveat the 802.11ax is disabled.
Comment 2 Golan Ben Ami 2021-11-22 12:58:56 UTC
Created attachment 299667 [details] Fix - formatted patch. Hey Andrew, Can you please check if this fix works for you?
Comment 3 Andrew M 2021-11-22 17:13:55 UTC
Hi Golan, thanks for the patch! I patched atop 5.14.18 and removed the iwlwifi param override workaround and ran some tests. Results using my original reproduction steps: 1. First verified that unpatched Kernel still exhibits issue—it does. 2. Ran test once, resulted in microcode error (it appears broadly the same as the original). However, network performance recovers after a second (unlike before where performance would tank indefinitely) 3. Repeated test multiple times and no microcode error occurred. 4. Disconnected from WiFi network and reconnected, then re-ran (2) and (3) with the same results. So overall definitely a lot better, if imperfect. Exhibits a microcode error seemingly once per connection, and recovery from error is much improved without the need for manual intervention.
Comment 4 Golan Ben Ami 2021-11-22 18:21:29 UTC
Thanks for the extensive tests and update Andrew! So, the patch I shared was intended to fix the slow tx after restart. Now, as you can see, we're back to the high tx rates very close after the fw restart. As for the NMI/fw restart itself - we'll need to complete that investigation.
Comment 5 Kevin Anderson 2021-11-26 00:43:23 UTC
Hi Golan, It appears I am having the same issue as Andrew and I can confirm that the provided patch reduces the poor/no throughput after a firmware restart from the previous 1-5 minutes down to a few seconds reliably. I previously posted on the Intel support forum about this issue here https://community.intel.com/t5/Wireless/Intel-AX210-Firmware-Reset-under-Load/m-p/1328150. If I can provide any additional data points or logs regarding the NMI/fw restart please let me know! - Kevin
Comment 6 ayala.beker 2021-12-02 13:52:11 UTC
There is no much data in the attached log, but could be the scan timeout is a duplicated issue to another bug. Can you please try the attached patch on top of your kernel and see if it solves the issue?
Comment 7 ayala.beker 2021-12-02 13:53:44 UTC
Created attachment 299841 [details] iwlwifi patch to set scan timeout to 30 seconds
Comment 8 Kevin Anderson 2021-12-05 16:57:39 UTC
Hi Ayala, I am not able to reproduce the SW error with the provided patch.
Comment 9 Golan Ben Ami 2021-12-05 16:58:02 UTC
Created attachment 299895 [details] attachment-26212-0.html Hi, Thank you for your email. I'm OOO Su-Tue (5-7/12/21). Available by WhatsApp for urgent issues. Thanks, Golan
Comment 10 K900 2022-05-29 17:48:26 UTC
Looks like both of the patches here are in 5.18, and I'm still getting lockups... Should I file a new issue?