Created attachment 300529 [details]
lspci and ethtool outputs on reproducing systems
- dense enterprise deployment, 10 lightweight aps (Aruba) on one office floor, up to 125 concurrent users total, up to 25 user per AP
- the wireless network supports 802.11n, 802.11ac and 802.11ax in 5 GHz band
- authentication is wpa2-psk
- client devices consists in a variety of endpoints (laptops, cell phones, tablets, smart devices), running various versions of Mac OSX, Linux, Windows, Android or IOS.
- certain clients supports only 20Mhz, HT protection kicks in and turns off on APs as those clients are moving around. Consequently ht_operation_mode fluctuates between 4 and 6 even when staying on the same AP.
- the issue affects various laptops with Intel AX200 or AX201 chipsets, running Debian or Ubuntu with a recent kernel >= 5.10
- see attached file devices.txt for detailed information on the different laptops we have reproduced the issue on
Steps to reproduce:
- appears sometimes, but not always, after the iwlwifi STA roams from one AP to another
- seen more often when ht_operation_mode changes between 4 and 6 (but not sufficient to trigger the issue)
- STA deassociates from current AP and associates to the new one successfully
- connectivity works on the new AP for a short period of time, usually between 30s and 1 minute
- then suddenly, the Rx path breaks. No more received frame visible on the STA wireless interface. AP reports that frames are retransmitted and not acknowledged by STA.
- the Tx path keeps working. Frames sent by STA to AP are received and visible on the network
- in this state each inbound frame appears to trigger iwl_pcie_rx_handle_rb with cmd BAR_FRAME_RELEASE (seqnum is always the same):
Mar 4 12:44:32 debian kernel: [15884.715812] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle Q 0: HW = 338, SW = 337
Mar 4 12:44:32 debian kernel: [15884.715819] iwlwifi 0000:00:14.3: iwl_pcie_get_rxb Got virtual RB ID 1348
Mar 4 12:44:32 debian kernel: [15884.715831] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle_rb Q 0: cmd at offset 0: BAR_FRAME_RELEASE (00.c2, seq 0xbfff)
Mar 4 12:44:32 debian kernel: [15884.715838] iwlwifi 0000:00:14.3: iwl_mvm_release_frames_from_notif Frame release notification for BAID 14, NSSN 169
Mar 4 12:44:32 debian kernel: [15884.715843] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle_rb Q 0: RB end marker at offset 64
Mar 4 12:44:32 debian kernel: [15884.715852] iwlwifi 0000:00:14.3: iwl_pcie_restock_bd Assigned virtual RB ID 1348 to queue 0 index 334
- those events do not appear during normal operation (or very rarely)
- in most cases, the STA remains in this state until Wifi is restarted or until it roams to another AP
- while in that state, it may happens (rarely) that a few frame are received with very high latency, then the next ones are lost, for instance:
[1646398334.114200] From 10.200.2.67 icmp_seq=148 Destination Host Unreachable
[1646398334.114242] From 10.200.2.67 icmp_seq=149 Destination Host Unreachable
[1646398334.114251] From 10.200.2.67 icmp_seq=150 Destination Host Unreachable
[1646398336.365181] 64 bytes from 10.200.2.1: icmp_seq=151 ttl=64 time=2251 ms
[1646398336.365237] 64 bytes from 10.200.2.1: icmp_seq=152 ttl=64 time=1227 ms
[1646398336.365250] 64 bytes from 10.200.2.1: icmp_seq=153 ttl=64 time=203 ms
[1646398375.042236] From 10.200.2.67 icmp_seq=188 Destination Host Unreachable
[1646398375.042291] From 10.200.2.67 icmp_seq=189 Destination Host Unreachable
[1646398375.042303] From 10.200.2.67 icmp_seq=190 Destination Host Unreachable
- disable_11ax=1 prevents the problem from happening
- devices.txt: device details (ethtool/lspci) on 3 different laptops that reproduced the issue
- fail2.txt: dmesg with all iwlwifi debug flags when reproducing
- fail3.txt and trace3.dat.xz.gpg: dmesg with all iwlwifi debug flags and associated trace when reproducing
Created attachment 300530 [details]
dmesg with all debug flags while reproducing (1)
Created attachment 300531 [details]
dmesg with all debug flags while reproducing (2)
Created attachment 300532 [details]
tracing output while reproducing (fail3.txt)
Issue could not be reproduced with Ubuntu's 5.8.0-63-generic, but 5.10 and more recent (tested up to 5.15) were reproducing.
Did some further testing with vanilla kernel.
5.10.66 and older DO NOT reproduce the issue.
5.10.67 and newer DO reproduce.
I see the following changes according to changelog:
iwlwifi: mvm: Fix scan channel flags settings
iwlwifi: fw: correctly limit to monitor dump
iwlwifi: mvm: fix access to BSS elements
iwlwifi: mvm: avoid static queue number aliasing
iwlwifi: mvm: fix a memory leak in iwl_mvm_mac_ctxt_beacon_changed
iwlwifi: pcie: free RBs during configure
Suspecting the one related with queues but no strong opinion atm.
Ok so after some further testing, turned out that after commenting the following lines in file drivers/net/wireless/intel/iwlwifi/pcie/trans.c:
/* free all first - we might be reconfigured for a different size */
Which were introduced by the following commit:
iwlwifi: pcie: free RBs during configure
Then i'm no longer able to reproduce. Tested in vanilla 5.10.67, vanilla 5.10.88 and 5.10.92 with Debian patches.
Finally reproduced with this line commented so it is not that one. It seems to happen much less often with this disabled tho.
Attaching traces as fail5.txt, issue was observed at timestamp 1647264000.945354 (Mar 14, 2022, 13:19:59 UTC) (last replied ping). This did not follow a ht_operation_mode change. Recovered by itself after roaming to another AP.
Created attachment 300566 [details]
dmesg with all debug flags while reproducing (3)
Also reproduced after reverting https://lore.kernel.org/all/iwlwifi.20210802172232.2e47e623f9e2.I9b0830dafbb68ef35b7b8f0f46160abec02ac7d0@changeid/, should be something else.
Today i was able to reproduce with 5.10.66. Attaching debug logs as fail6.txt.
Can't identify a version where the problem started to appear for sure so far, but still doing tests.
Created attachment 300570 [details]
dmesg with all debug flags while reproducing (3) (5.10.66)
Managed to reproduce yesterday on 5.10.47, testing 5.10.46 now.
Not sure whether or not it is a regression. Take a quick look at traces but not seeing anything special.
So today i could reproduce on 5.10.46 after a few days of testing. It does not look like a regression.
As usual, debug logs show each frame received in the context of a BAR_FRAME_RELEASE notification message, but they are not passed to the TCP stack.
The problem is reliably reproduced in the following scenario:
- enable wifi or roam to an access point with HT Protection active: use_cts_prot=0, ht_operation_mode=6, iwl_mvm_mac_ctxt_set_ht_flags protection mode set to 2
- start ping, at the beginning it works but after 5 seconds reply packets are no longer received
- when the issue happens, each received frame triggers BAR_FRAME_RELEASE and the driver attempt to retrieve the frames using iwl_mvm_release_frames. NSSN is increased by 1 at each frame, but the skb_list inside iwl_mvm_release_frames is always empty (we get into the first while loop but never inside the second one that calls iwl_mvm_pass_packet_to_mac80211).
- following events are visible at each received frame:
Apr 5 15:57:08 debian kernel: [ 1659.651634] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle_rb Q 0: cmd at offset 0: BAR_FRAME_RELEASE (00.c2, seq 0xbfff)
Apr 5 15:57:08 debian kernel: [ 1659.651637] iwlwifi 0000:00:14.3: iwl_mvm_release_frames_from_notif Frame release notification for BAID 1, NSSN 46
Apr 5 15:57:08 debian kernel: [ 1659.651638] iwl_mvm_release_frames: start
Apr 5 15:57:08 debian kernel: [ 1659.651639] iwl_mvm_release_frames: processing list: ffff9f9ec8b20e20
Apr 5 15:57:08 debian kernel: [ 1659.651640] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle_rb Q 0: RB end marker at offset 64
Apr 5 15:57:08 debian kernel: [ 1659.651642] iwlwifi 0000:00:14.3: iwl_pcie_restock_bd Assigned virtual RB ID 1593 to queue 0 index 476
Apr 5 15:57:08 debian kernel: [ 1659.835457] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle Q 0: HW = 486, SW = 485
Apr 5 15:57:08 debian kernel: [ 1659.835464] iwlwifi 0000:00:14.3: iwl_pcie_get_rxb Got virtual RB ID 1592
Apr 5 15:57:08 debian kernel: [ 1659.835471] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle_rb Q 0: cmd at offset 0: BAR_FRAME_RELEASE (00.c2, seq 0xbfff)
Apr 5 15:57:08 debian kernel: [ 1659.835477] iwlwifi 0000:00:14.3: iwl_mvm_release_frames_from_notif Frame release notification for BAID 1, NSSN 47
Apr 5 15:57:08 debian kernel: [ 1659.835479] iwl_mvm_release_frames: start
Apr 5 15:57:08 debian kernel: [ 1659.835482] iwl_mvm_release_frames: processing list: ffff9f9ec8b20e40
Apr 5 15:57:08 debian kernel: [ 1659.835486] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle_rb Q 0: RB end marker at offset 64
(the "iwl_mvm_release_frames" are printk i added to see in which parts of iwl_mvm_release_frames we are going through)
- before the issue starts the frames are not delivered via iwl_mvm_release_frames_from_notif / iwl_mvm_release_frames path.