Bug 215660

Summary: rx lockups with BAR_FRAME_RELEASE on AX200/AX201 when using 802.11ax
Product: Drivers Reporter: Stephane Poignant (stephane.poignant)
Component: network-wireless-intelAssignee: Default virtual assignee for network-wireless-intel (drivers_network-wireless-intel)
Status: NEW ---    
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.10.47 and later Subsystem:
Regression: No Bisected commit-id:
Attachments: lspci and ethtool outputs on reproducing systems
dmesg with all debug flags while reproducing (1)
dmesg with all debug flags while reproducing (2)
tracing output while reproducing (fail3.txt)
dmesg with all debug flags while reproducing (3)
dmesg with all debug flags while reproducing (3) (5.10.66)

Description Stephane Poignant 2022-03-04 17:24:49 UTC
Created attachment 300529 [details]
lspci and ethtool outputs on reproducing systems

Context:
- dense enterprise deployment, 10 lightweight aps (Aruba) on one office floor, up to 125 concurrent users total, up to 25 user per AP
- the wireless network supports 802.11n, 802.11ac and 802.11ax in 5 GHz band
- authentication is wpa2-psk
- client devices consists in a variety of endpoints (laptops, cell phones, tablets, smart devices), running various versions of Mac OSX, Linux, Windows, Android or IOS.
- certain clients supports only 20Mhz, HT protection kicks in and turns off on APs as those clients are moving around. Consequently ht_operation_mode fluctuates between 4 and 6 even when staying on the same AP.
- the issue affects various laptops with Intel AX200 or AX201 chipsets, running Debian or Ubuntu with a recent kernel >= 5.10
- see attached file devices.txt for detailed information on the different laptops we have reproduced the issue on


Steps to reproduce:
- appears sometimes, but not always, after the iwlwifi STA roams from one AP to another
- seen more often when ht_operation_mode changes between 4 and 6 (but not sufficient to trigger the issue)
- STA deassociates from current AP and associates to the new one successfully
- connectivity works on the new AP for a short period of time, usually between 30s and 1 minute
- then suddenly, the Rx path breaks. No more received frame visible on the STA wireless interface. AP reports that frames are retransmitted and not acknowledged by STA.
- the Tx path keeps working. Frames sent by STA to AP are received and visible on the network
- in this state each inbound frame appears to trigger iwl_pcie_rx_handle_rb with cmd BAR_FRAME_RELEASE (seqnum is always the same):

Mar  4 12:44:32 debian kernel: [15884.715812] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle Q 0: HW = 338, SW = 337
Mar  4 12:44:32 debian kernel: [15884.715819] iwlwifi 0000:00:14.3: iwl_pcie_get_rxb Got virtual RB ID 1348
Mar  4 12:44:32 debian kernel: [15884.715831] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle_rb Q 0: cmd at offset 0: BAR_FRAME_RELEASE (00.c2, seq 0xbfff)
Mar  4 12:44:32 debian kernel: [15884.715838] iwlwifi 0000:00:14.3: iwl_mvm_release_frames_from_notif Frame release notification for BAID 14, NSSN 169
Mar  4 12:44:32 debian kernel: [15884.715843] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle_rb Q 0: RB end marker at offset 64
Mar  4 12:44:32 debian kernel: [15884.715852] iwlwifi 0000:00:14.3: iwl_pcie_restock_bd Assigned virtual RB ID 1348 to queue 0 index 334

- those events do not appear during normal operation (or very rarely)


Temporary resolution:
- in most cases, the STA remains in this state until Wifi is restarted or until it roams to another AP
- while in that state, it may happens (rarely) that a few frame are received with very high latency, then the next ones are lost, for instance:

[1646398334.114200] From 10.200.2.67 icmp_seq=148 Destination Host Unreachable
[1646398334.114242] From 10.200.2.67 icmp_seq=149 Destination Host Unreachable
[1646398334.114251] From 10.200.2.67 icmp_seq=150 Destination Host Unreachable
[1646398336.365181] 64 bytes from 10.200.2.1: icmp_seq=151 ttl=64 time=2251 ms
[1646398336.365237] 64 bytes from 10.200.2.1: icmp_seq=152 ttl=64 time=1227 ms
[1646398336.365250] 64 bytes from 10.200.2.1: icmp_seq=153 ttl=64 time=203 ms
[1646398375.042236] From 10.200.2.67 icmp_seq=188 Destination Host Unreachable
[1646398375.042291] From 10.200.2.67 icmp_seq=189 Destination Host Unreachable
[1646398375.042303] From 10.200.2.67 icmp_seq=190 Destination Host Unreachable


Workaround:
- disable_11ax=1 prevents the problem from happening


Case attachments:
- devices.txt: device details (ethtool/lspci) on 3 different laptops that reproduced the issue
- fail2.txt: dmesg with all iwlwifi debug flags when reproducing
- fail3.txt and trace3.dat.xz.gpg: dmesg with all iwlwifi debug flags and associated trace when reproducing
Comment 1 Stephane Poignant 2022-03-04 17:25:28 UTC
Created attachment 300530 [details]
dmesg with all debug flags while reproducing (1)
Comment 2 Stephane Poignant 2022-03-04 17:25:56 UTC
Created attachment 300531 [details]
dmesg with all debug flags while reproducing (2)
Comment 3 Stephane Poignant 2022-03-04 17:26:40 UTC
Created attachment 300532 [details]
tracing output while reproducing (fail3.txt)
Comment 4 Stephane Poignant 2022-03-07 17:49:40 UTC
Issue could not be reproduced with Ubuntu's 5.8.0-63-generic, but 5.10 and more recent (tested up to 5.15) were reproducing.
Comment 5 Stephane Poignant 2022-03-10 14:48:39 UTC
Did some further testing with vanilla kernel.
5.10.66 and older DO NOT reproduce the issue.
5.10.67 and newer DO reproduce.

I see the following changes according to changelog:
iwlwifi: mvm: Fix scan channel flags settings
iwlwifi: fw: correctly limit to monitor dump
iwlwifi: mvm: fix access to BSS elements
iwlwifi: mvm: avoid static queue number aliasing
iwlwifi: mvm: fix a memory leak in iwl_mvm_mac_ctxt_beacon_changed
iwlwifi: pcie: free RBs during configure

Suspecting the one related with queues but no strong opinion atm.
Comment 6 Stephane Poignant 2022-03-11 10:18:29 UTC
Ok so after some further testing, turned out that after commenting the following lines in file drivers/net/wireless/intel/iwlwifi/pcie/trans.c:

	/* free all first - we might be reconfigured for a different size */
	iwl_pcie_free_rbs_pool(trans);

Which were introduced by the following commit:
iwlwifi: pcie: free RBs during configure
https://lore.kernel.org/all/iwlwifi.20210802170640.42d7c93279c4.I07f74e65aab0e3d965a81206fcb289dc92d74878@changeid/

Then i'm no longer able to reproduce. Tested in vanilla 5.10.67, vanilla 5.10.88 and 5.10.92 with Debian patches.
Comment 7 Stephane Poignant 2022-03-14 13:42:41 UTC
Finally reproduced with this line commented so it is not that one. It seems to happen much less often with this disabled tho.

Attaching traces as fail5.txt, issue was observed at timestamp 1647264000.945354 (Mar 14, 2022, 13:19:59 UTC) (last replied ping). This did not follow a ht_operation_mode change. Recovered by itself after roaming to another AP.
Comment 8 Stephane Poignant 2022-03-14 13:45:03 UTC
Created attachment 300566 [details]
dmesg with all debug flags while reproducing (3)
Comment 9 Stephane Poignant 2022-03-14 16:33:56 UTC
Also reproduced after reverting https://lore.kernel.org/all/iwlwifi.20210802172232.2e47e623f9e2.I9b0830dafbb68ef35b7b8f0f46160abec02ac7d0@changeid/, should be something else.
Comment 10 Stephane Poignant 2022-03-15 12:31:44 UTC
Today i was able to reproduce with 5.10.66. Attaching debug logs as fail6.txt.
Can't identify a version where the problem started to appear for sure so far, but still doing tests.
Comment 11 Stephane Poignant 2022-03-15 12:32:26 UTC
Created attachment 300570 [details]
dmesg with all debug flags while reproducing (3) (5.10.66)
Comment 12 Stephane Poignant 2022-03-17 09:05:17 UTC
Managed to reproduce yesterday on 5.10.47, testing 5.10.46 now.
Not sure whether or not it is a regression. Take a quick look at traces but not seeing anything special.
Comment 13 Stephane Poignant 2022-03-18 13:17:14 UTC
So today i could reproduce on 5.10.46 after a few days of testing. It does not look like a regression.
As usual, debug logs show each frame received in the context of a BAR_FRAME_RELEASE notification message, but they are not passed to the TCP stack.
Comment 14 Stephane Poignant 2022-04-05 17:14:30 UTC
The problem is reliably reproduced in the following scenario:
- enable wifi or roam to an access point with HT Protection active: use_cts_prot=0, ht_operation_mode=6, iwl_mvm_mac_ctxt_set_ht_flags protection mode set to 2
- start ping, at the beginning it works but after 5 seconds reply packets are no longer received
- when the issue happens, each received frame triggers BAR_FRAME_RELEASE and the driver attempt to retrieve the frames using iwl_mvm_release_frames. NSSN is increased by 1 at each frame, but the skb_list inside iwl_mvm_release_frames is always empty (we get into the first while loop but never inside the second one that calls iwl_mvm_pass_packet_to_mac80211).
- following events are visible at each received frame:
Apr  5 15:57:08 debian kernel: [ 1659.651634] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle_rb Q 0: cmd at offset 0: BAR_FRAME_RELEASE (00.c2, seq 0xbfff)
Apr  5 15:57:08 debian kernel: [ 1659.651637] iwlwifi 0000:00:14.3: iwl_mvm_release_frames_from_notif Frame release notification for BAID 1, NSSN 46
Apr  5 15:57:08 debian kernel: [ 1659.651638] iwl_mvm_release_frames: start
Apr  5 15:57:08 debian kernel: [ 1659.651639] iwl_mvm_release_frames: processing list: ffff9f9ec8b20e20
Apr  5 15:57:08 debian kernel: [ 1659.651640] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle_rb Q 0: RB end marker at offset 64
Apr  5 15:57:08 debian kernel: [ 1659.651642] iwlwifi 0000:00:14.3: iwl_pcie_restock_bd Assigned virtual RB ID 1593 to queue 0 index 476
Apr  5 15:57:08 debian kernel: [ 1659.835457] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle Q 0: HW = 486, SW = 485
Apr  5 15:57:08 debian kernel: [ 1659.835464] iwlwifi 0000:00:14.3: iwl_pcie_get_rxb Got virtual RB ID 1592
Apr  5 15:57:08 debian kernel: [ 1659.835471] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle_rb Q 0: cmd at offset 0: BAR_FRAME_RELEASE (00.c2, seq 0xbfff)
Apr  5 15:57:08 debian kernel: [ 1659.835477] iwlwifi 0000:00:14.3: iwl_mvm_release_frames_from_notif Frame release notification for BAID 1, NSSN 47
Apr  5 15:57:08 debian kernel: [ 1659.835479] iwl_mvm_release_frames: start
Apr  5 15:57:08 debian kernel: [ 1659.835482] iwl_mvm_release_frames: processing list: ffff9f9ec8b20e40
Apr  5 15:57:08 debian kernel: [ 1659.835486] iwlwifi 0000:00:14.3: iwl_pcie_rx_handle_rb Q 0: RB end marker at offset 64
...

(the "iwl_mvm_release_frames" are printk i added to see in which parts of iwl_mvm_release_frames we are going through)

- before the issue starts the frames are not delivered via iwl_mvm_release_frames_from_notif / iwl_mvm_release_frames path.