Created attachment 301070 [details] ax200 stall on 5.18 (iwlwifi.debug=0xffffffff) On every kernel version I've tested where the ax200 is supported (I've tested 5.3 through 5.18), there are rx stalls on the ax200. Since it affects even the oldest kernel where ax200 works for me (5.3), I cannot bisect the issue. All firmware versions appear to be affected, with 71.058653f6.0 (cc-a0-71.ucode) being the newest version that I've tested. The stalls occur at random, but I've recently found a way to reproduce it consistently within about a minute: `while true; do iperf3 -c $ACCESS_POINT_IP -u -b 1000M -P 16 -R; done` Doing something similar to the above with the `ethr` utility replicates the issue as well, so it's unrelated to iperf3. When a stall occurs, iperf3 gets stuck here: ``` [<0>] __skb_wait_for_more_packets+0x116/0x170 [<0>] __skb_recv_udp+0x1ef/0x310 [<0>] udp_recvmsg+0x8e/0x550 [<0>] inet_recvmsg+0x115/0x130 [<0>] __sys_recvfrom+0x14c/0x160 [<0>] __x64_sys_recvfrom+0x1b/0x20 [<0>] do_syscall_64+0x37/0x80 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae ``` I tested the backport-iwlwifi driver and it's also affected. I tested an ax210 and it's also affected. I tested a 7265D (which uses iwlmvm too), and it *isn't* affected. I tested an mt7921k (80 MHz 802.11ax card), and it *isn't* affected, so this doesn't seem like an issue with my AP. I've attached a short dmesg of the stall with `/sys/module/iwlwifi/parameters/debug` set to `0xffffffff`. In `ax200-stall-dmesg.txt`, the last message emitted before the stall occurred is at 405.480747 seconds. Every message printed after that is with iperf3 stalled. I can quickly test experimental kernel patches and provide more info if needed. I can also change hostapd settings on my AP to test anything if needed.
Any chance you could capture (with another device in monitor mode, per [1]) what happens over the air? Looks like you have AX200 and AX210, so if they're in different machines that shouldn't be too hard. Also, does the traffic ever recover? [1] https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#air_sniffing
Created attachment 301073 [details] ax200 stall capture (In reply to Johannes Berg from comment #1) > Any chance you could capture (with another device in monitor mode, per [1]) > what happens over the air? Looks like you have AX200 and AX210, so if > they're in different machines that shouldn't be too hard. FYI, I don't have the ax210 anymore (returned it after seeing that it had the same issue). I've got plenty of ax200s lying around though and attached a capture encrypted with the 3 keys at the bottom of your wiki link. The capture was done on a quiet channel with only my ax200 machine and my AP transmitting. The last packet that was sniffed before the stall occurred is at 17:50:26.181350 (992830798us). > Also, does the traffic ever recover? No. There's either an error or a timeout, but never a recovery. If nothing stops the user program while it's stalled, then it'll remain stalled.
Created attachment 301166 [details] ax200 stall capture (with data frames)
Created attachment 301167 [details] ax200 firmware dump (pairs with ax200-stall-capture_20220613.pcap.gpg)
As discussed over IRC, I've uploaded a firmware dump triggered during a stall and a complete capture to go along with it (since the last capture was missing data frames). The capture is truncated to show the last ~5 MB of the full capture (which is ~400 MB). I can provide earlier chunks of the capture if needed. In the new capture, the last packet sniffed before the stall occurred is at 18:56:15.443046 (468333516us). This stall was reproduced using VHT with 20 MHz bandwidth.
Hello, I work with Johannes on this one. I looked at the sniffer capture from comment#3 and it looks perfect. No packet loss there are at all. I do see something strange though. There are packets that are encrypted in TKIP and others in CCMP?! It'd be useful to see what happens in OPEN connection if you don't mind to see if that helps. Of course, please keep all the fixes that were shared with you until now. Thanks.
I tried to reproduce with all the fixes: fixes in the firwmare and in the reoder buffer. My AP is configured to be on channel 36 in 20MHz. I have a PC connected by LAN to the AP on which I push traffic: iperf3 -c 192.168.2.103 -u -b 1G -I16 -t 10000 Of course, I loose tons of packets on the receiving side: the pipe is just not big enough to grab all the downlink traffic. but I couldn't see any stall: ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.2.240, port 39794 [ 6] local 192.168.2.103 port 5201 connected to 192.168.2.240 port 45065 [ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams [ 6] 0.00-1.00 sec 11.8 MBytes 98.9 Mbits/sec 0.145 ms 45056/53591 (84%) [ 6] 1.00-2.00 sec 11.5 MBytes 96.5 Mbits/sec 0.154 ms 73444/81771 (90%) [ 6] 2.00-3.00 sec 11.5 MBytes 96.5 Mbits/sec 0.131 ms 75119/83450 (90%) [ 6] 3.00-4.00 sec 12.2 MBytes 102 Mbits/sec 0.187 ms 71255/80084 (89%) [ 6] 4.00-5.00 sec 14.1 MBytes 119 Mbits/sec 0.220 ms 76301/86531 (88%) [ 6] 5.00-6.00 sec 14.9 MBytes 125 Mbits/sec 0.153 ms 71517/82335 (87%) [ 6] 6.00-7.00 sec 14.9 MBytes 125 Mbits/sec 0.259 ms 70986/81787 (87%) [ 6] 7.00-8.00 sec 14.8 MBytes 124 Mbits/sec 0.207 ms 72394/83103 (87%) [ 6] 8.00-9.00 sec 14.7 MBytes 124 Mbits/sec 0.120 ms 71083/81748 (87%) [ 6] 9.00-10.00 sec 14.3 MBytes 120 Mbits/sec 0.138 ms 69391/79754 (87%) [ 6] 10.00-11.00 sec 14.5 MBytes 122 Mbits/sec 0.148 ms 74639/85138 (88%) [ 6] 11.00-12.00 sec 14.0 MBytes 117 Mbits/sec 0.109 ms 70613/80747 (87%) [ 6] 12.00-13.00 sec 11.1 MBytes 93.2 Mbits/sec 0.180 ms 72184/80230 (90%) [ 6] 13.00-14.00 sec 11.7 MBytes 98.1 Mbits/sec 0.383 ms 73222/81688 (90%) [ 6] 14.00-15.00 sec 10.8 MBytes 90.7 Mbits/sec 0.256 ms 73588/81417 (90%) [ 6] 15.00-16.00 sec 10.7 MBytes 89.4 Mbits/sec 0.187 ms 74626/82340 (91%) [ 6] 16.00-17.00 sec 10.8 MBytes 90.8 Mbits/sec 0.174 ms 75843/83685 (91%) [ 6] 17.00-18.00 sec 11.4 MBytes 95.9 Mbits/sec 0.165 ms 73750/82027 (90%) [ 6] 18.00-19.00 sec 11.7 MBytes 98.5 Mbits/sec 0.098 ms 77916/86418 (90%) .... May I suggest that you push less packets to the pipe and that instead of checking for stalls, I send you a patch for iperf to print what packets it is missing? Then we can see if we see those packets in the air sniffer (and possibly collect more data).
on 20MHz I can reliably push 130Mbps and I have no issues. I did see something interesting a few times. I run tcpdump on the wlan interface of the receiver and I print what packet is missing from iperf. I also have a pluing for wireshark that allows to see the iperf sequence number in wireshark. I could actually see that iperf is complaining because it missed a packet that I do see in the tcpdump on the wlan interface. Which is very strange.
Created attachment 304434 [details] patch to fetch info from iperf patch to fetch info from iperf: what packet is missing
Link for the wireshark dissector: https://github.com/geertn444/iperf3_dissector/blob/master/iperf3.lua You can copy it to: ~/.local/lib/wireshark/plugins/
Created attachment 304950 [details] fix Hi, This is a modified and enhanced version of your patch. Does it fix your issues?
(In reply to Emmanuel Grumbach from comment #11) > Created attachment 304950 [details] > fix > > Hi, > > This is a modified and enhanced version of your patch. > Does it fix your issues? Sorry for not getting back to you previously. I've just tested your patch atop 6.4.11, and it doesn't fix the issue; iperf3 still stalls within about 30 seconds. This was on an ax211, with the following firmware version: `[ 5.008802] iwlwifi 0000:00:14.3: loaded firmware version 78.3bfdc55f.0 so-a0-gf-a0-78.ucode op_mode iwlmvm`
Thanks It stalls forever? Or recovers? Are you positive that the patch you sent back then fixed the issue? My patch does pretty much the same as you did... I just cleaned up the code.
Created attachment 304951 [details] confirmed fix for rx stalls (In reply to Emmanuel Grumbach from comment #13) > Thanks > > It stalls forever? > Or recovers? It recovers, but takes up to a few minutes to do so. The queues don't appear to be stuck in a bad state like before if I just ^C iperf3; previously, I would observe a cascade of iperf3 stalls after ^C'ing a stalled iperf3 instance, where iperf3 would continue to stall a few more times after that initial stall. > Are you positive that the patch you sent back then fixed the issue? > My patch does pretty much the same as you did... I just cleaned up the code. Yes. The patch I sent back then withstood 2 hours of iperf3 torture and never stalled. I tested it again now without observing any stalls after several minutes. I've attached the current version of the patch I use, which contains a small clean up you made back in June.
All this is really strange because my patch does pretty much what your patch does but I remove tons of code that is no longer needed. I'll try tomorrow to prepare a series of patches that will be my patch split in several sub-patches. I'd be glad if you'd be able to run tests at each step to see where things go wrong. My patch was tested internally of course..
(In reply to Emmanuel Grumbach from comment #15) > All this is really strange because my patch does pretty much what your patch > does but I remove tons of code that is no longer needed. > I'll try tomorrow to prepare a series of patches that will be my patch split > in several sub-patches. I'd be glad if you'd be able to run tests at each > step to see where things go wrong. Sure. It's still quick for me to reproduce (always happens in <1m), so I can bisect quickly. > My patch was tested internally of course.. Were you able to reproduce the issue?
We had a customer which reported a problem of packets getting lost. We have several fixes for the firmware that address this. But wait! You don't have those fixes... And they are critical. Can you check out our backport tree? https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/backport-iwlwifi.git/log/ I'll send you the most recent version of the firmware with all the fixes, but you'll need our master branch from our internal tree to load the newest firmware.
(In reply to Emmanuel Grumbach from comment #17) > We had a customer which reported a problem of packets getting lost. We have > several fixes for the firmware that address this. > > But wait! > You don't have those fixes... And they are critical. > > Can you check out our backport tree? > https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/backport-iwlwifi.git/ > log/ > > I'll send you the most recent version of the firmware with all the fixes, > but you'll need our master branch from our internal tree to load the newest > firmware. Ah, yeah Johannes mentioned such to me back in June. When I tested backport-iwlwifi + newer firmware back then, stalls didn't occur but iperf3 still reported out-of-order packet receipt. :) Feel free to send the firmware to my email or we can talk over IRC if you'd like (my nick is kerneltoast on OFTC and Libera).
Created attachment 304952 [details] core81 firmware So I had this problem with your original patch + firmware fixes. So now please take: backport-iwlwifi master branch + the patch I attached here in comment#11 the firmware I attached here. I'll need to attach also the PNVM that matches this .ucode file. Please backup your previous PNVM file so that you can rollback afterwards.
Created attachment 304953 [details] pnvm
Tested all that with backport-iwlwifi @ 7a0a4e45dd7b1482c9964748d0ffb086552a9d1e and the stalls are indeed fixed. This issue is therefore resolved. Thanks!