Bug 216044 - ax200/ax210 rx stalls on all kernel versions [WIFI-222755]
Summary: ax200/ax210 rx stalls on all kernel versions [WIFI-222755]
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless-intel (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Default virtual assignee for network-wireless-intel
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-05-29 00:09 UTC by Sultan Alsawaf
Modified: 2023-08-28 03:39 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.3+
Subsystem:
Regression: No
Bisected commit-id:


Attachments
ax200 stall on 5.18 (iwlwifi.debug=0xffffffff) (4.27 MB, text/plain)
2022-05-29 00:09 UTC, Sultan Alsawaf
Details
ax200 stall capture (106.59 KB, application/pgp-encrypted)
2022-05-30 01:10 UTC, Sultan Alsawaf
Details
ax200 stall capture (with data frames) (4.28 MB, application/pgp-encrypted)
2022-06-14 02:43 UTC, Sultan Alsawaf
Details
ax200 firmware dump (pairs with ax200-stall-capture_20220613.pcap.gpg) (1.50 KB, application/pgp-encrypted)
2022-06-14 02:45 UTC, Sultan Alsawaf
Details
patch to fetch info from iperf (587 bytes, patch)
2023-06-15 14:59 UTC, Emmanuel Grumbach
Details | Diff
fix (21.26 KB, patch)
2023-08-27 17:14 UTC, Emmanuel Grumbach
Details | Diff
confirmed fix for rx stalls (4.27 KB, patch)
2023-08-27 18:52 UTC, Sultan Alsawaf
Details | Diff
core81 firmware (1.66 MB, application/octet-stream)
2023-08-27 19:52 UTC, Emmanuel Grumbach
Details
pnvm (54.28 KB, application/octet-stream)
2023-08-27 19:52 UTC, Emmanuel Grumbach
Details

Description Sultan Alsawaf 2022-05-29 00:09:03 UTC
Created attachment 301070 [details]
ax200 stall on 5.18 (iwlwifi.debug=0xffffffff)

On every kernel version I've tested where the ax200 is supported (I've tested 5.3 through 5.18), there are rx stalls on the ax200. Since it affects even the oldest kernel where ax200 works for me (5.3), I cannot bisect the issue. All firmware versions appear to be affected, with 71.058653f6.0 (cc-a0-71.ucode) being the newest version that I've tested.

The stalls occur at random, but I've recently found a way to reproduce it consistently within about a minute:
`while true; do iperf3 -c $ACCESS_POINT_IP -u -b 1000M -P 16 -R; done`

Doing something similar to the above with the `ethr` utility replicates the issue as well, so it's unrelated to iperf3.

When a stall occurs, iperf3 gets stuck here:
```
[<0>] __skb_wait_for_more_packets+0x116/0x170
[<0>] __skb_recv_udp+0x1ef/0x310
[<0>] udp_recvmsg+0x8e/0x550
[<0>] inet_recvmsg+0x115/0x130
[<0>] __sys_recvfrom+0x14c/0x160
[<0>] __x64_sys_recvfrom+0x1b/0x20
[<0>] do_syscall_64+0x37/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
```

I tested the backport-iwlwifi driver and it's also affected.

I tested an ax210 and it's also affected.

I tested a 7265D (which uses iwlmvm too), and it *isn't* affected.

I tested an mt7921k (80 MHz 802.11ax card), and it *isn't* affected, so this doesn't seem like an issue with my AP.

I've attached a short dmesg of the stall with `/sys/module/iwlwifi/parameters/debug` set to `0xffffffff`. In `ax200-stall-dmesg.txt`, the last message emitted before the stall occurred is at 405.480747 seconds. Every message printed after that is with iperf3 stalled.

I can quickly test experimental kernel patches and provide more info if needed. I can also change hostapd settings on my AP to test anything if needed.
Comment 1 Johannes Berg 2022-05-29 14:28:53 UTC
Any chance you could capture (with another device in monitor mode, per [1]) what happens over the air? Looks like you have AX200 and AX210, so if they're in different machines that shouldn't be too hard.

Also, does the traffic ever recover?

[1] https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#air_sniffing
Comment 2 Sultan Alsawaf 2022-05-30 01:10:58 UTC
Created attachment 301073 [details]
ax200 stall capture

(In reply to Johannes Berg from comment #1)
> Any chance you could capture (with another device in monitor mode, per [1])
> what happens over the air? Looks like you have AX200 and AX210, so if
> they're in different machines that shouldn't be too hard.

FYI, I don't have the ax210 anymore (returned it after seeing that it had the same issue).

I've got plenty of ax200s lying around though and attached a capture encrypted with the 3 keys at the bottom of your wiki link. The capture was done on a quiet channel with only my ax200 machine and my AP transmitting. The last packet that was sniffed before the stall occurred is at 17:50:26.181350 (992830798us).

> Also, does the traffic ever recover?

No. There's either an error or a timeout, but never a recovery. If nothing stops the user program while it's stalled, then it'll remain stalled.
Comment 3 Sultan Alsawaf 2022-06-14 02:43:31 UTC
Created attachment 301166 [details]
ax200 stall capture (with data frames)
Comment 4 Sultan Alsawaf 2022-06-14 02:45:00 UTC
Created attachment 301167 [details]
ax200 firmware dump (pairs with ax200-stall-capture_20220613.pcap.gpg)
Comment 5 Sultan Alsawaf 2022-06-14 02:52:50 UTC
As discussed over IRC, I've uploaded a firmware dump triggered during a stall and a complete capture to go along with it (since the last capture was missing data frames).

The capture is truncated to show the last ~5 MB of the full capture (which is ~400 MB). I can provide earlier chunks of the capture if needed.

In the new capture, the last packet sniffed before the stall occurred is at 18:56:15.443046 (468333516us).

This stall was reproduced using VHT with 20 MHz bandwidth.
Comment 6 Emmanuel Grumbach 2023-06-15 07:38:09 UTC
Hello,

I work with Johannes on this one.
I looked at the sniffer capture from comment#3 and it looks perfect.
No packet loss there are at all.

I do see something strange though.
There are packets that are encrypted in TKIP and others in CCMP?!
It'd be useful to see what happens in OPEN connection if you don't mind to see if that helps.
Of course, please keep all the fixes that were shared with you until now.
Thanks.
Comment 7 Emmanuel Grumbach 2023-06-15 12:31:19 UTC
I tried to reproduce with all the fixes: fixes in the firwmare and in the reoder buffer.

My AP is configured to be on channel 36 in 20MHz.

I have a PC connected by LAN to the AP on which I push traffic:
iperf3 -c 192.168.2.103 -u -b 1G -I16 -t 10000

Of course, I loose tons of packets on the receiving side: the pipe is just not big enough to grab all the downlink traffic. but I couldn't see any stall:
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.2.240, port 39794
[  6] local 192.168.2.103 port 5201 connected to 192.168.2.240 port 45065
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  6]   0.00-1.00   sec  11.8 MBytes  98.9 Mbits/sec  0.145 ms  45056/53591 (84%)
[  6]   1.00-2.00   sec  11.5 MBytes  96.5 Mbits/sec  0.154 ms  73444/81771 (90%)
[  6]   2.00-3.00   sec  11.5 MBytes  96.5 Mbits/sec  0.131 ms  75119/83450 (90%)
[  6]   3.00-4.00   sec  12.2 MBytes   102 Mbits/sec  0.187 ms  71255/80084 (89%)
[  6]   4.00-5.00   sec  14.1 MBytes   119 Mbits/sec  0.220 ms  76301/86531 (88%)
[  6]   5.00-6.00   sec  14.9 MBytes   125 Mbits/sec  0.153 ms  71517/82335 (87%)
[  6]   6.00-7.00   sec  14.9 MBytes   125 Mbits/sec  0.259 ms  70986/81787 (87%)
[  6]   7.00-8.00   sec  14.8 MBytes   124 Mbits/sec  0.207 ms  72394/83103 (87%)
[  6]   8.00-9.00   sec  14.7 MBytes   124 Mbits/sec  0.120 ms  71083/81748 (87%)
[  6]   9.00-10.00  sec  14.3 MBytes   120 Mbits/sec  0.138 ms  69391/79754 (87%)
[  6]  10.00-11.00  sec  14.5 MBytes   122 Mbits/sec  0.148 ms  74639/85138 (88%)
[  6]  11.00-12.00  sec  14.0 MBytes   117 Mbits/sec  0.109 ms  70613/80747 (87%)
[  6]  12.00-13.00  sec  11.1 MBytes  93.2 Mbits/sec  0.180 ms  72184/80230 (90%)
[  6]  13.00-14.00  sec  11.7 MBytes  98.1 Mbits/sec  0.383 ms  73222/81688 (90%)
[  6]  14.00-15.00  sec  10.8 MBytes  90.7 Mbits/sec  0.256 ms  73588/81417 (90%)
[  6]  15.00-16.00  sec  10.7 MBytes  89.4 Mbits/sec  0.187 ms  74626/82340 (91%)
[  6]  16.00-17.00  sec  10.8 MBytes  90.8 Mbits/sec  0.174 ms  75843/83685 (91%)
[  6]  17.00-18.00  sec  11.4 MBytes  95.9 Mbits/sec  0.165 ms  73750/82027 (90%)
[  6]  18.00-19.00  sec  11.7 MBytes  98.5 Mbits/sec  0.098 ms  77916/86418 (90%)

....

May I suggest that you push less packets to the pipe and that instead of checking for stalls, I send you a patch for iperf to print what packets it is missing?
Then we can see if we see those packets in the air sniffer (and possibly collect more data).
Comment 8 Emmanuel Grumbach 2023-06-15 14:56:34 UTC
on 20MHz I can reliably push 130Mbps and I have no issues.
I did see something interesting a few times.
I run tcpdump on the wlan interface of the receiver and I print what packet is missing from iperf.
I also have a pluing for wireshark that allows to see the iperf sequence number in wireshark.
I could actually see that iperf is complaining because it missed a packet that I do see in the tcpdump on the wlan interface. Which is very strange.
Comment 9 Emmanuel Grumbach 2023-06-15 14:59:17 UTC
Created attachment 304434 [details]
patch to fetch info from iperf

patch to fetch info from iperf: what packet is missing
Comment 10 Emmanuel Grumbach 2023-06-15 15:00:17 UTC
Link for the wireshark dissector:
https://github.com/geertn444/iperf3_dissector/blob/master/iperf3.lua

You can copy it to:
~/.local/lib/wireshark/plugins/
Comment 11 Emmanuel Grumbach 2023-08-27 17:14:47 UTC
Created attachment 304950 [details]
fix

Hi,

This is a modified and enhanced version of your patch.
Does it fix your issues?
Comment 12 Sultan Alsawaf 2023-08-27 18:30:52 UTC
(In reply to Emmanuel Grumbach from comment #11)
> Created attachment 304950 [details]
> fix
> 
> Hi,
> 
> This is a modified and enhanced version of your patch.
> Does it fix your issues?

Sorry for not getting back to you previously.

I've just tested your patch atop 6.4.11, and it doesn't fix the issue; iperf3 still stalls within about 30 seconds.

This was on an ax211, with the following firmware version:
`[    5.008802] iwlwifi 0000:00:14.3: loaded firmware version 78.3bfdc55f.0 so-a0-gf-a0-78.ucode op_mode iwlmvm`
Comment 13 Emmanuel Grumbach 2023-08-27 18:39:20 UTC
Thanks

It stalls forever?
Or recovers?

Are you positive that the patch you sent back then fixed the issue?
My patch does pretty much the same as you did... I just cleaned up the code.
Comment 14 Sultan Alsawaf 2023-08-27 18:52:58 UTC
Created attachment 304951 [details]
confirmed fix for rx stalls

(In reply to Emmanuel Grumbach from comment #13)
> Thanks
> 
> It stalls forever?
> Or recovers?

It recovers, but takes up to a few minutes to do so. The queues don't appear to be stuck in a bad state like before if I just ^C iperf3; previously, I would observe a cascade of iperf3 stalls after ^C'ing a stalled iperf3 instance, where iperf3 would continue to stall a few more times after that initial stall.

> Are you positive that the patch you sent back then fixed the issue?
> My patch does pretty much the same as you did... I just cleaned up the code.

Yes. The patch I sent back then withstood 2 hours of iperf3 torture and never stalled. I tested it again now without observing any stalls after several minutes.

I've attached the current version of the patch I use, which contains a small clean up you made back in June.
Comment 15 Emmanuel Grumbach 2023-08-27 19:06:54 UTC
All this is really strange because my patch does pretty much what your patch does but I remove tons of code that is no longer needed.
I'll try tomorrow to prepare a series of patches that will be my patch split in several sub-patches. I'd be glad if you'd be able to run tests at each step to see where things go wrong.

My patch was tested internally of course..
Comment 16 Sultan Alsawaf 2023-08-27 19:10:28 UTC
(In reply to Emmanuel Grumbach from comment #15)
> All this is really strange because my patch does pretty much what your patch
> does but I remove tons of code that is no longer needed.
> I'll try tomorrow to prepare a series of patches that will be my patch split
> in several sub-patches. I'd be glad if you'd be able to run tests at each
> step to see where things go wrong.

Sure. It's still quick for me to reproduce (always happens in <1m), so I can bisect quickly.

> My patch was tested internally of course..

Were you able to reproduce the issue?
Comment 17 Emmanuel Grumbach 2023-08-27 19:34:59 UTC
We had a customer which reported a problem of packets getting lost. We have several fixes for the firmware that address this.

But wait!
You don't have those fixes... And they are critical.

Can you check out our backport tree?
https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/backport-iwlwifi.git/log/

I'll send you the most recent version of the firmware with all the fixes, but you'll need our master branch from our internal tree to load the newest firmware.
Comment 18 Sultan Alsawaf 2023-08-27 19:41:12 UTC
(In reply to Emmanuel Grumbach from comment #17)
> We had a customer which reported a problem of packets getting lost. We have
> several fixes for the firmware that address this.
> 
> But wait!
> You don't have those fixes... And they are critical.
> 
> Can you check out our backport tree?
> https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/backport-iwlwifi.git/
> log/
> 
> I'll send you the most recent version of the firmware with all the fixes,
> but you'll need our master branch from our internal tree to load the newest
> firmware.

Ah, yeah Johannes mentioned such to me back in June. When I tested backport-iwlwifi + newer firmware back then, stalls didn't occur but iperf3 still reported out-of-order packet receipt. :)

Feel free to send the firmware to my email or we can talk over IRC if you'd like (my nick is kerneltoast on OFTC and Libera).
Comment 19 Emmanuel Grumbach 2023-08-27 19:52:09 UTC
Created attachment 304952 [details]
core81 firmware

So I had this problem with your original patch + firmware fixes.

So now please take:
backport-iwlwifi master branch + the patch I attached here in comment#11
the firmware I attached here.

I'll need to attach also the PNVM that matches this .ucode file.
Please backup your previous PNVM file so that you can rollback afterwards.
Comment 20 Emmanuel Grumbach 2023-08-27 19:52:28 UTC
Created attachment 304953 [details]
pnvm
Comment 21 Sultan Alsawaf 2023-08-27 20:47:25 UTC
Tested all that with backport-iwlwifi @ 7a0a4e45dd7b1482c9964748d0ffb086552a9d1e and the stalls are indeed fixed. This issue is therefore resolved. Thanks!

Note You need to log in before you can comment on or make changes to this bug.