Bug 217119

Summary: [Regression]: rt2800usb - Wifi performance issues and connection drops
Product: Networking Reporter: Thomas Mann (rauchwolke)
Component: WirelessAssignee: networking_wireless (networking_wireless)
Status: NEW ---    
Severity: normal CC: alexander, regressions
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 6.2.x Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg output
Potential Fix
linux 6.2.2 debug output
Debug patch
debug log with debug.patch
Debug Patch v2
dmesg v2 patch log
kernel config
Debug Patch v3
Potential Fix v2
Potential real Fix v1
Debug Patch v3 output
Debug Patch v3
Test patch
debug patch v4 log
Potential real Fix v2
Potential real Fix v3

Description Thomas Mann 2023-03-03 15:12:03 UTC
After the update of linux to 6.2.x, i get connection drops and bandwidth problems.

6.2.1 was completely unusable and 6.2.2 still has bandwidth problems but works a bit better

The device in use is:

13d3:3273 IMC Networks 802.11 n/g/b Wireless LAN USB Mini-Card

Downgrading the kernel to 6.1.[14,15] fixes the problem and the wifi gets stable again and the available bandwidth increases.

demsg shows no errors
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-03-04 05:45:33 UTC
Please attach dmesg [without it most people won't even know which driver is in use for your card]
Comment 2 Thomas Mann 2023-03-04 12:36:45 UTC
drive in use is rt2800usb
Comment 3 Thomas Mann 2023-03-04 12:38:01 UTC
Created attachment 303840 [details]
dmesg output
Comment 4 Thomas Mann 2023-03-05 15:23:33 UTC
i bisected and found the commit that introduced the regression:

# first bad commit: [4444bc2116aecdcde87dce80373540adc8bd478b] wifi: mac80211: Proper mark iTXQs for resumption
Comment 5 alexander 2023-03-05 21:30:49 UTC
Created attachment 303878 [details]
Potential Fix

Can you test if this patch helps?
It should prevent one racy situation I'm aware about.

If not we'll have to dig deeper and understands, what's going on here.
Comment 6 alexander 2023-03-05 22:07:44 UTC
If it's not fixing the issue I would be interested in the output of your iTXQ status.
Enable CONFIG_MAC80211_DEBUGFS and run this command when the connection is bad and send/share/upload to bugzilla the resulting debug.out:

k=1; while [ $k -lt 10 ]; do \
cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
k=$(($k+1)); done >> debug.out
Comment 7 Thomas Mann 2023-03-05 22:57:01 UTC
(In reply to alexander from comment #5)
> Created attachment 303878 [details]
> Potential Fix
> 
> Can you test if this patch helps?
> It should prevent one racy situation I'm aware about.
> 
> If not we'll have to dig deeper and understands, what's going on here.

i applied the patch on linux-6.2.2: it didn't fix the problem
Comment 8 Thomas Mann 2023-03-05 22:57:49 UTC
Created attachment 303879 [details]
linux 6.2.2 debug output
Comment 9 alexander 2023-03-06 18:27:48 UTC
Created attachment 303883 [details]
Debug patch

The debug output confirms the suspicion that an iTXQ is Dirty and somehow missed its wake call:
tid ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets flags
0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU DIRTY)

Can you apply this patch, so we get some more insights? (Use a clean 6.2 kernel)

Reproduce the issue and then upload dmesg again.

Can you also describe the behavior with more details?
Is it e.g. not working from start but then works in short burst or so? 
Maybe also share the ping output to your GW.

That said chances are, that this is related to power save. Can you first run
  iw dev <wlan device> get power_save
and check the output? 

I suspect it will be "Power save: on".

If it's indeed on try if 
"iw dev <wlan device> set power_save off" mitigates the issue.
Comment 10 Thomas Mann 2023-03-06 18:42:45 UTC
Created attachment 303884 [details]
debug log with debug.patch


Can you also describe the behavior with more details?
Is it e.g. not working from start but then works in short burst or so? 

slow bandwidth stuff works better, but the main problem/test case is to start a 8-16 mbit video stream, which sometimes runs for a few seconds and then stops or it doesn't start at all

it seems powersave is off:

iw dev wlan0 get power_save
Power save: off
Comment 11 alexander 2023-03-06 22:25:11 UTC
Created attachment 303887 [details]
Debug Patch v2

It looks like the driver tells mac80211 to stop TX and never resumes it.
I've attached a horrible verbose updated debug patch to hopefully catch a hint on what's going on here... The output will look scary and get really long due to the WARN_ON(1) I added to the driver. Please upload the output again.

But I also found a quite similar card to run some tests myself:
[17958.839634] usb 1-1.5: reset high-speed USB device number 3 using ehci-pci
[17959.000478] ieee80211 phy3: rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
[17959.055255] ieee80211 phy3: rt2x00_set_rf: Info - RF chipset 0005 detected
[17959.055884] ieee80211 phy3: Selected rate control algorithm 'minstrel_ht'
[17959.056781] usbcore: registered new interface driver rt2800usb
[17959.061576] rt2800usb 1-1.5:1.0 wlp0s29u1u5: renamed from wlan0

Only difference seems to be, that my card is using rev 0200 instead of 0201. 
And that's working quite fine for me when using linux 6.2.0 and a USBv2 port. (USBv3 is failing with some USB error.)

Can you attach your kernel config, so that I can try it with a kernel close to yours? Would be much simpler to debug that when I can reproduce the problem.

If it's ok for you I would also switch to communicating on the wireless mailing list. Maybe someone else on the list sees something I miss.
Comment 12 Thomas Mann 2023-03-06 23:14:53 UTC
Created attachment 303888 [details]
dmesg v2 patch log

the card is a minipci half sized card that exposes the wifi card as usb card at least it uses the usb driver.

what's the address of the kernel mailing list?
Comment 13 Thomas Mann 2023-03-06 23:15:33 UTC
Created attachment 303889 [details]
kernel config
Comment 14 alexander 2023-03-08 19:50:05 UTC
Created attachment 303904 [details]
Debug Patch v3

Here the next debug patch. Please apply to a clean 6.2 kernel and
reproduce the issue. Make sure catch the full output, from connecting to
the Wlan till the connection stall.

Running "dmesg -w > deboug.out" prior connecting should to the trick.

I've also made one change which may fix the issue for you. (The very first chunk of the patch). If that works it considerably narrows down what's wrong. If not the additional output hopefully tells us more...
Comment 15 alexander 2023-03-08 20:55:48 UTC
Created attachment 303905 [details]
Potential Fix v2

This is basically a revert - or as close as we can get to one - of the commit you identified as culprit.

Is this still fixing the issue?
Comment 16 alexander 2023-03-09 14:07:19 UTC
Created attachment 303908 [details]
Potential real Fix v1

Now I'm not familiar with the rt2800usb but the driver indeed seems to have a path to deadlock when the Tx queues are full. Which makes it a good candidate for our issue.
Can't trigger the queue full with my card/setup. Which also looks promising.

So here a first draft what could be the real fix for the issue.
When that's not working please also try the other ones.

If it's working I still would like to see the output from "Debug Patch v3".
Comment 17 Thomas Mann 2023-03-09 17:27:24 UTC
Created attachment 303911 [details]
Debug Patch v3 output

none of the patches fix the bug.

But commit e66b7920aa5ac5b1a1997a454004ba9246a3c005 (the commit before 4444bc2116aecdcde87dce80373540adc8bd478b) works without a problem.
Comment 18 alexander 2023-03-10 20:37:30 UTC
Created attachment 303925 [details]
Debug Patch v3

Here a patch which should prevent the "overlapping" TX operations.
Don't see how it can cause this error but without the patch you identified as culprit the these overlapping TX operations should not happen.

Record the output again with "dmesg -w > deboug2.out", so I can verify the patch works as intended.
Comment 19 alexander 2023-03-10 20:50:48 UTC
Created attachment 303926 [details]
Test patch

I strongly suspect the patch here fixes the issue for you. It will break other drivers and can't be the real fix. But if it works we have at least a simple workaround for now...
Comment 20 Thomas Mann 2023-03-10 22:11:51 UTC
Created attachment 303927 [details]
debug patch v4 log

This workaround works: https://bugzilla.kernel.org/attachment.cgi?id=303926

I also double checked, but i can't reproduce the bug with this debug patch (the work around patch wasn't applied) - https://bugzilla.kernel.org/attachment.cgi?id=303925
Comment 21 alexander 2023-03-11 20:28:33 UTC
Created attachment 303932 [details]
Potential real Fix v2

Can you check, if this patch also fixes the issue?

This is one potential fix I would like to propose and discuss on the mailing list. 
This version here prints out "XXXX drv_wake_tx_queue: SERIALIZATION" when mitigation the situation which seems to cause your problem.
Comment 22 Thomas Mann 2023-03-12 10:43:31 UTC
(In reply to alexander from comment #21)
> Created attachment 303932 [details]
> Potential real Fix v2
> 
> Can you check, if this patch also fixes the issue?
> 
> This is one potential fix I would like to propose and discuss on the mailing
> list. 
> This version here prints out "XXXX drv_wake_tx_queue: SERIALIZATION" when
> mitigation the situation which seems to cause your problem.

this patch fixes the bug, thanks
Comment 23 alexander 2023-03-12 18:00:53 UTC
Created attachment 303933 [details]
Potential real Fix v3

Thanks again for testing...

I simplified it and fixed a potential race. Can't promise this is the final version for official review, which itself also may also request changes here...
(I'll plan to do some unit testing tomorrow and when I still like it will post it for review.

A big thank you for all testing all my patches, that accelerated the bug hunting quite a bit...
Comment 24 Thomas Mann 2023-03-12 20:14:01 UTC
Potential real Fix v3 - works too