Bug 217119
Summary: | [Regression]: rt2800usb - Wifi performance issues and connection drops | ||
---|---|---|---|
Product: | Networking | Reporter: | Thomas Mann (rauchwolke) |
Component: | Wireless | Assignee: | networking_wireless (networking_wireless) |
Status: | NEW --- | ||
Severity: | normal | CC: | alexander, regressions |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 6.2.x | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg output
Potential Fix linux 6.2.2 debug output Debug patch debug log with debug.patch Debug Patch v2 dmesg v2 patch log kernel config Debug Patch v3 Potential Fix v2 Potential real Fix v1 Debug Patch v3 output Debug Patch v3 Test patch debug patch v4 log Potential real Fix v2 Potential real Fix v3 |
Description
Thomas Mann
2023-03-03 15:12:03 UTC
Please attach dmesg [without it most people won't even know which driver is in use for your card] drive in use is rt2800usb Created attachment 303840 [details]
dmesg output
i bisected and found the commit that introduced the regression: # first bad commit: [4444bc2116aecdcde87dce80373540adc8bd478b] wifi: mac80211: Proper mark iTXQs for resumption Created attachment 303878 [details]
Potential Fix
Can you test if this patch helps?
It should prevent one racy situation I'm aware about.
If not we'll have to dig deeper and understands, what's going on here.
If it's not fixing the issue I would be interested in the output of your iTXQ status. Enable CONFIG_MAC80211_DEBUGFS and run this command when the connection is bad and send/share/upload to bugzilla the resulting debug.out: k=1; while [ $k -lt 10 ]; do \ cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \ k=$(($k+1)); done >> debug.out (In reply to alexander from comment #5) > Created attachment 303878 [details] > Potential Fix > > Can you test if this patch helps? > It should prevent one racy situation I'm aware about. > > If not we'll have to dig deeper and understands, what's going on here. i applied the patch on linux-6.2.2: it didn't fix the problem Created attachment 303879 [details]
linux 6.2.2 debug output
Created attachment 303883 [details]
Debug patch
The debug output confirms the suspicion that an iTXQ is Dirty and somehow missed its wake call:
tid ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets flags
0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU DIRTY)
Can you apply this patch, so we get some more insights? (Use a clean 6.2 kernel)
Reproduce the issue and then upload dmesg again.
Can you also describe the behavior with more details?
Is it e.g. not working from start but then works in short burst or so?
Maybe also share the ping output to your GW.
That said chances are, that this is related to power save. Can you first run
iw dev <wlan device> get power_save
and check the output?
I suspect it will be "Power save: on".
If it's indeed on try if
"iw dev <wlan device> set power_save off" mitigates the issue.
Created attachment 303884 [details]
debug log with debug.patch
Can you also describe the behavior with more details?
Is it e.g. not working from start but then works in short burst or so?
slow bandwidth stuff works better, but the main problem/test case is to start a 8-16 mbit video stream, which sometimes runs for a few seconds and then stops or it doesn't start at all
it seems powersave is off:
iw dev wlan0 get power_save
Power save: off
Created attachment 303887 [details]
Debug Patch v2
It looks like the driver tells mac80211 to stop TX and never resumes it.
I've attached a horrible verbose updated debug patch to hopefully catch a hint on what's going on here... The output will look scary and get really long due to the WARN_ON(1) I added to the driver. Please upload the output again.
But I also found a quite similar card to run some tests myself:
[17958.839634] usb 1-1.5: reset high-speed USB device number 3 using ehci-pci
[17959.000478] ieee80211 phy3: rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
[17959.055255] ieee80211 phy3: rt2x00_set_rf: Info - RF chipset 0005 detected
[17959.055884] ieee80211 phy3: Selected rate control algorithm 'minstrel_ht'
[17959.056781] usbcore: registered new interface driver rt2800usb
[17959.061576] rt2800usb 1-1.5:1.0 wlp0s29u1u5: renamed from wlan0
Only difference seems to be, that my card is using rev 0200 instead of 0201.
And that's working quite fine for me when using linux 6.2.0 and a USBv2 port. (USBv3 is failing with some USB error.)
Can you attach your kernel config, so that I can try it with a kernel close to yours? Would be much simpler to debug that when I can reproduce the problem.
If it's ok for you I would also switch to communicating on the wireless mailing list. Maybe someone else on the list sees something I miss.
Created attachment 303888 [details]
dmesg v2 patch log
the card is a minipci half sized card that exposes the wifi card as usb card at least it uses the usb driver.
what's the address of the kernel mailing list?
Created attachment 303889 [details]
kernel config
Created attachment 303904 [details]
Debug Patch v3
Here the next debug patch. Please apply to a clean 6.2 kernel and
reproduce the issue. Make sure catch the full output, from connecting to
the Wlan till the connection stall.
Running "dmesg -w > deboug.out" prior connecting should to the trick.
I've also made one change which may fix the issue for you. (The very first chunk of the patch). If that works it considerably narrows down what's wrong. If not the additional output hopefully tells us more...
Created attachment 303905 [details]
Potential Fix v2
This is basically a revert - or as close as we can get to one - of the commit you identified as culprit.
Is this still fixing the issue?
Created attachment 303908 [details]
Potential real Fix v1
Now I'm not familiar with the rt2800usb but the driver indeed seems to have a path to deadlock when the Tx queues are full. Which makes it a good candidate for our issue.
Can't trigger the queue full with my card/setup. Which also looks promising.
So here a first draft what could be the real fix for the issue.
When that's not working please also try the other ones.
If it's working I still would like to see the output from "Debug Patch v3".
Created attachment 303911 [details]
Debug Patch v3 output
none of the patches fix the bug.
But commit e66b7920aa5ac5b1a1997a454004ba9246a3c005 (the commit before 4444bc2116aecdcde87dce80373540adc8bd478b) works without a problem.
Created attachment 303925 [details]
Debug Patch v3
Here a patch which should prevent the "overlapping" TX operations.
Don't see how it can cause this error but without the patch you identified as culprit the these overlapping TX operations should not happen.
Record the output again with "dmesg -w > deboug2.out", so I can verify the patch works as intended.
Created attachment 303926 [details]
Test patch
I strongly suspect the patch here fixes the issue for you. It will break other drivers and can't be the real fix. But if it works we have at least a simple workaround for now...
Created attachment 303927 [details] debug patch v4 log This workaround works: https://bugzilla.kernel.org/attachment.cgi?id=303926 I also double checked, but i can't reproduce the bug with this debug patch (the work around patch wasn't applied) - https://bugzilla.kernel.org/attachment.cgi?id=303925 Created attachment 303932 [details]
Potential real Fix v2
Can you check, if this patch also fixes the issue?
This is one potential fix I would like to propose and discuss on the mailing list.
This version here prints out "XXXX drv_wake_tx_queue: SERIALIZATION" when mitigation the situation which seems to cause your problem.
(In reply to alexander from comment #21) > Created attachment 303932 [details] > Potential real Fix v2 > > Can you check, if this patch also fixes the issue? > > This is one potential fix I would like to propose and discuss on the mailing > list. > This version here prints out "XXXX drv_wake_tx_queue: SERIALIZATION" when > mitigation the situation which seems to cause your problem. this patch fixes the bug, thanks Created attachment 303933 [details]
Potential real Fix v3
Thanks again for testing...
I simplified it and fixed a potential race. Can't promise this is the final version for official review, which itself also may also request changes here...
(I'll plan to do some unit testing tomorrow and when I still like it will post it for review.
A big thank you for all testing all my patches, that accelerated the bug hunting quite a bit...
Potential real Fix v3 - works too |