Bug 42828

Summary: rt2800pci unstable - chokes after too much I/O
Product: Drivers Reporter: Avant-texte (avanttexte)
Component: network-wirelessAssignee: Stanislaw Gruszka (stf_xl)
Status: CLOSED CODE_FIX    
Severity: normal CC: avanttexte, f.pinamartins, florian, gwingerde, helmut.schaa, IvDoorn, linville, stf_xl, x15
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: >=3.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: Requested debugfs file
dmesg output
rt2x00_revert_to_v3.4.patch
patches.tar.bz2
rt2x00-fix2.diff

Description Avant-texte 2012-02-27 07:22:02 UTC
I have observed this problem on a machine using Ralink's RT2800 802.11n PCI card. When too much data passed over the network, the interface chokes.

"Too much data": I can maintain a mostly (possibly for a few hours) stable connection if I throttle the network traffic to between 7 and 10 kb/s up and down; however, prolonged traffic at higher rates will cause the interface to choke. The higher the rate, the quicker it happens. For example, rates up to and over 50 kb/s are possible but only for less than a minute. The exact amount of time seems to vary a bit. If the execessive network usage is brief enough, the interface seems to stay functional there after, but with a seemingly reduced threshhold.

"Choke": Ifconfig and iwconfig will continue to report the interface as being associated to an AP and having an IP address, but any attempts to reach the wider Internet or LAN (even the local gateway) will result in a pause followed by an unknown host error for any application that tries. Ifconfig and iwconfig's output remains constant, as if there's no problem, throught this whole process.

The reason I think this a bug in the kernel and/or driver:
I found I can restart the interface's usefulness after choking, by reloading the driver module. However, it must be unloaded and reloaded twice. Once is not enough. These strange symtoms will continue --for as long as I try-- until I perform a second unload/reload cycle. That is to say I must:
---[code]---
| # modprobe -r rt2800pci
| # modprobe rt2800pci
| # modprobe -r rt2800pci
| # modprobe rt2800pci
| # dhcpcd wlan0
---[/code]---
before my connection will work again. I can do this an unlimited number of times to keep my connection running, but I do notice it sometimes breaks faster after several unload/reload cycles that happened close together.

Additionaly, I spot these errors via dmesg:
| phy0 -> rt2800pci_mcu_status: Error - MCU request failed, no response from hardware
| rt2800pci 0000:05:00.0: PCI INT A disabled 
| rt2800pci 0000:05:00.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19
| Registered led device: rt2800pci-phy0::radio
| Registered led device: rt2800pci-phy0::assoc
| Registered led device: rt2800pci-phy0::quality
Comment 1 Helmut Schaa 2012-02-27 15:51:42 UTC
Please enable debugfs in rt2x00 and check whether the tx queues are stuck (or simply attach the contents of /sys/kernel/debug/ieee80211/phy0/rt2800pci/queues/queues or similar).

Thanks.
Comment 2 Avant-texte 2012-02-29 15:06:21 UTC
(In reply to comment #1)
> Please enable debugfs in rt2x00 and check whether the tx queues are stuck (or
> simply attach the contents of
> /sys/kernel/debug/ieee80211/phy0/rt2800pci/queues/queues or similar).
> 
> Thanks.

How do I do that? My understanding is that debugfs is only a utility for ext filesystems. http://linux.die.net/man/8/debugfs

Thank you for your prompt response.
Comment 3 Avant-texte 2012-02-29 15:12:56 UTC
UPDATE:
I have confirmed the problem is with 3.2 (at least up to 3.2.7; I haven't yet compiled 3.28). When I put a 3.0.22 kernel on the exact same system with no other changes, the network no longer chokes up. The mcu errors are still present, so that appears to be a separate issue. I did find a patch for that, though.
Comment 4 Avant-texte 2012-03-15 16:00:25 UTC
@Helmut Schaa
Ignor that last question. I have debugfs mounted, but  /sys/kernel/debug/ieee80211/phy0/ is empty.

UPDATE:
Kernel versions 3.2.8 and 3.2.9 also still have the problem.
Comment 5 Helmut Schaa 2012-03-16 08:33:01 UTC
Did you compile rt2x00 and mac80211 with debugfs support?
Comment 6 Martin Schmidt 2012-03-21 02:48:39 UTC
For me (EeePC901, rt2860), the issue is the dropping of the connection to the access point, and failure to reconnect. The trigger seems to be multiple connections and/or high traffic, as for the OP.
Comment 7 Avant-texte 2012-03-21 10:04:28 UTC
From Helmut Schaa 2012-03-16 08:33:01
> Did you compile rt2x00 and mac80211 with debugfs support?
Eek, you're right. I must have missed that. Recompiling. Will update asap.

From Martin Schmidt
>For me (EeePC901, rt2860), the issue is the dropping of the connection to the
>access point, and failure to reconnect. The trigger seems to be multiple
>connections and/or high traffic, as for the OP.
When the connection drops, does iw/iwconfig still show the card as associated to the ap?
After the connection drops, does reloading the driver restore your network card's ability to associate, as it does for me?
Wondering if we have the same problem or similar ones.
Comment 8 Martin Schmidt 2012-03-22 01:31:11 UTC
Haven't had the chance to look at iw, but in the logs, it says it disconnects:

[ 1712.500144] ieee80211 phy0: wlan0: No probe response from AP 00:11:22:33:44:55 after 500ms, disconnecting.
[ 1713.030503] cfg80211: Calling CRDA to update world regulatory domain
[ 1713.040135] cfg80211: World regulatory domain updated:
[ 1713.040148] cfg80211:     (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
[ 1713.040158] cfg80211:     (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[ 1713.040167] cfg80211:     (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
[ 1713.040175] cfg80211:     (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
[ 1713.040184] cfg80211:     (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[ 1713.040193] cfg80211:     (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[ 1713.040245] cfg80211: Calling CRDA for country: US
[ 1713.047158] cfg80211: Regulatory domain changed to country: US
[ 1713.047166] cfg80211:     (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
[ 1713.047174] cfg80211:     (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2700 mBm)
[ 1713.047181] cfg80211:     (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 1700 mBm)
[ 1713.047188] cfg80211:     (5250000 KHz - 5330000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[ 1713.047195] cfg80211:     (5490000 KHz - 5600000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[ 1713.047201] cfg80211:     (5650000 KHz - 5710000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[ 1713.047208] cfg80211:     (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 3000 mBm)
[ 1715.262174] wlan0: authenticate with 00:11:22:33:44:55 (try 1)
[ 1715.460112] wlan0: authenticate with 00:11:22:33:44:55 (try 2)
[ 1715.660079] wlan0: authenticate with 00:11:22:33:44:55 (try 3)
[ 1715.860122] wlan0: authentication with 00:11:22:33:44:55 timed out
[ 1738.190139] phy0 -> rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 0.

The last massage is repeated a lot.
The other symptoms are the same - reload of the module makes the device usable as well. I have gone back to 3.0.24, it has the same issue (using ArchLinux linux-lts 3.0.24).
Comment 9 Stanislaw Gruszka 2012-03-22 20:48:08 UTC
Does disabling power save ( iwconfig wlan0 power off ) help with that problem ?
Comment 10 Martin Schmidt 2012-03-24 14:01:26 UTC
Hi.

After some longer testing, it appears I cannot trigger the bug if I turn off the power saving via
> iw dev wlan0 set power_save off
Interesting. I am curious how it got turned on now... :)
Comment 11 Avant-texte 2012-04-13 01:50:50 UTC
The problem seems to have gone away for me with version 3.3.1.
Comment 12 Avant-texte 2012-04-16 19:18:12 UTC
It seems I spoke too soon. The connection takes more to drop (it needs a sustained transfer now) and comes back to life after a minute or so.

I will post the debug output momentarily.
Comment 13 Avant-texte 2012-04-16 21:01:38 UTC
(In reply to comment #12)
> It seems I spoke too soon. The connection takes more to drop (it needs a
> sustained transfer now) and comes back to life after a minute or so.
>
> I will post the debug output momentarily.

It seems I was too optimistic on this too...

(In reply to comment #5)
> Did you compile rt2x00 and mac80211 with debugfs support?

Do you have any suggested patches? I've been banging my head on this for a bit, but I just seem to be spinning my wheels. My google searches have just turned up some archived threads on various patches relating to debugfs and to rtx00 drivers.
Comment 14 Avant-texte 2012-04-16 21:04:59 UTC
(In reply to comment #9)
> Does disabling power save ( iwconfig wlan0 power off ) help with that problem
> ?

I tried. From what I can tell, nothing changes.
Comment 15 Avant-texte 2012-04-20 00:37:23 UTC
Updating listed kernel version to better reflect scope.
Comment 16 Avant-texte 2012-04-20 00:43:08 UTC
> Changes submitted for bug 42828
>     Email sent to:
>         bugsfx@gmail.com, linuxdev@lavabit.com, gwingerde@gmail.com,
>         linville@tuxdriver.com, IvDoorn@gmail.com,
>          stf_xl@wp.pl, helmut.schaa@googlemail.com, x15@gmx.net
>      stf_xavanttexte@aExcluding:
>          stf_xavanttexte@aim.com,
>          drivers_network-wireless@kernel-bugs.osdl.org

If this is assigned to drivers_network-wireless@kernel-bugs.osdl.org, does that mean they're not getting new commits to this bug report?
Comment 17 John W. Linville 2012-04-20 14:15:32 UTC
drivers_network-wireless@kernel-bugs.osdl.org is an alias.  I get those emails, and I suspect that a few other people watch it as well.

Several of the other people in that list of addresses do significant work on rt2x00.  Hopefully one of them will be able to help you more soon.
Comment 18 Stanislaw Gruszka 2012-05-30 12:34:34 UTC
Does this problem still occurs on 3.4 ?
Comment 19 Avant-texte 2012-05-30 21:51:01 UTC
> Does this problem still occurs on 3.4 ?

Yes. I'm at a lose.

Sorry for the slow replies. I gave up on Linux/rt2800pci + my hardware. I can still troublshoot if anyone has ideas to try.
Comment 20 Stanislaw Gruszka 2012-05-31 13:50:50 UTC
Firs step is providing requested info. You need to compile kernel with 

CONFIG_MAC80211_DEBUGFS=y
CONFIG_RT2X00_LIB_DEBUGFS=y
CONFIG_RT2X00_DEBUG=y

and, when problem will happen, provide dmesg and /sys/kernel/debug/ieee80211/phy0/rt2800usb/queue/queue file output.

Please also check if you have updated your AP firmware, and if not, update it and check that solve the problem.
Comment 21 Stanislaw Gruszka 2012-07-22 09:13:20 UTC
Closing due to lack of needed information.
Comment 22 Francisco Pina Martins 2012-09-28 22:10:55 UTC
I would like to request the reopening of this bug.
I have the same issue as the OP (connection "chokes" on too much I/O, where too much is >= 100Kb/s for a few seconds). The connection does NOT get dropped, but all throughput stops for (quite!) a few seconds when this happens. It will happen again after I/O is "too high" again.
I do not get the mentioned MCU errors mentioned by the OP.
Unloading and reloading "rt2800pci" has no effect.
Issuing "iwconfig wlan0 power off" does not have any effect either.
I have compiled a kernel with the requested options (the information that is lacking which I am now uploading to dropbox (since the bug is closed and I cannot upload it here, but I will if the bug gets reopened).
Logs are here:

http://dl.dropbox.com/u/929646/debugfs.queue
http://dl.dropbox.com/u/929646/dmesg

I should also note that I don't see anything happening in dmesg when the problem occurs, but I am admittedly not very experienced with kernel debugging...
Also - this is especially noticeable when I am close to the AP, as my connection speed is supposed to be faster the closer I am. (In fact, this bug causes me to have a faster connection when I am further away, since lower I/O causes less chokes...).

Thank you for considering this.
Comment 23 Francisco Pina Martins 2012-09-28 22:12:29 UTC
Forgot to mention:
uname -a:
Linux Nanolaptop 3.5.4-1-debug #1 SMP PREEMPT Wed Sep 26 10:12:58 WEST 2012 i686 GNU/Linux

It's a stock arch linux kernel with the requested changes which I renamed "-debug".
Comment 24 Francisco Pina Martins 2012-09-28 22:29:19 UTC
Created attachment 81421 [details]
Requested debugfs file

The result of:
cat /sys/kernel/debug/ieee80211/phy0/rt2800pci/queue/queue > ~/debugfs.queue
Comment 25 Francisco Pina Martins 2012-09-28 22:33:34 UTC
Created attachment 81431 [details]
dmesg output

Results of:
dmesg > ~/dmesg
Comment 26 Francisco Pina Martins 2012-10-05 22:31:59 UTC
I have done some basic regression testing using Arch Rollback machine (this netbook takes the best part of the night to compile a kernel so I tried precompiled ones).
Here is what I have tested:

Good versions:
3.0.43
3.1.6
3.2.1
3.2.8
3.3.4
3.4.4
3.4.9 - Last known good version

Bad Versions:
3.5.0
3.5.3
3.5.4

So it seems the bug was introduced with kernel 3.5.0.
This means this bug is obviously different from the one originally posted, despite showing the same symptoms. Should I open a new bug?

How can I do a proper regression testing to find the offending commit with minimal compiling? from 3.4.9 to 3.5.0 there must have been hundreds of commits... Can I restrict them to the possible offending components? such as rt2800pci only? Are there any good online docs for this?
Thanks!
Comment 27 Stanislaw Gruszka 2012-10-08 08:00:19 UTC
Let's track it here since we reopened it already.
Comment 28 Stanislaw Gruszka 2012-10-08 08:43:01 UTC
Created attachment 82651 [details]
rt2x00_revert_to_v3.4.patch

This patches reverts rt2x00 driver on 3.5 to code from 3.4 . It is composed by following commits:

ed206d0 Revert "rt2x00: increase led's name buffer length"
32b21d9 Revert "rt2x00: configure different txdesc parameters for non HT channel"
66ac3fa Revert "rt2x00: do not generate seqno in h/w if QOS is disabled"
edc1b8a Revert "rt2800: introduce wpdma_disable function"
568995b Revert "rt2800: add disabling of DMA before loading firmware"
3d6d805 Revert "rt2800: initialize queues before giving up due to DMA error"
bd7f268 Revert "rt2800: zero registers of unused TX rings"
7778e3e Revert "wireless: rt2x00: rt{2500,73}usb.c put back duplicate id"
ec1e4e4 Revert "wireless: rt2x00: rt2800pci add more RT539x ids"
2883905 Revert "rt2x00: Don't let mac80211 send a BAR when an AMPDU subframe fails"
41912a7 Revert "wireless: rt2x00: rt2800usb add more devices ids"
7242b7a Revert "wireless: rt2x00: rt2800usb more devices were identified"
ae98c69 Revert "rt2800: debugfs register access: BBP is 256 bytes big"
7416efd Revert "rt2x00: Use GFP_KERNEL for rx buffer allocation on USB devices"
7795745 Revert "rt2800: add chipset revision RT5390R support"
13917e6 Revert "rt2x00: debugfs support - allow a register to be empty"
878840d Revert "rt2x00: Add debugfs access for rfcsr register"
1c6d193 Revert "rt2x00:Add RT539b chipset support"
20ef272e Revert "rt2x00: use atomic variable for seqno"
c362612 Revert "rt2x00usb: fix indexes ordering on RX queue kick"

If this patch helps with the problem, this mean the bug was introduced by one of reverted commits. Otherwise problems lies on some other subsystem, probably on mac80211.
Comment 29 Francisco Pina Martins 2012-10-10 12:41:35 UTC
I am currently abroad and with limited internet access.
But I have downloaded your patch, and will apply it ASAP.
I should be able to let you know how it went in the next few days.
Thank you for your effort in fixing this!
Comment 30 Francisco Pina Martins 2012-10-13 22:19:38 UTC
I have returned home.
Here's what I've got:
The patch resolves the problem on kernel 3.5.
However, during my trip I have used a few unencrypted networks where the problem is not present using the stock 3.5 kernel, this makes me think the problem might be related to the encryption.
My home network (where I can reproduce the problem) uses a WPA2 connection with AES encryption.
I hope this helps to narrow down the problem.
Comment 31 Stanislaw Gruszka 2012-10-15 13:07:35 UTC
Created attachment 83511 [details]
patches.tar.bz2

I can not see any obvious encryption related problem in the rt2x00 3.5 changes.

Could you please narrow this problem to one commit?

I'm attaching 20 individual patches. Please bisect to find one that fix the problem. I.e. apply 10 first patches and test. If they fix the problem - check 5 first patches, if not - check 15 first patches, and test. And so on until you'll fine exact patch which fix the issue.
Comment 32 Francisco Pina Martins 2012-10-15 14:20:24 UTC
Will do.
It might take a while due to the time it takes to compile a kernel on my netbook.
I will also look into cross compiling the kernel in a 64bit machine (my netbook has an atom N230 which is 32bit only) to try to speed things up.
Anyway, I will report back as soon as I know which commit is the culprit.
Comment 33 Stanislaw Gruszka 2012-10-15 14:55:39 UTC
Not sure why, are you rebuilding whole kernel and make package for it, and then install package? That is pretty noneffective, better will be compiling and installing vanilla kernel, by "make" and "make modules_install" and "make install".

Once you have already compiled and installed kernel, after apply a rt2x00 patch only rt2x00 driver will be recompiled, what is quite fast. Then "make modules_install" and  "modprober -r rt2800usb ; modprobe rt2800usb" will allow you to test patch.
Comment 34 Francisco Pina Martins 2012-10-17 22:47:11 UTC
Thank you for your tips, it really did sped up the process.
I have found the culprit!

0011-Revert-rt2x00-Don-t-let-mac80211-send-a-BAR-when-an-.patch

Simply reverting this commit will make the problem go away.
Comment 35 Stanislaw Gruszka 2012-10-23 08:30:51 UTC
Created attachment 84421 [details]
rt2x00-fix2.diff

This is fix for this problem.

Discussion about the bug:
http://rt2x00.serialmonkey.com/pipermail/users_rt2x00.serialmonkey.com/2012-October/005349.html
Comment 36 Stanislaw Gruszka 2012-12-15 21:32:08 UTC
Patch was commited to linus' tree.
Comment 37 Florian Mickler 2012-12-22 09:19:18 UTC
A patch referencing this bug report has been merged in Linux v3.8-rc1:

commit 5b632fe85ec82e5c43740b52e74c66df50a37db3
Author: Stanislaw Gruszka <sgruszka@redhat.com>
Date:   Mon Dec 3 12:56:33 2012 +0100

    mac80211: introduce IEEE80211_HW_TEARDOWN_AGGR_ON_BAR_FAIL
Comment 38 Florian Mickler 2013-01-04 09:24:38 UTC
A patch referencing this bug report has been merged in Linux v3.8-rc1:

commit ab9d6e4ffe192427ce9e93d4f927b0faaa8a941e
Author: Stanislaw Gruszka <sgruszka@redhat.com>
Date:   Mon Dec 3 12:59:04 2012 +0100

    Revert: "rt2x00: Don't let mac80211 send a BAR when an AMPDU subframe fails"
Comment 39 Florian Mickler 2013-03-05 01:01:46 UTC
A patch referencing a commit somehow associated to this bug report has been merged in Linux v3.9-rc1:

commit 8df6b7b11a5e4200484e9356073d288f08bdefb0
Author: Stanislaw Gruszka <sgruszka@redhat.com>
Date:   Mon Jan 28 14:42:30 2013 +0100

    mac80211: remove IEEE80211_HW_TEARDOWN_AGGR_ON_BAR_FAIL
Comment 40 Francisco Pina Martins 2013-03-26 09:11:54 UTC
Just tested this with 3.8.4 and I can confirm the "fixed" status!
Thank you everyone who worked hard on this!