Bug 82751

Summary: ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
Product: Drivers Reporter: qianguozheng (guozhengqian0825)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: blocking CC: alfonsogs60, cjcr.soft, corsaroangelo, daniel.santos, hacks+kernel, hentaiwushuang, jisakiel, linville, mmyz1234, nok.raven, roobsi93, russianneuromancer, sani, soprwa, stf_xl, szg00000
Priority: P1    
Hardware: Mips64   
OS: Linux   
Kernel Version: 3.10.49 Subsystem:
Regression: No Bisected commit-id:
Attachments: rt2800_flush_tx_timeouts.patch
rt2x00_queue_threshold.patch
dmesg output with debug patch applied
dmesg output with debug patch - phicomm psg1218 rev.a
dmesg output with latest patches - phicomm psg1218 rev.a
Patch to replace printk with netlink accounting

Description qianguozheng 2014-08-19 07:08:08 UTC
Using OpenWrt svn revision 41808 MT7620N, when below message occur, it would cause we cannot get connection with the AP.

----
[ 3702.380000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[ 3702.390000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[ 3702.400000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[ 3702.410000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[ 3702.420000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[97845.440000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[97845.450000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[97845.460000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[97845.470000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[97845.480000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[101808.120000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[101808.130000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[101808.140000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[101808.150000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[101808.160000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[102318.680000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[102318.690000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[102318.700000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[102318.710000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[102318.720000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
Comment 1 Lanek 2016-02-09 15:31:30 UTC
Still present, known occurences with Kernel 4.1.10 and beyond in current OpenWRT trunk builds.
Has been observed for a couple of years now, (as seen at OpenWRT) - got the suggestion to report it upstream.
Bug ticket https://dev.openwrt.org/ticket/12313 (this contains various additional information, like kernel traces, in chronological order)

Wireless connection dies after some time with this bug. In any case, transmitting a lot of data makes it stop pretty quickly (the connection is still displayed, but even already connected devices will not be able to transmit any data).
Comment 2 Angelo Corsaro 2016-06-07 08:22:51 UTC
Hi all,
I have the same problem in OpenWRT but different kenel ;
Linux ARV7510PW22 3.18.29 #15 Fri Jun 3 10:40:06 CEST 2016 mips GNU/Linux

the output from dmesg is the same :

[...]
[36356.952000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[36356.956000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[36356.968000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[36356.976000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[36356.984000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[36356.992000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
[...]
Comment 3 Stanislaw Gruszka 2017-02-18 13:57:19 UTC
This is issue with MT7620 OpenWRT patch, which is not (yet) in upstream kernel.

Daniel Golle is working to improve the patch. You can support him here:

https://www.kickstarter.com/projects/1327597961/better-support-for-mt7620a-n-in-openwrt-lede
Comment 4 sani 2018-07-03 15:37:01 UTC
Latest openwrt from today still same issue.
I tested 2 routers.
zbt8305 and PSG1218 (phicomm k2)

ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2

If i run speed test i can crash the router in 5 minutes maximum at 2.4ghz band.
Strange that this bug is from over 4 years and its still not resolved.
Comment 5 sani 2018-07-03 15:38:25 UTC
(In reply to sani from comment #4)
> Latest openwrt from today still same issue.
> I tested 2 routers.
> zbt8305 and PSG1218 (phicomm k2)
> 
> ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to
> full tx queue 2
> 
> If i run speed test i can crash the router in 5 minutes maximum at 2.4ghz
> band.
> Strange that this bug is from over 4 years and its still not resolved.

Some cool guru marked this as obsolete and closed it :)
Still crash. Latest openwrt and kernel.
Comment 6 Stanislaw Gruszka 2018-07-04 15:26:47 UTC
FWIW, I have Nexx WT3020H and I'm not capable to reproduce this problem. Perhaps problem is specific for hardware or software configuration.

Do you have version which include this change:
https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=f4a639a3d7d40b4f63c431c2d554c479fbcc6b74
Comment 7 sani 2018-07-04 15:45:10 UTC
I have tested 3 different routers using latest openwrt.
Just compiled, flashed i put channel to 5, changed ssid, put wpa2 password and run speedtest.net.
Routers i tested: 
tp-link 840 version 4/5
ZBT-WR8305RT
Phicomm K2 PSG1218 (here 5ghz part is fine)
All these routers use same ralink which crashes very fast when doing much traffic.
All these routers tested using pandorabox and other binary drivers work just fine.
This is driver or kernel problem for sure.
If you are interested in fixing the problem i can setup remote access to these routers also buildroot or anything else you need to compile.
Comment 8 Stanislaw Gruszka 2018-07-05 09:01:55 UTC
I'm interested to solve the problem, but not so eager to do remote debugging. I might have some patches to test though. 

Since is reproducible on all routers this looks more like config option or perhaps problem happen with particular station devices that you have. 

I just changed to channel 5 on my router and do speedtest.net, no errors. However I'm using this tree with the commit mentioned earlier:
https://git.openwrt.org/?p=openwrt/staging/dangole.git;a=summary
Comment 9 sani 2018-07-05 09:24:06 UTC
Hello,
I tested already this git https://git.openwrt.org/?p=openwrt/staging/dangole.git;a=summary

No effect. Still crashes very easy.
I tested may be 10 different versions all crashes only binary drivers are ok :(
If you have patches after 14 March 2018 i can test them.
Comment 10 sani 2018-07-05 09:24:58 UTC
I forgot gargoyle crashes too.
Comment 11 Stanislaw Gruszka 2018-07-09 14:15:52 UTC
You can check last 3 patches from 
https://github.com/sgruszka/wireless-drivers-next/commits/rt2800-draft


I did't test them on openwrt, but they should be applicable and do not blow things there, but who knows ;-)
Comment 13 T-Bone 2018-08-06 09:00:09 UTC
I am affected by this bug (on a Netgear WN3000RPv3 running OpenWRT 18.06), and I’ve tested your patches:

I have applied the 3 patches on top of OpenWRT 18.06 « backport-2017-11-01 »: it needed a bit of massaging as the rt2800mmio_interrupt() hunk in rt2x00mmio.c failed to apply so I manually edited the file to match your patch; and I tried to stress test the result for about 3 hours, exchanging ~20GB of data over wireless.

The good news is: I couldn’t crash the wireless connection (usually it would collapse in a matter of minutes as soon as a few Mbps of traffic was happening).
The bad news is: the error message is still randomly printed, along with one I don’t remember seeing before:

[ 1517.748008] ieee80211 phy0: rt2800mmio_txstatus_is_spurious: Warning - 4 spurious TX_FIFO_STATUS interrupt(s)

But the connection did survive these messages so far (2 days uptime now), which is an improvement over the previous situation :)

HTH
Comment 14 Stanislaw Gruszka 2018-08-06 11:52:10 UTC
Removing the patch 
package/kernel/mac80211/patches/600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch
should possibly make 3 patches apply cleanly and would go away the message: "Warning - 4 spurious TX_FIFO_STATUS interrupt(s)" .

I'll prepare another patch which should help with interrupts stability and perhaps 
get rid of this message:
ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
Comment 15 Stanislaw Gruszka 2018-08-06 12:08:55 UTC
(In reply to Stanislaw Gruszka from comment #14)
> I'll prepare another patch which should help with interrupts stability and

Actually that change is already incorporated in patch  
"rt2800mmio: use txdone/txstatus rutines from lib" 

So perhaps it was not applied due to conflict in rt2800mmio.c

Please remove 600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch, then apply 3 experimental patches and retest. If messages still happen please attach full dmesg output as txt file.
Comment 16 T-Bone 2018-08-06 19:10:04 UTC
(In reply to Stanislaw Gruszka from comment #15)
> (In reply to Stanislaw Gruszka from comment #14)
> > I'll prepare another patch which should help with interrupts stability and
> 
> Actually that change is already incorporated in patch  
> "rt2800mmio: use txdone/txstatus rutines from lib" 
> 
> So perhaps it was not applied due to conflict in rt2800mmio.c
> 
> Please remove
> 600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch, then
> apply 3 experimental patches and retest. If messages still happen please
> attach full dmesg output as txt file.

I have done just that. All 3 patches now apply cleanly (only a few lines offset).

I am happy to report that after stressing the wireless link for a couple hours and exchanging ~12GB, I wasn't able to trigger either messages (dmesg remains clean), and the link stayed perfectly stable!

HTH
Comment 17 Stanislaw Gruszka 2018-08-07 08:29:31 UTC
Nice! I need to do a bit more work here as patches break USB version of the driver (and need to add some flush fixes). Hopefully I'll do this soon and provide patches to test.
Comment 18 sani 2018-08-09 07:45:36 UTC
I would like to test too but having problems applying these patches.
Which tree do you use ?
If these patches works why can't you just commit global ?
Comment 19 sani 2018-08-09 09:03:44 UTC
Ok i manually applied 3 patches to dangole git. Also i erased 600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch.
So far router does not crash, but link is very unstable. I tested with iphone7, android phone,macbook. Very often it can not connect or asks for password. 
Thereis much work to be done. With lan cable connected to router its okay.
Still wifi driver is unusable. You can not count on it. I tested all possibilities with 20/40 mhz, all channels.....
Comment 20 T-Bone 2018-08-09 11:50:05 UTC
(In reply to sani from comment #19)
> Ok i manually applied 3 patches to dangole git. Also i erased

I did not use dangole's git. I used a clean checkout of OpenWRT 18.06.

> So far router does not crash, but link is very unstable.

There might be something in dangole's git that causes these symptoms, or that prevented the patches from applying cleanly. Have you run make V=s to make sure that the patches are correctly applied?
Comment 21 Stanislaw Gruszka 2018-08-09 13:00:30 UTC
(In reply to sani from comment #18)
> If these patches works why can't you just commit global ?

As I wrote before they break USB rt2800usb driver.
Comment 22 Stanislaw Gruszka 2018-08-09 13:04:07 UTC
Created attachment 277789 [details]
rt2800_flush_tx_timeouts.patch

Here is additional patch to test, it should be applied together with previous patches. It might help with link stability.
Comment 23 sani 2018-08-09 13:12:07 UTC
Thanks i will test that too. I do not need USB at all. Even disable it in the kernel.
Comment 24 Angelo Corsaro 2018-08-10 06:35:04 UTC
Hi all,
it's may be a stupid question, but I cannot figure out the kernel version or the OpenWRT version to test the patches. Can someone give me the version in order to test the patches on an Astoria?

Cheers,
Angelo
Comment 25 sani 2018-08-10 13:06:53 UTC
I am applying 4 patch for stability/link issue.
It is failing here:

Hunk #1 FAILED at 738.
1 out of 1 hunk FAILED -- saving rejects to file drivers/net/wireless/ralink/rt2x00/rt2x00mac.c.rej
Patch failed!  Please fix ./patches/600-27-patch4.patch!

--- drivers/net/wireless/ralink/rt2x00/rt2x00mac.c
+++ drivers/net/wireless/ralink/rt2x00/rt2x00mac.c
@@ -738,8 +738,12 @@ void rt2x00mac_flush(struct ieee80211_hw *hw, struct ieee80211_vif *vif,
        if (!test_bit(DEVICE_STATE_PRESENT, &rt2x00dev->flags))
                return;

+       set_bit(DEVICE_STATE_FLUSHING, &rt2x00dev->flags);
+
        tx_queue_for_each(rt2x00dev, queue)
                rt2x00queue_flush_queue(queue, drop);
+
+       clear_bit(DEVICE_STATE_FLUSHING, &rt2x00dev->flags);
 }
 EXPORT_SYMBOL_GPL(rt2x00mac_flush);

I am using latest openwrt. Will try manually applying the patch and test again.
Comment 26 sani 2018-08-10 13:56:04 UTC
Ok i compiled and flashed with 4 patches and removed patch 600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch before start.
I just did wifi enabled and left everything default like openwrt ssid no password and channel 11.
And i even can not connect to the router. Reason "unable to join" at iphone7.
MacOS again no success or connect but internet very slow or missing.
After many tries phone connected.
Sorry to say but i think driver need much working :(
T-Bone how you did not have problems at all ? What kind of routers/hardware you tested ?
I tested tp-link 840 version 5 and evolution board kind of router like this one
openwrt-ramips-mt7628-mt7628-squashfs-sysupgrade.bin
If I can do/test something i will do it with pleasure.
This router is staying for over 2 years.... still nothing to work with like openwrt or lede.
Comment 27 T-Bone 2018-08-10 16:10:05 UTC
(In reply to sani from comment #26)

> T-Bone how you did not have problems at all ? What kind of routers/hardware
> you tested ?

I have tested a Netgear WN3000RPv3, as I said in my first comment.
It was non-functional before the patches and perfectly working after.
I'm currently away and can't test the 4th patch before next week.
I have a Netgear EX3700 somewhere which also has a 7620 radio for 2.4 which I plan to test too. Will report.

Note that the TL-WR840N v4/v5 have a different radio: a 7628 which uses the mt76 driver, not the rt2800 which this particular ticket is about.

Thus the patches submitted by sgruszka will have no effect on that device.

Among the device you listed in comment #7, I think only the Phicomm K2 PSG1218 and ZBT-WR8305RT are concerned by these patches: they appear to be 7620.

HTH
Comment 28 Stanislaw Gruszka 2018-08-11 10:55:59 UTC
MT7260 chips can come with different variants, which require different register programming to device by the driver. Moreover even on the same chip variant, board can have different external parts connected to the chip (like oscillators, different antenna types, amplifiers, etc.), what also require different device programming by driver. Those settings are usually encoded in EEPROM , but sometimes EEPROM in the device is not correctly burned and correct eeprom image need to be provided by os to the driver. Additional things is temperature compensation code, it's needed on some devices and rt2800 driver lack of it.

In summary there are various factors why driver work on some routers and don't on others. To make things work on sani routers someone will need to figure out what vendor binary drivers do and implement that in rt2800 driver. This is not trivial task. If sources of vendor driver are available for those routers changes can be read from there. Otherwise reverse engineering of binary driver need to be done.
Comment 29 Stanislaw Gruszka 2018-08-11 11:03:10 UTC
(In reply to Angelo Corsaro from comment #24)
> the OpenWRT version to test the patches. Can someone give me the version in
> order to test the patches on an Astoria?

I'll ask dangole to include patches in his staging tree, but first I want to do some rework on them. 

Please do not test 4-th patch for now too, it need some rework as well.
Comment 30 sani 2018-08-12 09:25:02 UTC
Yes i am testing only mediatek 7628. Any chance we have similar fixes like for 7620 ?
Comment 31 Stanislaw Gruszka 2018-08-12 09:33:42 UTC
(In reply to sani from comment #30)
> Yes i am testing only mediatek 7628. Any chance we have similar fixes like
> for 7620 ?

That was confusing. mt76 driver need different fixes. What about two other routers that use MT7620 ?
Comment 32 sani 2018-08-12 10:12:18 UTC
I will try them next week. Thanks for your support.
Comment 33 Stanislaw Gruszka 2018-08-15 11:43:44 UTC
I have updated version of the patches here:
https://github.com/sgruszka/wireless-drivers-next/commits/rt2800-draft-v2

The first 3 patched did not change. I tested them on rt2800usb and now somehow they work. Looks on my previous tests I must had some other patch applied that broke USB.

I asked Daniel to put the patches on his openwrt staging tree.
Comment 34 Stanislaw Gruszka 2018-08-17 11:15:33 UTC
FTR: patches are available to testing in dangole tree:
https://git.openwrt.org/?p=openwrt/staging/dangole.git;a=summary
Comment 35 T-Bone 2018-08-17 15:03:45 UTC
Hi,

I have applied the 5 patches over a clean checkout of OpenWRT 18.06.0 (that's my reference point for "not working" state).

Patch 4 (feb8797) doesn't apply cleanly:
patching file drivers/net/wireless/ralink/rt2x00/rt2x00mac.c
Hunk #1 FAILED at 720.

I've manually fixed it to apply.

I built (no warning) and stress-tested the result on Netgear WN3000RPv3: as far as I can tell everything works but it appears the error messages are back:

ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2

I've slightly modified my test procedure (I now generate about symmetrical rx/tx bandwidth) so I'll retest with just the first 3 to see if they're also prone to showing the message under this scenario.

HTH
Comment 36 T-Bone 2018-08-17 17:17:38 UTC
Replying to myself: I've been able to get the error message with just the first 3 patches as well. It takes time, but it eventually happens.

The link nevertheless survives and remains operational.
Comment 37 Stanislaw Gruszka 2018-08-18 11:04:26 UTC
I really wish to get to know at what circumstances we get those errors. We basically stop queue once number of available entry's become less then particular threshold. We should not get any packets from upper mac80211 layer at that point.
Let try to increase threshold...
Comment 38 Stanislaw Gruszka 2018-08-18 11:07:10 UTC
Created attachment 277927 [details]
rt2x00_queue_threshold.patch

Please test this patch. If error messages will still happen try to increase threshold even more:

queue->threshold = DIV_ROUND_UP(queue->limit, 4);
Comment 39 T-Bone 2018-08-18 17:23:35 UTC
Reporting back:

I tried your latest "rt2x00_queue_threshold.patch" on top of a clean checkout of OpenWRT 18.06.0 plus your other 5 patches.

The new patch does not apply cleanly. In particular, the second hunk for rt2x00queue.c (line 719) doesn't apply at all: that test calling rt2x00queue_pause_queue() doesn't exist in the resulting OpenWRT output.

I have modified the patch to remove this offending hunk and adjust another one that would otherwise fail, and I've tested the result.

The error message returned after about 40-45 minutes of constant data throughput. Likewise when changing the queue->threshold value. Interestingly the errors seem to show in burst of 5-6 but only once (at least during my ~2h test runs).

HTH
Comment 40 Stanislaw Gruszka 2018-08-20 11:01:54 UTC
I'm going to prepare some debug patch that print queues state when we get this error. For now you can check if increase queue->limit to 128 (together with keeping 4 div when calculating threshold) make errors gone:

--- a/drivers/net/wireless/ralink/rt2x00/rt2800mmio.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2800mmio.c
@@ -673,7 +673,7 @@ void rt2800mmio_queue_init(struct data_queue *queue)
        case QID_AC_VI:
        case QID_AC_BE:
        case QID_AC_BK:
-               queue->limit = 64;
+               queue->limit = 128;
                queue->data_size = AGGREGATION_SIZE;
                queue->desc_size = TXD_DESC_SIZE;
                queue->winfo_size = txwi_size;
Comment 41 Stanislaw Gruszka 2018-08-20 12:11:52 UTC
Here is debug patch:

https://github.com/sgruszka/wireless-drivers-next/commit/525c50486e17446793b21ac7a8498cb48b3bb210.patch

please test it without threshold changes and provide dmesg output as txt file attachment. I would like to see if we correctly stop queue in mac80211.
Comment 42 T-Bone 2018-08-23 14:50:02 UTC
Created attachment 278047 [details]
dmesg output with debug patch applied

Debug output attached.
Triggering the message was much quicker and much more frequent this time.

As usual, clean checkout of OpenWRT 18.06.0 with all your patches applied. Threshold unmodified from your patches (DIV: 8).
Comment 43 Cjcr 2018-08-24 20:10:20 UTC
Hi, Stanislaw, here are more people doing test over Dangole's repo:

https://bugs.openwrt.org/index.php?do=details&task_id=896
Comment 44 Stanislaw Gruszka 2018-08-25 09:29:57 UTC
Hi Cjcr, so looks you have the same case as T-Bone , patches made your router workable , but error massage still happen, correct ?
Comment 45 Cjcr 2018-08-25 11:11:17 UTC
(In reply to Stanislaw Gruszka from comment #44)
> Hi Cjcr, so looks you have the same case as T-Bone , patches made your
> router workable , but error massage still happen, correct ?

That's right. It works, but errors appears with high data transfer. Anyway, I'm very happy that now it at least don't die when that happens. And thank you for that and also thanks to dangole for his colaboration.
Comment 46 Alfonso 2018-08-25 14:53:04 UTC
Good morning, Im new here but I decided to register to share my experience with this bug, I have lots of alfa R36 routers with the ralink rt2x00 driver, and I've been testing this router for over 3 years experiencing the same related bug (not present in the original alfa fw neither dd-wrt fw), I've been also following this saga for a while waiting for a working patch (testing every single patch since I found these thread), I can confirm that after building the 18.06 branch loaded with the patches provided here, (my lack of knowledge did not help to solve the failed hunks to sucessfully load the other 2 patches), I was able to just load these 3 patches into the final compile:

***txstatus-routines-to
***txstatus-routines-from-lib
***txstatus-tiemout-every-time

the router it's been working fantastic with no more hangings, but as the other members (Cjcr & Tbone) confirmed, the error message still appears, if somebody can give me instructions on how to load/fix the failed hunks for the 2 remaining patches I would really appreciate it, thank you Stanislaw and dangole for this amazing collaboration!!!
Comment 47 mmyz1234 2018-08-29 04:22:58 UTC
Hi. Thank you for trying to fix this bug.
I try to compile the openwrt 18.06.1, which is current stable release, and patch it with your patchs. But there is an error when apply 704-rt2x00-use-different-txstatus-timeouts-when-flushing.patch.

patching file drivers/net/wireless/ralink/rt2x00/rt2x00mac.c
Hunk #1 FAILED at 720.
1 out of 1 hunk FAILED -- saving rejects to file drivers/net/wireless/ralink/rt2x00/rt2x00mac.c.rej
Patch failed!  Please fix ./patches/704-rt2x00-use-different-txstatus-timeouts-when-flushing.patch!

I want to know how to fix it because I am a tiro.
Thanks.
Comment 48 Stanislaw Gruszka 2018-08-29 10:50:12 UTC
(In reply to mmyz1234 from comment #47)
> I want to know how to fix it because I am a tiro.

You can read https://elinux.org/Handling_Patch_Rejects to learn .

However I think I will just add openwrt repo to my github page and apply patches there since there is interest of having them applied on top of 18.06.1 release. Not sure why don't want to use dangole repo, though.
Comment 49 RussianNeuroMancer 2018-08-29 14:41:38 UTC
I testing dangole repo on ASUS RT-N56U A1 with Ralink RT3883 and so far it's been great.
Comment 50 mmyz1234 2018-08-29 15:57:26 UTC
Created attachment 278199 [details]
dmesg output with debug patch - phicomm psg1218 rev.a
Comment 51 mmyz1234 2018-08-29 16:16:29 UTC
(In reply to Stanislaw Gruszka from comment #48)
> (In reply to mmyz1234 from comment #47)
> > I want to know how to fix it because I am a tiro.
> 
> You can read https://elinux.org/Handling_Patch_Rejects to learn .
> 
> However I think I will just add openwrt repo to my github page and apply
> patches there since there is interest of having them applied on top of
> 18.06.1 release. Not sure why don't want to use dangole repo, though.

Thanks.
Now I compile and flash it successfully. The dmesg show errors so fast. It looks like the wireless 2.4g wireless still works fine. Maybe I need to observe it for a while.
Comment 52 T-Bone 2018-08-29 16:26:39 UTC
(In reply to Stanislaw Gruszka from comment #48)
> However I think I will just add openwrt repo to my github page and apply
> patches there since there is interest of having them applied on top of
> 18.06.1 release. Not sure why don't want to use dangole repo, though.

Besides tracking master (and no the 18.06 release), dangole's repo contains more than just your fixes. I personally think it's more efficient (and best practice) to only test code meant to fix a bug when debugging, vs trying a melting pot of other things whose (side) effects aren't known.

If we can get a well defined patch for this bug there's a chance to have it included in the current stable release, which would benefit all the users tracking releases. If we can't demonstrate that whatever patch you offer are sufficient standalone to fix the issue, we can't get that fix backported. QED :)

Besides, dangole's repo contains code that has very little (if any) chance of ever getting accepted in openwrt master, so no point in messing with that IMO.

Meanwhile, you didn't react after I sent debug data: was it useful?

My 2c.
Comment 53 Alfonso 2018-08-29 19:39:05 UTC
Hello, can somebody provide instructions and/or tell me if it's possible to apply these patches to another openwrt version (specifically Chaos calmer)? since I really need that version in order to sucessfully run coova-chilli, it can't run on lede/18.06 because memory restrictions, thanks in advance...
Comment 54 Stanislaw Gruszka 2018-08-30 16:21:37 UTC
(In reply to T-Bone from comment #52)
Ok, make sense. Debug data was useful, thanks, but I still think what next...
Comment 55 Stanislaw Gruszka 2018-09-06 12:52:00 UTC
I updated openwrt rt2x00 branch (18.06 based) with 5 previous patches and 2 that increase queue->limit to 256 and threshold to div 4:

https://github.com/sgruszka/openwrt/commits/rt2x00

Please test. Especially would be interesting if this work on older RT3***, RT5*** chips and if it does not make bufferbloat much worse. I.e. except with iperf testing please also test with some tools, which measure bufferbloat, see:

https://flent.org/
https://www.bufferbloat.net/projects/bloat/wiki/Tests_for_Bufferbloat/

I considered to apply/post 5 first patches. Posting latest 2 patches depend of test results. I will maybe do queue->limit increase only on MT7620 or do increase to 128 on all chip. Not sure yet. 

I also will remove the message (make it at debug level) since I discovered that printing messages by itself can hang cpu for few seconds on my Nexx WT3020H router, what can make wireless connection drops!

Anyway please test. Thanks in advance!
Comment 56 Cjcr 2018-09-06 16:53:00 UTC
Hello @Stanislaw

I will test it ASAP and I will report the results here. Thank you!
Comment 57 RussianNeuroMancer 2018-09-14 05:21:43 UTC
After 12 days of uptime build from Dangole repo starting to throw same error on ASUS RT-N56U A1, WiFi also stop working. Now I testing build from https://github.com/sgruszka/openwrt/commits/rt2x00
Comment 58 Stanislaw Gruszka 2018-09-19 13:13:10 UTC
Can you provide testing results ? The most interesting thing is if increasing queue size do not make bufferbloat (a lot) worse.

Note if I have to do testing by myself, I can not do other valuable work.
Comment 59 RussianNeuroMancer 2018-09-19 14:16:56 UTC
With build from https://github.com/sgruszka/openwrt/commits/rt2x00 WiFi stop working (without error in the logs, which is expected, I guess) in first day of testing, after around eight hours of uptime. However it's still not failed after router reboot. 

> The most interesting thing is if increasing queue size do not make
> bufferbloat (a lot) worse.

What is proper way of verifying this? Running betterspeedtest.sh few times with upstream 18.06.1 and with rt2x00 tree is sufficient? Do I need to disable VPN connection on a gateway to get proper results or this is doesn't matter?
Comment 60 daiaji 2018-09-20 07:30:27 UTC
(In reply to Stanislaw Gruszka from comment #55)
> I updated openwrt rt2x00 branch (18.06 based) with 5 previous patches and 2
> that increase queue->limit to 256 and threshold to div 4:
> 
> https://github.com/sgruszka/openwrt/commits/rt2x00
> 
> Please test. Especially would be interesting if this work on older RT3***,
> RT5*** chips and if it does not make bufferbloat much worse. I.e. except
> with iperf testing please also test with some tools, which measure
> bufferbloat, see:
> 
> https://flent.org/
> https://www.bufferbloat.net/projects/bloat/wiki/Tests_for_Bufferbloat/
> 
> I considered to apply/post 5 first patches. Posting latest 2 patches depend
> of test results. I will maybe do queue->limit increase only on MT7620 or do
> increase to 128 on all chip. Not sure yet. 
> 
> I also will remove the message (make it at debug level) since I discovered
> that printing messages by itself can hang cpu for few seconds on my Nexx
> WT3020H router, what can make wireless connection drops!
> 
> Anyway please test. Thanks in advance!

I built the image using the rt2x00 branch and wrote the image to my phicomm psg1218 rev.a
After that I ran speedtest.net for 2.4Ghz wireless network load test.
It seems that as long as there is a load of 30Mbps, my phone will disconnect from phicomm psg1218 rev.a, but this will not cause rt2x00 to hang. It seems that I can continue to use 2.4Ghz network as long as I reconnect to the AP.
The system and the kernel log do not seem to have anything special.
Maybe I should continue testing?
Comment 61 Stanislaw Gruszka 2018-09-20 10:14:24 UTC
(In reply to RussianNeuroMancer from comment #59)
> With build from https://github.com/sgruszka/openwrt/commits/rt2x00 WiFi stop
> working (without error in the logs, which is expected, I guess) in first day
> of testing, after around eight hours of uptime. However it's still not
> failed after router reboot. 

I did not remove message yet, to check if increasing queue size and threshold help with queuing packet to full queue. I plan to remove message though. 

If wifi stop working this is some different problem, look like HW/FW hung. So I think we would need to implement watchdog to reset HW/FW. However so far I don't know how to detect the problem and how to perform reset. I would need to experiment with that (and I can not reproduce that issue, so that's problem).

> > The most interesting thing is if increasing queue size do not make
> > bufferbloat (a lot) worse.
> 
> What is proper way of verifying this? Running betterspeedtest.sh few times
> with upstream 18.06.1 and with rt2x00 tree is sufficient? Do I need to
> disable VPN connection on a gateway to get proper results or this is doesn't
> matter?

I don't know how to measure buffer bloat, I haven't made those test. Please figure this out by yourself based on provided documentation. Please test with
top of my rt2x00 branch and with:

    661-0001-rt2800-change-queue-limit-to-256-for-pci-soc.patch
    661-0002-rt2x00-increase-threshold.patch

patches removed (you can achieve that by doing "git reset --hard HEAD~1")
Comment 62 daiaji 2018-09-20 15:27:21 UTC
(In reply to Stanislaw Gruszka from comment #61)
> (In reply to RussianNeuroMancer from comment #59)
> > With build from https://github.com/sgruszka/openwrt/commits/rt2x00 WiFi
> stop
> > working (without error in the logs, which is expected, I guess) in first
> day
> > of testing, after around eight hours of uptime. However it's still not
> > failed after router reboot. 
> 
> I did not remove message yet, to check if increasing queue size and
> threshold help with queuing packet to full queue. I plan to remove message
> though. 
> 
> If wifi stop working this is some different problem, look like HW/FW hung.
> So I think we would need to implement watchdog to reset HW/FW. However so
> far I don't know how to detect the problem and how to perform reset. I would
> need to experiment with that (and I can not reproduce that issue, so that's
> problem).
> 
> > > The most interesting thing is if increasing queue size do not make
> > > bufferbloat (a lot) worse.
> > 
> > What is proper way of verifying this? Running betterspeedtest.sh few times
> > with upstream 18.06.1 and with rt2x00 tree is sufficient? Do I need to
> > disable VPN connection on a gateway to get proper results or this is
> doesn't
> > matter?
> 
> I don't know how to measure buffer bloat, I haven't made those test. Please
> figure this out by yourself based on provided documentation. Please test with
> top of my rt2x00 branch and with:
> 
>     661-0001-rt2800-change-queue-limit-to-256-for-pci-soc.patch
>     661-0002-rt2x00-increase-threshold.patch
> 
> patches removed (you can achieve that by doing "git reset --hard HEAD~1")

I also encountered the same situation, rt2x00 stopped working after 22 hours of work.
My phone can't connect to the AP, and dmesg doesn't provide any valuable logs.
https://gist.github.com/daiaji/2d032bc9d849d276ceb3c26f79b2a704
Comment 63 Stanislaw Gruszka 2018-09-21 07:37:41 UTC
(In reply to daiaji from comment #62)
> My phone can't connect to the AP, and dmesg doesn't provide any valuable
> logs.
> https://gist.github.com/daiaji/2d032bc9d849d276ceb3c26f79b2a704

There are warnings:

rt2800_config_channel: Warning - Using incomplete support for external PA

I would check dangole tree if support for external PA was added there.
Comment 64 mmyz1234 2018-09-24 08:48:41 UTC
It is also unstable with the latest patches applied to my phicomm psg1218 rev.a. After a few hours, 2.4G wireless stop working without this error in dmesg output. In order connect to the 2.4G wireless again, I need to reboot it. As mentioned before, there are warnings that rt2800_config_channel: Warning - Using incomplete support for external PA.

There is an issue which I don't know if I should say it here. If I set option noscan 1 to enforce the use of 40MHz channel for 802.11n in 2.4Ghz band, I can not connect to the 2.4G wireless. After I reboot my router, I can connect to it but quickly disconnect. I can't connect to it again later. I don't if this issue related to this and if other routers have this issue.

Thanks.
Comment 65 mmyz1234 2018-09-24 08:58:58 UTC
Created attachment 278729 [details]
dmesg output with latest patches - phicomm psg1218 rev.a

The dmesg output is here.
Comment 66 RussianNeuroMancer 2018-11-24 16:05:06 UTC
I didn't test bufferbloat yet, because I wasn't to get long enough uptime (due to unrelated issues such as power outage) to verify if rt2x00 branch is stable enough for my particular device (ASUS RT-N56U A1). 

Unfortunately, I got issue described in Comment 59 three times so far. Today it's happened on 16th day of uptime, after copying 3 GB via 5 GHz AP. There was no error messages in dmesg, and disabling/enabling network in OpenWRT setting was sufficient. Seems like reboot is not necessary, or it's how it looks like so far.

Stanislaw, please let me know if measuring buffer bloat is still necessary, even if patches from rt2x00 branch doesn't achieve stable WiFi operation?
Comment 67 Stanislaw Gruszka 2018-11-26 09:42:53 UTC
(In reply to RussianNeuroMancer from comment #66)
> Unfortunately, I got issue described in Comment 59 three times so far. Today
> it's happened on 16th day of uptime, after copying 3 GB via 5 GHz AP. There
> was no error messages in dmesg, and disabling/enabling network in OpenWRT
> setting was sufficient. Seems like reboot is not necessary, or it's how it
> looks like so far.
> 
> Stanislaw, please let me know if measuring buffer bloat is still necessary,
> even if patches from rt2x00 branch doesn't achieve stable WiFi operation?

Yes, I'm still waiting of buffer bloat to see if we can increase tx queue size.

Regarding your issue as pointed already in comment 61 it's different problem. I guess we can have a deal, if you will do bufferbloat testing for me, I'm going to work on wotchdog :-)
Comment 68 Stanislaw Gruszka 2018-12-05 10:19:01 UTC
I did bufferbloat testing by myself. It's very disappointing that nobody wanted to do this.

Indeed there is ping latency effect when increase tx queues size. I used this script:
https://github.com/richb-hanover/CeroWrtScripts/blob/master/betterspeedtest.sh

TX lenght 64
threshold 7

[stasiu@localhost ~]$ ./betterspeedtest.sh -H 192.168.10.1 -p 192.168.10.1
2018-12-05 11:05:38 Testing against 192.168.10.1 (ipv4) with 5 simultaneous sessions while pinging 192.168.10.1 (60 seconds in each direction)
............................................................
. Download:  49.63 Mbps
  Latency: (in msec, 57 pings, 0.00% packet loss)
      Min: 4.850
    10pct: 20.100
   Median: 40.700
      Avg: 39.010
    90pct: 55.500
      Max: 77.100
.............................................................
   Upload:  58.13 Mbps
  Latency: (in msec, 61 pings, 0.00% packet loss)
      Min: 0.986
    10pct: 9.210
   Median: 13.300
      Avg: 15.919
    90pct: 25.000
      Max: 52.600
[stasiu@localhost ~]$ ./betterspeedtest.sh
2018-12-05 11:08:57 Testing against netperf.bufferbloat.net (ipv4) with 5 simultaneous sessions while pinging gstatic.com (60 seconds in each direction)
..............................................................
 Download:  12.79 Mbps
  Latency: (in msec, 62 pings, 0.00% packet loss)
      Min: 10.000
    10pct: 10.100
   Median: 12.600
      Avg: 16.202
    90pct: 25.300
      Max: 42.100
.............................................................
   Upload:  57.17 Mbps
  Latency: (in msec, 61 pings, 0.00% packet loss)
      Min: 10.000
    10pct: 17.600
   Median: 22.100
      Avg: 23.969
    90pct: 32.900
      Max: 50.700
TX length 256
threshold 64

[stasiu@localhost ~]$ ./betterspeedtest.sh -H 192.168.10.1 -p 192.168.10.1
2018-12-05 10:26:17 Testing against 192.168.10.1 (ipv4) with 5 simultaneous sessions while pinging 192.168.10.1 (60 seconds in each direction)
............................................................
 Download:  50.56 Mbps
  Latency: (in msec, 61 pings, 0.00% packet loss)
      Min: 1.620
    10pct: 44.800
   Median: 75.400
      Avg: 75.286
    90pct: 104.000
      Max: 134.000
.............................................................
   Upload:  57.14 Mbps
  Latency: (in msec, 61 pings, 0.00% packet loss)
      Min: 1.190
    10pct: 8.540
   Median: 15.000
      Avg: 16.940
    90pct: 26.600
      Max: 49.100
[stasiu@localhost ~]$ ./betterspeedtest.sh
2018-12-05 10:29:16 Testing against netperf.bufferbloat.net (ipv4) with 5 simultaneous sessions while pinging gstatic.com (60 seconds in each direction)
.............................................................
 Download:  14.89 Mbps
  Latency: (in msec, 61 pings, 0.00% packet loss)
      Min: 10.000
    10pct: 10.200
   Median: 14.400
      Avg: 18.274
    90pct: 30.600
      Max: 59.900
..............................................................
   Upload:  52.7 Mbps
  Latency: (in msec, 61 pings, 0.00% packet loss)
      Min: 10.400
    10pct: 13.400
   Median: 21.000
      Avg: 22.331
    90pct: 30.900
      Max: 57.000



TX length 512
threshold 126

[stasiu@localhost ~]$ ./betterspeedtest.sh -H 192.168.10.1 -p 192.168.10.1
2018-12-05 10:45:36 Testing against 192.168.10.1 (ipv4) with 5 simultaneous sessions while pinging 192.168.10.1 (60 seconds in each direction)
.............................................................
 Download:  50.2 Mbps
  Latency: (in msec, 58 pings, 0.00% packet loss)
      Min: 6.400
    10pct: 78.700
   Median: 122.000
      Avg: 133.705
    90pct: 208.000
      Max: 248.000
.............................................................
   Upload:  55.44 Mbps
  Latency: (in msec, 61 pings, 0.00% packet loss)
      Min: 2.020
    10pct: 9.900
   Median: 14.400
      Avg: 17.089
    90pct: 26.800
      Max: 57.200
[stasiu@localhost ~]$ ./betterspeedtest.sh
2018-12-05 10:50:41 Testing against netperf.bufferbloat.net (ipv4) with 5 simultaneous sessions while pinging gstatic.com (60 seconds in each direction)
..............................................................
 Download:  16.32 Mbps
  Latency: (in msec, 61 pings, 0.00% packet loss)
      Min: 15.400
    10pct: 15.600
   Median: 18.800
      Avg: 23.238
    90pct: 32.700
      Max: 69.600
..............................................................
   Upload:  50.52 Mbps
  Latency: (in msec, 62 pings, 0.00% packet loss)
      Min: 16.000
    10pct: 24.000
   Median: 35.000
      Avg: 37.535
    90pct: 51.000
      Max: 78.200
Comment 69 Stanislaw Gruszka 2019-01-02 08:07:54 UTC
T-Bone@parisc-linux.org and Cjcr , could you check if removing the "Dropping frame ..." prints by below patch is sufficient to fix wifi hungs  ?

https://lore.kernel.org/linux-wireless/1545318971-28351-3-git-send-email-sgruszka@redhat.com/raw

There is claim that 5 patches that fixes issues for you cause throughput regression and that removing "Dropping frame ..." prints is sufficient to fix wifi hung problem. Full discussion is here:

https://lore.kernel.org/linux-wireless/20181221125146.GB30351@redhat.com/T/#m460700c2ae3cd1bdeb1d6001c0fbf2945f412bb0
Comment 70 Daniel Santos 2019-01-02 21:55:35 UTC
Created attachment 280251 [details]
Patch to replace printk with netlink accounting

So this is the patch we ended up going with.  I'll get it properly documented and submit it to LKML.  This occurs when there are many processes all doing sendmsg/sendto and they get preempted in that racy area of ieee80211_tx_frags, after it checks to see if the queue is stopped, but releases its spinlock prior to calling the driver's .tx function, thus allowing userspace threads to be preempted.  This happens when running chilli-coova, which can have 100+ child processes via fork.

To fully resolve the overall performance problems, modifications had to be made to chilli-coova as well (essentially replacing some non-blocking calls with blocking calls).

Perhaps the main reason the printk is so deadly on some systems is that they are configured to emit the kernel log over a 56k serial line. :)
Comment 71 Daniel Santos 2019-01-02 22:07:56 UTC
Oh, I forgot to add that while I was not able to personally reproduce the problem, I was able to get some diagnostics from somebody who could and we definitely are *really* waiting for the hardware to tx (I had suspected something else in software might be causing tx delays, but there is no evidence of this).  The tests were carried out in an RF environment that's almost certainly congested in the ISM band (downtown in a major city).  The same error could not be reproduced on an almost identical device in the same environment, but with 9dBi antennae vs the 2.5dBi antennae of the AP where the problem occurs.
Comment 72 Stanislaw Gruszka 2019-01-03 14:20:53 UTC
(In reply to Daniel Santos from comment #70)
> Created attachment 280251 [details]
> Patch to replace printk with netlink accounting

I already posted patch witch changes print from err to dbg . Regarding adding drop statistics, this is good idea but I'm not sure if solution from the patch is the right one.
Comment 73 Daniel Santos 2019-01-04 01:35:55 UTC
(In reply to Stanislaw Gruszka from comment #72)
> (In reply to Daniel Santos from comment #70)
> > Created attachment 280251 [details]
> > Patch to replace printk with netlink accounting
> 
> I already posted patch witch changes print from err to dbg . Regarding
> adding drop statistics, this is good idea but I'm not sure if solution from
> the patch is the right one.

Oh, sorry I didn't see that patch.  Probably the *best* thing to do is put a rate limiter on it and emit a max of once per x seconds telling how many messages were squelched?

I promise that updating the statistics is the right thing to do, but I don't promise that I've done it the right way! :)  The screwy thing here is that, afaict, the mac80211 subsystem always increments the tx frame and byte count.

Personally, I think that the very best solution is to change the driver .tx function to return an int error code and let mac80211 manage the stats as well as adding the possibility for it to requeue the frame or fragment in its own queue so it doesn't necessarily need to be lost.  Please pardon my ignorance here, as I don't know how and where the decision is made when we must drop frames to avoid buffer bloat, cache thrash, etc.  Of course, this would be a substantial effort and would break out-of-tree drivers.
Comment 74 Stanislaw Gruszka 2019-01-04 12:18:02 UTC
(In reply to Daniel Santos from comment #73)
> Oh, sorry I didn't see that patch.  Probably the *best* thing to do is put a
> rate limiter on it and emit a max of once per x seconds telling how many
> messages were squelched?
I have done that for some other rt2x00 messages for USB drivers since logs were flooded when some USB host driver misbihave . For debug messages I don't think this is necessary.

> Personally, I think that the very best solution is to change the driver .tx
> function to return an int error code and let mac80211 manage the stats as
> well as adding the possibility for it to requeue the frame or fragment in
> its own queue so it doesn't necessarily need to be lost.  Please pardon my
> ignorance here, as I don't know how and where the decision is made when we
> must drop frames to avoid buffer bloat, cache thrash, etc.  Of course, this
> would be a substantial effort and would break out-of-tree drivers.

Historically we drop frame silently and this also happen in mac80211/tx path if some error occurs. We do not have tx_dropped statistic nowhere in mac80211, only rx_dropped. Some wireless drivers do own tx_dropped, some fullmac drivers use netdev->stats.tx_dropped . I think the best for rt2x00 would be add queue tx_dropped field end export it via debugfs via rt2x00debug_read_queue_stats .

BTW: have you tested some other patches i.e. 5 patches from 
https://github.com/sgruszka/openwrt/commit/61809eedbfab55cae8a5feb48f761f8b6dd8b308
or increasing queue length ? I would like to know if claim from 
https://lore.kernel.org/linux-wireless/CAKR_QVK_2j6a9YiwUEKuWF+ss0-pr808Sr=AUrX4a6L3Zw=F0w@mail.gmail.com/
is true: removing "Dropping frame due..." print fixes the same problem as 5 patches do. And if increase queue length do some good except make ping times worse when queue is full.
Comment 75 Cjcr 2019-04-17 13:30:23 UTC
@Stanislaw

Hi! I'am using your latest version:

Model	Nexx WT3020 (8M)
Architecture	MediaTek MT7620N ver:2 eco:6
Versión del firmware	OpenWrt 18.06-SNAPSHOT r7293-1e98677 / LuCI openwrt-18.06 branch (git-18.247.71242-9541751)
Versión del Kernel	4.14.67

It seems that is working fine, but I cannot make it working with 40 Mhz channel with, even using the "noscan = 1" trick.

Here's the copy&paste of wireless:

config wifi-device 'radio0'
        option type 'mac80211'
        option hwmode '11g'
        option path 'platform/10180000.wmac'
        option txpower '20'
        option noscan '1'
        option legacy_rates '0'
        option htmode 'HT40'
        option country 'JP'
        option channel '3'

config wifi-iface 'default_radio0'
        option device 'radio0'
        option mode 'ap'
        option network 'lan'
        option ssid '*****'
        option disassoc_low_ack '0'
        option encryption 'psk2'
        option key '*****'

config wifi-iface
        option device 'radio0'
        option mode 'ap'
        option ssid '****'
        option network 'lan'
        option disassoc_low_ack '0'
        option encryption 'psk2'
        option key '*****'
        option wmm '1'


Any idea?
Comment 76 Stanislaw Gruszka 2019-04-18 08:09:06 UTC
Cjcr , this is not related to this bug, better would be open different report i.e. open issue on github against my openwrt repo. Anyway with this config

config wifi-device 'radio1'
	option type 'mac80211'
	option channel '11'
	option hwmode '11g'
	option htmode 'HT40'
	option noscan '1'
	option path 'platform/10180000.wmac'
	option disabled '0'
	option country 'CZ'

HT40 works for me.
Comment 77 Cjcr 2019-04-23 12:00:15 UTC
(In reply to Stanislaw Gruszka from comment #76)
> Cjcr , this is not related to this bug, better would be open different
> report i.e. open issue on github against my openwrt repo. Anyway with this
> config
> 
> config wifi-device 'radio1'
>       option type 'mac80211'
>       option channel '11'
>       option hwmode '11g'
>       option htmode 'HT40'
>       option noscan '1'
>       option path 'platform/10180000.wmac'
>       option disabled '0'
>       option country 'CZ'
> 
> HT40 works for me.

Oh sorry, it seems that was my phone don't able to connect to 40Mhz channel width, maybe the ROM i'm using. With others it seems that works fine. Sorry!
Comment 78 Stanislaw Gruszka 2019-04-29 10:40:30 UTC
Code for problems reported in this bug is already in upstream and official OpenWRT repo. 

For remaining issue what is random hangs I provided watchdog here (after some more testing it would be unstreamed as well):

https://github.com/sgruszka/openwrt/commit/4667e54f528544b150c9841e167c883ff0b79794
Comment 79 Nikita Kniazev 2019-08-17 20:11:04 UTC
The issue still persists. Easily reproducible by simply installing **miniupnpd**, you will instantly see the log growing with frame drop messages, and by the end of the day the network will halt.

Reproduced on Xiaomi MiWiFi Mini (MediaTek MT7620A ver:2 eco:6) running OpenWrt 18.06.4 r7808-ef686b7292 (Kernel 4.14.131)
Comment 80 Daniel Santos 2019-08-17 23:37:23 UTC
(In reply to Nikita Kniazev from comment #79)
> Reproduced on Xiaomi MiWiFi Mini (MediaTek MT7620A ver:2 eco:6) running
> OpenWrt 18.06.4 r7808-ef686b7292 (Kernel 4.14.131)

It's not in 18.06.4.  Should this be backported though?  I would suggest you grab the commits Stanislaw linked above and cherry-pick them into your git tree.
Comment 81 Stanislaw Gruszka 2019-08-18 09:43:33 UTC
... or update to openwrt 19.07
Comment 82 T-Bone 2019-08-18 10:21:08 UTC
Not all devices can update to 19.07, this really should be backported to the 18 series if possible.
Comment 83 T-Bone 2019-08-25 11:32:28 UTC
(In reply to Stanislaw Gruszka from comment #74)

> BTW: have you tested some other patches i.e. 5 patches from 
> https://github.com/sgruszka/openwrt/commit/
> 61809eedbfab55cae8a5feb48f761f8b6dd8b308
> or increasing queue length ? I would like to know if claim from 
> https://lore.kernel.org/linux-wireless/CAKR_QVK_2j6a9YiwUEKuWF+ss0-
> pr808Sr=AUrX4a6L3Zw=F0w@mail.gmail.com/
> is true: removing "Dropping frame due..." print fixes the same problem as 5
> patches do.

FWIW, it doesn't. I just tested on an EX3700 (fresh 18.06.4 tree with '600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch' deleted and '666-0003-rt2x00-do-not-print-error-when-queue-is-full.patch' added) with only 2.4 radio active: it hangs almost instantly on traffic. Nothing in dmesg, as expected with the only extra patch applied.

After disabling 2.4 radio, dmesg showed:
[  259.989136] ieee80211 phy1: rt2x00queue_flush_queue: Warning - Queue 0 failed to flush
[  260.076080] ieee80211 phy1: rt2x00queue_flush_queue: Warning - Queue 2 failed to flush
[  260.402375] ieee80211 phy1: rt2x00queue_flush_queue: Warning - Queue 0 failed to flush
[  260.489194] ieee80211 phy1: rt2x00queue_flush_queue: Warning - Queue 2 failed to flush

HTH
T-Bone
Comment 84 Nikita Kniazev 2019-08-25 12:08:47 UTC
I have built 19.07 and run it for a few days and I did not get those warnings or catch network halt, so the problem seems to be fixed. Though, I still experience low ACK disassociations and average of 30-50 Mbit/s performance (measured with iperf3) with a client that is 1 meter away from the AP (5Ghz performance also is not perfect, about 40-60 Mbit/s) and which had been reaching 95 Mbit/s on 100 Mbit/s Internet when the AP was running Padavan firmware.

Also on 19.07 I experience 'deauthenticated due to inactivity' messages that I have not seen on 18.06 in such quantity (or at all?):

Sat Aug 24 09:38:07 2019 daemon.notice hostapd: wlan1: AP-STA-DISCONNECTED 08:d4:2b:xx:xx:xx
Sat Aug 24 09:38:07 2019 daemon.info hostapd: wlan1: STA 08:d4:2b:xx:xx:xx IEEE 802.11: disassociated
Sat Aug 24 09:38:08 2019 daemon.info hostapd: wlan1: STA 08:d4:2b:xx:xx:xx IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
Sat Aug 24 09:38:11 2019 daemon.info hostapd: wlan1: STA 08:d4:2b:xx:xx:xx IEEE 802.11: authenticated
Sat Aug 24 09:38:11 2019 daemon.info hostapd: wlan1: STA 08:d4:2b:xx:xx:xx IEEE 802.11: associated (aid 1)
Sat Aug 24 09:38:11 2019 daemon.notice hostapd: wlan1: AP-STA-CONNECTED 08:d4:2b:xx:xx:xx