Using OpenWrt svn revision 41808 MT7620N, when below message occur, it would cause we cannot get connection with the AP. ---- [ 3702.380000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [ 3702.390000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [ 3702.400000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [ 3702.410000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [ 3702.420000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [97845.440000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [97845.450000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [97845.460000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [97845.470000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [97845.480000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [101808.120000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [101808.130000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [101808.140000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [101808.150000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [101808.160000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [102318.680000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [102318.690000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [102318.700000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [102318.710000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [102318.720000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
Still present, known occurences with Kernel 4.1.10 and beyond in current OpenWRT trunk builds. Has been observed for a couple of years now, (as seen at OpenWRT) - got the suggestion to report it upstream. Bug ticket https://dev.openwrt.org/ticket/12313 (this contains various additional information, like kernel traces, in chronological order) Wireless connection dies after some time with this bug. In any case, transmitting a lot of data makes it stop pretty quickly (the connection is still displayed, but even already connected devices will not be able to transmit any data).
Hi all, I have the same problem in OpenWRT but different kenel ; Linux ARV7510PW22 3.18.29 #15 Fri Jun 3 10:40:06 CEST 2016 mips GNU/Linux the output from dmesg is the same : [...] [36356.952000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [36356.956000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [36356.968000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [36356.976000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [36356.984000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [36356.992000] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 [...]
This is issue with MT7620 OpenWRT patch, which is not (yet) in upstream kernel. Daniel Golle is working to improve the patch. You can support him here: https://www.kickstarter.com/projects/1327597961/better-support-for-mt7620a-n-in-openwrt-lede
Latest openwrt from today still same issue. I tested 2 routers. zbt8305 and PSG1218 (phicomm k2) ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 If i run speed test i can crash the router in 5 minutes maximum at 2.4ghz band. Strange that this bug is from over 4 years and its still not resolved.
(In reply to sani from comment #4) > Latest openwrt from today still same issue. > I tested 2 routers. > zbt8305 and PSG1218 (phicomm k2) > > ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to > full tx queue 2 > > If i run speed test i can crash the router in 5 minutes maximum at 2.4ghz > band. > Strange that this bug is from over 4 years and its still not resolved. Some cool guru marked this as obsolete and closed it :) Still crash. Latest openwrt and kernel.
FWIW, I have Nexx WT3020H and I'm not capable to reproduce this problem. Perhaps problem is specific for hardware or software configuration. Do you have version which include this change: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=f4a639a3d7d40b4f63c431c2d554c479fbcc6b74
I have tested 3 different routers using latest openwrt. Just compiled, flashed i put channel to 5, changed ssid, put wpa2 password and run speedtest.net. Routers i tested: tp-link 840 version 4/5 ZBT-WR8305RT Phicomm K2 PSG1218 (here 5ghz part is fine) All these routers use same ralink which crashes very fast when doing much traffic. All these routers tested using pandorabox and other binary drivers work just fine. This is driver or kernel problem for sure. If you are interested in fixing the problem i can setup remote access to these routers also buildroot or anything else you need to compile.
I'm interested to solve the problem, but not so eager to do remote debugging. I might have some patches to test though. Since is reproducible on all routers this looks more like config option or perhaps problem happen with particular station devices that you have. I just changed to channel 5 on my router and do speedtest.net, no errors. However I'm using this tree with the commit mentioned earlier: https://git.openwrt.org/?p=openwrt/staging/dangole.git;a=summary
Hello, I tested already this git https://git.openwrt.org/?p=openwrt/staging/dangole.git;a=summary No effect. Still crashes very easy. I tested may be 10 different versions all crashes only binary drivers are ok :( If you have patches after 14 March 2018 i can test them.
I forgot gargoyle crashes too.
You can check last 3 patches from https://github.com/sgruszka/wireless-drivers-next/commits/rt2800-draft I did't test them on openwrt, but they should be applicable and do not blow things there, but who knows ;-)
For the record commits can be downloaded from github as raw patched by adding .patch suffix to the url. In the case of the commits it will be: https://github.com/sgruszka/wireless-drivers-next/commit/b96f881f78ad5e15b4b036b5eba87e66309dc2b2.patch https://github.com/sgruszka/wireless-drivers-next/commit/fdf3a7b2209b5d180c40ebc9e87c3756e6f8a0a8.patch https://github.com/sgruszka/wireless-drivers-next/commit/d2ed41f8998e985970ecf39da561877f56cad391.patch
I am affected by this bug (on a Netgear WN3000RPv3 running OpenWRT 18.06), and I’ve tested your patches: I have applied the 3 patches on top of OpenWRT 18.06 « backport-2017-11-01 »: it needed a bit of massaging as the rt2800mmio_interrupt() hunk in rt2x00mmio.c failed to apply so I manually edited the file to match your patch; and I tried to stress test the result for about 3 hours, exchanging ~20GB of data over wireless. The good news is: I couldn’t crash the wireless connection (usually it would collapse in a matter of minutes as soon as a few Mbps of traffic was happening). The bad news is: the error message is still randomly printed, along with one I don’t remember seeing before: [ 1517.748008] ieee80211 phy0: rt2800mmio_txstatus_is_spurious: Warning - 4 spurious TX_FIFO_STATUS interrupt(s) But the connection did survive these messages so far (2 days uptime now), which is an improvement over the previous situation :) HTH
Removing the patch package/kernel/mac80211/patches/600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch should possibly make 3 patches apply cleanly and would go away the message: "Warning - 4 spurious TX_FIFO_STATUS interrupt(s)" . I'll prepare another patch which should help with interrupts stability and perhaps get rid of this message: ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2
(In reply to Stanislaw Gruszka from comment #14) > I'll prepare another patch which should help with interrupts stability and Actually that change is already incorporated in patch "rt2800mmio: use txdone/txstatus rutines from lib" So perhaps it was not applied due to conflict in rt2800mmio.c Please remove 600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch, then apply 3 experimental patches and retest. If messages still happen please attach full dmesg output as txt file.
(In reply to Stanislaw Gruszka from comment #15) > (In reply to Stanislaw Gruszka from comment #14) > > I'll prepare another patch which should help with interrupts stability and > > Actually that change is already incorporated in patch > "rt2800mmio: use txdone/txstatus rutines from lib" > > So perhaps it was not applied due to conflict in rt2800mmio.c > > Please remove > 600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch, then > apply 3 experimental patches and retest. If messages still happen please > attach full dmesg output as txt file. I have done just that. All 3 patches now apply cleanly (only a few lines offset). I am happy to report that after stressing the wireless link for a couple hours and exchanging ~12GB, I wasn't able to trigger either messages (dmesg remains clean), and the link stayed perfectly stable! HTH
Nice! I need to do a bit more work here as patches break USB version of the driver (and need to add some flush fixes). Hopefully I'll do this soon and provide patches to test.
I would like to test too but having problems applying these patches. Which tree do you use ? If these patches works why can't you just commit global ?
Ok i manually applied 3 patches to dangole git. Also i erased 600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch. So far router does not crash, but link is very unstable. I tested with iphone7, android phone,macbook. Very often it can not connect or asks for password. Thereis much work to be done. With lan cable connected to router its okay. Still wifi driver is unusable. You can not count on it. I tested all possibilities with 20/40 mhz, all channels.....
(In reply to sani from comment #19) > Ok i manually applied 3 patches to dangole git. Also i erased I did not use dangole's git. I used a clean checkout of OpenWRT 18.06. > So far router does not crash, but link is very unstable. There might be something in dangole's git that causes these symptoms, or that prevented the patches from applying cleanly. Have you run make V=s to make sure that the patches are correctly applied?
(In reply to sani from comment #18) > If these patches works why can't you just commit global ? As I wrote before they break USB rt2800usb driver.
Created attachment 277789 [details] rt2800_flush_tx_timeouts.patch Here is additional patch to test, it should be applied together with previous patches. It might help with link stability.
Thanks i will test that too. I do not need USB at all. Even disable it in the kernel.
Hi all, it's may be a stupid question, but I cannot figure out the kernel version or the OpenWRT version to test the patches. Can someone give me the version in order to test the patches on an Astoria? Cheers, Angelo
I am applying 4 patch for stability/link issue. It is failing here: Hunk #1 FAILED at 738. 1 out of 1 hunk FAILED -- saving rejects to file drivers/net/wireless/ralink/rt2x00/rt2x00mac.c.rej Patch failed! Please fix ./patches/600-27-patch4.patch! --- drivers/net/wireless/ralink/rt2x00/rt2x00mac.c +++ drivers/net/wireless/ralink/rt2x00/rt2x00mac.c @@ -738,8 +738,12 @@ void rt2x00mac_flush(struct ieee80211_hw *hw, struct ieee80211_vif *vif, if (!test_bit(DEVICE_STATE_PRESENT, &rt2x00dev->flags)) return; + set_bit(DEVICE_STATE_FLUSHING, &rt2x00dev->flags); + tx_queue_for_each(rt2x00dev, queue) rt2x00queue_flush_queue(queue, drop); + + clear_bit(DEVICE_STATE_FLUSHING, &rt2x00dev->flags); } EXPORT_SYMBOL_GPL(rt2x00mac_flush); I am using latest openwrt. Will try manually applying the patch and test again.
Ok i compiled and flashed with 4 patches and removed patch 600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch before start. I just did wifi enabled and left everything default like openwrt ssid no password and channel 11. And i even can not connect to the router. Reason "unable to join" at iphone7. MacOS again no success or connect but internet very slow or missing. After many tries phone connected. Sorry to say but i think driver need much working :( T-Bone how you did not have problems at all ? What kind of routers/hardware you tested ? I tested tp-link 840 version 5 and evolution board kind of router like this one openwrt-ramips-mt7628-mt7628-squashfs-sysupgrade.bin If I can do/test something i will do it with pleasure. This router is staying for over 2 years.... still nothing to work with like openwrt or lede.
(In reply to sani from comment #26) > T-Bone how you did not have problems at all ? What kind of routers/hardware > you tested ? I have tested a Netgear WN3000RPv3, as I said in my first comment. It was non-functional before the patches and perfectly working after. I'm currently away and can't test the 4th patch before next week. I have a Netgear EX3700 somewhere which also has a 7620 radio for 2.4 which I plan to test too. Will report. Note that the TL-WR840N v4/v5 have a different radio: a 7628 which uses the mt76 driver, not the rt2800 which this particular ticket is about. Thus the patches submitted by sgruszka will have no effect on that device. Among the device you listed in comment #7, I think only the Phicomm K2 PSG1218 and ZBT-WR8305RT are concerned by these patches: they appear to be 7620. HTH
MT7260 chips can come with different variants, which require different register programming to device by the driver. Moreover even on the same chip variant, board can have different external parts connected to the chip (like oscillators, different antenna types, amplifiers, etc.), what also require different device programming by driver. Those settings are usually encoded in EEPROM , but sometimes EEPROM in the device is not correctly burned and correct eeprom image need to be provided by os to the driver. Additional things is temperature compensation code, it's needed on some devices and rt2800 driver lack of it. In summary there are various factors why driver work on some routers and don't on others. To make things work on sani routers someone will need to figure out what vendor binary drivers do and implement that in rt2800 driver. This is not trivial task. If sources of vendor driver are available for those routers changes can be read from there. Otherwise reverse engineering of binary driver need to be done.
(In reply to Angelo Corsaro from comment #24) > the OpenWRT version to test the patches. Can someone give me the version in > order to test the patches on an Astoria? I'll ask dangole to include patches in his staging tree, but first I want to do some rework on them. Please do not test 4-th patch for now too, it need some rework as well.
Yes i am testing only mediatek 7628. Any chance we have similar fixes like for 7620 ?
(In reply to sani from comment #30) > Yes i am testing only mediatek 7628. Any chance we have similar fixes like > for 7620 ? That was confusing. mt76 driver need different fixes. What about two other routers that use MT7620 ?
I will try them next week. Thanks for your support.
I have updated version of the patches here: https://github.com/sgruszka/wireless-drivers-next/commits/rt2800-draft-v2 The first 3 patched did not change. I tested them on rt2800usb and now somehow they work. Looks on my previous tests I must had some other patch applied that broke USB. I asked Daniel to put the patches on his openwrt staging tree.
FTR: patches are available to testing in dangole tree: https://git.openwrt.org/?p=openwrt/staging/dangole.git;a=summary
Hi, I have applied the 5 patches over a clean checkout of OpenWRT 18.06.0 (that's my reference point for "not working" state). Patch 4 (feb8797) doesn't apply cleanly: patching file drivers/net/wireless/ralink/rt2x00/rt2x00mac.c Hunk #1 FAILED at 720. I've manually fixed it to apply. I built (no warning) and stress-tested the result on Netgear WN3000RPv3: as far as I can tell everything works but it appears the error messages are back: ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due to full tx queue 2 I've slightly modified my test procedure (I now generate about symmetrical rx/tx bandwidth) so I'll retest with just the first 3 to see if they're also prone to showing the message under this scenario. HTH
Replying to myself: I've been able to get the error message with just the first 3 patches as well. It takes time, but it eventually happens. The link nevertheless survives and remains operational.
I really wish to get to know at what circumstances we get those errors. We basically stop queue once number of available entry's become less then particular threshold. We should not get any packets from upper mac80211 layer at that point. Let try to increase threshold...
Created attachment 277927 [details] rt2x00_queue_threshold.patch Please test this patch. If error messages will still happen try to increase threshold even more: queue->threshold = DIV_ROUND_UP(queue->limit, 4);
Reporting back: I tried your latest "rt2x00_queue_threshold.patch" on top of a clean checkout of OpenWRT 18.06.0 plus your other 5 patches. The new patch does not apply cleanly. In particular, the second hunk for rt2x00queue.c (line 719) doesn't apply at all: that test calling rt2x00queue_pause_queue() doesn't exist in the resulting OpenWRT output. I have modified the patch to remove this offending hunk and adjust another one that would otherwise fail, and I've tested the result. The error message returned after about 40-45 minutes of constant data throughput. Likewise when changing the queue->threshold value. Interestingly the errors seem to show in burst of 5-6 but only once (at least during my ~2h test runs). HTH
I'm going to prepare some debug patch that print queues state when we get this error. For now you can check if increase queue->limit to 128 (together with keeping 4 div when calculating threshold) make errors gone: --- a/drivers/net/wireless/ralink/rt2x00/rt2800mmio.c +++ b/drivers/net/wireless/ralink/rt2x00/rt2800mmio.c @@ -673,7 +673,7 @@ void rt2800mmio_queue_init(struct data_queue *queue) case QID_AC_VI: case QID_AC_BE: case QID_AC_BK: - queue->limit = 64; + queue->limit = 128; queue->data_size = AGGREGATION_SIZE; queue->desc_size = TXD_DESC_SIZE; queue->winfo_size = txwi_size;
Here is debug patch: https://github.com/sgruszka/wireless-drivers-next/commit/525c50486e17446793b21ac7a8498cb48b3bb210.patch please test it without threshold changes and provide dmesg output as txt file attachment. I would like to see if we correctly stop queue in mac80211.
Created attachment 278047 [details] dmesg output with debug patch applied Debug output attached. Triggering the message was much quicker and much more frequent this time. As usual, clean checkout of OpenWRT 18.06.0 with all your patches applied. Threshold unmodified from your patches (DIV: 8).
Hi, Stanislaw, here are more people doing test over Dangole's repo: https://bugs.openwrt.org/index.php?do=details&task_id=896
Hi Cjcr, so looks you have the same case as T-Bone , patches made your router workable , but error massage still happen, correct ?
(In reply to Stanislaw Gruszka from comment #44) > Hi Cjcr, so looks you have the same case as T-Bone , patches made your > router workable , but error massage still happen, correct ? That's right. It works, but errors appears with high data transfer. Anyway, I'm very happy that now it at least don't die when that happens. And thank you for that and also thanks to dangole for his colaboration.
Good morning, Im new here but I decided to register to share my experience with this bug, I have lots of alfa R36 routers with the ralink rt2x00 driver, and I've been testing this router for over 3 years experiencing the same related bug (not present in the original alfa fw neither dd-wrt fw), I've been also following this saga for a while waiting for a working patch (testing every single patch since I found these thread), I can confirm that after building the 18.06 branch loaded with the patches provided here, (my lack of knowledge did not help to solve the failed hunks to sucessfully load the other 2 patches), I was able to just load these 3 patches into the final compile: ***txstatus-routines-to ***txstatus-routines-from-lib ***txstatus-tiemout-every-time the router it's been working fantastic with no more hangings, but as the other members (Cjcr & Tbone) confirmed, the error message still appears, if somebody can give me instructions on how to load/fix the failed hunks for the 2 remaining patches I would really appreciate it, thank you Stanislaw and dangole for this amazing collaboration!!!
Hi. Thank you for trying to fix this bug. I try to compile the openwrt 18.06.1, which is current stable release, and patch it with your patchs. But there is an error when apply 704-rt2x00-use-different-txstatus-timeouts-when-flushing.patch. patching file drivers/net/wireless/ralink/rt2x00/rt2x00mac.c Hunk #1 FAILED at 720. 1 out of 1 hunk FAILED -- saving rejects to file drivers/net/wireless/ralink/rt2x00/rt2x00mac.c.rej Patch failed! Please fix ./patches/704-rt2x00-use-different-txstatus-timeouts-when-flushing.patch! I want to know how to fix it because I am a tiro. Thanks.
(In reply to mmyz1234 from comment #47) > I want to know how to fix it because I am a tiro. You can read https://elinux.org/Handling_Patch_Rejects to learn . However I think I will just add openwrt repo to my github page and apply patches there since there is interest of having them applied on top of 18.06.1 release. Not sure why don't want to use dangole repo, though.
I testing dangole repo on ASUS RT-N56U A1 with Ralink RT3883 and so far it's been great.
Created attachment 278199 [details] dmesg output with debug patch - phicomm psg1218 rev.a
(In reply to Stanislaw Gruszka from comment #48) > (In reply to mmyz1234 from comment #47) > > I want to know how to fix it because I am a tiro. > > You can read https://elinux.org/Handling_Patch_Rejects to learn . > > However I think I will just add openwrt repo to my github page and apply > patches there since there is interest of having them applied on top of > 18.06.1 release. Not sure why don't want to use dangole repo, though. Thanks. Now I compile and flash it successfully. The dmesg show errors so fast. It looks like the wireless 2.4g wireless still works fine. Maybe I need to observe it for a while.
(In reply to Stanislaw Gruszka from comment #48) > However I think I will just add openwrt repo to my github page and apply > patches there since there is interest of having them applied on top of > 18.06.1 release. Not sure why don't want to use dangole repo, though. Besides tracking master (and no the 18.06 release), dangole's repo contains more than just your fixes. I personally think it's more efficient (and best practice) to only test code meant to fix a bug when debugging, vs trying a melting pot of other things whose (side) effects aren't known. If we can get a well defined patch for this bug there's a chance to have it included in the current stable release, which would benefit all the users tracking releases. If we can't demonstrate that whatever patch you offer are sufficient standalone to fix the issue, we can't get that fix backported. QED :) Besides, dangole's repo contains code that has very little (if any) chance of ever getting accepted in openwrt master, so no point in messing with that IMO. Meanwhile, you didn't react after I sent debug data: was it useful? My 2c.
Hello, can somebody provide instructions and/or tell me if it's possible to apply these patches to another openwrt version (specifically Chaos calmer)? since I really need that version in order to sucessfully run coova-chilli, it can't run on lede/18.06 because memory restrictions, thanks in advance...
(In reply to T-Bone from comment #52) Ok, make sense. Debug data was useful, thanks, but I still think what next...
I updated openwrt rt2x00 branch (18.06 based) with 5 previous patches and 2 that increase queue->limit to 256 and threshold to div 4: https://github.com/sgruszka/openwrt/commits/rt2x00 Please test. Especially would be interesting if this work on older RT3***, RT5*** chips and if it does not make bufferbloat much worse. I.e. except with iperf testing please also test with some tools, which measure bufferbloat, see: https://flent.org/ https://www.bufferbloat.net/projects/bloat/wiki/Tests_for_Bufferbloat/ I considered to apply/post 5 first patches. Posting latest 2 patches depend of test results. I will maybe do queue->limit increase only on MT7620 or do increase to 128 on all chip. Not sure yet. I also will remove the message (make it at debug level) since I discovered that printing messages by itself can hang cpu for few seconds on my Nexx WT3020H router, what can make wireless connection drops! Anyway please test. Thanks in advance!
Hello @Stanislaw I will test it ASAP and I will report the results here. Thank you!
After 12 days of uptime build from Dangole repo starting to throw same error on ASUS RT-N56U A1, WiFi also stop working. Now I testing build from https://github.com/sgruszka/openwrt/commits/rt2x00
Can you provide testing results ? The most interesting thing is if increasing queue size do not make bufferbloat (a lot) worse. Note if I have to do testing by myself, I can not do other valuable work.
With build from https://github.com/sgruszka/openwrt/commits/rt2x00 WiFi stop working (without error in the logs, which is expected, I guess) in first day of testing, after around eight hours of uptime. However it's still not failed after router reboot. > The most interesting thing is if increasing queue size do not make > bufferbloat (a lot) worse. What is proper way of verifying this? Running betterspeedtest.sh few times with upstream 18.06.1 and with rt2x00 tree is sufficient? Do I need to disable VPN connection on a gateway to get proper results or this is doesn't matter?
(In reply to Stanislaw Gruszka from comment #55) > I updated openwrt rt2x00 branch (18.06 based) with 5 previous patches and 2 > that increase queue->limit to 256 and threshold to div 4: > > https://github.com/sgruszka/openwrt/commits/rt2x00 > > Please test. Especially would be interesting if this work on older RT3***, > RT5*** chips and if it does not make bufferbloat much worse. I.e. except > with iperf testing please also test with some tools, which measure > bufferbloat, see: > > https://flent.org/ > https://www.bufferbloat.net/projects/bloat/wiki/Tests_for_Bufferbloat/ > > I considered to apply/post 5 first patches. Posting latest 2 patches depend > of test results. I will maybe do queue->limit increase only on MT7620 or do > increase to 128 on all chip. Not sure yet. > > I also will remove the message (make it at debug level) since I discovered > that printing messages by itself can hang cpu for few seconds on my Nexx > WT3020H router, what can make wireless connection drops! > > Anyway please test. Thanks in advance! I built the image using the rt2x00 branch and wrote the image to my phicomm psg1218 rev.a After that I ran speedtest.net for 2.4Ghz wireless network load test. It seems that as long as there is a load of 30Mbps, my phone will disconnect from phicomm psg1218 rev.a, but this will not cause rt2x00 to hang. It seems that I can continue to use 2.4Ghz network as long as I reconnect to the AP. The system and the kernel log do not seem to have anything special. Maybe I should continue testing?
(In reply to RussianNeuroMancer from comment #59) > With build from https://github.com/sgruszka/openwrt/commits/rt2x00 WiFi stop > working (without error in the logs, which is expected, I guess) in first day > of testing, after around eight hours of uptime. However it's still not > failed after router reboot. I did not remove message yet, to check if increasing queue size and threshold help with queuing packet to full queue. I plan to remove message though. If wifi stop working this is some different problem, look like HW/FW hung. So I think we would need to implement watchdog to reset HW/FW. However so far I don't know how to detect the problem and how to perform reset. I would need to experiment with that (and I can not reproduce that issue, so that's problem). > > The most interesting thing is if increasing queue size do not make > > bufferbloat (a lot) worse. > > What is proper way of verifying this? Running betterspeedtest.sh few times > with upstream 18.06.1 and with rt2x00 tree is sufficient? Do I need to > disable VPN connection on a gateway to get proper results or this is doesn't > matter? I don't know how to measure buffer bloat, I haven't made those test. Please figure this out by yourself based on provided documentation. Please test with top of my rt2x00 branch and with: 661-0001-rt2800-change-queue-limit-to-256-for-pci-soc.patch 661-0002-rt2x00-increase-threshold.patch patches removed (you can achieve that by doing "git reset --hard HEAD~1")
(In reply to Stanislaw Gruszka from comment #61) > (In reply to RussianNeuroMancer from comment #59) > > With build from https://github.com/sgruszka/openwrt/commits/rt2x00 WiFi > stop > > working (without error in the logs, which is expected, I guess) in first > day > > of testing, after around eight hours of uptime. However it's still not > > failed after router reboot. > > I did not remove message yet, to check if increasing queue size and > threshold help with queuing packet to full queue. I plan to remove message > though. > > If wifi stop working this is some different problem, look like HW/FW hung. > So I think we would need to implement watchdog to reset HW/FW. However so > far I don't know how to detect the problem and how to perform reset. I would > need to experiment with that (and I can not reproduce that issue, so that's > problem). > > > > The most interesting thing is if increasing queue size do not make > > > bufferbloat (a lot) worse. > > > > What is proper way of verifying this? Running betterspeedtest.sh few times > > with upstream 18.06.1 and with rt2x00 tree is sufficient? Do I need to > > disable VPN connection on a gateway to get proper results or this is > doesn't > > matter? > > I don't know how to measure buffer bloat, I haven't made those test. Please > figure this out by yourself based on provided documentation. Please test with > top of my rt2x00 branch and with: > > 661-0001-rt2800-change-queue-limit-to-256-for-pci-soc.patch > 661-0002-rt2x00-increase-threshold.patch > > patches removed (you can achieve that by doing "git reset --hard HEAD~1") I also encountered the same situation, rt2x00 stopped working after 22 hours of work. My phone can't connect to the AP, and dmesg doesn't provide any valuable logs. https://gist.github.com/daiaji/2d032bc9d849d276ceb3c26f79b2a704
(In reply to daiaji from comment #62) > My phone can't connect to the AP, and dmesg doesn't provide any valuable > logs. > https://gist.github.com/daiaji/2d032bc9d849d276ceb3c26f79b2a704 There are warnings: rt2800_config_channel: Warning - Using incomplete support for external PA I would check dangole tree if support for external PA was added there.
It is also unstable with the latest patches applied to my phicomm psg1218 rev.a. After a few hours, 2.4G wireless stop working without this error in dmesg output. In order connect to the 2.4G wireless again, I need to reboot it. As mentioned before, there are warnings that rt2800_config_channel: Warning - Using incomplete support for external PA. There is an issue which I don't know if I should say it here. If I set option noscan 1 to enforce the use of 40MHz channel for 802.11n in 2.4Ghz band, I can not connect to the 2.4G wireless. After I reboot my router, I can connect to it but quickly disconnect. I can't connect to it again later. I don't if this issue related to this and if other routers have this issue. Thanks.
Created attachment 278729 [details] dmesg output with latest patches - phicomm psg1218 rev.a The dmesg output is here.
I didn't test bufferbloat yet, because I wasn't to get long enough uptime (due to unrelated issues such as power outage) to verify if rt2x00 branch is stable enough for my particular device (ASUS RT-N56U A1). Unfortunately, I got issue described in Comment 59 three times so far. Today it's happened on 16th day of uptime, after copying 3 GB via 5 GHz AP. There was no error messages in dmesg, and disabling/enabling network in OpenWRT setting was sufficient. Seems like reboot is not necessary, or it's how it looks like so far. Stanislaw, please let me know if measuring buffer bloat is still necessary, even if patches from rt2x00 branch doesn't achieve stable WiFi operation?
(In reply to RussianNeuroMancer from comment #66) > Unfortunately, I got issue described in Comment 59 three times so far. Today > it's happened on 16th day of uptime, after copying 3 GB via 5 GHz AP. There > was no error messages in dmesg, and disabling/enabling network in OpenWRT > setting was sufficient. Seems like reboot is not necessary, or it's how it > looks like so far. > > Stanislaw, please let me know if measuring buffer bloat is still necessary, > even if patches from rt2x00 branch doesn't achieve stable WiFi operation? Yes, I'm still waiting of buffer bloat to see if we can increase tx queue size. Regarding your issue as pointed already in comment 61 it's different problem. I guess we can have a deal, if you will do bufferbloat testing for me, I'm going to work on wotchdog :-)
I did bufferbloat testing by myself. It's very disappointing that nobody wanted to do this. Indeed there is ping latency effect when increase tx queues size. I used this script: https://github.com/richb-hanover/CeroWrtScripts/blob/master/betterspeedtest.sh TX lenght 64 threshold 7 [stasiu@localhost ~]$ ./betterspeedtest.sh -H 192.168.10.1 -p 192.168.10.1 2018-12-05 11:05:38 Testing against 192.168.10.1 (ipv4) with 5 simultaneous sessions while pinging 192.168.10.1 (60 seconds in each direction) ............................................................ . Download: 49.63 Mbps Latency: (in msec, 57 pings, 0.00% packet loss) Min: 4.850 10pct: 20.100 Median: 40.700 Avg: 39.010 90pct: 55.500 Max: 77.100 ............................................................. Upload: 58.13 Mbps Latency: (in msec, 61 pings, 0.00% packet loss) Min: 0.986 10pct: 9.210 Median: 13.300 Avg: 15.919 90pct: 25.000 Max: 52.600 [stasiu@localhost ~]$ ./betterspeedtest.sh 2018-12-05 11:08:57 Testing against netperf.bufferbloat.net (ipv4) with 5 simultaneous sessions while pinging gstatic.com (60 seconds in each direction) .............................................................. Download: 12.79 Mbps Latency: (in msec, 62 pings, 0.00% packet loss) Min: 10.000 10pct: 10.100 Median: 12.600 Avg: 16.202 90pct: 25.300 Max: 42.100 ............................................................. Upload: 57.17 Mbps Latency: (in msec, 61 pings, 0.00% packet loss) Min: 10.000 10pct: 17.600 Median: 22.100 Avg: 23.969 90pct: 32.900 Max: 50.700 TX length 256 threshold 64 [stasiu@localhost ~]$ ./betterspeedtest.sh -H 192.168.10.1 -p 192.168.10.1 2018-12-05 10:26:17 Testing against 192.168.10.1 (ipv4) with 5 simultaneous sessions while pinging 192.168.10.1 (60 seconds in each direction) ............................................................ Download: 50.56 Mbps Latency: (in msec, 61 pings, 0.00% packet loss) Min: 1.620 10pct: 44.800 Median: 75.400 Avg: 75.286 90pct: 104.000 Max: 134.000 ............................................................. Upload: 57.14 Mbps Latency: (in msec, 61 pings, 0.00% packet loss) Min: 1.190 10pct: 8.540 Median: 15.000 Avg: 16.940 90pct: 26.600 Max: 49.100 [stasiu@localhost ~]$ ./betterspeedtest.sh 2018-12-05 10:29:16 Testing against netperf.bufferbloat.net (ipv4) with 5 simultaneous sessions while pinging gstatic.com (60 seconds in each direction) ............................................................. Download: 14.89 Mbps Latency: (in msec, 61 pings, 0.00% packet loss) Min: 10.000 10pct: 10.200 Median: 14.400 Avg: 18.274 90pct: 30.600 Max: 59.900 .............................................................. Upload: 52.7 Mbps Latency: (in msec, 61 pings, 0.00% packet loss) Min: 10.400 10pct: 13.400 Median: 21.000 Avg: 22.331 90pct: 30.900 Max: 57.000 TX length 512 threshold 126 [stasiu@localhost ~]$ ./betterspeedtest.sh -H 192.168.10.1 -p 192.168.10.1 2018-12-05 10:45:36 Testing against 192.168.10.1 (ipv4) with 5 simultaneous sessions while pinging 192.168.10.1 (60 seconds in each direction) ............................................................. Download: 50.2 Mbps Latency: (in msec, 58 pings, 0.00% packet loss) Min: 6.400 10pct: 78.700 Median: 122.000 Avg: 133.705 90pct: 208.000 Max: 248.000 ............................................................. Upload: 55.44 Mbps Latency: (in msec, 61 pings, 0.00% packet loss) Min: 2.020 10pct: 9.900 Median: 14.400 Avg: 17.089 90pct: 26.800 Max: 57.200 [stasiu@localhost ~]$ ./betterspeedtest.sh 2018-12-05 10:50:41 Testing against netperf.bufferbloat.net (ipv4) with 5 simultaneous sessions while pinging gstatic.com (60 seconds in each direction) .............................................................. Download: 16.32 Mbps Latency: (in msec, 61 pings, 0.00% packet loss) Min: 15.400 10pct: 15.600 Median: 18.800 Avg: 23.238 90pct: 32.700 Max: 69.600 .............................................................. Upload: 50.52 Mbps Latency: (in msec, 62 pings, 0.00% packet loss) Min: 16.000 10pct: 24.000 Median: 35.000 Avg: 37.535 90pct: 51.000 Max: 78.200
T-Bone@parisc-linux.org and Cjcr , could you check if removing the "Dropping frame ..." prints by below patch is sufficient to fix wifi hungs ? https://lore.kernel.org/linux-wireless/1545318971-28351-3-git-send-email-sgruszka@redhat.com/raw There is claim that 5 patches that fixes issues for you cause throughput regression and that removing "Dropping frame ..." prints is sufficient to fix wifi hung problem. Full discussion is here: https://lore.kernel.org/linux-wireless/20181221125146.GB30351@redhat.com/T/#m460700c2ae3cd1bdeb1d6001c0fbf2945f412bb0
Created attachment 280251 [details] Patch to replace printk with netlink accounting So this is the patch we ended up going with. I'll get it properly documented and submit it to LKML. This occurs when there are many processes all doing sendmsg/sendto and they get preempted in that racy area of ieee80211_tx_frags, after it checks to see if the queue is stopped, but releases its spinlock prior to calling the driver's .tx function, thus allowing userspace threads to be preempted. This happens when running chilli-coova, which can have 100+ child processes via fork. To fully resolve the overall performance problems, modifications had to be made to chilli-coova as well (essentially replacing some non-blocking calls with blocking calls). Perhaps the main reason the printk is so deadly on some systems is that they are configured to emit the kernel log over a 56k serial line. :)
Oh, I forgot to add that while I was not able to personally reproduce the problem, I was able to get some diagnostics from somebody who could and we definitely are *really* waiting for the hardware to tx (I had suspected something else in software might be causing tx delays, but there is no evidence of this). The tests were carried out in an RF environment that's almost certainly congested in the ISM band (downtown in a major city). The same error could not be reproduced on an almost identical device in the same environment, but with 9dBi antennae vs the 2.5dBi antennae of the AP where the problem occurs.
(In reply to Daniel Santos from comment #70) > Created attachment 280251 [details] > Patch to replace printk with netlink accounting I already posted patch witch changes print from err to dbg . Regarding adding drop statistics, this is good idea but I'm not sure if solution from the patch is the right one.
(In reply to Stanislaw Gruszka from comment #72) > (In reply to Daniel Santos from comment #70) > > Created attachment 280251 [details] > > Patch to replace printk with netlink accounting > > I already posted patch witch changes print from err to dbg . Regarding > adding drop statistics, this is good idea but I'm not sure if solution from > the patch is the right one. Oh, sorry I didn't see that patch. Probably the *best* thing to do is put a rate limiter on it and emit a max of once per x seconds telling how many messages were squelched? I promise that updating the statistics is the right thing to do, but I don't promise that I've done it the right way! :) The screwy thing here is that, afaict, the mac80211 subsystem always increments the tx frame and byte count. Personally, I think that the very best solution is to change the driver .tx function to return an int error code and let mac80211 manage the stats as well as adding the possibility for it to requeue the frame or fragment in its own queue so it doesn't necessarily need to be lost. Please pardon my ignorance here, as I don't know how and where the decision is made when we must drop frames to avoid buffer bloat, cache thrash, etc. Of course, this would be a substantial effort and would break out-of-tree drivers.
(In reply to Daniel Santos from comment #73) > Oh, sorry I didn't see that patch. Probably the *best* thing to do is put a > rate limiter on it and emit a max of once per x seconds telling how many > messages were squelched? I have done that for some other rt2x00 messages for USB drivers since logs were flooded when some USB host driver misbihave . For debug messages I don't think this is necessary. > Personally, I think that the very best solution is to change the driver .tx > function to return an int error code and let mac80211 manage the stats as > well as adding the possibility for it to requeue the frame or fragment in > its own queue so it doesn't necessarily need to be lost. Please pardon my > ignorance here, as I don't know how and where the decision is made when we > must drop frames to avoid buffer bloat, cache thrash, etc. Of course, this > would be a substantial effort and would break out-of-tree drivers. Historically we drop frame silently and this also happen in mac80211/tx path if some error occurs. We do not have tx_dropped statistic nowhere in mac80211, only rx_dropped. Some wireless drivers do own tx_dropped, some fullmac drivers use netdev->stats.tx_dropped . I think the best for rt2x00 would be add queue tx_dropped field end export it via debugfs via rt2x00debug_read_queue_stats . BTW: have you tested some other patches i.e. 5 patches from https://github.com/sgruszka/openwrt/commit/61809eedbfab55cae8a5feb48f761f8b6dd8b308 or increasing queue length ? I would like to know if claim from https://lore.kernel.org/linux-wireless/CAKR_QVK_2j6a9YiwUEKuWF+ss0-pr808Sr=AUrX4a6L3Zw=F0w@mail.gmail.com/ is true: removing "Dropping frame due..." print fixes the same problem as 5 patches do. And if increase queue length do some good except make ping times worse when queue is full.
@Stanislaw Hi! I'am using your latest version: Model Nexx WT3020 (8M) Architecture MediaTek MT7620N ver:2 eco:6 Versión del firmware OpenWrt 18.06-SNAPSHOT r7293-1e98677 / LuCI openwrt-18.06 branch (git-18.247.71242-9541751) Versión del Kernel 4.14.67 It seems that is working fine, but I cannot make it working with 40 Mhz channel with, even using the "noscan = 1" trick. Here's the copy&paste of wireless: config wifi-device 'radio0' option type 'mac80211' option hwmode '11g' option path 'platform/10180000.wmac' option txpower '20' option noscan '1' option legacy_rates '0' option htmode 'HT40' option country 'JP' option channel '3' config wifi-iface 'default_radio0' option device 'radio0' option mode 'ap' option network 'lan' option ssid '*****' option disassoc_low_ack '0' option encryption 'psk2' option key '*****' config wifi-iface option device 'radio0' option mode 'ap' option ssid '****' option network 'lan' option disassoc_low_ack '0' option encryption 'psk2' option key '*****' option wmm '1' Any idea?
Cjcr , this is not related to this bug, better would be open different report i.e. open issue on github against my openwrt repo. Anyway with this config config wifi-device 'radio1' option type 'mac80211' option channel '11' option hwmode '11g' option htmode 'HT40' option noscan '1' option path 'platform/10180000.wmac' option disabled '0' option country 'CZ' HT40 works for me.
(In reply to Stanislaw Gruszka from comment #76) > Cjcr , this is not related to this bug, better would be open different > report i.e. open issue on github against my openwrt repo. Anyway with this > config > > config wifi-device 'radio1' > option type 'mac80211' > option channel '11' > option hwmode '11g' > option htmode 'HT40' > option noscan '1' > option path 'platform/10180000.wmac' > option disabled '0' > option country 'CZ' > > HT40 works for me. Oh sorry, it seems that was my phone don't able to connect to 40Mhz channel width, maybe the ROM i'm using. With others it seems that works fine. Sorry!
Code for problems reported in this bug is already in upstream and official OpenWRT repo. For remaining issue what is random hangs I provided watchdog here (after some more testing it would be unstreamed as well): https://github.com/sgruszka/openwrt/commit/4667e54f528544b150c9841e167c883ff0b79794
The issue still persists. Easily reproducible by simply installing **miniupnpd**, you will instantly see the log growing with frame drop messages, and by the end of the day the network will halt. Reproduced on Xiaomi MiWiFi Mini (MediaTek MT7620A ver:2 eco:6) running OpenWrt 18.06.4 r7808-ef686b7292 (Kernel 4.14.131)
(In reply to Nikita Kniazev from comment #79) > Reproduced on Xiaomi MiWiFi Mini (MediaTek MT7620A ver:2 eco:6) running > OpenWrt 18.06.4 r7808-ef686b7292 (Kernel 4.14.131) It's not in 18.06.4. Should this be backported though? I would suggest you grab the commits Stanislaw linked above and cherry-pick them into your git tree.
... or update to openwrt 19.07
Not all devices can update to 19.07, this really should be backported to the 18 series if possible.
(In reply to Stanislaw Gruszka from comment #74) > BTW: have you tested some other patches i.e. 5 patches from > https://github.com/sgruszka/openwrt/commit/ > 61809eedbfab55cae8a5feb48f761f8b6dd8b308 > or increasing queue length ? I would like to know if claim from > https://lore.kernel.org/linux-wireless/CAKR_QVK_2j6a9YiwUEKuWF+ss0- > pr808Sr=AUrX4a6L3Zw=F0w@mail.gmail.com/ > is true: removing "Dropping frame due..." print fixes the same problem as 5 > patches do. FWIW, it doesn't. I just tested on an EX3700 (fresh 18.06.4 tree with '600-23-rt2x00-rt2800mmio-add-a-workaround-for-spurious-TX_F.patch' deleted and '666-0003-rt2x00-do-not-print-error-when-queue-is-full.patch' added) with only 2.4 radio active: it hangs almost instantly on traffic. Nothing in dmesg, as expected with the only extra patch applied. After disabling 2.4 radio, dmesg showed: [ 259.989136] ieee80211 phy1: rt2x00queue_flush_queue: Warning - Queue 0 failed to flush [ 260.076080] ieee80211 phy1: rt2x00queue_flush_queue: Warning - Queue 2 failed to flush [ 260.402375] ieee80211 phy1: rt2x00queue_flush_queue: Warning - Queue 0 failed to flush [ 260.489194] ieee80211 phy1: rt2x00queue_flush_queue: Warning - Queue 2 failed to flush HTH T-Bone
I have built 19.07 and run it for a few days and I did not get those warnings or catch network halt, so the problem seems to be fixed. Though, I still experience low ACK disassociations and average of 30-50 Mbit/s performance (measured with iperf3) with a client that is 1 meter away from the AP (5Ghz performance also is not perfect, about 40-60 Mbit/s) and which had been reaching 95 Mbit/s on 100 Mbit/s Internet when the AP was running Padavan firmware. Also on 19.07 I experience 'deauthenticated due to inactivity' messages that I have not seen on 18.06 in such quantity (or at all?): Sat Aug 24 09:38:07 2019 daemon.notice hostapd: wlan1: AP-STA-DISCONNECTED 08:d4:2b:xx:xx:xx Sat Aug 24 09:38:07 2019 daemon.info hostapd: wlan1: STA 08:d4:2b:xx:xx:xx IEEE 802.11: disassociated Sat Aug 24 09:38:08 2019 daemon.info hostapd: wlan1: STA 08:d4:2b:xx:xx:xx IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE) Sat Aug 24 09:38:11 2019 daemon.info hostapd: wlan1: STA 08:d4:2b:xx:xx:xx IEEE 802.11: authenticated Sat Aug 24 09:38:11 2019 daemon.info hostapd: wlan1: STA 08:d4:2b:xx:xx:xx IEEE 802.11: associated (aid 1) Sat Aug 24 09:38:11 2019 daemon.notice hostapd: wlan1: AP-STA-CONNECTED 08:d4:2b:xx:xx:xx