Bug 70761
Summary: | AR8161 wir alx driver: Randomly stops to receive packets with small MTU | ||
---|---|---|---|
Product: | Drivers | Reporter: | XmainframeX |
Component: | Network | Assignee: | drivers_network (drivers_network) |
Status: | NEW --- | ||
Severity: | normal | CC: | 34mailme, alan, bernardo.reino, bernhard, bugs, crow.jamesm+kernelbugzilla, Dagobertstaler, damaxx08, danyer, doaxan77, email, eugene.shatokhin, feng.tang, g2485269, gleb.simanov, hundycougar, jarod, kernelbz.bobd, ldap.tester, marekrusinowski, marianpoltak, mosonkonrad, mp-001, nstephenh, olelukoie, ongun.kanat+kernelbugzilla, patkoscsaba, peters.nico, prudy1, rainer.klier, sd, szg00000, thiago.mast3r, tiwai, tobias.regnery, zoot1612 |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | >3.6 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
direct fix for this bug
extra fix for consuming all memory fix for recovery from rx underflow attachment-18127-0.html Buffer size sanitation, padding and consistency attachment-11151-0.html attachment-11265-0.html new_skb_allocator work_around_dma_issue.patch new_dma_patch new dma patch dma patch for 4.1/4.4 stable kernel |
Description
XmainframeX
2014-02-18 14:42:49 UTC
The same problem on kernel 3.13.4-1-ARCH on Lenovo Y580 with the same AR8161. Setting MTU to 8192 really helped! Also, there are missing statistics of alx driver (/proc/net/dev) and it's difficult to debug what's going on. I can confirm that the bug still exist in newer kernel (even with linux 4-rc6 release ) Using Lenovo IdeaPad Y580. When the connection is lost the ifconfig is reporting RX overruns which is still increasing (it looks like every single package received cause overrun). This lasts until cable unplug and plug again or after reloading module alx. The workaround with setting MTU to 8192 is working. Same problem with 3.18.6. When using IP forwarding it stops randomly. I can confirm this with my IdeaPad Y580 as well. Workaround with MTU 8192 is working. I can also confirm this with my IdeaPad Y580. Workaround with MTU 8192 works. Me too. I confirm this issue and workaround with Lenovo Ideapad N581 (AR8161 rev 08). I can also confirm this issue and that the workaround fixes it. Kernel 3.19 *** Bug 96911 has been marked as a duplicate of this bug. *** I can confirm this issue. I see no received packets with wireshark. I have AR816x/AR817x (Lenovo Ideapad P580, I can't find specifics for the h/w). HOWEVER, I find that kernel 3.18.7 *works fine* without changing mtu, while with kernel 3.19.5 the interface goes down shortly after configured, unless I bump up the mtu to 8192. I have reproduced this several times. In my case, I am plugging in the network cable, disabling WiFi, and restarting the network. With 3.19.5 things work initially, then shortly thereafter stops receiving packets. Changing mtu to 8192 restores operation. With 3.18.7 I do the same procedure, but encounter no issues. I too can confirm this. Kernel 3.18 LTS series did not have this issue, but when I have updated to Ubuntu 15.04 it appeared again. Even on new 4.1 rc7 the issue persists. Workaround with MTU 8192 is working. I have a Lenovo W520 with a similar problem - pulling the cable out and putting it back in was the only fix, and it worked for only 30 seconds... lspci | grep net 00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04) Hi, Atheros AR8161 : Same problem for me. Kernel 4.1.1. But, on the 4.0.6, all worked ... Network: Card-1: Intel Centrino Wireless-N 2230 driver: iwlwifi IF: wlp3s0 state: up mac: 68:5d:43:2a:f3:af Card-2: Qualcomm Atheros AR8161 Gigabit Ethernet driver: alx IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac: same problem for me. i am on openSUSE 13.2 x86_64 on an Asus G750J. lspci: 04:00.0 Ethernet controller: Qualcomm Atheros QCA8171 Gigabit Ethernet (rev 10) dmesg: alx 0000:04:00.0 eth0: Qualcomm Atheros AR816x/AR817x Ethernet [ac:22:0b:b8:00:59] Workaround with MTU 8192 is working only on kernels below 4.1.x. i tried 4.1.0 and here it stops working even when using MTU 8192 workaround. currently i am back on 4.0.5. (In reply to Rainer Klier from comment #13) > Workaround with MTU 8192 is working only on kernels below 4.1.x. > i tried 4.1.0 and here it stops working even when using MTU 8192 workaround. > currently i am back on 4.0.5. i was wrong here. in fact, i failed setting the MTU to 8192. i thought i did it, but it didn't work. so, below kernel 4.1.x it worked most of the time for me, even without setting the MTU to 8192. but the error occured randomly. and with kernel 4.1.x the error occured instantly. currently i finally managed to set MTU to 8192 and trying again kernel 4.1.1.... 15 minutes after booting alx is still working.... ;-) (In reply to Rainer Klier from comment #14) > (In reply to Rainer Klier from comment #13) > > currently i finally managed to set MTU to 8192 and trying again kernel > 4.1.1.... > 15 minutes after booting alx is still working.... ;-) 40 minutes later it happened again, even with the MTU workaround. :-( back to Kernel 4.0.5. Kernel 4.1.2 : same problem and same workaround : MTU = 8000 is ok Kernel 4.1.2 (on Arch Linux) and Lenovo Y580: can confirm the same problem and the MTU=8192 fix. I've the same problem with the Asus N76VM (AR8161) on Fedora 22 with kernel >= 4.1. Setting MTU to 8192 works, also kernel < 4.1. with kernel 4.1.3 and MTU 9000 it works now as stable as with kernel 4.0.5, which means that the bug/crash happens not that often, but happens from time to time. Same issue with Dell One 27 running Fedora 22 Kernel 4.1.5-200.fc22.x86_64 with Qualcomm Atheros AR816x/AR817x Ethernet driver alx. Set MTU 9000 has worked, not been running very long so don't know how stable it is. I confirm to have this same issue with newest distros. Ubuntu 15.04 and also Antergos. But Debian 8 Jessie works without this problem. At the moment: 3.19.0-28-generic #30-Ubuntu SMP Mon Aug 31 15:52:51 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux Kernel 4.1.7 : same problem, no correction :'( 4.1.8 and 4.1.9 : same problem A number of bug confirmation reports but is there anyone actually looking into this? It is a second time I see problem around here (https://bugzilla.kernel.org/show_bug.cgi?id=51671). And this time it is is also not occasional problem, it simply happen from first run. It that code tested at all? Why this buggy thing reached kernel at all? BTW. The MTU change does not help if few GB file to be transferred. (In reply to pru from comment #24) > A number of bug confirmation reports but is there anyone actually looking > into this? > > It is a second time I see problem around here > (https://bugzilla.kernel.org/show_bug.cgi?id=51671). And this time it is is > also not occasional problem, it simply happen from first run. It that code > tested at all? Why this buggy thing reached kernel at all? > > BTW. The MTU change does not help if few GB file to be transferred. I think no and that's the damn problem! Since no one is looking into this let me share my findings. The main problem is that RX signalling is gone after a while. This covers interrupt flag an update bit in word3 register. Checking the skb buffers content with pattern shows there is no overflow signs. Thus the problem might be in handling them at the chip side. Looking at other drivers the extra pattern space is used for specific chips. Trying the same with ar8162 shows the extra 16B are needed here. As there is no documentation the only prof for that is by testing number of transfers with different mtu. Thus the direct solution for this bug is by adding 16B padding, which is in patch 0001. Well, the more I tested the more problems I found. There are two other things that I would link to this bug too. First - the rx buffer refill in some condition loops till end of memory. This is because it goes up to read index, that is constantly running away in the background. This is fixed in patch 0002. Second – in case of rx underflow there is no recovery from this state as nothing will allocate new rx buffers. This is fixed in patch 0003. Created attachment 191091 [details]
direct fix for this bug
Created attachment 191101 [details]
extra fix for consuming all memory
Created attachment 191111 [details]
fix for recovery from rx underflow
(In reply to pru from comment #26) > Since no one is looking into this let me share my findings. > nothing will allocate new rx buffers. This is fixed in patch 0003. thanks for bringing light into this. great! but how and when will these fixes be part of the kernel? I'm not a kernel developer, I did some approved things and be glad to do another in future, but the patches must be reviewed, modified or not, approved etc. Hopefully this is an existing bug and it is assigned already, thus let assigned staff continue. (In reply to pru from comment #31) > I'm not a kernel developer, I did some approved things and be glad to do > another in future, but the patches must be reviewed, modified or not, > approved etc. where did you get the current alx source from? is it https://github.com/erikarn/alx ? i want to try out your patches. to which source did you apply your patches? @Rainer Klier, The link you posted is for the original Qualcomm unified driver. Kernel 3.8+ includes an in-tree driver (by Johannes Berg) based on that one but stripped off a lot of things. I suspect either something is broken in that driver, and/or the card (or some revisions thereof) have a hardware bug, which the original driver might be able to circumvent (something to do with tcp segmentation offload). I imagine @prui's patches apply to the in-tree driver. (In reply to Bernardo Reino from comment #33) > @Rainer Klier, > > The link you posted is for the original Qualcomm unified driver. Kernel 3.8+ > includes an in-tree driver (by Johannes Berg) based on that one but stripped > off a lot of things. ah, ok. > I imagine @prui's patches apply to the in-tree driver. i was just asking IF the source from https://github.com/erikarn/alx is the correct one to use the patch. at that time i didn't know any other source for this driver. now i assume the in-tree driver source is this: https://github.com/torvalds/linux/tree/master/drivers/net/ethernet/atheros/alx anyhow, i tried to compile the driver from https://github.com/erikarn/alx but failed. :-( i think i have to wait for the fix to be included in one of the next kernel releases.... The patches are against https://github.com/torvalds/linux.git, sorry I could mention that before. I know the backport version has more code, but since the first entry in this bug says 'alx driver backport worked well up to' I assumed it also suffers from the same problem. Note to the patch 0003 - it might be good to schedule rx refill on a timer instead immediate queue on underflow. I spent only more than one day on this so treat the patches as the proof of concept, even they do the job. for all openSUSE users: i have created also a ticket in opensuse bugzilla: https://bugzilla.opensuse.org/show_bug.cgi?id=952621 and Takashi Iwai kindly made a new kernel module with the above pathces to be tested. this new kernel module is located in this repositories: http://download.opensuse.org/repositories/home:/tiwai:/bnc952621 (In reply to pru from comment #31) > I'm not a kernel developer, I did some approved things and be glad to do > another in future, but the patches must be reviewed, modified or not, > approved etc. > Hopefully this is an existing bug and it is assigned already, thus let > assigned staff continue. I guess no one else working on this, so your fix would be the best to be applied to upstream. Could you submit your fix patches (at least the first two) to upstream ML after brushing them up a bit? Make each subject line concise, and put more information in the changelog texts, describe for which bug it is and what each patch actually does. Better to put this bugzilla as the information point, and take a tested-by tag, for example. (In reply to pru from comment #35) > The patches are against https://github.com/torvalds/linux.git, sorry I could > mention that before. > I know the backport version has more code, but since the first entry in this > bug says 'alx driver backport worked well up to' I assumed it also suffers > from the same problem. Please post your patches to the netdev mailing list (netdev@vger.kernel.org). That's where all development for networking and network drivers happen, including code review Patches cannot be included in the kernel unless they are submitted to the appropriate mailing list. And most kernel developers (at least in networking) don't read bugzilla, sorry. I suspect hardware bugs on some models, as I have never encountered this bug, or any other problem, with my alx card, but I will test your patches. Documentation for submitting patches [0][1]. [0] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches [1] http://kernelnewbies.org/FirstKernelPatch (In reply to Rainer Klier from comment #37) > and Takashi Iwai kindly made a new kernel module with the above pathces to > be tested. > > this new kernel module is located in this repositories: > http://download.opensuse.org/repositories/home:/tiwai:/bnc952621 sadly, i had the crash again today even with the patched/new kernel module: [ 287.796808] alx 0000:04:00.0 eth0: fatal interrupt 0x4019607, resetting so it seems, that it only takes longer to appear, but the changes do not solve it. Argh, that's unfortunate. I'm doing a test build with these patches to take for a spin on my own alx-driven NIC (E2200, doesn't have this bug though), and trying to clean them up slightly to get them sent to netdev for proper review. Rainer – a fundamental question, without the patches and without the MTU set, did you observe the lost connection separately from the crash or it was the crash always? Because if you hit the lost connection without the crash then we probably have two separate problems here. Note the patches focus on the lost connection only as I never had a crash (well, did not test this log enough). Takashi/Sabrina – if there are two separate problems, so lost connection vs. crash, I can push the patches to the list, sadly not before the next week. Jarod – the driver needs some work, as I mentioned before I found two other problems during testing and I stopped looking further. Note my note to the 3rd patch, doing this differently it is a matter of preferences, but this patch is needed to recover from a dead end. (In reply to pru from comment #42) > Rainer – a fundamental question, without the patches and without the MTU > set, did you observe the lost connection separately from the crash or it was > the crash always? Because if you hit the lost connection without the crash for me it is always the same. the alx driver stops working. the connection is lost. the alx driver is not useable any more. and the dmesg output is flooded with the "eth0: fatal interrupt 0x4019607, resetting" messages. it once happened that my disk ran out of space because /var/log/messages was several gigabytes large.... in this situation i am not able to use the network card any more until reboot. i can only use wlan at this situation. besides this the computer is normally useable. so it does not crash completely. i only have to reboot as fast as possible because /var/log/messages is growing. but i think i remember that without the patches a reboot didn't always work cleanly. sometimes i had a bad kernel crash while trying to reboot. and in the debug output of this crash the alx module was mentioned as reason somehow. Created attachment 195371 [details] attachment-18127-0.html I have no problem using my network card on kernel 4.0. So maybe someone should take a look what happened since then that broke the driver. On Wed, Nov 25, 2015 at 9:47 AM <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=70761 > > --- Comment #43 from Rainer Klier <rainer.klier@gmx.at> --- > (In reply to pru from comment #42) > > Rainer – a fundamental question, without the patches and without the MTU > > set, did you observe the lost connection separately from the crash or it > was > > the crash always? Because if you hit the lost connection without the > crash > > for me it is always the same. > the alx driver stops working. > the connection is lost. > the alx driver is not useable any more. > and the dmesg output is flooded with the "eth0: fatal interrupt 0x4019607, > resetting" messages. it once happened that my disk ran out of space because > /var/log/messages was several gigabytes large.... > > in this situation i am not able to use the network card any more until > reboot. > i can only use wlan at this situation. > besides this the computer is normally useable. > so it does not crash completely. > i only have to reboot as fast as possible because /var/log/messages is > growing. > > but i think i remember that without the patches a reboot didn't always work > cleanly. sometimes i had a bad kernel crash while trying to reboot. > and in the debug output of this crash the alx module was mentioned as > reason > somehow. > > -- > You are receiving this mail because: > You are on the CC list for the bug. The symptoms of this bug seem to vary from report to report. Maybe it depends on exactly which model chip you have. I have two machines, Dell All-In-One 27 with a Qualcomm Atheros AR8161 Gigabit Ethernet (rev 08). The alx driver works fine with kernel 4.0.x but not with 4.1.x and later. My symptoms of the problem are that the interface works fine for about a minute after boot and then stops communicating. Packets are transmitted out of the machine, but no packets are received. The machine works fine otherwise. Only the ethernet communication fails. After about 25 minutes, I get a message like: kernel: alx 0000:06:00.0 p5p1: fatal interrupt 0x8400, resetting. The network works again for about a minute and then stops working. I tested the above posted patches with kernel 4.2.5, and saw essentially the same symptoms - only minor differences in timing. Maybe I did the patching wrong but I don't think so. Please see https://bugzilla.redhat.com/show_bug.cgi?id=1251434 Created attachment 196341 [details]
Buffer size sanitation, padding and consistency
If some folks with affected hardware could give this patch a spin, it would be much appreciated. It's based loosely on Przemek's first patch, I've not yet dug into the other issues. I've tested this lightly on my own E2200-equipped laptop, which had no problems to begin with, and continues to function just fine with this patch.
(In reply to Patkós Csaba from comment #44) > I have no problem using my network card on kernel 4.0. So maybe someone > should take a look what happened since then that broke the driver. Well, for starters... $ git log v4.0..v4.3 -- drivers/net/ethernet/atheros/alx/ <absolutely no changes to the alx driver code> Thanks for working on this, Jarod. I tested your patch with kernel-4.2.5-201.fc22.x86_64 and still got the same failure. The symptoms are essentially the same as I described in my last post. About one minute after reboot, the interface stops receiving packets. It transmits OK. It stays that way for about 25 minutes and then I see these messages: kernel: alx 0000:06:00.0 p5p1: fatal interrupt 0x400, resetting NetworkManager[689]: <info> (p5p1): link disconnected (deferring action for 4 seconds) NetworkManager[689]: <info> (p5p1): link connected kernel: alx 0000:06:00.0 p5p1: NIC Up: 1 Gbps Full upon which it works again for minute, and then stops receiving again for another 25. I note that ifconfig reports: RX packets 24070 bytes 204666 (199.8 KiB) RX errors 21893 dropped 5 overruns 21893 frame 0 TX packets 2009 bytes 198321 (193.6 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 I confirm that the bug is on lenovo y580 with kernel 4.x. 3.18.24-1 kernel works fine. Is there a new version of the kernel to fix this problem? (In reply to doaxan77 from comment #49) > I confirm that the bug is on lenovo y580 with kernel 4.x. 3.18.24-1 kernel > works fine. Is there a new version of the kernel to fix this problem? If 3.18.x works, your problem might be different from what other people face here. And if it's really so (3.18.x works and later not), try git bisect at best. I have been using above patch for about one week now and had no problem since then so definitely the patch fixes the problem for me. But will continue to use it and test it. But without the patch the connection has dropped after a few minutes and almost immediately after connected with both ethernet and wifi (or host only adapter from virtualbox). I have Lenovo Y580 OpenSuse Leap 42.1 kernel 4.1.13 lspci | grep Ethernet 02:00.0 Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet (rev 08) perhaps there are some differences between these ethernet cards ? PS: definitely working kernel was 3.11.0 with driver http://linuxwireless.org/download/compat-wireless-2.6/compat-wireless-2012-07-03-pc.tar.bz2 Created attachment 197451 [details] attachment-11151-0.html This is a very elusive issue. My onboard NIC on an HP desktop works perfectly up until kernel 4.0. Any kernel above, 4.1 and above, brakes it, though there are no changes to the driver between these kernels. So most probably something changed __in the kernel__ that made the driver act up. On Tue, Dec 15, 2015 at 9:45 PM <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=70761 > > --- Comment #51 from Marian Poltak <marianpoltak@centrum.sk> --- > I have been using above patch for about one week now and had no problem > since > then so definitely the patch fixes the problem for me. But will continue > to use > it and test it. > But without the patch the connection has dropped after a few minutes and > almost > immediately after connected with both ethernet and wifi (or host only > adapter > from virtualbox). > > I have Lenovo Y580 OpenSuse Leap 42.1 kernel 4.1.13 > lspci | grep Ethernet > 02:00.0 Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet (rev > 08) > > perhaps there are some differences between these ethernet cards ? > > PS: > definitely working kernel was 3.11.0 with driver > > > http://linuxwireless.org/download/compat-wireless-2.6/compat-wireless-2012-07-03-pc.tar.bz2 > > -- > You are receiving this mail because: > You are on the CC list for the bug. > Created attachment 197461 [details] attachment-11265-0.html I forgot to mention that I have the same NIC in my Assus laptop. Works with kernel 4.0. Didn't try anything newer. On Tue, Dec 15, 2015 at 9:55 PM Patkós Csaba <patkoscsaba@gmail.com> wrote: > This is a very elusive issue. My onboard NIC on an HP desktop works > perfectly up until kernel 4.0. Any kernel above, 4.1 and above, brakes it, > though there are no changes to the driver between these kernels. So most > probably something changed __in the kernel__ that made the driver act up. > > On Tue, Dec 15, 2015 at 9:45 PM <bugzilla-daemon@bugzilla.kernel.org> > wrote: > >> https://bugzilla.kernel.org/show_bug.cgi?id=70761 >> >> --- Comment #51 from Marian Poltak <marianpoltak@centrum.sk> --- >> I have been using above patch for about one week now and had no problem >> since >> then so definitely the patch fixes the problem for me. But will continue >> to use >> it and test it. >> But without the patch the connection has dropped after a few minutes and >> almost >> immediately after connected with both ethernet and wifi (or host only >> adapter >> from virtualbox). >> >> I have Lenovo Y580 OpenSuse Leap 42.1 kernel 4.1.13 >> lspci | grep Ethernet >> 02:00.0 Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet >> (rev 08) >> >> perhaps there are some differences between these ethernet cards ? >> >> PS: >> definitely working kernel was 3.11.0 with driver >> >> >> http://linuxwireless.org/download/compat-wireless-2.6/compat-wireless-2012-07-03-pc.tar.bz2 >> >> -- >> You are receiving this mail because: >> You are on the CC list for the bug. >> > -- > Patkós Csaba > Software Developer @ Syneto LTD > www.syneto.eu > "A little bit more agile every day" > With distribution Manjaro I tested the Linux kernel, and got the following results: 4.2.6-1 NOT 4.3.0-1 NOT 4.4rc4-1 NOT 4.1.13-1 NOT 3.19.8.10-1 NOT 3.18.24-1 worked 3.16.7.20-1 worked In cases where the internet did not work helped MTU 9000. I hope for the solution of this problem for home users of Linux. (In reply to Marian Poltak from comment #51) > I have been using above patch for about one week now and had no problem > since then so definitely the patch fixes the problem for me. But will > continue to use it and test it. Which above patch (or patches)? There are four. :) > But without the patch the connection has dropped after a few minutes and > almost immediately after connected with both ethernet and wifi (or host only > adapter from virtualbox). The patches all touch only the alx nic driver, they shouldn't have any impact on wifi or host only vbox networking... (In reply to Jarod Wilson from comment #55) > (In reply to Marian Poltak from comment #51) > > I have been using above patch for about one week now and had no problem > > since then so definitely the patch fixes the problem for me. But will > > continue to use it and test it. > > Which above patch (or patches)? There are four. :) this patches from this repository: (for opensuse leap 42.1 x86_64) http://download.opensuse.org/repositories/home:/tiwai:/bnc952621 > > > But without the patch the connection has dropped after a few minutes and > > almost immediately after connected with both ethernet and wifi (or host > only > > adapter from virtualbox). > > The patches all touch only the alx nic driver, they shouldn't have any > impact on wifi or host only vbox networking... I know the other connections are working fine but when active and have some traffic, it somehow make alx driver to fail much faster (I do not know why perhaps something about timing ? that tcp/ip stack process those other adapter sooner and therefore make alx wait longer and fail ? absolutely do not know) Typical situation: (checking via ifconfig) 1. connect to internet via Ethernet. (alx working RX overrun: 0) 2. create ad-hoc wifi for internet sharing (alx working RX overrun: 0) 3. connect smartphone to the shared wifi (alx working RX overrun: 0) 4. start browsing net on smartphone via shared wifi (alx break after few sec. RX overrun start growing) ping over wifi is working (smartphone and PC) no problem only ethernet (alx) cannot receive any packet. and it is similar when using vbox only adapter. The pathes hepls only for scenario 1. Previously I tried above patches only when connected via single connection on ethernet - the connection was stable, then I tried Both ethernet and wifi but wifi has no traffic so connection was still stable. Unfortunately, today morning I tried the wifi sharing with my phone and start streaming from youtube and the alx stop working after a fes sec again - the same. (not even fully youtube page was loaded) so in this case the patches did not solve the problem - still the same behaviour. PS. the workaround setting MTU to 9000 is working also in wifi sharing scenario. I did not try last patch ( attachment 196341 [details] ) nor build it from source because lack of time but definitely will try it during Christmast holiday
Just FYI. The patch for alx from comment #46 helped on a system with AR8162 network adapter (PCI IDs: 1969:1090) and kernel 4.1.x in ROSA Linux. One of our users has such hardware. Before the patch, the wired networking seemed not to work at all on that system. The workaround with MTU 9000 helped though. After the patch was applied, the wired connection has been working normally for at least several days now. No problems so far. Thanks for investigating the issue and providing workarounds and the patches! (In reply to Eugene A. Shatokhin from comment #58) > Just FYI. > > The patch for alx from comment #46 helped on a system with AR8162 network > adapter (PCI IDs: 1969:1090) and kernel 4.1.x in ROSA Linux. One of our > users has such hardware. > > Before the patch, the wired networking seemed not to work at all on that > system. The workaround with MTU 9000 helped though. After the patch was > applied, the wired connection has been working normally for at least several > days now. No problems so far. > > Thanks for investigating the issue and providing workarounds and the patches! I've gone ahead and sent this patch to the netdev mailing list for review, since it hasn't caused any regressions, and shows significant improvements for at least one user. (In reply to Jarod Wilson from comment #59) > > I've gone ahead and sent this patch to the netdev mailing list for review, > since it hasn't caused any regressions, and shows significant improvements > for at least one user. since i use the patched driver from Takashi Iwai (using patch from comment #46) located in http://download.opensuse.org/repositories/home:/tiwai:/bnc952621 i experienced the crash only one time. so it is, of course, a significant improvement/enhancement. (In reply to Rainer Klier from comment #60) > (In reply to Jarod Wilson from comment #59) > > > > I've gone ahead and sent this patch to the netdev mailing list for review, > > since it hasn't caused any regressions, and shows significant improvements > > for at least one user. > > since i use the patched driver from Takashi Iwai (using patch from comment > #46) located in > http://download.opensuse.org/repositories/home:/tiwai:/bnc952621 i > experienced the crash only one time. > > so it is, of course, a significant improvement/enhancement. This patch is now committed in davem's net-next tree, and should find it's way into kernel 4.5. We can keep working on the remaining issues, of course. Jarod, Thank you! Thank you! Thank you! I have a computer with an affected network card. It is a Dell 2710 all in one and replacing the NIC was not possible. I have been tracking issues with this hardware for some time. Before I upgraded to Ubuntu 15.10 (kernel 4.2.0) I was using the workaround of a large MTU. Kernel 4.2.0 completely broke my hardware. ~$ lspci -vs 04:00 04:00.0 Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet (rev 08) Subsystem: Dell Device 054b Flags: bus master, fast devsel, latency 0, IRQ 41 Memory at f7800000 (64-bit, non-prefetchable) [size=256K] I/O ports at e000 [size=128] Capabilities: <access denied> Kernel driver in use: alx Your patch from comment #46 fixes my card. I can remove the USB3 ethernet device I had been using. Sincerely, James Jarod, Thank you! Works fine for me and your patch has be merged into master. https://github.com/torvalds/linux/commit/c406700cdf882b89cb036117414fcd8b0cc2656d It will be available for Kernel 4.5 Nicolas Jarod, I just tried kernel-4.5.0-0.rc5.git0.2.fc25.x86_64 from fedora rawhide. It did not solve my problem. Same symptoms as I have reported before. I have a Dell 2710 All in One with an AR8161. I would like to retract my previous comment #62 of the patch fixing my problem. It improved my ethernet, but there are still issues. I have reverted to kernel 3.16.7 which works flawlessly. For the benefit of someone trying to replicate my issues, here is my setup: - Dell 2710 AIO (AR8161) system connected to an ethernet switch - switch connected to a router/nat device - router/nat device connected to a cable modem - MTU 1500 all through to the Internet My memory is not perfect on which MTU setting caused/resolved which issue, but I set the MTU to 1 of 3 values: - 8192 - 1500 - 1492 (normal for DSL, but not required for my cable connection) MTU of 1500 did not work reliably. One of the other MTU values made scp file copies stall at 2208 KB. These were to another Linux host on the same switch. One of the other MTU settings caused Internet connectivity to be unreliable. (I do not have a packet capture to know exactly what was happening.) Symptoms, were my program to sync files to Google Drive would stall on small files and never recover...even after a restart of the entire system. Web browsing experienced frequent page stalls ( my guess is packets timed out and had to be resent). Google Chrome seemed to handle this better than Firefox did. I am happy to test other patches, but for now I am running an older kernel that functions flawlessly. Thank you, James Created attachment 216961 [details]
new_skb_allocator
(In reply to ldap.tester from comment #64) > Jarod, > I just tried kernel-4.5.0-0.rc5.git0.2.fc25.x86_64 from fedora rawhide. It > did not solve my problem. Same symptoms as I have reported before. I have > a Dell 2710 All in One with an AR8161. can you try the attachment 216961 [details] , I made it before I googled this bugzilla and Jarod's patch (In reply to Feng Tang from comment #67) <snip> > can you try the attachment 216961 [details] , I made it before I googled > this bugzilla and Jarod's patch Hi Feng, I applied your patch to kernel v.4.5.5 from Ubuntu Xenial and installed it on my machine. First thing I tried was to copy a 650MB ISO to another machine. This worked! Normally this would stall at some point and timeout. I only got ~4.5MB/sec transfer rate. Not sure why it was so low. Normal speed is 30-40MB/s. I am now trying some general web surfing. This always seemed to expose problems as well. After more time I will report back with my findings. Thank you, James Crow (In reply to James from comment #68) > (In reply to Feng Tang from comment #67) > <snip> > > can you try the attachment 216961 [details] , I made it before I googled > > this bugzilla and Jarod's patch > > Hi Feng, > > I applied your patch to kernel v.4.5.5 from Ubuntu Xenial and installed it > on my machine. > > First thing I tried was to copy a 650MB ISO to another machine. This worked! > Normally this would stall at some point and timeout. I only got ~4.5MB/sec > transfer rate. Not sure why it was so low. Normal speed is 30-40MB/s. > > I am now trying some general web surfing. This always seemed to expose > problems as well. After more time I will report back with my findings. > > Thank you, > James Crow After further testing, my ethernet seems to be stable. SSH transfer speed (which was 30-40MB/s with older kernels) has been much slower. I just completed the copy of a 1.9GB file at an average speed of 1.4MB/s. My Internet surfing has seemed to remain constant. With all other 4.0+ kernels I have seen significant issues. Pages partially load or load, but only after significant waits. My guess is some TCP timeout occurs and the browser resends the request. Overall, this is progress, but the speed seems more like a 50MBit ethernet adapter rather than a gigabit card. Thank you, James Crow (In reply to James from comment #69) <snip> > > After further testing, my ethernet seems to be stable. SSH transfer speed > (which was 30-40MB/s with older kernels) has been much slower. I just > completed the copy of a 1.9GB file at an average speed of 1.4MB/s. > > My Internet surfing has seemed to remain constant. With all other 4.0+ > kernels I have seen significant issues. Pages partially load or load, but > only after significant waits. My guess is some TCP timeout occurs and the > browser resends the request. > > Overall, this is progress, but the speed seems more like a 50MBit ethernet > adapter rather than a gigabit card. > > Thank you, > James Crow One more data point, I tried to copy two large files (1.3G & 3.3G) from another host to my desktop. These transfers were at the normal speed 65-70MB/s. It seems only when my card is sending data is there a slow down. Thank you, James Crow > can you try the attachment 216961 [details] , I made it before I googled > this bugzilla and Jarod's patch I've tested this patch with kernel 4.1.15 (the patch requred a bit of rediff) and it seems like my AR8161 (Asus n56vz laptop) works like a charm (at least for the last half an hour) using MTU=1360. No any speed problems (8 MiB/s, the maximum speed with my connection), no disconnections. Tested with direct HTTP, FTP downloads and with some torrents (used these images: https://www.mageia.org/en/6/ ). (In reply to James from comment #70) > > One more data point, I tried to copy two large files (1.3G & 3.3G) from > another host to my desktop. These transfers were at the normal speed > 65-70MB/s. > > It seems only when my card is sending data is there a slow down. Hi James, Thanks for the trying and detail info. For the TX slow issue, I just did a test of scp a big file to another machine, and the TX speed here is about 35 MB/s on my Y580, similar to the RX speed (scp file copy). My kernel is 4.4 + the patch. - Feng (In reply to Feng Tang from comment #72) > For the TX slow issue, I just did a test of scp a big file to another > machine, and the TX speed here is about 35 MB/s on my Y580, similar to the > RX speed (scp file copy). My kernel is 4.4 + the patch. > > - Feng I did another round of test with kernel 4.6, the TX speed for scp big file out is also around 30 MB/s. Thanks, Feng I did some more debug, it looks very likely to be related with the RX DMA address. 0xXXXX..XXf80 is very likely to cause the RX overflow interrupt and corrupt the link. For kernel 4.5.0 which merged Jarod's patch and works fine with my AR8161/Lennov Y580, if I made some change to the __netdev_alloc_skb --> __alloc_page_frag() to make the allocated buffer can get an address with 0x...f80, then it will hit the same issue. So I tend to believe that the 0x..f80 address cause the silicon to behave abnormally Both Jarod's and my patch will actually make the RX buffer not hit the 0x..f80 address, my patch is the copy or porting from Eric's patch, which will be more general, while Jarod's patch can not make the 0x...f80 won't happen for some device or some MTU value or some kernel. Created attachment 218211 [details]
work_around_dma_issue.patch
Hi James and Ole Lukoie, Could you help to test the new work_around_dma_issue.patch? thanks The idea is instead of adding a new custom allocator, we shoot at the right target of error. (In reply to Feng Tang from comment #76) > Hi James and Ole Lukoie, > > Could you help to test the new work_around_dma_issue.patch? thanks > > The idea is instead of adding a new custom allocator, we shoot at the right > target of error. I think I'll be able to test this patch but I don't know when. Probably not before the weekend... for all openSUSE users: Takashi Iwai kindly made a new kernel module with the latest patch to be tested. the opensuse ticket for this is: https://bugzilla.opensuse.org/show_bug.cgi?id=952621 this new kernel module is located in this repositories: http://download.opensuse.org/repositories/home:/tiwai:/bnc952621 i have already installed the new module and up to now it seems to work. but as i am using it only a few hours now, this does not really tell us much. (In reply to Rainer Klier from comment #78) > for all openSUSE users: > > Takashi Iwai kindly made a new kernel module with the latest patch to be > tested. > i have already installed the new module and up to now it seems to work. > but as i am using it only a few hours now, this does not really tell us much. Thanks Rainer/Takashi Iwai for the try and update! I read the https://bugzilla.opensuse.org/show_bug.cgi?id=952621 carefully, and your problem looks like to be different from most of the reports here, your error info is: [ 287.796808] alx 0000:04:00.0 eth0: fatal interrupt 0x4019607, resetting From this ISR, it indicatse all the fatal interrupts happened like the ALX_ISR_PCIE_LNKDOWN ALX_ISR_DMAW ALX_ISR_DMAR while mixing with RX and TX complete ISR, which is really scary! :) (I don't have a datasheet, but only guess from the driver and header file) And the most error we met is about 0x8 (RXF overflow interrupt), so I suggest you to new a bug in kernel bugzilla for this, and give detail info like: 1. which kernel version is the last one to work well, and which version is the first to not work 2. detail kernel error message 3. your step to reproduce the issue, like tx/rx big files (In reply to Feng Tang from comment #75) > Created attachment 218211 [details] > work_around_dma_issue.patch So I tested your patch and my ethernet is still usable after several hours of iperf benchmarking. I see no performance difference between this patch and the custom skb allocator. Without one of those patches my AR8161 isn't usable after a few minutes of network traffic. (In reply to Feng Tang from comment #75) > Created attachment 218211 [details] > work_around_dma_issue.patch I've made some testing too. Both patches (this one and previous one) do behave equally, no any difference in network connection speed or stability. Just one notice: I'm a bit scared by potential infinite loop in the latest patch: retry: skb = __netdev_alloc_skb ...; ... if (((u32)skb->data & 0xfff) == 0xfc0) { ... dev_kfree_skb(skb); goto retry; } Are there any guarantee against constantly repeating wrong allocations? (In reply to olelukoie from comment #81) > > Are there any guarantee against constantly repeating wrong allocations? Thank you for the test. From maths, the endless loop should not happen. While Eric Dumazet has suggested another way, instead of retry, we allocate 64B more space, and offset it by 64, when 0x...fc0 is found. could you also try this new patch? Created attachment 219281 [details]
new_dma_patch
Created attachment 219291 [details]
new dma patch
The latest patch works too. I've just noticed that GKH used the patch with custom skb allocator instead of the latest version called "new_dma_patch". Is it ok and there's something wrong with "new_dma_patch" version? And another question: are there plans on applying the patch to current longterm kernels (4.1 & 4.4)? (In reply to olelukoie from comment #86) > I've just noticed that GKH used the patch with custom skb allocator instead > of the latest version called "new_dma_patch". Is it ok and there's something > wrong with "new_dma_patch" version? No, I don't see anything wrong with the "new_dma_patch". I think GKH pick the new allocator patch only because this is the one inside Linus' mainline tree. > > And another question: are there plans on applying the patch to current > longterm kernels (4.1 & 4.4)? The patch won't apply as is (I tried), I'll try to follow the stable kernel rule to see what I can do. btw, I make a patch against 4.4 kernel (and it could apply to 4.1 as well), could you help to test? Thanks, Feng Created attachment 221141 [details]
dma patch for 4.1/4.4 stable kernel
(In reply to Feng Tang from comment #87) > btw, I make a patch against 4.4 kernel (and it could apply to 4.1 as well), > could you help to test? I've already updated kernel to version 4.4.13 a week ago and the patch works normally with it. 4.4.37 : no problems for me :) Patch included in kernel ? Yes, there are patches merged upstream and brought back to the stable trees that should solve all the problems reported here. I think this bug can probably be closed now. hi, (In reply to Feng Tang from comment #79) > (In reply to Rainer Klier from comment #78) > > for all openSUSE users: > > > while mixing with RX and TX complete ISR, which is really scary! :) > (I don't have a datasheet, but only guess from the driver and header file) > > And the most error we met is about 0x8 (RXF overflow interrupt), so I > suggest you to new a bug in kernel bugzilla for this, and give detail info thanks. it seems that this is already done by somebody else: https://bugzilla.kernel.org/show_bug.cgi?id=102171 |