Bug 70761 - AR8161 wir alx driver: Randomly stops to receive packets with small MTU
Summary: AR8161 wir alx driver: Randomly stops to receive packets with small MTU
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
: 96911 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-02-18 14:42 UTC by XmainframeX
Modified: 2016-12-14 12:56 UTC (History)
36 users (show)

See Also:
Kernel Version: >3.6
Tree: Mainline
Regression: Yes


Attachments
direct fix for this bug (3.42 KB, patch)
2015-10-25 16:13 UTC, pru
Details | Diff
extra fix for consuming all memory (2.04 KB, patch)
2015-10-25 16:14 UTC, pru
Details | Diff
fix for recovery from rx underflow (2.56 KB, patch)
2015-10-25 16:14 UTC, pru
Details | Diff
attachment-18127-0.html (2.26 KB, text/html)
2015-11-25 09:04 UTC, Patkós Csaba
Details
Buffer size sanitation, padding and consistency (4.32 KB, patch)
2015-12-02 16:27 UTC, Jarod Wilson
Details | Diff
attachment-11151-0.html (2.12 KB, text/html)
2015-12-15 19:56 UTC, Patkós Csaba
Details
attachment-11265-0.html (2.81 KB, text/html)
2015-12-15 19:57 UTC, Patkós Csaba
Details
new_skb_allocator (5.22 KB, application/octet-stream)
2016-05-20 09:19 UTC, Feng Tang
Details
work_around_dma_issue.patch (805 bytes, patch)
2016-05-30 06:42 UTC, Feng Tang
Details | Diff
new_dma_patch (809 bytes, application/octet-stream)
2016-06-07 09:53 UTC, Feng Tang
Details
new dma patch (969 bytes, application/octet-stream)
2016-06-07 09:57 UTC, Feng Tang
Details
dma patch for 4.1/4.4 stable kernel (1.06 KB, patch)
2016-06-24 07:42 UTC, Feng Tang
Details | Diff

Description XmainframeX 2014-02-18 14:42:49 UTC
I have an Asus NV56VZ Laptop with a AR8161 network card. The alx driver backport worked well up to kernel version 3.6. With newer kernels, the alx driver included from 3.8 and all backported drivers exhibit the following behaviour:

When I plug in a network cable, the network card stops to receive packets after some random amount of time (about between 10 and 120 seconds): Wireshark does not show any incoming packets anymore. Reloading the alx module fixes the problem but only for 10-120 more seconds.
The connection will only work permanently if the Network calble is plugged in before booting and no changes are made to any network settings at all (e.g. enabling the wireless networking makes the wired connection stop working, too).

I diffed the alx driver source for kernels 3.6 and 3.8 and stumbled over some changes in the calculation of the RX buffer size. Beside that, I noticed that there is some difference in RX buffer calculation between MTU > and <= 7*1024 bytes.

Therefore I changed my MTU from 1542 to 8192 as a shot in the dark which fixed the problem for me. This does however not seem to be a good permanent solution to me, so I think an expert on the alx driver should review the code paths related to MTUs < 7*2014.

Feel free to ask for any additional information you might need to solve the problem!
Comment 1 Konrad "morsik" Mosoń 2014-02-21 19:55:12 UTC
The same problem on kernel 3.13.4-1-ARCH on Lenovo Y580 with the same AR8161. Setting MTU to 8192 really helped!
Also, there are missing statistics of alx driver (/proc/net/dev) and it's difficult to debug what's going on.
Comment 2 Marian Poltak 2015-03-30 16:44:02 UTC
I can confirm that the bug still exist in newer kernel (even with linux 4-rc6 release ) Using Lenovo IdeaPad Y580.
When the connection is lost the ifconfig is reporting RX overruns which is still increasing (it looks like every single package received cause overrun). This lasts until cable unplug and plug again or after reloading module alx.
The workaround with setting MTU to 8192 is working.
Comment 3 Ongun Kanat 2015-04-07 02:35:05 UTC
Same problem with 3.18.6. When using IP forwarding it stops randomly.
Comment 4 Marco B. 2015-04-27 05:42:50 UTC
I can confirm this with my IdeaPad Y580 as well. Workaround with MTU 8192 is working.
Comment 5 Bernhard Rausch (harti) 2015-04-28 09:03:29 UTC
I can also confirm this with my IdeaPad Y580. Workaround with MTU 8192 works.
Comment 6 Bernardo Reino 2015-04-28 09:30:17 UTC
Me too. I confirm this issue and workaround with Lenovo Ideapad N581 (AR8161 rev 08).
Comment 7 nstephenh 2015-05-02 00:09:17 UTC
I can also confirm this issue and that the workaround fixes it. Kernel 3.19
Comment 8 Sergio Perez 2015-05-04 12:02:44 UTC
*** Bug 96911 has been marked as a duplicate of this bug. ***
Comment 9 Bob Doolittle 2015-05-19 21:06:41 UTC
I can confirm this issue. I see no received packets with wireshark. I have AR816x/AR817x (Lenovo Ideapad P580, I can't find specifics for the h/w).

HOWEVER, I find that kernel 3.18.7 *works fine* without changing mtu, while with kernel 3.19.5 the interface goes down shortly after configured, unless I bump up the mtu to 8192. I have reproduced this several times.

In my case, I am plugging in the network cable, disabling WiFi, and restarting the network. With 3.19.5 things work initially, then shortly thereafter stops receiving packets. Changing mtu to 8192 restores operation.

With 3.18.7 I do the same procedure, but encounter no issues.
Comment 10 Serious Sam 2015-06-08 18:54:06 UTC
I too can confirm this. Kernel 3.18 LTS series did not have this issue, but when I have updated to Ubuntu 15.04 it appeared again. Even on new 4.1 rc7 the issue persists.  Workaround with MTU 8192 is working.
Comment 11 hundycougar 2015-06-24 17:18:17 UTC
I have a Lenovo W520 with a similar problem - pulling the cable out and putting it back in was the only fix, and it worked for only 30 seconds...

 lspci | grep net
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
Comment 12 Adrien DAUGABEL 2015-07-01 11:44:40 UTC
Hi,

Atheros AR8161 : Same problem for me. Kernel 4.1.1. But, on the 4.0.6, all worked ...

Network:   Card-1: Intel Centrino Wireless-N 2230 driver: iwlwifi
           IF: wlp3s0 state: up mac: 68:5d:43:2a:f3:af
           Card-2: Qualcomm Atheros AR8161 Gigabit Ethernet driver: alx
           IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac:
Comment 13 Rainer Klier 2015-07-02 06:48:48 UTC
same problem for me.
i am on openSUSE 13.2 x86_64 on an Asus G750J.

lspci:
04:00.0 Ethernet controller: Qualcomm Atheros QCA8171 Gigabit Ethernet (rev 10)

dmesg:
alx 0000:04:00.0 eth0: Qualcomm Atheros AR816x/AR817x Ethernet [ac:22:0b:b8:00:59]

Workaround with MTU 8192 is working only on kernels below 4.1.x.
i tried 4.1.0 and here it stops working even when using MTU 8192 workaround.
currently i am back on 4.0.5.
Comment 14 Rainer Klier 2015-07-02 08:02:32 UTC
(In reply to Rainer Klier from comment #13)

> Workaround with MTU 8192 is working only on kernels below 4.1.x.
> i tried 4.1.0 and here it stops working even when using MTU 8192 workaround.
> currently i am back on 4.0.5.

i was wrong here.
in fact, i failed setting the MTU to 8192.
i thought i did it, but it didn't work.

so, below kernel 4.1.x it worked most of the time for me, even without setting the MTU to 8192.

but the error occured randomly.

and with kernel 4.1.x the error occured instantly.

currently i finally managed to set MTU to 8192 and trying again kernel 4.1.1....
15 minutes after booting alx is still working.... ;-)
Comment 15 Rainer Klier 2015-07-02 08:59:35 UTC
(In reply to Rainer Klier from comment #14)
> (In reply to Rainer Klier from comment #13)
> 

> currently i finally managed to set MTU to 8192 and trying again kernel
> 4.1.1....
> 15 minutes after booting alx is still working.... ;-)

40 minutes later it happened again, even with the MTU workaround.
:-(

back to Kernel 4.0.5.
Comment 16 Adrien DAUGABEL 2015-07-12 20:39:39 UTC
Kernel 4.1.2 : same problem and same workaround : MTU = 8000 is ok
Comment 17 Thiago Okada 2015-07-22 16:30:02 UTC
Kernel 4.1.2 (on Arch Linux) and Lenovo Y580: can confirm the same problem and the MTU=8192 fix.
Comment 18 mp-001 2015-08-03 20:13:59 UTC
I've the same problem with the Asus N76VM (AR8161) on Fedora 22 with kernel >= 4.1. 

Setting MTU to 8192 works, also kernel < 4.1.
Comment 19 Rainer Klier 2015-08-04 07:19:36 UTC
with kernel 4.1.3 and MTU 9000 it works now as stable as with kernel 4.0.5, which means that the bug/crash happens not that often, but happens from time to time.
Comment 20 John 2015-08-26 00:59:14 UTC
Same issue with Dell One 27 running Fedora 22 Kernel 4.1.5-200.fc22.x86_64 with Qualcomm Atheros AR816x/AR817x Ethernet driver alx.

Set MTU 9000 has worked, not been running very long so don't know how stable it is.
Comment 21 Gleb Simanov 2015-09-11 04:38:00 UTC
I confirm to have this same issue with newest distros. Ubuntu 15.04 and also Antergos. But Debian 8 Jessie works without this problem.

At the moment:

3.19.0-28-generic #30-Ubuntu SMP Mon Aug 31 15:52:51 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Comment 22 Adrien DAUGABEL 2015-09-17 19:09:43 UTC
Kernel 4.1.7 : same problem, no correction :'(
Comment 23 Adrien DAUGABEL 2015-10-02 10:26:59 UTC
4.1.8 and 4.1.9 : same problem
Comment 24 pru 2015-10-13 20:25:29 UTC
A number of bug confirmation reports but is there anyone actually looking into this?

It is a second time I see problem around here (https://bugzilla.kernel.org/show_bug.cgi?id=51671). And this time it is is also not occasional problem, it simply happen from first run. It that code tested at all? Why this buggy thing reached kernel at all?

BTW. The MTU change does not help if few GB file to be transferred.
Comment 25 Gleb Simanov 2015-10-13 20:29:01 UTC
(In reply to pru from comment #24)
> A number of bug confirmation reports but is there anyone actually looking
> into this?
> 
> It is a second time I see problem around here
> (https://bugzilla.kernel.org/show_bug.cgi?id=51671). And this time it is is
> also not occasional problem, it simply happen from first run. It that code
> tested at all? Why this buggy thing reached kernel at all?
> 
> BTW. The MTU change does not help if few GB file to be transferred.

I think no and that's the damn problem!
Comment 26 pru 2015-10-25 16:12:50 UTC
Since no one is looking into this let me share my findings.
The main problem is that RX signalling is gone after a while. This covers interrupt flag an update bit in word3 register. Checking the skb buffers content with pattern shows there is no overflow signs. Thus the problem might be in handling them at the chip side. Looking at other drivers the extra pattern space is used for specific chips. Trying the same with ar8162 shows the extra 16B are needed here. As there is no documentation the only prof for that is by testing number of transfers with different mtu. Thus the direct solution for this bug is by adding 16B padding, which is in patch 0001.

Well, the more I tested the more problems I found. There are two other things that I would link to this bug too.
First - the rx buffer refill in some condition loops till end of memory. This is because it goes up to read index, that is constantly running away in the background. This is fixed in patch 0002.
Second – in case of rx underflow there is no recovery from this state as nothing will allocate new rx buffers. This is fixed in patch 0003.
Comment 27 pru 2015-10-25 16:13:20 UTC
Created attachment 191091 [details]
direct fix for this bug
Comment 28 pru 2015-10-25 16:14:00 UTC
Created attachment 191101 [details]
extra fix for consuming all memory
Comment 29 pru 2015-10-25 16:14:51 UTC
Created attachment 191111 [details]
fix for recovery from rx underflow
Comment 30 Rainer Klier 2015-10-27 14:18:09 UTC
(In reply to pru from comment #26)
> Since no one is looking into this let me share my findings.
> nothing will allocate new rx buffers. This is fixed in patch 0003.

thanks for bringing light into this.
great!
but how and when will these fixes be part of the kernel?
Comment 31 pru 2015-10-27 17:12:26 UTC
I'm not a kernel developer, I did some approved things and be glad to do another in future, but the patches must be reviewed, modified or not, approved etc.
Hopefully this is an existing bug and it is assigned already, thus let assigned staff continue.
Comment 32 Rainer Klier 2015-10-29 08:56:19 UTC
(In reply to pru from comment #31)
> I'm not a kernel developer, I did some approved things and be glad to do
> another in future, but the patches must be reviewed, modified or not,
> approved etc.

where did you get the current alx source from?
is it https://github.com/erikarn/alx ?

i want to try out your patches.
to which source did you apply your patches?
Comment 33 Bernardo Reino 2015-10-29 12:36:37 UTC
@Rainer Klier,

The link you posted is for the original Qualcomm unified driver. Kernel 3.8+ includes an in-tree driver (by Johannes Berg) based on that one but stripped off a lot of things.

I suspect either something is broken in that driver, and/or the card (or some revisions thereof) have a hardware bug, which the original driver might be able to circumvent (something to do with tcp segmentation offload).

I imagine @prui's patches apply to the in-tree driver.
Comment 34 Rainer Klier 2015-10-29 12:54:30 UTC
(In reply to Bernardo Reino from comment #33)
> @Rainer Klier,
> 
> The link you posted is for the original Qualcomm unified driver. Kernel 3.8+
> includes an in-tree driver (by Johannes Berg) based on that one but stripped
> off a lot of things.

ah, ok.

> I imagine @prui's patches apply to the in-tree driver.

i was just asking IF the source from  https://github.com/erikarn/alx is the correct one to use the patch.
at that time i didn't know any other source for this driver.

now i assume the in-tree driver source is this:
https://github.com/torvalds/linux/tree/master/drivers/net/ethernet/atheros/alx

anyhow, i tried to compile the driver from https://github.com/erikarn/alx but failed. :-(

i think i have to wait for the fix to be included in one of the next kernel releases....
Comment 35 pru 2015-10-29 15:55:24 UTC
The patches are against https://github.com/torvalds/linux.git, sorry I could mention that before.
I know the backport version has more code, but since the first entry in this bug says 'alx driver backport worked well up to' I assumed it also suffers from the same problem.
Comment 36 pru 2015-10-29 16:09:23 UTC
Note to the patch 0003 - it might be good to schedule rx refill on a timer instead immediate queue on underflow. I spent only more than one day on this so treat the patches as the proof of concept, even they do the job.
Comment 37 Rainer Klier 2015-11-09 14:32:06 UTC
for all openSUSE users:
i have created also a ticket in opensuse bugzilla:
https://bugzilla.opensuse.org/show_bug.cgi?id=952621

and  Takashi Iwai kindly made a new kernel module with the above pathces to be tested.

this new kernel module is located in this repositories:
http://download.opensuse.org/repositories/home:/tiwai:/bnc952621
Comment 38 Takashi Iwai 2015-11-23 14:18:56 UTC
(In reply to pru from comment #31)
> I'm not a kernel developer, I did some approved things and be glad to do
> another in future, but the patches must be reviewed, modified or not,
> approved etc.
> Hopefully this is an existing bug and it is assigned already, thus let
> assigned staff continue.

I guess no one else working on this, so your fix would be the best to be applied to upstream.

Could you submit your fix patches (at least the first two) to upstream ML after brushing them up a bit?  Make each subject line concise, and put more information in the changelog texts, describe for which bug it is and what each patch actually does.  Better to put this bugzilla as the information point, and take a tested-by tag, for example.
Comment 39 Sabrina Dubroca 2015-11-23 14:35:50 UTC
(In reply to pru from comment #35)
> The patches are against https://github.com/torvalds/linux.git, sorry I could
> mention that before.
> I know the backport version has more code, but since the first entry in this
> bug says 'alx driver backport worked well up to' I assumed it also suffers
> from the same problem.

Please post your patches to the netdev mailing list (netdev@vger.kernel.org). That's where all development for networking and network drivers happen, including code review Patches cannot be included in the kernel unless they are submitted to the appropriate mailing list. And most kernel developers (at least in networking) don't read bugzilla, sorry.

I suspect hardware bugs on some models, as I have never encountered this bug, or any other problem, with my alx card, but I will test your patches.

Documentation for submitting patches [0][1].

[0] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches
[1] http://kernelnewbies.org/FirstKernelPatch
Comment 40 Rainer Klier 2015-11-24 09:44:36 UTC
(In reply to Rainer Klier from comment #37)

> and  Takashi Iwai kindly made a new kernel module with the above pathces to
> be tested.
> 
> this new kernel module is located in this repositories:
> http://download.opensuse.org/repositories/home:/tiwai:/bnc952621

sadly, i had the crash again today even with the patched/new kernel module:

[  287.796808] alx 0000:04:00.0 eth0: fatal interrupt 0x4019607, resetting

so it seems, that it only takes longer to appear, but the changes do not solve it.
Comment 41 Jarod Wilson 2015-11-24 16:10:38 UTC
Argh, that's unfortunate. I'm doing a test build with these patches to take for a spin on my own alx-driven NIC (E2200, doesn't have this bug though), and trying to clean them up slightly to get them sent to netdev for proper review.
Comment 42 pru 2015-11-24 18:34:11 UTC
Rainer – a fundamental question, without the patches and without the MTU set, did you observe the lost connection separately from the crash or it was the crash always? Because if you hit the lost connection without the crash then we probably have two separate problems here.
Note the patches focus on the lost connection only as I never had a crash (well, did not test this log enough).

Takashi/Sabrina – if there are two separate problems, so lost connection vs. crash, I can push the patches to the list, sadly not before the next week.

Jarod – the driver needs some work, as I mentioned before I found two other problems during testing and I stopped looking further. Note my note to the 3rd patch, doing this differently it is a matter of preferences, but this patch is needed to recover from a dead end.
Comment 43 Rainer Klier 2015-11-25 07:47:48 UTC
(In reply to pru from comment #42)
> Rainer – a fundamental question, without the patches and without the MTU
> set, did you observe the lost connection separately from the crash or it was
> the crash always? Because if you hit the lost connection without the crash

for me it is always the same.
the alx driver stops working.
the connection is lost.
the alx driver is not useable any more.
and the dmesg output is flooded with the "eth0: fatal interrupt 0x4019607, resetting" messages. it once happened that my disk ran out of space because /var/log/messages was several gigabytes large....

in this situation i am not able to use the network card any more until reboot.
i can only use wlan at this situation.
besides this the computer is normally useable.
so it does not crash completely.
i only have to reboot as fast as possible because /var/log/messages is growing.

but i think i remember that without the patches a reboot didn't always work cleanly. sometimes i had a bad kernel crash while trying to reboot.
and in the debug output of this crash the alx module was mentioned as reason somehow.
Comment 44 Patkós Csaba 2015-11-25 09:04:57 UTC
Created attachment 195371 [details]
attachment-18127-0.html

I have no problem using my network card on kernel 4.0. So maybe someone
should take a look what happened since then that broke the driver.

On Wed, Nov 25, 2015 at 9:47 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=70761
>
> --- Comment #43 from Rainer Klier <rainer.klier@gmx.at> ---
> (In reply to pru from comment #42)
> > Rainer – a fundamental question, without the patches and without the MTU
> > set, did you observe the lost connection separately from the crash or it
> was
> > the crash always? Because if you hit the lost connection without the
> crash
>
> for me it is always the same.
> the alx driver stops working.
> the connection is lost.
> the alx driver is not useable any more.
> and the dmesg output is flooded with the "eth0: fatal interrupt 0x4019607,
> resetting" messages. it once happened that my disk ran out of space because
> /var/log/messages was several gigabytes large....
>
> in this situation i am not able to use the network card any more until
> reboot.
> i can only use wlan at this situation.
> besides this the computer is normally useable.
> so it does not crash completely.
> i only have to reboot as fast as possible because /var/log/messages is
> growing.
>
> but i think i remember that without the patches a reboot didn't always work
> cleanly. sometimes i had a bad kernel crash while trying to reboot.
> and in the debug output of this crash the alx module was mentioned as
> reason
> somehow.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 45 ldap.tester 2015-11-30 16:17:55 UTC
The symptoms of this bug seem to vary from report to report.  Maybe it depends on exactly which model chip you have.  I have two machines, Dell All-In-One 27 with a Qualcomm Atheros AR8161 Gigabit Ethernet (rev 08).  The alx driver works fine with kernel 4.0.x but not with 4.1.x and later.  My symptoms of the problem are that the interface works fine for about a minute after boot and then stops communicating.  Packets are transmitted out of the machine, but no packets are received.  The machine works fine otherwise.  Only the ethernet communication fails.  After about 25 minutes, I get a message like: kernel: alx 0000:06:00.0 p5p1: fatal interrupt 0x8400, resetting.
The network works again for about a minute and then stops working.  I tested the above posted patches with kernel 4.2.5, and saw essentially the same symptoms - only minor differences in timing.  Maybe I did the patching wrong but I don't think so.  Please see https://bugzilla.redhat.com/show_bug.cgi?id=1251434
Comment 46 Jarod Wilson 2015-12-02 16:27:26 UTC
Created attachment 196341 [details]
Buffer size sanitation, padding and consistency

If some folks with affected hardware could give this patch a spin, it would be much appreciated. It's based loosely on Przemek's first patch, I've not yet dug into the other issues. I've tested this lightly on my own E2200-equipped laptop, which had no problems to begin with, and continues to function just fine with this patch.
Comment 47 Jarod Wilson 2015-12-02 16:41:17 UTC
(In reply to Patkós Csaba from comment #44)
> I have no problem using my network card on kernel 4.0. So maybe someone
> should take a look what happened since then that broke the driver.

Well, for starters...

$ git log v4.0..v4.3 -- drivers/net/ethernet/atheros/alx/
<absolutely no changes to the alx driver code>
Comment 48 ldap.tester 2015-12-03 17:00:09 UTC
Thanks for working on this, Jarod.  I tested your patch with kernel-4.2.5-201.fc22.x86_64 and still got the same failure.  The symptoms are essentially the same as I described in my last post.  About one minute after reboot, the interface stops receiving packets.  It transmits OK.  It stays that way for about 25 minutes and then I see these messages:

kernel: alx 0000:06:00.0 p5p1: fatal interrupt 0x400, resetting
NetworkManager[689]: <info>  (p5p1): link disconnected (deferring action for 4 seconds)
NetworkManager[689]: <info>  (p5p1): link connected
kernel: alx 0000:06:00.0 p5p1: NIC Up: 1 Gbps Full
     
upon which it works again for  minute, and then stops receiving again for another 25.

I note that ifconfig reports:

RX packets 24070  bytes 204666 (199.8 KiB)
RX errors 21893  dropped 5  overruns 21893  frame 0
TX packets 2009  bytes 198321 (193.6 KiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
Comment 49 Doaxan 2015-12-15 14:19:37 UTC
I confirm that the bug is on lenovo y580 with kernel 4.x. 3.18.24-1 kernel works fine. Is there a new version of the kernel to fix this problem?
Comment 50 Takashi Iwai 2015-12-15 14:26:26 UTC
(In reply to doaxan77 from comment #49)
> I confirm that the bug is on lenovo y580 with kernel 4.x. 3.18.24-1 kernel
> works fine. Is there a new version of the kernel to fix this problem?

If 3.18.x works, your problem might be different from what other people face here.  And if it's really so (3.18.x works and later not), try git bisect at best.
Comment 51 Marian Poltak 2015-12-15 19:44:53 UTC
I have been using above patch for about one week now and had no problem since then so definitely the patch fixes the problem for me. But will continue to use it and test it.
But without the patch the connection has dropped after a few minutes and almost immediately after connected with both ethernet and wifi (or host only adapter from virtualbox).

I have Lenovo Y580 OpenSuse Leap 42.1 kernel 4.1.13
lspci | grep Ethernet
02:00.0 Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet (rev 08)

perhaps there are some differences between these ethernet cards ?

PS:
definitely working kernel was 3.11.0 with driver http://linuxwireless.org/download/compat-wireless-2.6/compat-wireless-2012-07-03-pc.tar.bz2
Comment 52 Patkós Csaba 2015-12-15 19:56:01 UTC
Created attachment 197451 [details]
attachment-11151-0.html

This is a very elusive issue. My onboard NIC on an HP desktop works
perfectly up until kernel 4.0. Any kernel above, 4.1 and above, brakes it,
though there are no changes to the driver between these kernels. So most
probably something changed __in the kernel__ that made the driver act up.

On Tue, Dec 15, 2015 at 9:45 PM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=70761
>
> --- Comment #51 from Marian Poltak <marianpoltak@centrum.sk> ---
> I have been using above patch for about one week now and had no problem
> since
> then so definitely the patch fixes the problem for me. But will continue
> to use
> it and test it.
> But without the patch the connection has dropped after a few minutes and
> almost
> immediately after connected with both ethernet and wifi (or host only
> adapter
> from virtualbox).
>
> I have Lenovo Y580 OpenSuse Leap 42.1 kernel 4.1.13
> lspci | grep Ethernet
> 02:00.0 Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet (rev
> 08)
>
> perhaps there are some differences between these ethernet cards ?
>
> PS:
> definitely working kernel was 3.11.0 with driver
>
>
> http://linuxwireless.org/download/compat-wireless-2.6/compat-wireless-2012-07-03-pc.tar.bz2
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 53 Patkós Csaba 2015-12-15 19:56:59 UTC
Created attachment 197461 [details]
attachment-11265-0.html

I forgot to mention that I have the same NIC in my Assus laptop. Works with
kernel 4.0. Didn't try anything newer.


On Tue, Dec 15, 2015 at 9:55 PM Patkós Csaba <patkoscsaba@gmail.com> wrote:

> This is a very elusive issue. My onboard NIC on an HP desktop works
> perfectly up until kernel 4.0. Any kernel above, 4.1 and above, brakes it,
> though there are no changes to the driver between these kernels. So most
> probably something changed __in the kernel__ that made the driver act up.
>
> On Tue, Dec 15, 2015 at 9:45 PM <bugzilla-daemon@bugzilla.kernel.org>
> wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=70761
>>
>> --- Comment #51 from Marian Poltak <marianpoltak@centrum.sk> ---
>> I have been using above patch for about one week now and had no problem
>> since
>> then so definitely the patch fixes the problem for me. But will continue
>> to use
>> it and test it.
>> But without the patch the connection has dropped after a few minutes and
>> almost
>> immediately after connected with both ethernet and wifi (or host only
>> adapter
>> from virtualbox).
>>
>> I have Lenovo Y580 OpenSuse Leap 42.1 kernel 4.1.13
>> lspci | grep Ethernet
>> 02:00.0 Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet
>> (rev 08)
>>
>> perhaps there are some differences between these ethernet cards ?
>>
>> PS:
>> definitely working kernel was 3.11.0 with driver
>>
>>
>> http://linuxwireless.org/download/compat-wireless-2.6/compat-wireless-2012-07-03-pc.tar.bz2
>>
>> --
>> You are receiving this mail because:
>> You are on the CC list for the bug.
>>
> --
> Patkós Csaba
> Software Developer @ Syneto LTD
> www.syneto.eu
> "A little bit more agile every day"
>
Comment 54 Doaxan 2015-12-16 20:17:50 UTC
With distribution Manjaro I tested the Linux kernel, and got the following results:
4.2.6-1 NOT
4.3.0-1 NOT
4.4rc4-1 NOT
4.1.13-1 NOT
3.19.8.10-1 NOT
3.18.24-1 worked 
3.16.7.20-1 worked

In cases where the internet did not work helped MTU 9000. I hope for the solution of this problem for home users of Linux.
Comment 55 Jarod Wilson 2015-12-18 16:00:14 UTC
(In reply to Marian Poltak from comment #51)
> I have been using above patch for about one week now and had no problem
> since then so definitely the patch fixes the problem for me. But will
> continue to use it and test it.

Which above patch (or patches)? There are four. :)

> But without the patch the connection has dropped after a few minutes and
> almost immediately after connected with both ethernet and wifi (or host only
> adapter from virtualbox).

The patches all touch only the alx nic driver, they shouldn't have any impact on wifi or host only vbox networking...
Comment 56 Marian Poltak 2015-12-19 09:33:21 UTC
(In reply to Jarod Wilson from comment #55)
> (In reply to Marian Poltak from comment #51)
> > I have been using above patch for about one week now and had no problem
> > since then so definitely the patch fixes the problem for me. But will
> > continue to use it and test it.
> 
> Which above patch (or patches)? There are four. :)

this patches from this repository: (for opensuse leap 42.1 x86_64)
http://download.opensuse.org/repositories/home:/tiwai:/bnc952621
> 
> > But without the patch the connection has dropped after a few minutes and
> > almost immediately after connected with both ethernet and wifi (or host
> only
> > adapter from virtualbox).
> 
> The patches all touch only the alx nic driver, they shouldn't have any
> impact on wifi or host only vbox networking...

I know the other connections are working fine but when active and have some traffic, it somehow make alx driver to fail much faster (I do not know why perhaps something about timing ? that tcp/ip stack process those other adapter sooner and therefore make alx wait longer and fail ? absolutely do not know)

Typical situation: (checking via ifconfig)
1.
connect to internet via Ethernet. (alx working RX overrun: 0)
2.
create ad-hoc wifi for internet sharing (alx working RX overrun: 0)
3.
connect smartphone to the shared wifi (alx working RX overrun: 0)
4.
start browsing net on smartphone via shared wifi (alx break after few sec. RX overrun start growing)
ping over wifi is working (smartphone and PC) no problem
only ethernet (alx) cannot receive any packet.

and it is similar when using vbox only adapter.

The pathes hepls only for scenario 1.

Previously I tried above patches
only when connected via single connection on ethernet - the connection was stable, then I tried Both ethernet and wifi but wifi has no traffic so connection was still stable.
Unfortunately, today morning I tried the wifi sharing with my phone and start streaming from youtube and the alx stop working after a fes sec again - the same. (not even fully youtube page was loaded)
so in this case the patches did not solve the problem - still the same behaviour.
PS. the workaround setting MTU to 9000 is working also in wifi sharing scenario.
Comment 57 Marian Poltak 2015-12-19 09:38:55 UTC
I did not try last patch  ( attachment 196341 [details] ) nor build it from source because lack of time but definitely will try it during Christmast holiday
Comment 58 Eugene A. Shatokhin 2015-12-20 18:51:02 UTC
Just FYI. 

The patch for alx from comment #46 helped on a system with AR8162 network adapter (PCI IDs: 1969:1090) and kernel 4.1.x in ROSA Linux. One of our users has such hardware.

Before the patch, the wired networking seemed not to work at all on that system. The workaround with MTU 9000 helped though. After the patch was applied, the wired connection has been working normally for at least several days now. No problems so far. 

Thanks for investigating the issue and providing workarounds and the patches!
Comment 59 Jarod Wilson 2016-01-05 23:14:25 UTC
(In reply to Eugene A. Shatokhin from comment #58)
> Just FYI. 
> 
> The patch for alx from comment #46 helped on a system with AR8162 network
> adapter (PCI IDs: 1969:1090) and kernel 4.1.x in ROSA Linux. One of our
> users has such hardware.
> 
> Before the patch, the wired networking seemed not to work at all on that
> system. The workaround with MTU 9000 helped though. After the patch was
> applied, the wired connection has been working normally for at least several
> days now. No problems so far. 
> 
> Thanks for investigating the issue and providing workarounds and the patches!

I've gone ahead and sent this patch to the netdev mailing list for review, since it hasn't caused any regressions, and shows significant improvements for at least one user.
Comment 60 Rainer Klier 2016-01-07 16:53:55 UTC
(In reply to Jarod Wilson from comment #59)
> 
> I've gone ahead and sent this patch to the netdev mailing list for review,
> since it hasn't caused any regressions, and shows significant improvements
> for at least one user.

since i use the patched driver from Takashi Iwai (using patch from comment #46) located in http://download.opensuse.org/repositories/home:/tiwai:/bnc952621 i experienced the crash only one time.

so it is, of course, a significant improvement/enhancement.
Comment 61 Jarod Wilson 2016-01-07 19:03:37 UTC
(In reply to Rainer Klier from comment #60)
> (In reply to Jarod Wilson from comment #59)
> > 
> > I've gone ahead and sent this patch to the netdev mailing list for review,
> > since it hasn't caused any regressions, and shows significant improvements
> > for at least one user.
> 
> since i use the patched driver from Takashi Iwai (using patch from comment
> #46) located in
> http://download.opensuse.org/repositories/home:/tiwai:/bnc952621 i
> experienced the crash only one time.
> 
> so it is, of course, a significant improvement/enhancement.

This patch is now committed in davem's net-next tree, and should find it's way into kernel 4.5. We can keep working on the remaining issues, of course.
Comment 62 James 2016-02-04 00:06:06 UTC
Jarod,

Thank you! Thank you! Thank you!

I have a computer with an affected network card. It is a Dell 2710 all in one and replacing the NIC was not possible. I have been tracking issues with this hardware for some time. Before I upgraded to Ubuntu 15.10 (kernel 4.2.0) I was using the workaround of a large MTU. Kernel 4.2.0 completely broke my hardware.

~$ lspci -vs 04:00
04:00.0 Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet (rev 08)
	Subsystem: Dell Device 054b
	Flags: bus master, fast devsel, latency 0, IRQ 41
	Memory at f7800000 (64-bit, non-prefetchable) [size=256K]
	I/O ports at e000 [size=128]
	Capabilities: <access denied>
	Kernel driver in use: alx


Your patch from comment #46 fixes my card. I can remove the USB3 ethernet device I had been using.

Sincerely,
James
Comment 63 Nicolas Peters 2016-02-06 23:08:52 UTC
Jarod,

Thank you! Works fine for me and your patch has be merged into master.
https://github.com/torvalds/linux/commit/c406700cdf882b89cb036117414fcd8b0cc2656d

It will be available for Kernel 4.5


Nicolas
Comment 64 ldap.tester 2016-03-01 21:21:07 UTC
Jarod,
I just tried kernel-4.5.0-0.rc5.git0.2.fc25.x86_64 from fedora rawhide.  It did not solve my problem.  Same symptoms as I have reported before.  I have a Dell 2710 All in One with an AR8161.
Comment 65 James 2016-03-04 15:13:34 UTC
I would like to retract my previous comment #62 of the patch fixing my problem. It improved my ethernet, but there are still issues. I have reverted to kernel 3.16.7 which works flawlessly.

For the benefit of someone trying to replicate my issues, here is my setup:
- Dell 2710 AIO (AR8161) system connected to an ethernet switch
- switch connected to a router/nat device
- router/nat device connected to a cable modem 
- MTU 1500 all through to the Internet

My memory is not perfect on which MTU setting caused/resolved which issue, but I set the MTU to 1 of 3 values:
- 8192
- 1500
- 1492 (normal for DSL, but not required for my cable connection)

MTU of 1500 did not work reliably.
One of the other MTU values made scp file copies stall at 2208 KB. These were to another Linux host on the same switch.
One of the other MTU settings caused Internet connectivity to be unreliable. (I do not have a packet capture to know exactly what was happening.) Symptoms, were my program to sync files to Google Drive would stall on small files and never recover...even after a restart of the entire system.

Web browsing experienced frequent page stalls ( my guess is packets timed out and had to be resent). Google Chrome seemed to handle this better than Firefox did.

I am happy to test other patches, but for now I am running an older kernel that functions flawlessly.


Thank you,
James
Comment 66 Feng Tang 2016-05-20 09:19:19 UTC
Created attachment 216961 [details]
new_skb_allocator
Comment 67 Feng Tang 2016-05-20 09:20:53 UTC
(In reply to ldap.tester from comment #64)
> Jarod,
> I just tried kernel-4.5.0-0.rc5.git0.2.fc25.x86_64 from fedora rawhide.  It
> did not solve my problem.  Same symptoms as I have reported before.  I have
> a Dell 2710 All in One with an AR8161.

can you try the attachment 216961 [details] , I made it before I googled this bugzilla and Jarod's patch
Comment 68 James 2016-05-21 11:17:15 UTC
(In reply to Feng Tang from comment #67)
<snip>
> can you try the attachment 216961 [details] , I made it before I googled
> this bugzilla and Jarod's patch

Hi Feng,

I applied your patch to kernel v.4.5.5 from Ubuntu Xenial and installed it on my machine. 

First thing I tried was to copy a 650MB ISO to another machine. This worked! Normally this would stall at some point and timeout. I only got ~4.5MB/sec transfer rate. Not sure why it was so low. Normal speed is 30-40MB/s.

I am now trying some general web surfing. This always seemed to expose problems as well. After more time I will report back with my findings.

Thank you,
James Crow
Comment 69 James 2016-05-21 11:43:08 UTC
(In reply to James from comment #68)
> (In reply to Feng Tang from comment #67)
> <snip>
> > can you try the attachment 216961 [details] , I made it before I googled
> > this bugzilla and Jarod's patch
> 
> Hi Feng,
> 
> I applied your patch to kernel v.4.5.5 from Ubuntu Xenial and installed it
> on my machine. 
> 
> First thing I tried was to copy a 650MB ISO to another machine. This worked!
> Normally this would stall at some point and timeout. I only got ~4.5MB/sec
> transfer rate. Not sure why it was so low. Normal speed is 30-40MB/s.
> 
> I am now trying some general web surfing. This always seemed to expose
> problems as well. After more time I will report back with my findings.
> 
> Thank you,
> James Crow

After further testing, my ethernet seems to be stable. SSH transfer speed (which was 30-40MB/s with older kernels) has been much slower. I just completed the copy of a 1.9GB file at an average speed of 1.4MB/s.

My Internet surfing has seemed to remain constant. With all other 4.0+ kernels I have seen significant issues. Pages partially load or load, but only after significant waits. My guess is some TCP timeout occurs and the browser resends the request.

Overall, this is progress, but the speed seems more like a 50MBit ethernet adapter rather than a gigabit card.

Thank you,
James Crow
Comment 70 James 2016-05-21 14:13:28 UTC
(In reply to James from comment #69)
<snip>
> 
> After further testing, my ethernet seems to be stable. SSH transfer speed
> (which was 30-40MB/s with older kernels) has been much slower. I just
> completed the copy of a 1.9GB file at an average speed of 1.4MB/s.
> 
> My Internet surfing has seemed to remain constant. With all other 4.0+
> kernels I have seen significant issues. Pages partially load or load, but
> only after significant waits. My guess is some TCP timeout occurs and the
> browser resends the request.
> 
> Overall, this is progress, but the speed seems more like a 50MBit ethernet
> adapter rather than a gigabit card.
> 
> Thank you,
> James Crow

One more data point, I tried to copy two large files (1.3G & 3.3G) from another host to my desktop. These transfers were at the normal speed 65-70MB/s.

It seems only when my card is sending data is there a slow down.

Thank you,
James Crow
Comment 71 olelukoie 2016-05-21 14:23:53 UTC
> can you try the attachment 216961 [details] , I made it before I googled
> this bugzilla and Jarod's patch

I've tested this patch with kernel 4.1.15 (the patch requred a bit of rediff) and it seems like my AR8161 (Asus n56vz laptop) works like a charm (at least for the last half an hour) using MTU=1360. No any speed problems (8 MiB/s, the maximum speed with my connection), no disconnections. Tested with direct HTTP, FTP downloads and with some torrents (used these images: https://www.mageia.org/en/6/ ).
Comment 72 Feng Tang 2016-05-23 02:40:04 UTC
(In reply to James from comment #70)

> 
> One more data point, I tried to copy two large files (1.3G & 3.3G) from
> another host to my desktop. These transfers were at the normal speed
> 65-70MB/s.
> 
> It seems only when my card is sending data is there a slow down.

Hi James,

Thanks for the trying and detail info.

For the TX slow issue, I just did a test of scp a big file to another machine, and the TX speed here is about 35 MB/s on my Y580, similar to the RX speed (scp file copy). My kernel is 4.4 + the patch.

- Feng
Comment 73 Feng Tang 2016-05-23 03:16:00 UTC
(In reply to Feng Tang from comment #72)

> For the TX slow issue, I just did a test of scp a big file to another
> machine, and the TX speed here is about 35 MB/s on my Y580, similar to the
> RX speed (scp file copy). My kernel is 4.4 + the patch.
> 
> - Feng

I did another round of test with kernel 4.6, the TX speed for scp big file
out is also around 30 MB/s.

Thanks, 
Feng
Comment 74 Feng Tang 2016-05-25 05:20:44 UTC
I did some more debug, it looks very likely to be related with the RX DMA address. 0xXXXX..XXf80 is very likely to cause the RX overflow interrupt and corrupt the link.

For kernel 4.5.0 which merged Jarod's patch and works fine with my AR8161/Lennov Y580, if I made some change to the __netdev_alloc_skb --> __alloc_page_frag() to make the allocated buffer can get an address with 0x...f80, then it will hit the same issue. So I tend to believe that the 0x..f80 address cause the silicon to behave abnormally
    

Both Jarod's and my patch will actually make the RX buffer not hit the 0x..f80 address, my patch is the copy or porting from Eric's patch, which will be more general, while Jarod's patch can not make the 0x...f80 won't happen for some device or some MTU value or some kernel.
Comment 75 Feng Tang 2016-05-30 06:42:07 UTC
Created attachment 218211 [details]
work_around_dma_issue.patch
Comment 76 Feng Tang 2016-05-30 06:44:19 UTC
Hi James and Ole Lukoie,

Could you help to test the new work_around_dma_issue.patch? thanks

The idea is instead of adding a new custom allocator, we shoot at the right target of error.
Comment 77 olelukoie 2016-05-30 17:41:46 UTC
(In reply to Feng Tang from comment #76)
> Hi James and Ole Lukoie,
> 
> Could you help to test the new work_around_dma_issue.patch? thanks
> 
> The idea is instead of adding a new custom allocator, we shoot at the right
> target of error.

I think I'll be able to test this patch but I don't know when. Probably not before the weekend...
Comment 78 Rainer Klier 2016-06-01 07:22:01 UTC
for all openSUSE users:

Takashi Iwai kindly made a new kernel module with the latest patch to be tested.

the opensuse ticket for this is:
https://bugzilla.opensuse.org/show_bug.cgi?id=952621

this new kernel module is located in this repositories:
http://download.opensuse.org/repositories/home:/tiwai:/bnc952621

i have already installed the new module and up to now it seems to work.
but as i am using it only a few hours now, this does not really tell us much.
Comment 79 Feng Tang 2016-06-03 03:32:55 UTC
(In reply to Rainer Klier from comment #78)
> for all openSUSE users:
> 
> Takashi Iwai kindly made a new kernel module with the latest patch to be
> tested.
> i have already installed the new module and up to now it seems to work.
> but as i am using it only a few hours now, this does not really tell us much.

Thanks Rainer/Takashi Iwai for the try and update!

I read the https://bugzilla.opensuse.org/show_bug.cgi?id=952621 carefully, and your problem looks like to be different from most of the reports here, your error info is: 

[  287.796808] alx 0000:04:00.0 eth0: fatal interrupt 0x4019607, resetting

From this ISR, it indicatse all the fatal interrupts happened like the 
  ALX_ISR_PCIE_LNKDOWN
  ALX_ISR_DMAW
  ALX_ISR_DMAR

while mixing with RX and TX complete ISR, which is really scary! :)
(I don't have a datasheet, but only guess from the driver and header file)

And the most error we met is about 0x8 (RXF overflow interrupt), so I suggest you to new a bug in kernel bugzilla for this, and give detail info like:
1. which kernel version is the last one to work well, and which version is the first to not work
2. detail kernel error message
3. your step to reproduce the issue, like tx/rx big files
Comment 80 Tobias Regnery 2016-06-03 10:16:45 UTC
(In reply to Feng Tang from comment #75)
> Created attachment 218211 [details]
> work_around_dma_issue.patch

So I tested your patch and my ethernet is still usable after several hours of iperf benchmarking. I see no performance difference between this patch and the custom skb allocator. 

Without one of those patches my AR8161 isn't usable after a few minutes of network traffic.
Comment 81 olelukoie 2016-06-06 18:52:48 UTC
(In reply to Feng Tang from comment #75)
> Created attachment 218211 [details]
> work_around_dma_issue.patch

I've made some testing too. Both patches (this one and previous one) do behave equally, no any difference in network connection speed or stability.

Just one notice: I'm a bit scared by potential infinite loop in the latest patch:

retry:
skb = __netdev_alloc_skb ...;
...
if (((u32)skb->data & 0xfff) == 0xfc0) {
  ...
  dev_kfree_skb(skb);
  goto retry;
}

Are there any guarantee against constantly repeating wrong allocations?
Comment 82 Feng Tang 2016-06-07 09:47:49 UTC
(In reply to olelukoie from comment #81)
> 
> Are there any guarantee against constantly repeating wrong allocations?

Thank you for the test. From maths, the endless loop should not happen.

While Eric Dumazet has suggested another way, instead of retry, we allocate 64B more space, and offset it by 64, when 0x...fc0 is found.

could you also try this new patch?
Comment 83 Feng Tang 2016-06-07 09:53:11 UTC
Created attachment 219281 [details]
new_dma_patch
Comment 84 Feng Tang 2016-06-07 09:57:50 UTC
Created attachment 219291 [details]
new dma patch
Comment 85 olelukoie 2016-06-09 05:28:31 UTC
The latest patch works too.
Comment 86 olelukoie 2016-06-24 05:19:35 UTC
I've just noticed that GKH used the patch with custom skb allocator instead of the latest version called "new_dma_patch". Is it ok and there's something wrong with "new_dma_patch" version?

And another question: are there plans on applying the patch to current longterm kernels (4.1 & 4.4)?
Comment 87 Feng Tang 2016-06-24 07:39:16 UTC
(In reply to olelukoie from comment #86)
> I've just noticed that GKH used the patch with custom skb allocator instead
> of the latest version called "new_dma_patch". Is it ok and there's something
> wrong with "new_dma_patch" version?

No, I don't see anything wrong with the "new_dma_patch". I think GKH pick the new allocator patch only because this is the one inside Linus' mainline tree.

> 
> And another question: are there plans on applying the patch to current
> longterm kernels (4.1 & 4.4)?

The patch won't apply as is (I tried), I'll try to follow the stable kernel rule to see what I can do.

btw, I make a patch against 4.4 kernel (and it could apply to 4.1 as well), could you help to test?

Thanks,
Feng
Comment 88 Feng Tang 2016-06-24 07:42:03 UTC
Created attachment 221141 [details]
dma patch for 4.1/4.4 stable kernel
Comment 89 olelukoie 2016-06-24 16:12:37 UTC
(In reply to Feng Tang from comment #87)
> btw, I make a patch against 4.4 kernel (and it could apply to 4.1 as well),
> could you help to test?

I've already updated kernel to version 4.4.13 a week ago and the patch works normally with it.
Comment 90 Adrien DAUGABEL 2016-12-13 19:46:19 UTC
4.4.37 : no problems for me :)

Patch included in kernel ?
Comment 91 Jarod Wilson 2016-12-13 19:54:45 UTC
Yes, there are patches merged upstream and brought back to the stable trees that should solve all the problems reported here. I think this bug can probably be closed now.
Comment 92 Rainer Klier 2016-12-14 12:56:50 UTC
hi,

(In reply to Feng Tang from comment #79)
> (In reply to Rainer Klier from comment #78)
> > for all openSUSE users:
> > 
> while mixing with RX and TX complete ISR, which is really scary! :)
> (I don't have a datasheet, but only guess from the driver and header file)
> 
> And the most error we met is about 0x8 (RXF overflow interrupt), so I
> suggest you to new a bug in kernel bugzilla for this, and give detail info

thanks.

it seems that this is already done by somebody else:
https://bugzilla.kernel.org/show_bug.cgi?id=102171

Note You need to log in before you can comment on or make changes to this bug.