Bug 13404 - with atl1e: Corrupted MAC on input
with atl1e: Corrupted MAC on input
Status: RESOLVED OBSOLETE
Product: Drivers
Classification: Unclassified
Component: Network
All Linux
: P1 high
Assigned To: drivers_network@kernel-bugs.osdl.org
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-05-30 20:10 UTC by Gene Czarcinski
Modified: 2012-10-30 17:32 UTC (History)
10 users (show)

See Also:
Kernel Version: 2.6.29.4-167.fc11.x86_64
Tree: Mainline
Regression: No


Attachments
lspci -v output showing both old and new NIC interfaces (9.42 KB, text/plain)
2009-05-30 20:12 UTC, Gene Czarcinski
Details
cpuinfo (3.17 KB, text/plain)
2009-06-09 13:26 UTC, Gene Czarcinski
Details
meminfo (1.06 KB, text/plain)
2009-06-09 13:37 UTC, Gene Czarcinski
Details
lspci -v output (10.35 KB, text/plain)
2009-07-01 17:01 UTC, Ville Törhönen
Details

Description Gene Czarcinski 2009-05-30 20:10:31 UTC
There appears to be a serious problem with the "atl1e" driver supporting the Attansic Technology Corp. Atheros AR8121/AR8113/AR8114 PCI-E Ethernet Controller

For me, the problem occurred when I was copying hundreds of gigabytes of ISO image files from one system to another using ssh's "scp" command/program.

At random points during copying I would get the error "Corrupted MAC on input" which then terminated the scp command.  This "test" was run multiple (about 6) times and each time it failed at some (random) point.

The software: Fedora 11 preview with "latest" updates and the 2.6.29.4-167.fc11.x86_64 kernel.

The hardware: ASUS M4A78 PRO motherboard, AMD Phenom II 940 processor (3 GHz, four CPUs), 8 GB system memory.  The Atheros Ethernet Controller integrated on the mobo. (I will be attaching the output of "lspci".

Why do I believe it is the driver --

1.  I installed Fedora 10 running the 2.6.27.24-170.2.68.fc10.x86_64 kernel.  I again ran a half dozen tests with NO failures.

2.  I installed Fedora 11 preview with updates on another system (4400 dual processor) with a Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10) NIC. I then ran 4 tests copys with NO failures.

3.  Finally, I installed a new PCI Express NIC on the Phenom system --  D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter (rev 13).  I then ran 8 copy tests with NO failures.

Conclusion: major problem in the atl1e driver

Although I did not test this and thus have no proof, I suspect copying large amounts of data with something like ftp to this system via atl1e would also result in corrupted data but the only way to detect it would be by checksumming the files.
Comment 1 Gene Czarcinski 2009-05-30 20:12:06 UTC
Created attachment 21634 [details]
lspci -v output showing both old and new NIC interfaces
Comment 2 Andrew Morton 2009-06-02 02:23:21 UTC
Cc maintainers.
Comment 3 Chuck Ebbert 2009-06-03 08:01:14 UTC
Duplicate of bug 12282 ??
Comment 4 Gene Czarcinski 2009-06-03 18:25:05 UTC
It sure looks like a dup to me but I will leave this open for someone a lot more expert than me to judge.

Note: all "tests" that I ran consisted in attempting to copy approximately 600GB from one system to another.  The link between the two systems is a Netgear gigabit switch with short (~ 3m) cables.  The "to" system was always the one with the atl1e driver.
Comment 5 Jay Cliburn 2009-06-05 12:50:27 UTC
Just to capture the maintainer's response in the bug report...

On Fri, 5 Jun 2009 12:44:19 +0800
Jie Yang <Jie.Yang@Atheros.com> wrote:

>  On  Friday, June 05, 2009 7:03 AM
> Jay Cliburn <jcliburn@gmail.com> wrote:
[...]
> > Jie,
> >
> > Could you please look into these reports of corruption?
> >
> 
> sure, I will try to reproduce this bug first.
> 
> Best wishes
> jie
Comment 6 Anonymous Emailer 2009-06-09 05:38:55 UTC
Reply-To: Jie.Yang@Atheros.com

On Friday, June 05, 2009 8:50 PM
Jay Cliburn <jcliburn@gmail.com>  wrote:

> Just to capture the maintainer's response in the bug report...
>
> On Fri, 5 Jun 2009 12:44:19 +0800
> Jie Yang <Jie.Yang@Atheros.com> wrote:
>
> >  On  Friday, June 05, 2009 7:03 AM
> > Jay Cliburn <jcliburn@gmail.com> wrote:
> [...]
> > > Jie,
> > >
> > > Could you please look into these reports of corruption?
> > >
> >
> > sure, I will try to reproduce this bug first.
> >
> > Best wishes
> > jie
>
Oh, I failed to reproduce this bug on my platform.

Mainboard: ASUS M3A79-T Deluxe
CPU: AMD Phenom(tm) 9950 Quad-Core Processor
Mem: 6G
software paltform: 2.6.29.1-102.fc11.x86_64

I use scp to copy about 4GB, it successd.

[root@localhost ~]# scp /tmp/Fedora-11-Preview-x86_64-DVD.iso root@192.168.0.1:/dev/null
Address 192.168.0.1 maps to leo-pc.users.atheros.com, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!
root@192.168.0.1's password:
Fedora-11-Preview-x86_64-DVD.iso                                                             100% 4397MB  36.3MB/s   02:01

[root@localhost ~]# ifconfig eth7
eth7      Link encap:Ethernet  HWaddr 00:13:74:12:14:01
          inet addr:192.168.0.2  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::213:74ff:fe12:1401/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1692161 errors:0 dropped:0 overruns:0 frame:0
          TX packets:14447145 errors:0 dropped:0 overruns:0 carrier:1
          collisions:0 txqueuelen:1000
          RX bytes:123255824 (117.5 MiB)  TX bytes:20737024713 (19.3 GiB)
          Interrupt:30

Attach is the detail info about "pcis, cpuinfo, meminfo"

Can you give me some advise to reproduce this bug.
Comment 7 Gene Czarcinski 2009-06-09 13:26:31 UTC
Created attachment 21826 [details]
cpuinfo
Comment 8 Gene Czarcinski 2009-06-09 13:37:58 UTC
Created attachment 21827 [details]
meminfo

OK, lscpi -v output, /proc/cpuinfo and /proc/meminfo now attached.

My test involved copying a lot of data with scp: 182GB (not just 4GB).

Sometimes the error would occur almost immediately but other times it took a while.

I suspect (I have not done this) that a test could be constructed using netcat (nc) to repeatedly transfer a file with a known checksum and then test that file to see if it was different.

Anyway, I suspect [I have not looked/tested] that other data handled by scp is being corrupted besides that which produces the "Corrupted MAC on Input" error.

Do you need help setting up such a netcat test?  After you transferred the 4GB file, did you run a checksum to see if it was exactly the same as the original file?
Comment 9 Ville Törhönen 2009-07-01 17:01:05 UTC
Created attachment 22168 [details]
lspci -v output

I am experiencing this same problem on my hardware:

ASUS P5QL Pro 
Intel Core 2 Duo E6600
4GB RAM
Fedora 11, running 2.6.29.5-191.fc11.x86_64

I've managed to reproduce this error by logging to the new computer by SSH, and by transferring files to the computer by SCP.

This is definitely a problem with the atl1e driver.
Comment 10 Jay Cliburn 2009-07-01 18:44:00 UTC
[Adding Atheros maintainer to cc list.]

On Wed, 1 Jul 2009 17:01:08 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=13404
> 
> 
> 
> 
> 
> --- Comment #9 from Ville Törhönen <ville@torhonen.fi>  2009-07-01 17:01:05 ---
> Created an attachment (id=22168)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=22168)
> lspci -v output
> 
> I am experiencing this same problem on my hardware:
> 
> ASUS P5QL Pro 
> Intel Core 2 Duo E6600
> 4GB RAM
> Fedora 11, running 2.6.29.5-191.fc11.x86_64
> 
> I've managed to reproduce this error by logging to the new computer by SSH, and
> by transferring files to the computer by SCP.
> 
> This is definitely a problem with the atl1e driver.
>
Comment 11 Dietrich Clauss 2010-02-03 13:49:33 UTC
Same here on an Aspire 6530G laptop (AMD Turion X2).

I tried to switch off TSO and boot with maxcpus=1.  This had no effect.

The problem disappears when the machine is under heavy load.  Running a loop like 'while true; do false; done' while the network activity takes place, makes the problem disappear.
Comment 12 account 2010-11-21 05:15:48 UTC
Had the same problem with the same hardware as the original poster (Asus M4A78 PRO, AMD Phenom II 940). After changing network cables and switches, I finally tried a bios update and that fixed it. So doesn't seem to be a driver problem, at least for this hardware setup.
Comment 13 nyxkn 2011-03-28 22:11:48 UTC
My controller is "Atheros Communications AR8121/AR8113/AR8114 Gigabit or Fast Ethernet (rev b0)", as per lspci, and I've been having the same issue. 
However, it appears to have been fixed (at least for me) with the latest Atheros AR81 driver. The package name is "AR81Family-linux-v1.0.1.14.tar.gz", and it's available from http://partner.atheros.com/Drivers.aspx. modinfo of this new atl1e module now displays "1.0.1.14" in the version field, while the one that came with my kernel 2.6.37.4 had "1.0.0.7-NAPI". 
The kernel driver module maybe needs to be updated?
(This could probably also fix the issues from bug 12282 and maybe bug 27712)
Comment 14 Alan 2012-10-30 17:32:50 UTC
If this is still seen on modern kernels then please re-open/update

Note You need to log in before you can comment on or make changes to this bug.