Bug 7517

Summary: r8169: Packet corruption on receive for 8111B chip
Product: Drivers Reporter: Mike Isely (isely)
Component: NetworkAssignee: Francois Romieu (romieu)
Status: CLOSED CODE_FIX    
Severity: blocking CC: isely, romieu
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.19-rc5 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Tarball containing lspci output from machine under test and saved wireshark capture output showing packet corruption
sync with Realtek's init sequence
debug helper
Test results with patches applied
align more carefully
align even more carefully
Test results with id=9567 patch applied

Description Mike Isely 2006-11-14 00:00:30 UTC
Most recent kernel where this bug did not occur: N/A
Distribution: Debian (but stock vanilla kernel.org kernel tree)
Hardware Environment: Intel Core 2 Duo E6300 on Gigabyte GA-945GM-S2 mainboard,
NIC under test running 100Mbps, full duplex
Software Environment: Debian testing installation (last updated early Nov 2006)
Problem Description: About 2/3 of received packet data corrupted when using the
r8169 driver to operate the onboard Realtek 8111B gigabit ethernet chip

Steps to reproduce:
modprobe r8169, configure appropriate for operation on your ethernet
ping the NIC from another machine
Notice that over 60% of the pings don't return

An examination of the traffic with wireshark running on the test machine reveals
that the machine is not returning most of the pings because the received packet
data is being corrupted.  The nature of the corruption is that the first 4 bytes
of the corrupted packet data is simply missing (and there's 4 bytes of garbage
appended at the end so the overall packet size stays the same).  Smells like a
fencepost error somewhere...

I have not tried this with older kernels, but then before 2.6.19, the r8169
driver apparently did not support this chip.
Comment 1 Mike Isely 2006-11-14 00:07:37 UTC
Created attachment 9493 [details]
Tarball containing lspci output from machine under test and saved wireshark capture output showing packet corruption

The attached tarball contains the following:

1. lspci output of machine under test, showing specific data about the Realtek
8111B NIC where the problem is happening.

2. A saved packet capture file ("bad-pings").

Load up the packet capture in ethereal / wireshark and look at the packet
contents.  The machine under test is where this data was captured.  Its IP
address was 192.168.27.10.  The machine sending the ping requests was
192.168.27.1.  The packets labelled "LLC" by wireshark are examples of
corrupted data.  You can actually count those up and notice that they
correspond to the missing ping requests (the gap in the ICMP sequence numbers
matches the intervening count of these packets).
Comment 2 Francois Romieu 2006-11-14 14:08:47 UTC
Created attachment 9509 [details]
sync with Realtek's init sequence
Comment 3 Francois Romieu 2006-11-14 14:21:33 UTC
Created attachment 9510 [details]
debug helper

Please try the previous patch. I am not very confident that it will change
things but it
deserves to be tested. If it does not work better, please apply the debug patch
above
on top of it, start a ping test and capture the traffic. If you avoid unrelated
traffic on the
link, it should not spam your logs too much. The output of the test (capture
file + kernel
log + ping transmission rate) would be welcome for something like hundred or
two
hundred ICMP packets.

-- 
Ueimor
Comment 4 Mike Isely 2006-11-14 18:49:51 UTC
Created attachment 9514 [details]
Test results with patches applied

I applied the initialization patch (which as expected didn't help) and the
patch that added some debug code.  Results are in the attachment.  There's a
readme file within the tarball that describes its contents.
Comment 5 Francois Romieu 2006-11-17 14:20:12 UTC
Created attachment 9550 [details]
align more carefully

- 69% packets loss.
- 1/3 of skb->data aligned on a 16 bytes boundary
  2/3 of (skb->data - 4) aligned on a 16 bytes boundary

No need to shout, it seems crystal clear. Can you add the patch above on top of
the previous serie and reproduce the last test ?

Your .config would be welcome too.

-- 
Ueimor
Comment 6 Francois Romieu 2006-11-19 11:37:22 UTC
Created attachment 9567 [details]
align even more carefully

Oops

-- 
Ueimor
Comment 7 Mike Isely 2006-11-19 14:15:46 UTC
Created attachment 9568 [details]
Test results with id=9567 patch applied

All requested data is in the tar-bzip file
Comment 8 Francois Romieu 2006-11-19 15:55:35 UTC
It seems to perform better.

Can you apply path #9567 on top of last 2.6.19-rc6 (without the debug stuff) and
check that nothing bad happens ?

-- 
Ueimor
Comment 9 Anonymous Emailer 2006-11-19 17:38:29 UTC
Reply-To: isely@isely.net


Patch applied, new kernel (2.6.19-rc6) built, and it works.

Any chance this can get into 2.6.19?...

  -Mike

Comment 10 Francois Romieu 2006-11-21 13:19:47 UTC
bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> :
[...]
> Any chance this can get into 2.6.19?...

A patch that I had submitted two weeks ago has been postponed to 2.6.20.
It was related to the link management. I can guess the answer if a patch
which messes with the alignment in the skb is submitted.

Comment 11 Francois Romieu 2006-12-07 16:40:38 UTC
Patch has been committed in Linus's trunk..

For details, see:
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=cc9f022d97d08e4e36d38661857991fe91447d68

-- 
Ueimor