Bug 6807

Summary: r8169: freeze at higher speeds and MCE
Product: Drivers Reporter: Mourad De Clerck (bugs-kernel)
Component: NetworkAssignee: Francois Romieu (romieu)
Status: RESOLVED CODE_FIX    
Severity: normal CC: 7eggert, christian, dion, eike-kernel, flo, jost, kernel.org, leonard.norrgard, mwflaher
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.29rc6 Subsystem:
Regression: No Bisected commit-id:
Attachments: current kernel .config
full dmesg log
full lspci output
Fix a performance regression on plain 8169
lspci -nn -vvv
dmesg
r8169 oops on AMD64 machine
Patch to get r8168 Realtek module to compile cleanly with 2.6.29 kernel

Description Mourad De Clerck 2006-07-10 02:01:34 UTC
Most recent kernel where this bug did not occur:
This is recently bought hardware, and I haven't found an older kernel where this
bug did not occur. 

Distribution:
Debian unstable


Hardware Environment:
nforce2 chipset, sata_sil/r8169 combo pci card


Software Environment:
wget, scp


Problem Description:

The network card seems pretty stable and functional at low speeds. But as soon
as I transfer things at relatively higher speeds (> 10MB/sec) it locks up. CPU
intensive transfers (like scp) will usually lock it up faster than wget, but
given a large enough transfer (1GB) it will lock up with wget too.

When it locks up, it locks up hard - keyboard lights don't work etc.

I include some pointers of some previous discussion on the netdev mailinglist:

http://marc.theaimsgroup.com/?l=linux-netdev&m=114986904805281&w=2
http://marc.theaimsgroup.com/?l=linux-netdev&m=115010829624722&w=2
http://marc.theaimsgroup.com/?l=linux-netdev&m=115065165514318&w=2


Steps to reproduce:

Transfer a 1GB file with wget or scp at 100MBit or Gigabit speeds.
Comment 1 Mourad De Clerck 2006-07-10 02:04:26 UTC
Created attachment 8517 [details]
current kernel .config
Comment 2 Mourad De Clerck 2006-07-10 02:06:13 UTC
Created attachment 8518 [details]
full dmesg log
Comment 3 Mourad De Clerck 2006-07-10 02:07:21 UTC
Created attachment 8519 [details]
full lspci output
Comment 4 Mourad De Clerck 2006-07-10 02:15:30 UTC
As an additional data point:
* the r1000 driver from Realtek has the same issue
* windows 2000 and its driver are perfectly stable
Comment 5 Matt Flaherty 2006-09-05 05:40:23 UTC
I'm seeing this as well, except with a Netgear GA311 card in PCI slot 3 of an
Abit VT7 motherboard. Found this bug searching around before I do a kernel
upgrade, looking to see if anything sounds familiar.

-------------------
storage:/usr/share# uname -a
Linux storage 2.6.8-2-386 #1 Tue Aug 16 12:46:35 UTC 2005 i686 GNU/Linux

storage:/usr/share# lspci
0000:00:00.0 Host bridge: VIA Technologies, Inc.: Unknown device 0258
0000:00:00.1 Host bridge: VIA Technologies, Inc.: Unknown device 1258
0000:00:00.2 Host bridge: VIA Technologies, Inc.: Unknown device 2258
0000:00:00.3 Host bridge: VIA Technologies, Inc.: Unknown device 3258
0000:00:00.4 Host bridge: VIA Technologies, Inc.: Unknown device 4258
0000:00:00.7 Host bridge: VIA Technologies, Inc.: Unknown device 7258
0000:00:01.0 PCI bridge: VIA Technologies, Inc. VT8237 PCI Bridge
0000:00:0a.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169
Gigabit Ethernet (rev 10)

0000:00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID
Controller (rev 80)
0000:00:0f.1 IDE interface: VIA Technologies, Inc.
VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Mast
er IDE (rev 06)
0000:00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1
Controller (rev 81)
0000:00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1
Controller (rev 81)
0000:00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1
Controller (rev 81)
0000:00:10.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1
Controller (rev 81)
0000:00:10.4 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86)
0000:00:11.0 ISA bridge: VIA Technologies, Inc. VT8237 ISA bridge [K8T800 South]
0000:01:00.0 VGA compatible controller: nVidia Corporation NV34 [GeForce FX
5200] (rev a1)

storage:/usr/share# dmesg
....
r8169 Gigabit Ethernet driver 1.2 loaded
ACPI: PCI interrupt 0000:00:0a.0[A] -> GSI 18 (level, low) -> IRQ 169
eth0: Identified chip type is 'RTL8169s/8110s'.
eth0: RTL8169 at 0xe0820000, 00:14:6c:c1:b2:07, IRQ 169
eth0: Auto-negotiation Enabled.
eth0: 1000Mbps Full-duplex operation.
....
Comment 6 Francois Romieu 2006-12-07 16:43:07 UTC
Please give the upcoming 2.6.20-rc1 a try.

-- 
Ueimor
Comment 7 Leonard Norrgard 2006-12-29 01:19:09 UTC
I see this too, with Linus' tree as of now, 2.6.20rc2-git (29th Dec).

To trigger the bug I did an "scp -pr remote:hugefiles/ .".  I was expecting the
crash, so I let it work for a few minutes.  I then decided I'd browse the web
while the copying was underway.  As soon as the browser window had been restored
(un-minimized), the system froze.

I'll hook up a serial port debugging cable later today and do some more testing.

The motherboard (MSI K9A Platinum) has two ports (identifying as different
chips), for this test the first one below was used.

CPU:
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 5200+

# uname -a
Linux x 2.6.20-rc2-git #0 SMP Fri Dec 29 03:54:00 EET 2006 x86_64 GNU/Linux

# lspci -nn -vvv
...
02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
RTL8111/8168B PCI Express Gigabit Ethernet controller [10ec:8168] (rev 01)
        Subsystem: Micro-Star International Co., Ltd. Unknown device [1462:280c]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR+ <PERR-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 18
        Region 0: I/O ports at a800 [size=256]
        Region 2: Memory at fe9ff000 (64-bit, non-prefetchable) [size=4K]
        Expansion ROM at fe9c0000 [disabled] [size=128K]
        Capabilities: [40] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA
PME(D0-,D1+,D2+,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] Vital Product Data
        Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1
Enable-
                Address: 0000000000000000  Data: 0000
        Capabilities: [60] Express Endpoint IRQ 0
                Device: Supported: MaxPayload 1024 bytes, PhantFunc 0, ExtTag+
                Device: Latency L0s <1us, L1 unlimited
                Device: AtnBtn+ AtnInd+ PwrInd+
                Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
                Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
                Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
                Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0
                Link: Latency L0s unlimited, L1 unlimited
                Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch-
                Link: Speed 2.5Gb/s, Width x1
        Capabilities: [84] Vendor Specific Information

03:03.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL-8169SC
Gigabit Ethernet [10ec:8167] (rev 10)
        Subsystem: Micro-Star International Co., Ltd. Unknown device [1462:280c]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (8000ns min, 16000ns max), Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 21
        Region 0: I/O ports at b800 [size=256]
        Region 1: Memory at feaff400 (32-bit, non-prefetchable) [size=256]
        Expansion ROM at dfe00000 [disabled] [size=128K]
        Capabilities: [dc] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA
PME(D0-,D1+,D2+,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-


# dmesg
....
r8169 Gigabit Ethernet driver 2.2LK-NAPI loaded
ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 18 (level, low) -> IRQ 18
PCI: Setting latency timer of device 0000:02:00.0 to 64
eth0: RTL8168b/8111b at 0xffffc20000042000, 00:16:17:9b:26:ca, IRQ 18
r8169 Gigabit Ethernet driver 2.2LK-NAPI loaded
ACPI: PCI Interrupt 0000:03:03.0[A] -> GSI 21 (level, low) -> IRQ 21
eth1: RTL8169sc/8110sc at 0xffffc20000044400, 00:16:17:9b:26:cb, IRQ 21
...
r8169: eth0: link up
r8169: eth0: link up
...
eth0: no IPv6 routers present
Comment 8 Francois Romieu 2006-12-29 03:15:38 UTC
Created attachment 9961 [details]
Fix a performance regression on plain 8169
Comment 9 Francois Romieu 2006-12-29 03:20:59 UTC
Leonard, can you send your .config and full dmesg ?

Please add the patch above to your 2.6.20-rc2. It will (almost surely) not fix
your problem but the driver is wrong without it.

-- 
Ueimor
Comment 10 Leonard Norrgard 2006-12-29 06:47:13 UTC
Francois,

Good news - in further tests, the driver passes with a clean record. The machine
kept freezing, but finally I shut down X11 and did the tests from the consoles.
Then it was no problem at all to transfer more than 35 GB with two simultaneous
"scp -pr" commands on a completely saturated 100 Mbps link to two other (much
slower machines) for a continuous link speed of about 10.8 MB/s, if scp:s
numbers for the huge files are to be believed.

Further support that the driver is ok is that 1) in the console, I was able to
watch TV using aatv(1) without problems, while under X11 it would crash within 2
-3 seconds, most often immediately and that 2) glxgears would likewise crash
very fast. All in all, it looks like the gfx card is to blame (a brand new,
previously unproven one), not the r8169 driver.

I will do further tests tonight, with a crossover cable between the two ports on
the motherboard.  Unfortunately it's just a CAT-5 cable, so I probably won't be
able to reach gigabit speeds. I will also test your latest patch then and post a
final note here.
Comment 11 Leonard Norrgard 2006-12-31 06:04:39 UTC
Francois,

More good news: here are the results of round 2 of my tests at 100 Mbps (I don't
have 1000 Mbps hardware at hand).

Test setup 1: three machines a, b and c. Machine b has two realtek ports (b1 and
b2), a and c have other makes. I set up two simultaneous nc pipes a>b1>c and
a<b2<c, each starting with "cat knoppix.iso|" and ending in "|md5sum", comparing
 the sums. The iso image was 695 MeBi. The test was repeated three times.

Result: all ok.

Test setup 2: The same machines, but this time there are four pipes, set up as
follows: a<b1 and a>b1, b2<c and b2>c, all four cat:ing the same image and
md5:ing. The test was repeated three times.

Result: all ok.

Both tests were done using git master as of today on machine b, with b running
with no X11 running (as my machine crashes then).
Comment 12 Mourad De Clerck 2006-12-31 07:54:48 UTC
I'll verify and try to confirm with the above patch on rc2, compiling as I type
this...
Comment 13 Francois Romieu 2006-12-31 08:05:22 UTC
bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> :
[...]
> I'll verify and try to confirm with the above patch on rc2, compiling
> as I type this...

Even if it works outside of X, please, please, take the time to attach
a complete dmesg, a .config and an 'lspci -vvx'. It helps to find bug
patterns.

Comment 14 Francois Romieu 2006-12-31 08:08:02 UTC
Francois:
[...]

The previous message was intended for Leonard, sorry.

-- 
Ueimor
Comment 15 Mourad De Clerck 2006-12-31 19:32:34 UTC
I just got back from newyear's festivities, and had a kernel waiting for me to
try out. I just did, and am happy to report that sending a 200MB and 500MB file
with scp seemed to have worked flawlessly. I'll do some further testing with a
clearer head, but it seems to look good for the time being. I'll add further
comments later.

Happy New Year, by the way ;)
Comment 16 Leonard Norrgard 2007-01-02 00:57:12 UTC
For lspci/config/dmesg files, please see my attachments in Bug 7759, which is
for the same box.
Comment 17 Francois Romieu 2007-02-19 15:22:57 UTC
Does the current kernel fix the issue for everybody ?

I'd welcome a datapoint before publishing new stuff.

-- 
Ueimor
Comment 18 Mourad De Clerck 2007-02-20 16:53:09 UTC
Just tested on 2.6.20 - and sad to report I still have the same issue.

I don't know what changed since rc2; I really thought we had a winner then.
Maybe I just got lucky that night.

I triggered the bug now by booting in single mode, and transferring a big file
using scp. (It doesn't trigger immediately, I had to try twice with files >500MB
before it locked up).

I tried with CONFIG_R8169_NAPI set and unset. I'm still going to try with
CONFIG_R8169_VLAN unset, just in case (I seem to remember with my rc2 config
that they were both unset)

I'll keep you posted.
Comment 19 Christian Rish 2007-03-28 18:42:21 UTC
r8169 consistently hangs on high loads for me as well.

Test setup: Two identical servers, running 'iperf -s' on one and 'iperf -c <IP>'
on the other.

Result: No traffic gets through. ifconfig reports: RX packets:33 errors:0
dropped:20 overruns:0 frame:20.

We're running Ubuntu Edgy on Opteron 1218 (SMP). 

# uname -a
Linux dub 2.6.20.4 #1 SMP Wed Mar 28 22:42:31 CEST 2007 x86_64 GNU/Linux

Comment 20 Christian Rish 2007-03-28 18:43:53 UTC
Created attachment 10986 [details]
lspci -nn -vvv
Comment 21 Christian Rish 2007-03-28 18:44:20 UTC
Created attachment 10987 [details]
dmesg
Comment 22 Christian Rish 2007-03-28 18:50:11 UTC
I compiled a kernel with '#define RTL8169_DEBUG 1' in r8169.c to get more
debugging information. However, I see no extra information in the logs. Feel
free to instruct me how to provide additional debugging information.
Comment 23 Francois Romieu 2007-03-29 00:16:21 UTC
bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> :
[...]
> I compiled a kernel with '#define RTL8169_DEBUG 1' in r8169.c to get more
> debugging information. However, I see no extra information in the logs. Feel
> free to instruct me how to provide additional debugging information.

Can you give a try to the patchkit available at:
http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.21-rc4/r8169-20070316

Comment 24 Christian Rish 2007-03-29 15:27:28 UTC
FYI: Patching against 2.6.21-rc4:

# for f in ../patches/r8169/*.patch; patch -p1 --input=$f          
patching file drivers/net/r8169.c
patching file drivers/net/r8169.c
patching file drivers/net/r8169.c
patching file drivers/net/r8169.c
patching file drivers/net/r8169.c
patching file drivers/net/r8169.c
patching file drivers/net/r8169.c
patching file drivers/net/r8169.c
patching file drivers/net/r8169.c
patching file drivers/net/r8169.c
patching file drivers/net/r8169.c
patching file drivers/net/r8169.c
Reversed (or previously applied) patch detected!  Assume -R? [n] 
Apply anyway? [n] y
Hunk #1 FAILED at 250.
Hunk #2 FAILED at 2518.
2 out of 2 hunks FAILED -- saving rejects to file drivers/net/r8169.c.rej
patching file drivers/net/r8169.c
patching file drivers/net/r8169.c

I will try compiling anyway.
Comment 25 Francois Romieu 2007-03-29 15:40:29 UTC
christian@rishoj.net:
> FYI: Patching against 2.6.21-rc4:
> 
> # for f in ../patches/r8169/*.patch; patch -p1 --input=$f          

Please echo the name of the patch. The serie should contain 13 patches.

Comment 26 Christian Rish 2007-03-29 15:51:34 UTC
# for f in ../patches/r8169/*.patch; do echo "Applying $f"; patch  -p1
--input=$f; doneApplying
../patches/r8169/0001-r8169-fix-suspend-resume-for-down-interface.patch
patching file drivers/net/r8169.c
Applying ../patches/r8169/0002-r8169-add-per-device-hw_start-handler-1-2.patch
patching file drivers/net/r8169.c
Applying ../patches/r8169/0003-r8169-add-per-device-hw_start-handler-2-2.patch
patching file drivers/net/r8169.c
Applying
../patches/r8169/0004-r8169-merge-with-version-6.001.00-of-Realtek-s-r8169-driver.patch
patching file drivers/net/r8169.c
Applying
../patches/r8169/0005-r8169-merge-with-version-8.001.00-of-Realtek-s-r8168-driver.patch
patching file drivers/net/r8169.c
Applying
../patches/r8169/0006-r8169-confusion-between-hardware-and-IP-header-alignment.patch
patching file drivers/net/r8169.c
Applying ../patches/r8169/0007-r8169-small-8101-comment.patch
patching file drivers/net/r8169.c
Applying ../patches/r8169/0008-r8169-remove-the-media-option.patch
patching file drivers/net/r8169.c
Applying ../patches/r8169/0009-r8169-cleanup.patch
patching file drivers/net/r8169.c
Applying ../patches/r8169/0010-r8169-MSI-support.patch
patching file drivers/net/r8169.c
Applying
../patches/r8169/0011-r8169-add-bit-description-for-the-TxPoll-register.patch
patching file drivers/net/r8169.c
Applying
../patches/r8169/0011-r8169.c-add-bit-description-for-the-TxPoll-register.patch
patching file drivers/net/r8169.c
Reversed (or previously applied) patch detected!  Assume -R? [n] 
Apply anyway? [n] 
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file drivers/net/r8169.c.rej
Applying
../patches/r8169/0012-r8169-align-the-IP-header-when-there-is-no-DMA-constraint.patch
patching file drivers/net/r8169.c
Applying ../patches/r8169/0013-r8169-mac-address-change-support.patch
patching file drivers/net/r8169.c

Turns out patch 0011 was there twice, though not in the series file. I suppose 
 I ought to learn using quilt.

Ignoring one of the duplicates, the series applies. Compiling now...
Comment 27 Christian Rish 2007-03-29 16:48:23 UTC
Perfect! After applying the patchset:

% iperf -i1 -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 10.198.56.22 port 5001 connected with 10.198.56.23 port 48526
[  4]  0.0- 1.0 sec    101 MBytes    847 Mbits/sec
[  4]  1.0- 2.0 sec    102 MBytes    855 Mbits/sec
[  4]  2.0- 3.0 sec    102 MBytes    855 Mbits/sec
[  4]  3.0- 4.0 sec    102 MBytes    855 Mbits/sec
[  4]  4.0- 5.0 sec    102 MBytes    855 Mbits/sec
[  4]  5.0- 6.0 sec    102 MBytes    855 Mbits/sec
[  4]  6.0- 7.0 sec    102 MBytes    855 Mbits/sec
[  4]  7.0- 8.0 sec    102 MBytes    855 Mbits/sec
[  4]  8.0- 9.0 sec    102 MBytes    855 Mbits/sec
[  4]  9.0-10.0 sec    102 MBytes    855 Mbits/sec
[  4]  0.0-10.0 sec  1019 MBytes    854 Mbits/sec

% ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 00:18:E7:16:04:7C  
          inet addr:10.198.56.22  Bcast:10.255.255.255  Mask:255.0.0.0
          inet6 addr: fe80::218:e7ff:fe16:47c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:7200  Metric:1
          RX packets:150238 errors:0 dropped:0 overruns:0 frame:0
          TX packets:75161 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1078873520 (1.0 GiB)  TX bytes:4960790 (4.7 MiB)
          Interrupt:11 Base address:0x6c00 

This is much appreciated.

Any idea when this patchset will make it into the kernel?
Comment 28 Jost Diederichs 2007-04-01 23:44:18 UTC
Created attachment 11023 [details]
r8169 oops on AMD64 machine

I have been following this bug, thinking it might be related to my problem, see

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=231269
recap: with recent kernels I couldn't even ifup my r8169 card. 
So I tried the above patchkit set on the most recent fedora development kernel
- and - voila, no panic.
However, I then proceeded to try the patchkit on the most recent kernel from
here, ie 2.6.21-rc5-git7 (using the fedora devel config) and the symptoms are
just like before. Panic as soon as the device is configured (ifup eth0). The
module insmod's fine otherwise. 
The kernel dump is attached. 
I hope this is not totally unrelated.
Comment 29 Francois Romieu 2007-04-02 13:44:33 UTC
bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> :
[...]
> So I tried the above patchkit set on the most recent fedora development kernel
> - and - voila, no panic.

Ok.

> However, I then proceeded to try the patchkit on the most recent kernel from
> here, ie 2.6.21-rc5-git7 (using the fedora devel config) and the symptoms are
> just like before. Panic as soon as the device is configured (ifup eth0). The
> module insmod's fine otherwise. 

Ok.

> The kernel dump is attached. 
> I hope this is not totally unrelated. 

Please try against latest 2.6.21-rcX-git_of_the_day the patchkit available at:
http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.21-rc5/r8169-20070403

Comment 30 Jost Diederichs 2007-04-02 20:58:33 UTC
success.
I have tried the patch on the above mentioned 2.6.21-rc5-git7 and also on the
latest 2.6.21-rc5-git9. 
I have done various stress test including a script that repeatedly ran 
modprobe r8169 ; ifup eth0 ; ifdown eth0 ; rmmod r8169.
No problems so far. Thanks
Comment 31 Francois Romieu 2007-07-11 14:30:26 UTC
Mourad, will you be kind enough to give 2.6.23-rc1 a try when it goes out ?

-- 
Ueimor
Comment 32 Mourad De Clerck 2007-07-11 21:23:38 UTC
OK, I will.
Comment 33 Mourad De Clerck 2007-07-30 14:05:21 UTC
I just tried with 2.6.23-rc1, and it seemed to work... at first.

I booted in single user mode, and transferred 1GB of data to another machine - twice. This succeeded, however I made the observation that it only transferred at 100Mbit speed - a quick check with mii-tool confirmed this:
capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
... no gigabit speed available.

Just to make sure the stability problems were fixed, I started using the machine like I normally do - I wanted to see if it stayed stable after a day or so of "normal" use. Sadly enough, it locked up within 5 minutes, while finishing copying a file (using Nautilus) to a (cifs) network share.

As usual, it became completely unresponsive, no mouse movement, no capslock, no SysRq, no console switching, no network login. However, there was incessant harddisk activity, like it was trashing continuously.

So basically:
- gigabit speeds don't work
- freezes still happen, albeit a lot less quickly than it used to. I managed to transfer quite a lot of data before it locked up. When it finally did freeze, there seemed to be a lot of harddisk activity (swapping?).
Comment 34 Francois Romieu 2007-07-30 14:45:34 UTC
1. CIFS == user space smbd or in kernel cifs support ?
   It may make sense to monitor the swap/mem activity with 'vmstat 1' during
   the file copy.
2. Sorry for the gigabit regression :o/
   Can you send the output of 'mii-tool -vv eth0' for an an old kernel and
   for 2.6.23-rc1 ?

-- 
Ueimor
Comment 35 Mourad De Clerck 2007-07-31 06:30:25 UTC
1. in kernel cifs support. I tried monitoring with vmstat 1 when I did the 2x1GB transfers, but of course it only seems to happen when I'm not monitoring... I'll see if I can get some vmstat output

2. To be fair, I'm not sure it actually is a regression - the oldest kernel I have around is 2.6.18 and that one's even worse; I remember seeing a link at 1000Mbit when I first reported it (was it 2.6.16?), and I know the hardware is supposed to be able to do it. You can see in my old email here that I believed I was running at 1000Mbit:
http://marc.theaimsgroup.com/?l=linux-netdev&m=115010829624722&w=2

However, I just noticed there's a discrepancy between what mii-tool reports and what ethtool reports: one says I have link at 1000Mbit, the other tells me I'm at 100Mbit. Probably also the reason why I thought 2.6.16 was running at 1000Mbit - it probably never did?


To illustrate, with 2.6.22:

mii-tool:
Using SIOCGMIIPHY=0x8947
eth1: negotiated 100baseTx-FD flow-control, link ok
  registers for MII PHY 32: 
    1000 796d 001c c910 0de1 cde1 000d 2001
    40bd 0300 7800 1000 1007 f880 0000 3000
    0060 acc0 0000 0000 0060 0000 ef84 0108
    2740 6789 0000 010e 0990 0000 0000 98e0
  product info: vendor 00:07:32, model 17 rev 0
  basic mode:   autonegotiation enabled
  basic status: autonegotiation complete, link ok
  capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  advertising:  100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
  link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control

ethtool:

Settings for eth1:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: pumbg
        Wake-on: g
        Current message level: 0x00000033 (51)
        Link detected: yes




For completeness sake, here's mii-tool output for the other kernels:

2.6.18:
Using SIOCGMIIPHY=0x8947
eth1: 10 Mbit, half duplex, link ok
  registers for MII PHY 32: 
    0000 794d 001c c910 0de1 0020 0004 2001
    0000 0300 0000 1000 1007 f880 0000 3000
    0060 0c40 0000 0440 0060 0000 009a 0108
    2740 6669 0000 8000 8400 0000 0000 48b0
  product info: vendor 00:07:32, model 17 rev 0
  basic mode:   10 Mbit, half duplex
  basic status: link ok
  capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  advertising:  100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
  link partner: 10baseT-HD

2.6.23-rc1:
Using SIOCGMIIPHY=0x8947
eth1: negotiated 100baseTx-FD flow-control, link ok
  registers for MII PHY 32: 
    1000 796d 001c c910 0de1 cde1 000d 2001
    4680 0300 3800 1000 1007 f880 0000 3000
    0060 acc0 0000 0000 0060 0000 ef84 0108
    2740 6669 0000 010f 0910 0000 0000 98e0
  product info: vendor 00:07:32, model 17 rev 0
  basic mode:   autonegotiation enabled
  basic status: autonegotiation complete, link ok
  capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  advertising:  100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
  link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
Comment 36 Francois Romieu 2007-10-10 14:09:08 UTC
Mourad, there have been several changes in the r8169 driver from 2.6.23-rc1
to 2.6.23. May I ask you to give 2.6.23 a try ?

Thanks in advance.

-- 
Ueimor
Comment 37 Francois Romieu 2007-11-29 15:24:07 UTC
Ping.

-- 
Ueimor
Comment 38 Mourad De Clerck 2007-12-16 10:36:53 UTC
Hi, sorry I didn't get back to you sooner.

With 2.6.23 I can still get a complete freeze with that card. It's just a matter of sending enough data (SCP'ing a couple of GB usually does it).

I'm seriously starting to wonder whether this could be a hardware issue after all. Like I said in the beginning, I did test it in Windows and it seemed perfectly stable, but it could have been a (un)lucky fluke. If there's any other way I could try to figure out whether this is a hardware issue, let me know (I'd prefer not to have to install Windows again, but I'll do so if really needed)

I've stopped using this network card obviously, but I don't mind continuing to plug it in and test new kernel versions now and again. I also won't mind if you'd prefer to close this bug, if you're satisfied this is most likely a hardware issue.
Comment 39 Dmitry Nezhevenko 2008-02-08 13:30:02 UTC
Hi all. I think that I have almost same issue.
My machine is MSI M673 laptop. It has followed Ethernet card:

01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)
        Subsystem: Micro-Star International Co., Ltd. Unknown device 3fdf
        Flags: bus master, fast devsel, latency 0, IRQ 18
        I/O ports at b800 [size=256]
        Memory at f8cff000 (64-bit, non-prefetchable) [size=4K]
        Expansion ROM at f8cc0000 [disabled] [size=128K]
        Capabilities: [40] Power Management version 2
        Capabilities: [48] Vital Product Data <?>
        Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable-
        Capabilities: [60] Express Endpoint, MSI 00
        Capabilities: [84] Vendor Specific Information <?>
        Kernel driver in use: r8169
        Kernel modules: r8169

It's connected to 100Mbit D-Link ethernet switch. Card just "freezes" when trasfer rate is too hight with no messages in syslog. After this I need to reconfigure network interface (ifdown lan0 && ifup lan0) to make it works again.

My distro is debian unstable with self-build 2.6.24 kernel. dmesg and some other useful info about laptop is available at http://inhex.net/dion/lj/m673/

If issue is not same, I will open new bug
Comment 40 Rolf Eike Beer 2008-05-20 22:58:27 UTC
Same driver, same problem:

05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8101E PCI Express Fast Ethernet controller (rev 01)
        Subsystem: Toshiba America Info Systems Unknown device ff00
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 220
        Region 0: I/O ports at 4000 [size=256]
        Region 2: Memory at da000000 (64-bit, non-prefetchable) [size=4K]
        [virtual] Expansion ROM at d4000000 [disabled] [size=64K]
        Capabilities: [40] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] Vital Product Data
        Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable+
                Address: 00000000fee0100c  Data: 41e9
        Capabilities: [60] Express Endpoint IRQ 0
                Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag+
                Device: Latency L0s <1us, L1 unlimited
                Device: AtnBtn+ AtnInd+ PwrInd+
                Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
                Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                Device: MaxPayload 128 bytes, MaxReadReq 128 bytes
                Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0
                Link: Latency L0s unlimited, L1 unlimited
                Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch-
                Link: Speed 2.5Gb/s, Width x1
        Capabilities: [84] Vendor Specific Information
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [12c] Virtual Channel
        Capabilities: [148] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx
        Capabilities: [154] Power Budgeting

That's the build-in network chip of a Toshiba Sattelite A110-178. I've seen this for ages now with more or less recent kernels (Linus' git). When I do this on console I get this:

CPU 1: Machine Check Exception 000000000005
Bank0: b200004000000800
Bank5: b200120020080400

It seems to me that this is easier to reproduce if the receiver is slower than me, e.g. sending stuff to my PentiumI at 10 MBit/s even froze if I limit the transfer to something like 50kByte/s. I have NAPI enabled an thought this fixed it but I still have this issues when I copy larger files.

When I saw this using scp from console I had the effect that suddenly the transfer rate dropped and within seconds the system froze.
Comment 41 Francois Romieu 2008-09-23 14:13:11 UTC
Is the behavior the same with:
- 2.6.27-rc7
- 2.6.27-rc6 +
  http://userweb.kernel.org/~romieu/r8169/2.6.27-rc6/20080913-r8169-test.patch

There are enough changes in the r8169 driver for it to deserve a try. Please
note that the 8168 (Dmitry) and the 8101 (Rolf) will not necessarily behave
the same. Actually, one can expect differences as soon as the XID displayed
by the r8169 driver in the kernel log (since 2.6.23) are not the same.

-- 
Ueimor
Comment 42 Rolf Eike Beer 2008-09-26 01:15:53 UTC
I'm on 2.6.27-rc7-git now and was not able to reproduce this until now.
Comment 43 Rolf Eike Beer 2008-10-03 03:24:56 UTC
Freeze is still there but looks like it is harder to hit. Or I just had luck.
Comment 44 Dmitry Nezhevenko 2008-11-15 12:30:09 UTC
Looks like works for me with 2.6.27.x kernels. At least I can't reproduce it for now.
Comment 45 Rolf Eike Beer 2009-02-24 07:53:06 UTC
I tried Linus tree from 2009-02-20 (that's basically 2.6.29-rc6 when looking at net drivers) and still got this.
Comment 46 Florian Engelhardt 2009-04-30 09:03:24 UTC
I had the same problem with a "IntelĀ® D945GCLF2 inkl. IntelĀ® Atom 330" mainboard. It comes with a r8168b Gigabit NIC on board. Archlinux tried to use the r8169 kernel module, but at high transfer rates, the NIC freezed. It did not respond to ping, nore was i able to ping other computers from that server. After several minutes it automagicaly worked again, but only at low transfer speeds.

I tried the kernel module for the r8168 from the realtek homepage. You have to fix some defines to get it to compile with the 2.6.29 kernel, but than it compiles and works.

To check if its working, i transfered about 320 GB from my Desktop to the Server running that r8168 module from realtek doing 118 MB/s avg (RAID 0 in Desktop, RAID10 on Server)
No freezing, no locks.
Comment 47 Ralph Seichter 2009-05-02 10:20:06 UTC
Just like Florian, I use an Intel Essential Series D945GCLF2 Board with Realtek RTL8111/8168B NIC, and I'm experiencing the same problems with the Module "/lib64/modules/2.6.29-gentoo-r2/kernel/drivers/net/r8169.ko". The NIC is currently attached to a 100 Mbit hub, and when large amounts of data are transferred simultaneously inbound and outbound, I see transmit timeouts:

------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xcd/0x16f()
Hardware name:
NETDEV WATCHDOG: eth0 (r8169): transmit timed out
Modules linked in: nfsd lockd nfs_acl sunrpc exportfs smsc47m1 smsc47m192 hwmon_vid ehci_hcd uhci_hcd i2c_i801 usbcore
Pid: 0, comm: swapper Tainted: G        W  2.6.29-gentoo-r2 #3
Call Trace:
 <IRQ>  [<ffffffff8103b483>] warn_slowpath+0xd3/0x10f
 [<ffffffff811763b9>] ? cpumask_next_and+0x2b/0x3c
 [<ffffffff81032005>] ? enqueue_task_fair+0x25/0x92
 [<ffffffff8102efcf>] ? enqueue_task+0x50/0x5b
 [<ffffffff8102f0cc>] ? activate_task+0x28/0x31
 [<ffffffff81035a1b>] ? try_to_wake_up+0x255/0x267
 [<ffffffff81035a3a>] ? default_wake_function+0xd/0xf
 [<ffffffff812bf4cd>] ? dev_watchdog+0x0/0x16f
 [<ffffffff8102f5f1>] ? __wake_up_common+0x46/0x75
 [<ffffffff812bf49d>] ? netif_tx_lock+0x48/0x78
 [<ffffffff812bf4cd>] ? dev_watchdog+0x0/0x16f
 [<ffffffff812bf59a>] dev_watchdog+0xcd/0x16f
 [<ffffffff81043e60>] run_timer_softirq+0x18b/0x200
 [<ffffffff81057296>] ? clockevents_program_event+0x77/0x80
 [<ffffffff810403be>] __do_softirq+0x83/0x121
 [<ffffffff8100d2bc>] call_softirq+0x1c/0x28
 [<ffffffff8100e1d4>] do_softirq+0x34/0x76
 [<ffffffff81040154>] irq_exit+0x3f/0x79
 [<ffffffff8101bd07>] smp_apic_timer_interrupt+0x93/0xac
 [<ffffffff8100ccf3>] apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffff8101219e>] ? mwait_idle+0x6e/0x73
 [<ffffffff8100b244>] ? enter_idle+0x22/0x24
 [<ffffffff8100b298>] ? cpu_idle+0x52/0x93
 [<ffffffff8131832f>] ? start_secondary+0x175/0x17a
---[ end trace f425effd8183898b ]---
r8169: eth0: link up

I downloaded Realtek drivers from

http://152.104.125.41/downloads/downloadsView.aspx?Langid=1&PNid=5&PFid=5&Level=5&Conn=4&DownTypeID=3&GetDown=false#2

but the source won't compile out of the box. Florian, could you please
tell me what modifiations you made? Thanks!
Comment 48 Florian Engelhardt 2009-05-02 14:08:20 UTC
Created attachment 21186 [details]
Patch to get r8168 Realtek module to compile cleanly with 2.6.29 kernel

Ok, i added the patch. Just apply it, and it should compile with 2.6.29 kernel. There are two warnings (unused variable and foo defined but not used) which you can ignore.
Comment 49 Ralph Seichter 2009-05-02 18:16:15 UTC
> Ok, i added the patch. Just apply it, and it should compile with 2.6.29
> kernel.

Yes, it compiles OK and the NIC works with the new module. Thank you! I'll perform some load testing to see how it behaves. For the record, I now use the following modules on my machine:

  # lsmod
  Module                  Size  Used by
  smsc47m1               10168  0
  smsc47m192             15288  0
  hwmon_vid               2616  1 smsc47m192
  hwmon                   2648  2 smsc47m1,smsc47m192
  af_packet              14216  2
  nfsd                  100520  13
  lockd                  67044  1 nfsd
  nfs_acl                 2936  1 nfsd
  sunrpc                179112  10 nfsd,lockd,nfs_acl
  exportfs                4200  1 nfsd
  ehci_hcd               48400  0
  uhci_hcd               31576  0
  r8168                  40296  0
  usbcore               154704  3 ehci_hcd,uhci_hcd
  iTCO_wdt               12352  0
  i2c_i801                9364  0
  iTCO_vendor_support     3356  1 iTCO_wdt
  bitrev                  1960  1 r8168
  crc32                   3960  1 r8168
Comment 50 Ralph Seichter 2009-05-07 18:11:58 UTC
The "r8168" module works fine here. Is there a chance to add this module to the Linux Kernel sources?
Comment 51 Francois Romieu 2009-06-15 22:02:21 UTC
This ought to be fixed in 2.6.30.

Can you give it a try ?

-- 
Ueimor
Comment 52 Ralph Seichter 2009-06-18 21:00:07 UTC
I've built a kernel "2.6.30-gentoo-r1" with the r8169 module, but I did not yet have the opportunity to do a network stress test. I hope to find the required time during the next weekend.
Comment 53 Ralph Seichter 2009-06-19 20:27:00 UTC
I performed some tests today, and so far I have not experienced any failures with Kernel 2.6.30-gentoo-r1 and the r8169 NIC driver module. Nice work, Francois.
Comment 54 Francois Romieu 2009-06-19 22:12:47 UTC
I do little work. Many people contribute.

Thanks for your patience.

-- 
Ueimor