Bug 6929

Summary: T60 and e1000 long ping
Product: Drivers Reporter: Daniel Smolik (marvin)
Component: NetworkAssignee: Auke Kok (auke-jan.h.kok)
Status: CLOSED CODE_FIX    
Severity: normal CC: auke-jan.h.kok, bunk, erik.andren, jdelvare, jesse.brandeburg, john.ronciak, kernel.org, normes-kbt, sluskyb, Thorsten, u288
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.17.1 Subsystem:
Regression: --- Bisected commit-id:
Attachments: lspci
lspci
Cisco VPN Client Connection w/e1000 and alternate nic

Description Daniel Smolik 2006-07-31 00:29:47 UTC
Most recent kernel where this bug did not occur:-
Distribution:Debian Sarge
Hardware Environment: Thinkpad T60
Software Environment: only kernel
Problem Description:
 Hi,
I have new T60 notebook with e1000 network card an Core
Duo CPU. E1000 is connected on PCie. If I ping to T60 I
get this pings:
64 bytes from 192.168.3.74: icmp_seq=88 ttl=64
time=1000.2 ms
64 bytes from 192.168.3.74: icmp_seq=89 ttl=64 time=0.2 ms

64 bytes from 192.168.3.74: icmp_seq=88 ttl=64
time=1000.2 ms
64 bytes from 192.168.3.74: icmp_seq=89 ttl=64 time=0.2 ms

If I ping from T60 to my comp situation is normal.

If I disable in kernel config Irq Balancing pings are
much better but not the best :-)

64 bytes from 192.168.3.74: icmp_seq=29 ttl=64 time=12.7 ms
64 bytes from 192.168.3.74: icmp_seq=30 ttl=64 time=10.0 ms
64 bytes from 192.168.3.74: icmp_seq=31 ttl=64 time=7.3 ms
64 bytes from 192.168.3.74: icmp_seq=32 ttl=64 time=4.5 ms

I tested kernel 2.6.17.1 with included driver and the
latest diver from sf.net upgrade BIOS but nothing
helps. I also test with NAPI and without NAPI.


Steps to reproduce:
Compile kernel with SMP irq balancing enabled and ping to T60
Comment 1 Jesse Brandeburg 2006-07-31 09:02:50 UTC
Please try loading the driver with parameter InterruptThrottleRate=0
modprobe e1000 InterruptThrottleRate=0

also please send the output of lspci, cat /proc/interrupts and ethtool -e eth0

Is this an e1000 problem or an smp irqbalance problem?  It seems your results
are normal when disabling in-kernel smp irqbalance so I would say it is probably
not e1000.
Comment 2 Daniel Smolik 2006-07-31 09:57:05 UTC
Created attachment 8655 [details]
lspci

output of lspci
Comment 3 Daniel Smolik 2006-07-31 09:57:44 UTC
Created attachment 8656 [details]
lspci

output of lspci -v
Comment 4 Daniel Smolik 2006-07-31 10:00:09 UTC
InterruptThrottleRate=0 - not help
boot with init=/bin/bash nosmp - not help
uniprocesor kernel - not help
 ethtool  -e eth0
Offset          Values
------          ------
0x0000          00 15 58 2a 8a a7 30 0b b2 ff 51 00 ff ff ff ff 
0x0010          53 00 03 01 6b 02 01 20 aa 17 9a 10 86 80 df 80 
0x0020          00 00 00 20 54 7e 00 00 14 00 da 00 04 00 00 27 
0x0030          c9 6c 50 31 3e 07 0b 04 8b 29 00 00 00 f0 02 0f 
0x0040          08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff 
0x0050          14 00 1d 00 14 00 1d 00 af aa 1e 00 00 00 1d 00 
0x0060          00 01 00 40 1f 12 07 40 ff ff ff ff ff ff ff ff 
0x0070          ff ff ff ff ff ff ff ff ff ff ff ff ff ff 97 b8 

    CPU0       CPU1       
  0:     982852          0    IO-APIC-edge  timer
  1:        234          0    IO-APIC-edge  i8042
  9:        184          0   IO-APIC-level  acpi
 12:       2350          0    IO-APIC-edge  i8042
 14:         12          0    IO-APIC-edge  ide0
 66:       2891          0         PCI-MSI  libata
 74:        161          0   IO-APIC-level  HDA Intel, uhci_hcd:usb2
 82:          0          0   IO-APIC-level  uhci_hcd:usb3
 90:         44          0   IO-APIC-level  uhci_hcd:usb4, ehci_hcd:usb5
 98:       1060          0         PCI-MSI  eth0
169:          1          0   IO-APIC-level  uhci_hcd:usb1, yenta
NMI:          0          0 
LOC:     982652     982653 
ERR:          0
MIS:          0
output of lspci + lspci -v attached
Comment 5 Jean Delvare 2006-10-02 12:04:28 UTC
Same problem here on a Thinkpad T60p. Switching to an UP kernel didn't help so
it doesn't have anything to do with irq balancing. I tried several kernels from
2.6.16 to 2.6.18, and the 7.2.9 e1000 driver from sourceforge, the problem is
still there.

I tried InterruptThrottleRate=0, it doesn't help.

I observed that pings to the local network don't appear to be affected by the
problem. I also observed that the problem doesn't exist for larger packets, i.e.
ping -s 1000 works OK. I norrowed the limit down to 342, i.e. ping -s 342 and
less have the problem, ping -s 343 and above are OK. See by yourself
(62.4.16.243 is the hop right after my home router):

jdelvare@amber:~> ping -c 10 62.4.16.243
PING 62.4.16.243 (62.4.16.243) 56(84) bytes of data.
64 bytes from 62.4.16.243: icmp_seq=1 ttl=254 time=1010 ms
64 bytes from 62.4.16.243: icmp_seq=2 ttl=254 time=392 ms
64 bytes from 62.4.16.243: icmp_seq=3 ttl=254 time=1000 ms
64 bytes from 62.4.16.243: icmp_seq=4 ttl=254 time=392 ms
64 bytes from 62.4.16.243: icmp_seq=5 ttl=254 time=1000 ms
64 bytes from 62.4.16.243: icmp_seq=6 ttl=254 time=392 ms
64 bytes from 62.4.16.243: icmp_seq=7 ttl=254 time=1000 ms
64 bytes from 62.4.16.243: icmp_seq=8 ttl=254 time=392 ms
64 bytes from 62.4.16.243: icmp_seq=9 ttl=254 time=1000 ms
64 bytes from 62.4.16.243: icmp_seq=10 ttl=254 time=392 ms

--- 62.4.16.243 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9011ms
rtt min/avg/max/mdev = 392.354/697.534/1010.802/304.989 ms, pipe 2
jdelvare@amber:~> ping -c 10 62.4.16.243 -s 342
PING 62.4.16.243 (62.4.16.243) 342(370) bytes of data.
350 bytes from 62.4.16.243: icmp_seq=1 ttl=254 time=838 ms
350 bytes from 62.4.16.243: icmp_seq=2 ttl=254 time=1000 ms
350 bytes from 62.4.16.243: icmp_seq=3 ttl=254 time=40.5 ms
350 bytes from 62.4.16.243: icmp_seq=4 ttl=254 time=1000 ms
350 bytes from 62.4.16.243: icmp_seq=5 ttl=254 time=836 ms
350 bytes from 62.4.16.243: icmp_seq=6 ttl=254 time=1000 ms
350 bytes from 62.4.16.243: icmp_seq=7 ttl=254 time=836 ms
350 bytes from 62.4.16.243: icmp_seq=8 ttl=254 time=1000 ms
350 bytes from 62.4.16.243: icmp_seq=9 ttl=254 time=836 ms
350 bytes from 62.4.16.243: icmp_seq=10 ttl=254 time=1836 ms

--- 62.4.16.243 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9002ms
rtt min/avg/max/mdev = 40.570/922.476/1836.399/408.233 ms, pipe 2
jdelvare@amber:~> ping -c 10 62.4.16.243 -s 343
PING 62.4.16.243 (62.4.16.243) 343(371) bytes of data.
351 bytes from 62.4.16.243: icmp_seq=1 ttl=254 time=41.7 ms
351 bytes from 62.4.16.243: icmp_seq=2 ttl=254 time=42.7 ms
351 bytes from 62.4.16.243: icmp_seq=3 ttl=254 time=40.6 ms
351 bytes from 62.4.16.243: icmp_seq=4 ttl=254 time=40.6 ms
351 bytes from 62.4.16.243: icmp_seq=5 ttl=254 time=40.7 ms
351 bytes from 62.4.16.243: icmp_seq=6 ttl=254 time=40.7 ms
351 bytes from 62.4.16.243: icmp_seq=7 ttl=254 time=40.7 ms
351 bytes from 62.4.16.243: icmp_seq=8 ttl=254 time=40.7 ms
351 bytes from 62.4.16.243: icmp_seq=9 ttl=254 time=40.6 ms
351 bytes from 62.4.16.243: icmp_seq=10 ttl=254 time=44.7 ms

--- 62.4.16.243 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9001ms
rtt min/avg/max/mdev = 40.623/41.425/44.780/1.308 ms
jdelvare@amber:~>

The same commands run from any other host of my local network don't exhibit
these high latencies and jitter.

I'm not attaching lspci info, as this is exactly the same as Daniel's.

I am willing to help debug this issue, but I just don't know what to try next.
Any hint?
Comment 6 Jean Delvare 2006-10-07 04:15:23 UTC
Peter A. Zannucci found that using RxIntDelay=5 when loading the e1000 driver
makes the problem go away, and I confirm that. As I understand it we don't want
to make this the default, so this is in no way a fix, but it may help the
developers understand the cause of the problem.
Comment 7 Jesse Brandeburg 2006-12-18 15:01:16 UTC
we have a root cause to this issue, and the fix involves an eeprom change.  We
will shortly attach a script users can run after loading the driver.
Comment 8 Jesse Brandeburg 2007-01-18 11:47:39 UTC
Status update:
My previous comment regarding an eeprom fix was incorrect.  Currently the
hardware team says the recommended workaround is to load driver with
RxIntDelay=8 or RxIntDelay=32 (basically any non-zero value), only on 82573L
(device id 8086:109a)

The issue is still being worked.
Comment 9 John Marrett 2007-02-13 05:32:13 UTC
Created attachment 10403 [details]
Cisco VPN Client Connection w/e1000 and alternate nic
Comment 10 John Marrett 2007-02-13 05:34:29 UTC
I believe that this bug may also cause issues with the cisco vpn client
(vpnclient-linux-x86_64-4.8.00.0490-k9).

I have a T60p w/a 82573L running ubuntu dapper (2.6.15-28-386), I can reproduce
the issue with ubuntu edgy (2.6.17). 

While modifying the RxIntDelay settings does stablise my ping response times I
am unable to connect to a vpn using the Cisco vpn client in TCP tunneling mode.

I am able to connect using the ipw3945 interface, as well as with a 3com PC Card
ethernet card (3CCFE575CT) on the same system, so it does seem to be related to
the e1000 driver.

Packet captures at the router reveal that traffic is sent to and from the
machine, so the vpn client is able to use the network, just not able to connect
to the vpn concentrator.

Is there anything else I can do that might improve this situation?

Is there any diagnostic information that I can provide you with that might help
to identify the specifics of the problem?

I've attached a file showing the bahviour of the client.
Comment 11 Jean Delvare 2007-02-13 06:42:37 UTC
John, I suspect you are affected by a different bug:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=753eab76a3337863a0d86ce045fa4eb6c3cbeef9

Try a kernel with CONFIG_E1000_DISABLE_PACKET_SPLIT=y or backport the fix above.
Comment 12 John Marrett 2007-02-13 06:50:11 UTC
John, you bring a very interesting point.

A quick question though, are you certain that this issue can also affect TCP
encapsulated traffic, it seems to perhaps be UDP specific?

With the vpn configuration I'm using I don't do traditional IPSec using isakmp
and esp, instead the traffic is encapsulated into a TCP connection.
Comment 13 Jean Delvare 2007-02-13 07:04:34 UTC
If your VPN doesn't make use of encapsulated ESP packets, then indeed it's
unlikely that the patch I mentioned will help you. But maybe there's a similar
bug in the TCP code, so I think it's still worth trying
CONFIG_E1000_DISABLE_PACKET_SPLIT=y. If it doesn't help, then it means my guess
wasn't correct and your problem is different.
Comment 14 John Marrett 2007-02-14 05:39:52 UTC
Jean,

You are entirely right. I recompiled the kernel on edgy after changing the 
CONFIG_E1000_DISABLE_PACKET_SPLIT parameter to y. My e1000 now works properly
with the vpn client.

Do you have a link to the bug for the patch you refered me to? I'd like to
update it and point out that it appears to affect (some?) TCP communication as
well. If a new bug would be more appropriate please let me know.

Sorry to ask such questions, quite new to bug reporting.
Comment 15 Jean Delvare 2007-02-14 06:19:11 UTC
John, the bug in question was opened against SLES 10 in Novell's bugzilla, and
it isn't public, sorry.

I suggest that you try a recent kernel (at least 2.6.18.6, but the more recent
the better) and confirm that the problem is still there. Maybe it's already
fixed after all. If the bug is still there, then opening a new bug sounds sane.
Comment 16 John Marrett 2007-02-19 17:09:39 UTC
I've opened Bug 8042 on the VPN related issues discussed above.

http://bugzilla.kernel.org/show_bug.cgi?id=8042

Jean,

Thanks for all your help, much appreciated.
Comment 17 Petri Lehtolainen 2007-03-30 05:49:48 UTC
Updating to the latest BIOS version (2.11) fixed the long ping issue on my T60
(2007-FUG).
Comment 18 Mattias Holmlund 2007-03-30 12:45:51 UTC
Bios 2.11 from 

http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=MIGR-63027

solved it for me as well. Ping over 100 Mbps Ethernet shows

rtt min/avg/max/mdev = 0.320/0.517/0.739/0.153 ms

Thank you!
Comment 19 Jesse Brandeburg 2007-04-03 15:16:46 UTC
Looks like the issue is resolved with the latest bios available.  Closing.
Comment 20 Jesse Brandeburg 2007-04-04 13:14:35 UTC
allow me to correct myself.

This issue was not completely resolved by the BIOS update 2.11 but by the patch
that changes the RDTR value. This patch will be included shortly in a patch to
the kernel, probably for 2.6.22.
Comment 21 Erik Andr 2007-05-22 09:51:37 UTC
Jesse, what patch are you referring to?
Comment 22 Jesse Brandeburg 2007-05-22 11:06:08 UTC
the patch is still not upstream, due to the release before the release with this
patch being rejected.

the fix is simple, and can be done by a user at module load time
rmmod e1000; modprobe e1000 RxIntDelay=8

the driver fix is available in the driver on our sourceforge site, version
7.5.5.  it contains a patch to e1000_param.c to make sure the RxIntDelay is >= 8
on 82573.
Comment 23 justy 2007-06-10 21:56:38 UTC
Installed CentOS 4.4 on T60.
update BIOS to 2.12 (79ETD2WW). 
e1000 driver version 7.5.5.
used RxIntDelay=5, 8 or 32, ping is still slow.


64 bytes from 192.168.1.1: icmp_seq=19 ttl=64 time=0.824 ms
64 bytes from 192.168.1.1: icmp_seq=20 ttl=64 time=124 ms
64 bytes from 192.168.1.1: icmp_seq=21 ttl=64 time=123 ms
64 bytes from 192.168.1.1: icmp_seq=22 ttl=64 time=122 ms
64 bytes from 192.168.1.1: icmp_seq=23 ttl=64 time=4.06 ms
64 bytes from 192.168.1.1: icmp_seq=24 ttl=64 time=121 ms
64 bytes from 192.168.1.1: icmp_seq=25 ttl=64 time=121 ms
64 bytes from 192.168.1.1: icmp_seq=26 ttl=64 time=1.97 ms
64 bytes from 192.168.1.1: icmp_seq=27 ttl=64 time=0.824 ms
64 bytes from 192.168.1.1: icmp_seq=28 ttl=64 time=952 ms
64 bytes from 192.168.1.1: icmp_seq=29 ttl=64 time=0.833 ms
64 bytes from 192.168.1.1: icmp_seq=30 ttl=64 time=108 ms
64 bytes from 192.168.1.1: icmp_seq=31 ttl=64 time=107 ms
64 bytes from 192.168.1.1: icmp_seq=32 ttl=64 time=0.822 ms
64 bytes from 192.168.1.1: icmp_seq=33 ttl=64 time=0.823 ms
64 bytes from 192.168.1.1: icmp_seq=34 ttl=64 time=947 ms
64 bytes from 192.168.1.1: icmp_seq=35 ttl=64 time=98.3 ms
Comment 24 Petri Lehtolainen 2007-06-15 03:51:37 UTC
justy: Does your T60 have an ATI video chip? If yes, and you are using open source driver, give ATI proprietary drivers a shot.

I recently installed Centos 5, and the ping issue came back. The RxIntDelay fix works, but only if i'm using ATI proprietary drivers. What makes this even more strange, is that if I use Dual Head mode the problem appears again.

It seems that there is some issues with e1000 and ATI drivers.
Comment 25 gagarine 2007-07-03 08:03:13 UTC
Same thing of justy and i have the ati proprietary driver. Run on feisty with the kernl 2.6.20-16-generic (386).

A bug is open in ubuntu launchpad https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.15/+bug/42572
Comment 26 Auke Kok 2007-07-03 09:41:50 UTC
STATUS UPDATE:

Issue is still being worked. I finally have hardware that I can test this issue on.
Comment 27 Zoltan Hidvegi 2007-07-18 09:00:39 UTC
I've recently upgraded to 2.6.21.6 (Debian/unstable kernel) (from 2.6.18) on my X60, and the problem is back.  My hardware is X60 with 945GM chipset.  I've also upgraded to version 2.1.0 xserver-xorg-video-intel driver, but I was upgrading the kernel and the driver more or less the same time, I'm not sure what effect that had.  With 2.6.18 and the old i810 video driver, I got slow but steady ping with around 20ms roundtrip times.  Now it's back to alternating 1000ms / 0.5ms when pinged from a remote host.
Comment 28 gagarine 2007-10-12 02:02:11 UTC
With kernel 2.6.22 (Gusty) I have slow ping too with a fresh install.
Comment 29 Nikolay Ulyanitsky 2007-10-31 08:32:36 UTC
X60
Fedora 7
Kernel 2.6.23.1 from updates
Bios v2.12
Bug not fixed.

RxIntDelay=8 fix the problem
Comment 30 Auke Kok 2007-11-13 10:49:17 UTC
The final fix (disabling L1 ASPM completely) was merged in jgarzik/netdev-2.6 #upstream as of this week.

please test and confirm. I'll set this bug to close.