Most recent kernel where this bug did not occur:- Distribution:Debian Sarge Hardware Environment: Thinkpad T60 Software Environment: only kernel Problem Description: Hi, I have new T60 notebook with e1000 network card an Core Duo CPU. E1000 is connected on PCie. If I ping to T60 I get this pings: 64 bytes from 192.168.3.74: icmp_seq=88 ttl=64 time=1000.2 ms 64 bytes from 192.168.3.74: icmp_seq=89 ttl=64 time=0.2 ms 64 bytes from 192.168.3.74: icmp_seq=88 ttl=64 time=1000.2 ms 64 bytes from 192.168.3.74: icmp_seq=89 ttl=64 time=0.2 ms If I ping from T60 to my comp situation is normal. If I disable in kernel config Irq Balancing pings are much better but not the best :-) 64 bytes from 192.168.3.74: icmp_seq=29 ttl=64 time=12.7 ms 64 bytes from 192.168.3.74: icmp_seq=30 ttl=64 time=10.0 ms 64 bytes from 192.168.3.74: icmp_seq=31 ttl=64 time=7.3 ms 64 bytes from 192.168.3.74: icmp_seq=32 ttl=64 time=4.5 ms I tested kernel 2.6.17.1 with included driver and the latest diver from sf.net upgrade BIOS but nothing helps. I also test with NAPI and without NAPI. Steps to reproduce: Compile kernel with SMP irq balancing enabled and ping to T60
Please try loading the driver with parameter InterruptThrottleRate=0 modprobe e1000 InterruptThrottleRate=0 also please send the output of lspci, cat /proc/interrupts and ethtool -e eth0 Is this an e1000 problem or an smp irqbalance problem? It seems your results are normal when disabling in-kernel smp irqbalance so I would say it is probably not e1000.
Created attachment 8655 [details] lspci output of lspci
Created attachment 8656 [details] lspci output of lspci -v
InterruptThrottleRate=0 - not help boot with init=/bin/bash nosmp - not help uniprocesor kernel - not help ethtool -e eth0 Offset Values ------ ------ 0x0000 00 15 58 2a 8a a7 30 0b b2 ff 51 00 ff ff ff ff 0x0010 53 00 03 01 6b 02 01 20 aa 17 9a 10 86 80 df 80 0x0020 00 00 00 20 54 7e 00 00 14 00 da 00 04 00 00 27 0x0030 c9 6c 50 31 3e 07 0b 04 8b 29 00 00 00 f0 02 0f 0x0040 08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff 0x0050 14 00 1d 00 14 00 1d 00 af aa 1e 00 00 00 1d 00 0x0060 00 01 00 40 1f 12 07 40 ff ff ff ff ff ff ff ff 0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 97 b8 CPU0 CPU1 0: 982852 0 IO-APIC-edge timer 1: 234 0 IO-APIC-edge i8042 9: 184 0 IO-APIC-level acpi 12: 2350 0 IO-APIC-edge i8042 14: 12 0 IO-APIC-edge ide0 66: 2891 0 PCI-MSI libata 74: 161 0 IO-APIC-level HDA Intel, uhci_hcd:usb2 82: 0 0 IO-APIC-level uhci_hcd:usb3 90: 44 0 IO-APIC-level uhci_hcd:usb4, ehci_hcd:usb5 98: 1060 0 PCI-MSI eth0 169: 1 0 IO-APIC-level uhci_hcd:usb1, yenta NMI: 0 0 LOC: 982652 982653 ERR: 0 MIS: 0 output of lspci + lspci -v attached
Same problem here on a Thinkpad T60p. Switching to an UP kernel didn't help so it doesn't have anything to do with irq balancing. I tried several kernels from 2.6.16 to 2.6.18, and the 7.2.9 e1000 driver from sourceforge, the problem is still there. I tried InterruptThrottleRate=0, it doesn't help. I observed that pings to the local network don't appear to be affected by the problem. I also observed that the problem doesn't exist for larger packets, i.e. ping -s 1000 works OK. I norrowed the limit down to 342, i.e. ping -s 342 and less have the problem, ping -s 343 and above are OK. See by yourself (62.4.16.243 is the hop right after my home router): jdelvare@amber:~> ping -c 10 62.4.16.243 PING 62.4.16.243 (62.4.16.243) 56(84) bytes of data. 64 bytes from 62.4.16.243: icmp_seq=1 ttl=254 time=1010 ms 64 bytes from 62.4.16.243: icmp_seq=2 ttl=254 time=392 ms 64 bytes from 62.4.16.243: icmp_seq=3 ttl=254 time=1000 ms 64 bytes from 62.4.16.243: icmp_seq=4 ttl=254 time=392 ms 64 bytes from 62.4.16.243: icmp_seq=5 ttl=254 time=1000 ms 64 bytes from 62.4.16.243: icmp_seq=6 ttl=254 time=392 ms 64 bytes from 62.4.16.243: icmp_seq=7 ttl=254 time=1000 ms 64 bytes from 62.4.16.243: icmp_seq=8 ttl=254 time=392 ms 64 bytes from 62.4.16.243: icmp_seq=9 ttl=254 time=1000 ms 64 bytes from 62.4.16.243: icmp_seq=10 ttl=254 time=392 ms --- 62.4.16.243 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9011ms rtt min/avg/max/mdev = 392.354/697.534/1010.802/304.989 ms, pipe 2 jdelvare@amber:~> ping -c 10 62.4.16.243 -s 342 PING 62.4.16.243 (62.4.16.243) 342(370) bytes of data. 350 bytes from 62.4.16.243: icmp_seq=1 ttl=254 time=838 ms 350 bytes from 62.4.16.243: icmp_seq=2 ttl=254 time=1000 ms 350 bytes from 62.4.16.243: icmp_seq=3 ttl=254 time=40.5 ms 350 bytes from 62.4.16.243: icmp_seq=4 ttl=254 time=1000 ms 350 bytes from 62.4.16.243: icmp_seq=5 ttl=254 time=836 ms 350 bytes from 62.4.16.243: icmp_seq=6 ttl=254 time=1000 ms 350 bytes from 62.4.16.243: icmp_seq=7 ttl=254 time=836 ms 350 bytes from 62.4.16.243: icmp_seq=8 ttl=254 time=1000 ms 350 bytes from 62.4.16.243: icmp_seq=9 ttl=254 time=836 ms 350 bytes from 62.4.16.243: icmp_seq=10 ttl=254 time=1836 ms --- 62.4.16.243 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9002ms rtt min/avg/max/mdev = 40.570/922.476/1836.399/408.233 ms, pipe 2 jdelvare@amber:~> ping -c 10 62.4.16.243 -s 343 PING 62.4.16.243 (62.4.16.243) 343(371) bytes of data. 351 bytes from 62.4.16.243: icmp_seq=1 ttl=254 time=41.7 ms 351 bytes from 62.4.16.243: icmp_seq=2 ttl=254 time=42.7 ms 351 bytes from 62.4.16.243: icmp_seq=3 ttl=254 time=40.6 ms 351 bytes from 62.4.16.243: icmp_seq=4 ttl=254 time=40.6 ms 351 bytes from 62.4.16.243: icmp_seq=5 ttl=254 time=40.7 ms 351 bytes from 62.4.16.243: icmp_seq=6 ttl=254 time=40.7 ms 351 bytes from 62.4.16.243: icmp_seq=7 ttl=254 time=40.7 ms 351 bytes from 62.4.16.243: icmp_seq=8 ttl=254 time=40.7 ms 351 bytes from 62.4.16.243: icmp_seq=9 ttl=254 time=40.6 ms 351 bytes from 62.4.16.243: icmp_seq=10 ttl=254 time=44.7 ms --- 62.4.16.243 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9001ms rtt min/avg/max/mdev = 40.623/41.425/44.780/1.308 ms jdelvare@amber:~> The same commands run from any other host of my local network don't exhibit these high latencies and jitter. I'm not attaching lspci info, as this is exactly the same as Daniel's. I am willing to help debug this issue, but I just don't know what to try next. Any hint?
Peter A. Zannucci found that using RxIntDelay=5 when loading the e1000 driver makes the problem go away, and I confirm that. As I understand it we don't want to make this the default, so this is in no way a fix, but it may help the developers understand the cause of the problem.
we have a root cause to this issue, and the fix involves an eeprom change. We will shortly attach a script users can run after loading the driver.
Status update: My previous comment regarding an eeprom fix was incorrect. Currently the hardware team says the recommended workaround is to load driver with RxIntDelay=8 or RxIntDelay=32 (basically any non-zero value), only on 82573L (device id 8086:109a) The issue is still being worked.
Created attachment 10403 [details] Cisco VPN Client Connection w/e1000 and alternate nic
I believe that this bug may also cause issues with the cisco vpn client (vpnclient-linux-x86_64-4.8.00.0490-k9). I have a T60p w/a 82573L running ubuntu dapper (2.6.15-28-386), I can reproduce the issue with ubuntu edgy (2.6.17). While modifying the RxIntDelay settings does stablise my ping response times I am unable to connect to a vpn using the Cisco vpn client in TCP tunneling mode. I am able to connect using the ipw3945 interface, as well as with a 3com PC Card ethernet card (3CCFE575CT) on the same system, so it does seem to be related to the e1000 driver. Packet captures at the router reveal that traffic is sent to and from the machine, so the vpn client is able to use the network, just not able to connect to the vpn concentrator. Is there anything else I can do that might improve this situation? Is there any diagnostic information that I can provide you with that might help to identify the specifics of the problem? I've attached a file showing the bahviour of the client.
John, I suspect you are affected by a different bug: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=753eab76a3337863a0d86ce045fa4eb6c3cbeef9 Try a kernel with CONFIG_E1000_DISABLE_PACKET_SPLIT=y or backport the fix above.
John, you bring a very interesting point. A quick question though, are you certain that this issue can also affect TCP encapsulated traffic, it seems to perhaps be UDP specific? With the vpn configuration I'm using I don't do traditional IPSec using isakmp and esp, instead the traffic is encapsulated into a TCP connection.
If your VPN doesn't make use of encapsulated ESP packets, then indeed it's unlikely that the patch I mentioned will help you. But maybe there's a similar bug in the TCP code, so I think it's still worth trying CONFIG_E1000_DISABLE_PACKET_SPLIT=y. If it doesn't help, then it means my guess wasn't correct and your problem is different.
Jean, You are entirely right. I recompiled the kernel on edgy after changing the CONFIG_E1000_DISABLE_PACKET_SPLIT parameter to y. My e1000 now works properly with the vpn client. Do you have a link to the bug for the patch you refered me to? I'd like to update it and point out that it appears to affect (some?) TCP communication as well. If a new bug would be more appropriate please let me know. Sorry to ask such questions, quite new to bug reporting.
John, the bug in question was opened against SLES 10 in Novell's bugzilla, and it isn't public, sorry. I suggest that you try a recent kernel (at least 2.6.18.6, but the more recent the better) and confirm that the problem is still there. Maybe it's already fixed after all. If the bug is still there, then opening a new bug sounds sane.
I've opened Bug 8042 on the VPN related issues discussed above. http://bugzilla.kernel.org/show_bug.cgi?id=8042 Jean, Thanks for all your help, much appreciated.
Updating to the latest BIOS version (2.11) fixed the long ping issue on my T60 (2007-FUG).
Bios 2.11 from http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=MIGR-63027 solved it for me as well. Ping over 100 Mbps Ethernet shows rtt min/avg/max/mdev = 0.320/0.517/0.739/0.153 ms Thank you!
Looks like the issue is resolved with the latest bios available. Closing.
allow me to correct myself. This issue was not completely resolved by the BIOS update 2.11 but by the patch that changes the RDTR value. This patch will be included shortly in a patch to the kernel, probably for 2.6.22.
Jesse, what patch are you referring to?
the patch is still not upstream, due to the release before the release with this patch being rejected. the fix is simple, and can be done by a user at module load time rmmod e1000; modprobe e1000 RxIntDelay=8 the driver fix is available in the driver on our sourceforge site, version 7.5.5. it contains a patch to e1000_param.c to make sure the RxIntDelay is >= 8 on 82573.
Installed CentOS 4.4 on T60. update BIOS to 2.12 (79ETD2WW). e1000 driver version 7.5.5. used RxIntDelay=5, 8 or 32, ping is still slow. 64 bytes from 192.168.1.1: icmp_seq=19 ttl=64 time=0.824 ms 64 bytes from 192.168.1.1: icmp_seq=20 ttl=64 time=124 ms 64 bytes from 192.168.1.1: icmp_seq=21 ttl=64 time=123 ms 64 bytes from 192.168.1.1: icmp_seq=22 ttl=64 time=122 ms 64 bytes from 192.168.1.1: icmp_seq=23 ttl=64 time=4.06 ms 64 bytes from 192.168.1.1: icmp_seq=24 ttl=64 time=121 ms 64 bytes from 192.168.1.1: icmp_seq=25 ttl=64 time=121 ms 64 bytes from 192.168.1.1: icmp_seq=26 ttl=64 time=1.97 ms 64 bytes from 192.168.1.1: icmp_seq=27 ttl=64 time=0.824 ms 64 bytes from 192.168.1.1: icmp_seq=28 ttl=64 time=952 ms 64 bytes from 192.168.1.1: icmp_seq=29 ttl=64 time=0.833 ms 64 bytes from 192.168.1.1: icmp_seq=30 ttl=64 time=108 ms 64 bytes from 192.168.1.1: icmp_seq=31 ttl=64 time=107 ms 64 bytes from 192.168.1.1: icmp_seq=32 ttl=64 time=0.822 ms 64 bytes from 192.168.1.1: icmp_seq=33 ttl=64 time=0.823 ms 64 bytes from 192.168.1.1: icmp_seq=34 ttl=64 time=947 ms 64 bytes from 192.168.1.1: icmp_seq=35 ttl=64 time=98.3 ms
justy: Does your T60 have an ATI video chip? If yes, and you are using open source driver, give ATI proprietary drivers a shot. I recently installed Centos 5, and the ping issue came back. The RxIntDelay fix works, but only if i'm using ATI proprietary drivers. What makes this even more strange, is that if I use Dual Head mode the problem appears again. It seems that there is some issues with e1000 and ATI drivers.
Same thing of justy and i have the ati proprietary driver. Run on feisty with the kernl 2.6.20-16-generic (386). A bug is open in ubuntu launchpad https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.15/+bug/42572
STATUS UPDATE: Issue is still being worked. I finally have hardware that I can test this issue on.
I've recently upgraded to 2.6.21.6 (Debian/unstable kernel) (from 2.6.18) on my X60, and the problem is back. My hardware is X60 with 945GM chipset. I've also upgraded to version 2.1.0 xserver-xorg-video-intel driver, but I was upgrading the kernel and the driver more or less the same time, I'm not sure what effect that had. With 2.6.18 and the old i810 video driver, I got slow but steady ping with around 20ms roundtrip times. Now it's back to alternating 1000ms / 0.5ms when pinged from a remote host.
With kernel 2.6.22 (Gusty) I have slow ping too with a fresh install.
X60 Fedora 7 Kernel 2.6.23.1 from updates Bios v2.12 Bug not fixed. RxIntDelay=8 fix the problem
The final fix (disabling L1 ASPM completely) was merged in jgarzik/netdev-2.6 #upstream as of this week. please test and confirm. I'll set this bug to close.