Bug 7579
Summary: | Sky2 receive checksum errors | ||
---|---|---|---|
Product: | Drivers | Reporter: | Badalian Slava (slavon.net) |
Component: | Network | Assignee: | Stephen Hemminger (stephen) |
Status: | REJECTED DUPLICATE | ||
Severity: | high | CC: | bryce, bunk, flyboy, grail, rakhal, stephen, zigamlinar |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.18 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
config
Filtered messages log detect and turn off receive checksum |
Description
Badalian Slava
2006-11-24 07:09:06 UTC
Does it fail instantly, or only under load? The sky2 receive error message occurs if the receiver can't keep up with the arriving data. How loaded is this machine? computer use for DNS server... more time 99% idle... have second computer... his have some problems (install new system... not have clients and connections... only 1-5 ssh sessions) Symtoms: Mashine not ping and not answer to requests 2-10 mins - after all work ok... if unplug and plug network cable - all work ok... Some more information: 1. What is system name/motherboard, perhaps I can find one to try and reproduce the problem. 2. What is output of driver on boot up. (dmesg | grep sky2) The driver prints chip version information. Please retry with 2.6.18.3 or 2.6.19-rc6, there were fixes after 2.6.18 for sky2 1. OS - gentoo... last version and last portage. i can't see motherboard name... computer location 200km from me =( Second computer now have 2 e1000 cards and go to first computer... i can't also see motherboard name =( I can get to u any other info that can get from linux remote. 2. sky2 v1.7 addr 0xdfffc000 irq 17 Yukon-EC (0xb6) rev 1 Ok... i try 2.6.18.3 and post results... but bug reproduce time may be up to week... After update to 2.6.18.3 ns1 ~ # uname -a Linux ns1 2.6.18-gentoo-r3 #1 SMP Wed Nov 29 09:32:49 MSK 2006 i686 Intel(R) Pentium(R) 4 CPU 3.00GHz GenuineIntel GNU/Linux ns1 ~ # dmesg | grep sky2 sky2 v1.5 addr 0xdfffc000 irq 17 Yukon-EC (0xb6) rev 1 sky2 eth0: addr 00:11:2f:88:9d:e4 sky2 eth0: enabling interface sky2 eth0: Link is up at 100 Mbps, full duplex, flow control none version not change ;) Does turning off receive checksumming fix the problem? ethtool -K eth0 rx off Are you using both ports? What is the kernel configuration? What is the IRQ assignment? ie. cat /proc/interrupts 2.6.18.3 ip_ct_ras: decoding error: out of bound eth0: hw csum failure. [<c02e9245>] __skb_checksum_complete+0x67/0x69 [<c0336bc8>] udp_error+0xca/0x1a8 [<c0114d6c>] try_to_wake_up+0x40/0x401 [<c0301b50>] ip_rcv_finish+0x0/0x27b [<c0334e91>] ip_conntrack_in+0xa4/0x4b8 [<c0321b3c>] udp_queue_rcv_skb+0xa8/0x2a8 [<c0301b50>] ip_rcv_finish+0x0/0x27b [<c02fb286>] nf_iterate+0x66/0x8a [<c0301b50>] ip_rcv_finish+0x0/0x27b [<c0301b50>] ip_rcv_finish+0x0/0x27b [<c02fb3f9>] nf_hook_slow+0x59/0xd9 [<c0301b50>] ip_rcv_finish+0x0/0x27b [<c03023f3>] ip_rcv+0x304/0x51b [<c0301b50>] ip_rcv_finish+0x0/0x27b [<c02ebf3d>] __net_timestamp+0x14/0x27 [<c02ec0d2>] netif_receive_skb+0x182/0x1fc [<c02905d1>] sky2_poll+0x545/0xa69 [<c0274ee5>] i8042_interrupt+0x1e7/0x22a [<c02edb24>] net_rx_action+0x7d/0x10a [<c01200e2>] __do_softirq+0x73/0xdf [<c0120189>] do_softirq+0x3b/0x3d [<c01054f5>] do_IRQ+0x30/0x6b [<c01036de>] common_interrupt+0x1a/0x20 [<c0101a99>] mwait_idle+0x2a/0x34 [<c0101a59>] cpu_idle+0x63/0x79 [<c042e7af>] start_kernel+0x34a/0x3fb [<c042e1eb>] unknown_bootoption+0x0/0x27a ns1 ~ # cat /proc/interrupts CPU0 CPU1 0: 26205345 0 IO-APIC-edge timer 7: 0 0 IO-APIC-edge parport0 9: 0 0 IO-APIC-level acpi 17: 6791063 0 IO-APIC-level sky2 19: 70142 0 IO-APIC-level libata NMI: 0 0 LOC: 26088304 26088304 ERR: 0 MIS: 0 >Does turning off receive checksumming fix the problem? > ethtool -K eth0 rx off i try do it only now... i ask if it help for me... > Are you using both ports? My MB have only 1 ethernet port... config.gz i attach Created attachment 9673 [details]
config
ok... now i not have error messages in dmesg... but have problem in connection some time server not request on ip... (not ping, not connect ssh)... if i shutdown port on Cisco and up it - all normal... some time... to next down... =( in dmesg sky2 eth0: Link is down. sky2 eth0: Link is up at 100 Mbps, full duplex, flow control none sky2 eth0: Link is down. sky2 eth0: Link is up at 100 Mbps, full duplex, flow control none I have the same problem... System information: Motherboard: AOPEN i915gmm-hfs Distribution: debian atos:/etc# uname -a Linux atos 2.6.18.2 #1 PREEMPT Sat Nov 18 19:28:03 CET 2006 i686 GNU/Linux atos:/etc# lspci 0000:00:00.0 Host bridge: Intel Corporation Mobile 915GM/PM/GMS/910GML Express Processor to DRAM Controller (rev 03) 0000:00:02.0 VGA compatible controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03) 0000:00:02.1 Display controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03) 0000:00:1c.0 PCI bridge: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) PCI Express Port 1 (rev 04) 0000:00:1c.1 PCI bridge: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) PCI Express Port 2 (rev 04) 0000:00:1c.2 PCI bridge: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) PCI Express Port 3 (rev 04) 0000:00:1c.3 PCI bridge: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) PCI Express Port 4 (rev 04) 0000:00:1d.0 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #1 (rev 04) 0000:00:1d.1 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #2 (rev 04) 0000:00:1d.2 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #3 (rev 04) 0000:00:1d.3 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #4 (rev 04) 0000:00:1d.7 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB2 EHCI Controller (rev 04) 0000:00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev d4) 0000:00:1f.0 ISA bridge: Intel Corporation 82801FBM (ICH6M) LPC Interface Bridge (rev 04) 0000:00:1f.1 IDE interface: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) IDE Controller (rev 04) 0000:00:1f.2 IDE interface: Intel Corporation 82801FBM (ICH6M) SATA Controller (rev 04) 0000:00:1f.3 SMBus: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) SMBus Controller (rev 04) 0000:02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 19) 0000:03:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 19) 0000:05:04.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78) 0000:05:05.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 30) atos:/etc# ethtool -i eth2 driver: sky2 version: 1.5 firmware-version: N/A bus-info: 0000:02:00.0 Problem Description: Recently I downloaded and compiled the 2.6.18.2 kernel because I needed to add USB support to the kernel. Previously I ran 2.6.14 kernel patched with sk98lin from SysKonnect without any problem for over a year. After the kernel update the 88E8053 ethernet interface randomly freezes. The machine have two 3com ethernet cards installed. The firts I installed when the machine was new because when I used the 88E8053 the ISP:s dhcp server stepped in and activated a failsafe and shut down the link because my ethernet interfaces had reported over 30 diffrent mac addresses to the server. The second one I installed today when I was tierd of that my internal network was down due to the new sky2 bug. Created attachment 9721 [details]
Filtered messages log
*** Bug 7617 has been marked as a duplicate of this bug. *** *** Bug 7615 has been marked as a duplicate of this bug. *** Change title to match description *** Bug 7611 has been marked as a duplicate of this bug. *** What are the netfilter/iptables rules in use. I need to audit those code paths. ns1 ~ # lsmod Module Size Used by iptable_filter 3456 0 ns1 ~ # iptables-save # Generated by iptables-save v1.3.6 on Thu Dec 7 10:41:39 2006 *filter :INPUT ACCEPT [198729160:19918983787] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [201042059:31695809779] COMMIT # Completed on Thu Dec 7 10:41:39 2006 see config for compiled to kernel modules Stephen Hemminger, could you write down all commands in bash, that you would like to see the output of. I will gladly post it here. Ziga Created attachment 10514 [details]
detect and turn off receive checksum
I have seen similar problem on Sony laptop.
This patch automatically turns off hardware rx checksumming if we get a bogus
value.
I also have been experiencing this problem, under several Fedora Core 6 kernels (2.6.18 to 2.6.19-1.2911.6.5). No reboot is needed; only rmmod sky2 ifup eth0 This is a web server with phpBB2 forum and a proxy for only 6-7 people, i.e. the machine is under no heavy network load (it does have small, but almost constant load). Before this started to happen, I could transfer hundreds of MB over the network in one go with no trouble at all. Under heavy testing, the card started responding after somewhere around 6GB was transfered (testing was done with 8GB file). From dmesg: NETDEV WATCHDOG: eth0: transmit timed out sky2 eth0: tx timeout sky2 eth0: transmit ring 107 .. 84 report=110 done=110 sky2 status report lost? grep sky2 /var/log/messages: Mar 10 20:53:56 xxx kernel: sky2 eth0: tx timeout Mar 10 20:53:56 xxx kernel: sky2 status report lost? Mar 10 20:54:01 xxx kernel: sky2 eth0: tx timeout Mar 10 20:54:01 xxx kernel: sky2 hardware hung? flushing Ethernet is Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 19), built on some Intel motherboard. Any ideas from which kernel has this started, so I can downgrade until it is fixed? Forgot to mention: I've found somewhere that adding idle=poll to kernel boot options helps sometimes (with additional comment "I don't really understand why it helps" :-D) and have tried it with no visible improvement. Heavy load DOES increase probability of network failure. I.e. a simple aMule client will produce that effect. Distribution: Debian Linux oft10 2.6.18-3-686 #1 SMP Mon Dec 4 16:41:14 UTC 2006 i686 GNU/Linux I want to confirm that with Debian 2.6.18-3-686 we were seeing the problem: Feb 19 10:39:17 oft14 kernel: sky2 v1.5 addr 0xdf100000 irq 169 Yukon-EC (0xb6) rev 2 Feb 19 10:39:17 oft14 kernel: sky2 eth0: addr 00:03:25:27:c3:62 Feb 19 10:39:17 oft14 kernel: sky2 v1.5 addr 0xdf200000 irq 177 Yukon-EC (0xb6) rev 2 Feb 19 10:39:17 oft14 kernel: sky2 eth1: addr 00:03:25:27:c3:63 Feb 19 10:39:17 oft14 kernel: sky2 eth0: enabling interface Feb 19 10:39:17 oft14 kernel: sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both Mar 2 17:24:14 oft14 kernel: sky2 hardware hung? flushing Mar 2 21:55:27 oft14 kernel: sky2 status report lost? Mar 2 22:12:31 oft14 kernel: sky2 hardware hung? flushing As you can see above - the lock up occurs after a few days of high load - seems like a cumulative effect of handling large bandwidths. Now we have upgraded to Debian 2.6.18-4-686 We are waiting to see if the event recurs and will report it if it does. (If we see the problem again we would be willing to try different settings upon request - to help diagnose the problem). Stumbled upon this bug: http://bugs.gentoo.org/show_bug.cgi?id=127367 They claim that the issue (not sure if it was/is the same issue!) was sloved in sky 1.4. Funny thing is: my FC6 kernel (2.6.19-1.2911.6.5.fc6) has sky 1.10, while the other machine - running older kernel (2.6.18-1.2798.fc6) has sky 1.5. Unless .10 is more than .5, this is confusing. The other machine is unter almost no load, so I cannot confirm that everything is ok. Version 1.4 is in kernel 2.6.17; 2.6.20 has 1.10. Btw, a little workaround while the issue arises: add to /etc/crontab: 6-56/10 * * * * root if (ping -c1 -q -w1 ip1 >& /dev/null || ping -c1 -q -w1 ip2 >& /dev/null); then echo -n > /dev/null; else /sbin/rmmod sky2; /sbin/ifup eth0; date | tee /root/network_restart | /bin/mail -s "Network restart" your@email; fi ip1, ip2 are IP addresses of some machines that are always up. This chunk pings those machines every 10 minutes and if none of them is responding, it restarts sky2 and eth0. I've been running kernel 2.6.16.43 now with sky2 version 0.15.1 for 4.5 days now and had no problems with it (aMule running all the time, so there is a constant network load which previously was enough to crash sky2 in a metter of hours). It seems to me that this issue is resolved. Following up on comment#23 I want to report that we again had the sky2 hang incident this morning: 21.03.2007 09:10:00 MET The incident seems to consistently repeat every 10 days or so on our load. Inspired by comment#24 (thanks Vedran) we have placed the following entry in /etc/crontab * * * * * root if ! (ping -c1 -q -w2 IP >& /dev/null) then /sbin/rmmod sky2;/sbin/modprobe sky2;/usr/bin/mailx -s "Sky2 Restarted" EMAIL;fi [where IP is a machine which is always on - and EMAIL is where you want notification] Forgot to mention - we are using: Debian Linux oft14 2.6.18-4-686 #1 SMP Wed Feb 21 16:06:54 UTC 2007 i686 GNU/Linux Can you please reproduce problem on a 2.6.20 or later kernel. I have the same problem (I believe) under 2.6.21. See: http://lkml.org/lkml/2007/4/28/105 I have updated the network driver on my machine and have now reduced the number of network interface hangs to just a few in a month. The bug is still there. I do not find any messages in the logs when the hang occurs. Is there some thing I have to configure to get the messages (I have the standard messages when I start and stop the device and plug and unplug the cable)? I bought a manageable switch to rule out that the switch had anything to do with the hangs. atos:/# uname -a Linux atos 2.6.20.2 #5 PREEMPT Mon Mar 12 00:34:27 CET 2007 i686 GNU/Linux atos:/# ethtool -i eth2 driver: sky2 version: 1.10 firmware-version: N/A bus-info: 0000:03:00.0 The remaining problems look like the (long) list reported in another but, so I am going to mark this bug as a duplicate. *** This bug has been marked as a duplicate of 7546 *** |