The BCM5906M (14E4:1713) network adapter in a Lenovo S10e netbook reproducibly loses connectivity after "two to thirty minutes" under network load (involuntary testing load here: the machine is a distcc client). Seldom, the connection will recover after a long, varying delay (2 minutes minimum, longest observed ~ 10 minutes); more often the connection will not recover, even after hours (overnight). When recovery happens, there are some log messages at the time of recovery; unfortunately I do not have an example with a vanilla 3.2.2 at hand. I'll post the log as soon as I can capture one. When the connection does not recover by itself, nothing is logged at all; especially, there are no messages at the point the connection is actually lost. Sometimes, removing and reconnecting (after about 8 seconds wait) the ethernet cable will reestablish the connection. Always, removing and reloading the tg3 module will help. tg3 module load messages: tg3.c:v3.121 (November 2, 2011) tg3 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 tg3 0000:02:00.0: setting latency timer to 64 tg3 0000:02:00.0: eth0: Tigon3 [partno(BCM95906) rev c002] (PCI Express) MAC address 00:23:8b:18:d8:32 tg3 0000:02:00.0: eth0: attached PHY is 5906 (10/100Base-TX Ethernet) (Wire Speed[0], EEE[0]) tg3 0000:02:00.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[0] tg3 0000:02:00.0: eth0: dma_rwctrl[76180000] dma_mask[64-bit] tg3 0000:02:00.0: eth0: Link is up at 100 Mbps, full duplex tg3 0000:02:00.0: eth0: Flow control is on for TX and on for RX The link is connected to a Netgear GS108 switch, the link parameters match what I would expect. In the "connection lost" state, the LEDs on the switch still indicate connectivity; also, the switch indicates constant activity at that port. From a software point of view, no connection is possible (distcc, ssh, ping, arp and looking at tcpdumps on both ends all behave identical to a hypothetical HOST <-> SWITCH <- no connection -> SWITCH <-> HOST setup; i.e. the links are believed up but there is never any traffic besides the own TX). I'll be happy to provide any further information / help with narrowing down the issue.
On a hunch, I tried disabling flow control on the tg3 while it was in the stuck state with: # ethtool -A eth0 autoneg off tx off rx off This caused the connection to become operational again, although only briefly: After about 30 seconds the link became opaque again. Besides the link down / link up messages, this time w/ flow control off, there are no messages logged. There are messages logged when the carrier state changes even while the connection does not work, and does not recover on re-plug. Also, I've tried to [0940]: 1. unload the tg3 module 2. unplug the ethernet cable 3. load the tg3 module 4. disable flow control as described above 5. connect the ethernet cable After 30 - 60 seconds under load, the connection failed again. Subjectively, there may be some connection with the type/direction of traffic: 0. ping is running from an external host to the affected system 1. reestablish connection by some means 2. connection is used by distcc, works at first 3. connect to the affected system using ssh => at that moment, the connection appears to fail, i.e. the ssh connection fails with "No route to host" and roughly at that moment, ping starts showing "host unreachable"
Sorry for the fragmented reports - I forgot to mention: In my memory, this problem appears long-standing; i.e. from operations experience the gentoo kernels 2.6.36, 2.6.38, 2.6.39, 3.0.6 were affected as well. With kernel 3.2.2 and this bug I'm finally taking the time to reproduce and report the issue with a vanilla kernel.
Created attachment 72535 [details] Log when tg3 recovers on its own In the mean time, I managed to see an instance of the connection recovering on its own after a few (3-7) minutes of "opaqueness". I'm attaching the log that's generated at the instance of recovery; there is nothing logged when the connection was lost earlier.
In the mean time, I found this: https://bugzilla.redhat.com/show_bug.cgi?id=509759 There are a few workarounds mentioned; I first tried: # ethtool -K eth0 sg off With this workaround, the connection is not dropped completely as described above (in a time frame where this should have happened multiple times), BUT: I see a number of TCP connections in a stuck state: # netstat -n -et Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State User Inode tcp 0 14161 192.168.42.42:45241 192.168.42.1:3632 ESTABLISHED 0 7815 tcp 0 14142 192.168.42.42:45246 192.168.42.1:3632 ESTABLISHED 0 7839 tcp 0 14150 192.168.42.42:45227 192.168.42.1:3632 ESTABLISHED 0 7749 tcp 0 14138 192.168.42.42:45244 192.168.42.1:3632 ESTABLISHED 0 7831 tcp 0 14126 192.168.42.42:45242 192.168.42.1:3632 ESTABLISHED 0 7820 tcp 0 14480 192.168.42.42:45223 192.168.42.1:3632 ESTABLISHED 0 7738 These (distcc) connections never progress over a timespan of about a minute, with other distcc connections happily communicating (with the same target host). Hm. Conjecture: The ACKs for these connections are never received? I'll try the other workaround mentioned (noapic); moving the card to a different slot as suggested can't be done on the netbook.
'noapic' alone loses the connection same as in the original report.
Both 'noapic' and 'ethtool -K eth0 sg off' behave identical to comment 4, i.e. I see stalled TCP connection with a Send-Q > 0. The tg3 interrupt on this machine appears to be shared with graphics and one USB controller, I'll attach the lshw output.
Created attachment 72539 [details] lshw output on the affected machine, with 'noapic'
Clarification to comment 6: "Both 'noapic' and 'ethtool ...' ..." is supposed to mean: "The system behaves identical to comment 4 when both 'noapic' and 'ethtool ...' are given."
The stalled TCP connections eventually lead to distcc errors: distcc[12694] (dcc_writex) ERROR: failed to write: Connection reset by peer distcc[12694] (dcc_readx) ERROR: unexpected eof on fd5 distcc[12694] (dcc_r_token_int) ERROR: read failed while waiting for token "DONE" distcc[12694] (dcc_r_result_header) ERROR: server provided no answer. Is the server configured to allow access from your IP address? Does the server have the compiler installed? Is the server configured to access the compiler? distcc[12694] Warning: failed to distribute ... With the 'sg off' workaround in place, these only happen for a small percentage of distcc runs; most of the (concurrent) distcc processes communicate and return OK.
I have the same bug https://bugzilla.kernel.org/show_bug.cgi?id=42663 posted 2 months before but nobody handed over it yet
It seemes that bug is fixed by this commit https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=5250def1935443f4bbddce101281e0aaf2e66838 So far I've no problems with 3.4.9 and patched 3.0.35