Bug 42856 - tg3 loses connectivity under load
Summary: tg3 loses connectivity under load
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-03-04 08:31 UTC by Joe Breuer
Modified: 2016-02-15 21:56 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.2.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Log when tg3 recovers on its own (10.94 KB, text/plain)
2012-03-04 09:40 UTC, Joe Breuer
Details
lshw output on the affected machine, with 'noapic' (18.97 KB, text/plain)
2012-03-05 07:43 UTC, Joe Breuer
Details

Description Joe Breuer 2012-03-04 08:31:23 UTC
The BCM5906M (14E4:1713) network adapter in a Lenovo S10e netbook reproducibly loses connectivity after "two to thirty minutes" under network load (involuntary testing load here: the machine is a distcc client).

Seldom, the connection will recover after a long, varying delay (2 minutes minimum, longest observed ~ 10 minutes); more often the connection will not recover, even after hours (overnight).

When recovery happens, there are some log messages at the time of recovery; unfortunately I do not have an example with a vanilla 3.2.2 at hand. I'll post the log as soon as I can capture one.

When the connection does not recover by itself, nothing is logged at all; especially, there are no messages at the point the connection is actually lost.

Sometimes, removing and reconnecting (after about 8 seconds wait) the ethernet cable will reestablish the connection.

Always, removing and reloading the tg3 module will help.


tg3 module load messages:

tg3.c:v3.121 (November 2, 2011)
tg3 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
tg3 0000:02:00.0: setting latency timer to 64
tg3 0000:02:00.0: eth0: Tigon3 [partno(BCM95906) rev c002] (PCI Express) MAC address 00:23:8b:18:d8:32
tg3 0000:02:00.0: eth0: attached PHY is 5906 (10/100Base-TX Ethernet) (Wire Speed[0], EEE[0])
tg3 0000:02:00.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[0]
tg3 0000:02:00.0: eth0: dma_rwctrl[76180000] dma_mask[64-bit]
tg3 0000:02:00.0: eth0: Link is up at 100 Mbps, full duplex
tg3 0000:02:00.0: eth0: Flow control is on for TX and on for RX

The link is connected to a Netgear GS108 switch, the link parameters match what I would expect.

In the "connection lost" state, the LEDs on the switch still indicate connectivity; also, the switch indicates constant activity at that port. From a software point of view, no connection is possible (distcc, ssh, ping, arp and looking at tcpdumps on both ends all behave identical to a hypothetical HOST <-> SWITCH <- no connection -> SWITCH <-> HOST setup; i.e. the links are believed up but there is never any traffic besides the own TX).


I'll be happy to provide any further information / help with narrowing down the issue.
Comment 1 Joe Breuer 2012-03-04 08:45:07 UTC
On a hunch, I tried disabling flow control on the tg3 while it was in the stuck state with:

# ethtool -A eth0 autoneg off tx off rx off

This caused the connection to become operational again, although only briefly: After about 30 seconds the link became opaque again.

Besides the link down / link up messages, this time w/ flow control off, there are no messages logged.


There are messages logged when the carrier state changes even while the connection does not work, and does not recover on re-plug.


Also, I've tried to [0940]:
1. unload the tg3 module
2. unplug the ethernet cable
3. load the tg3 module
4. disable flow control as described above
5. connect the ethernet cable

After 30 - 60 seconds under load, the connection failed again.


Subjectively, there may be some connection with the type/direction of traffic:
0. ping is running from an external host to the affected system
1. reestablish connection by some means
2. connection is used by distcc, works at first
3. connect to the affected system using ssh
   => at that moment, the connection appears to fail,
      i.e. the ssh connection fails with "No route to host"
      and roughly at that moment, ping starts showing "host unreachable"
Comment 2 Joe Breuer 2012-03-04 08:47:35 UTC
Sorry for the fragmented reports - I forgot to mention:

In my memory, this problem appears long-standing; i.e. from operations experience the gentoo kernels 2.6.36, 2.6.38, 2.6.39, 3.0.6 were affected as well.

With kernel 3.2.2 and this bug I'm finally taking the time to reproduce and report the issue with a vanilla kernel.
Comment 3 Joe Breuer 2012-03-04 09:40:56 UTC
Created attachment 72535 [details]
Log when tg3 recovers on its own

In the mean time, I managed to see an instance of the connection recovering on its own after a few (3-7) minutes of "opaqueness".

I'm attaching the log that's generated at the instance of recovery; there is nothing logged when the connection was lost earlier.
Comment 4 Joe Breuer 2012-03-05 07:33:21 UTC
In the mean time, I found this:
  https://bugzilla.redhat.com/show_bug.cgi?id=509759

There are a few workarounds mentioned; I first tried:
# ethtool -K eth0 sg off

With this workaround, the connection is not dropped completely as described above (in a time frame where this should have happened multiple times), BUT:
I see a number of TCP connections in a stuck state:

# netstat -n -et
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode     
tcp        0  14161 192.168.42.42:45241     192.168.42.1:3632       ESTABLISHED 0          7815      
tcp        0  14142 192.168.42.42:45246     192.168.42.1:3632       ESTABLISHED 0          7839      
tcp        0  14150 192.168.42.42:45227     192.168.42.1:3632       ESTABLISHED 0          7749      
tcp        0  14138 192.168.42.42:45244     192.168.42.1:3632       ESTABLISHED 0          7831      
tcp        0  14126 192.168.42.42:45242     192.168.42.1:3632       ESTABLISHED 0          7820      
tcp        0  14480 192.168.42.42:45223     192.168.42.1:3632       ESTABLISHED 0          7738      

These (distcc) connections never progress over a timespan of about a minute, with other distcc connections happily communicating (with the same target host).

Hm. Conjecture: The ACKs for these connections are never received?

I'll try the other workaround mentioned (noapic); moving the card to a different slot as suggested can't be done on the netbook.
Comment 5 Joe Breuer 2012-03-05 07:37:51 UTC
'noapic' alone loses the connection same as in the original report.
Comment 6 Joe Breuer 2012-03-05 07:41:51 UTC
Both 'noapic' and 'ethtool -K eth0 sg off' behave identical to comment 4, i.e. I see stalled TCP connection with a Send-Q > 0.

The tg3 interrupt on this machine appears to be shared with graphics and one USB controller, I'll attach the lshw output.
Comment 7 Joe Breuer 2012-03-05 07:43:47 UTC
Created attachment 72539 [details]
lshw output on the affected machine, with 'noapic'
Comment 8 Joe Breuer 2012-03-05 07:44:45 UTC
Clarification to comment 6:

"Both 'noapic' and 'ethtool ...' ..." is supposed to mean:

"The system behaves identical to comment 4 when both 'noapic' and 'ethtool ...' are given."
Comment 9 Joe Breuer 2012-03-05 07:47:16 UTC
The stalled TCP connections eventually lead to distcc errors:

distcc[12694] (dcc_writex) ERROR: failed to write: Connection reset by peer
distcc[12694] (dcc_readx) ERROR: unexpected eof on fd5
distcc[12694] (dcc_r_token_int) ERROR: read failed while waiting for token "DONE"
distcc[12694] (dcc_r_result_header) ERROR: server provided no answer. Is the server configured to allow access from your IP address? Does the server have the compiler installed? Is the server configured to access the compiler?
distcc[12694] Warning: failed to distribute ...


With the 'sg off' workaround in place, these only happen for a small percentage of distcc runs; most of the (concurrent) distcc processes communicate and return OK.
Comment 10 Alexey Kunitskiy 2012-08-23 16:16:31 UTC
I have the same bug https://bugzilla.kernel.org/show_bug.cgi?id=42663 posted 2 months before but nobody handed over it yet
Comment 11 Alexey Kunitskiy 2012-09-14 08:59:59 UTC
It seemes that bug is fixed by this commit https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=5250def1935443f4bbddce101281e0aaf2e66838

So far I've no problems with 3.4.9 and patched 3.0.35

Note You need to log in before you can comment on or make changes to this bug.