Bug 42856
Summary: | tg3 loses connectivity under load | ||
---|---|---|---|
Product: | Drivers | Reporter: | Joe Breuer (linux-kernel) |
Component: | Network | Assignee: | drivers_network (drivers_network) |
Status: | NEW --- | ||
Severity: | normal | CC: | alexey.kv, mcarlson, mchan, nsujir, szg00000 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.2.2 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Log when tg3 recovers on its own
lshw output on the affected machine, with 'noapic' |
Description
Joe Breuer
2012-03-04 08:31:23 UTC
On a hunch, I tried disabling flow control on the tg3 while it was in the stuck state with: # ethtool -A eth0 autoneg off tx off rx off This caused the connection to become operational again, although only briefly: After about 30 seconds the link became opaque again. Besides the link down / link up messages, this time w/ flow control off, there are no messages logged. There are messages logged when the carrier state changes even while the connection does not work, and does not recover on re-plug. Also, I've tried to [0940]: 1. unload the tg3 module 2. unplug the ethernet cable 3. load the tg3 module 4. disable flow control as described above 5. connect the ethernet cable After 30 - 60 seconds under load, the connection failed again. Subjectively, there may be some connection with the type/direction of traffic: 0. ping is running from an external host to the affected system 1. reestablish connection by some means 2. connection is used by distcc, works at first 3. connect to the affected system using ssh => at that moment, the connection appears to fail, i.e. the ssh connection fails with "No route to host" and roughly at that moment, ping starts showing "host unreachable" Sorry for the fragmented reports - I forgot to mention: In my memory, this problem appears long-standing; i.e. from operations experience the gentoo kernels 2.6.36, 2.6.38, 2.6.39, 3.0.6 were affected as well. With kernel 3.2.2 and this bug I'm finally taking the time to reproduce and report the issue with a vanilla kernel. Created attachment 72535 [details]
Log when tg3 recovers on its own
In the mean time, I managed to see an instance of the connection recovering on its own after a few (3-7) minutes of "opaqueness".
I'm attaching the log that's generated at the instance of recovery; there is nothing logged when the connection was lost earlier.
In the mean time, I found this: https://bugzilla.redhat.com/show_bug.cgi?id=509759 There are a few workarounds mentioned; I first tried: # ethtool -K eth0 sg off With this workaround, the connection is not dropped completely as described above (in a time frame where this should have happened multiple times), BUT: I see a number of TCP connections in a stuck state: # netstat -n -et Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State User Inode tcp 0 14161 192.168.42.42:45241 192.168.42.1:3632 ESTABLISHED 0 7815 tcp 0 14142 192.168.42.42:45246 192.168.42.1:3632 ESTABLISHED 0 7839 tcp 0 14150 192.168.42.42:45227 192.168.42.1:3632 ESTABLISHED 0 7749 tcp 0 14138 192.168.42.42:45244 192.168.42.1:3632 ESTABLISHED 0 7831 tcp 0 14126 192.168.42.42:45242 192.168.42.1:3632 ESTABLISHED 0 7820 tcp 0 14480 192.168.42.42:45223 192.168.42.1:3632 ESTABLISHED 0 7738 These (distcc) connections never progress over a timespan of about a minute, with other distcc connections happily communicating (with the same target host). Hm. Conjecture: The ACKs for these connections are never received? I'll try the other workaround mentioned (noapic); moving the card to a different slot as suggested can't be done on the netbook. 'noapic' alone loses the connection same as in the original report. Both 'noapic' and 'ethtool -K eth0 sg off' behave identical to comment 4, i.e. I see stalled TCP connection with a Send-Q > 0. The tg3 interrupt on this machine appears to be shared with graphics and one USB controller, I'll attach the lshw output. Created attachment 72539 [details]
lshw output on the affected machine, with 'noapic'
Clarification to comment 6: "Both 'noapic' and 'ethtool ...' ..." is supposed to mean: "The system behaves identical to comment 4 when both 'noapic' and 'ethtool ...' are given." The stalled TCP connections eventually lead to distcc errors: distcc[12694] (dcc_writex) ERROR: failed to write: Connection reset by peer distcc[12694] (dcc_readx) ERROR: unexpected eof on fd5 distcc[12694] (dcc_r_token_int) ERROR: read failed while waiting for token "DONE" distcc[12694] (dcc_r_result_header) ERROR: server provided no answer. Is the server configured to allow access from your IP address? Does the server have the compiler installed? Is the server configured to access the compiler? distcc[12694] Warning: failed to distribute ... With the 'sg off' workaround in place, these only happen for a small percentage of distcc runs; most of the (concurrent) distcc processes communicate and return OK. I have the same bug https://bugzilla.kernel.org/show_bug.cgi?id=42663 posted 2 months before but nobody handed over it yet It seemes that bug is fixed by this commit https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=5250def1935443f4bbddce101281e0aaf2e66838 So far I've no problems with 3.4.9 and patched 3.0.35 |