Bug 7440 - (net 3c59x) suddenly receives no more packets
Summary: (net 3c59x) suddenly receives no more packets
Status: CLOSED OBSOLETE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Steffen Klassert
URL:
Keywords:
: 6444 (view as bug list)
Depends on: 6444
Blocks:
  Show dependency tree
 
Reported: 2006-10-31 18:17 UTC by Pierre Ynard
Modified: 2012-05-14 03:03 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.18.1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
3c59x module logs (136 bytes, text/html)
2006-10-31 18:21 UTC, Pierre Ynard
Details

Description Pierre Ynard 2006-10-31 18:17:56 UTC
Most recent kernel where this bug did not occur: < 2.6.16
Distribution: Debian (unstable)
Hardware Environment: Pentium MMX, Ethernet controller: 3Com Corporation 3c905B
100BaseTX [Cyclone] (rev 64)

Problem Description:
Without any warning, my 3c59x network interface suddenly receives no more
packets. A tcpdump shows output packets going out, but no received packet at all
(whereas there *is* traffic on the network). From the outside, it seems that the
affected host still outputs packets in a normal way, but is not responding
anymore to ARP or anything, and this results in an almost total loss of network
connectivity. Fortunately, it seems that the bug only appears under rare
condition: it happened to me twice or thrice during last year.

Lately, I have been able to correlate it to echo problems on the network. For
some reason, it may happen on my network that there is a loop causing echo in
IPv6 traffic: my network card then receives duplicate IPv6 packets, with an
interval around 200 ns between each other. At some apparently random point, the
bug is triggered.

See attached logs below.

Reloading the 3c59x module restores normal condition.
Comment 1 Pierre Ynard 2006-10-31 18:21:57 UTC
Created attachment 9390 [details]
3c59x module logs

Kernel logs with full debug output from 3c59x module.
I estimate by host timeouts that the bug has been triggered aroung 19:15.
Comment 2 Pierre Ynard 2007-02-19 03:34:58 UTC
I have read through the driver code and made further debugging. My understanding
is the following.

At the moment when the device "crashes", logs indicates that it receives 32
packets from the network, which is the size of the rx_ring for incoming packets.
I can confirm that those 32 packets are really 32 different valid packets, sent
on the network at exactly the same time, i.e. between two interrupts, and that
actually there are very likely more than 32 of them, which is more than the size
of the rx_ring... It looks like those further overflowing packets are eventually
"ignored", since the interrupt handler reads the 32 packets on the rx_ring and
returns, and then no more interrupt is ever sent by the device to signal
reception of new packets.

Note that it is only my understanding, but yes, I have reasons to believe that
my network does send more than 32 different packets at the same time before my
driver can handle them (at least that is what happens according to the driver).
I guess that I could increase the size of the rx_ring to see if it "fixes" the
problem.
Comment 3 Pierre Ynard 2007-03-08 18:03:29 UTC
I tried changing RX_RING_SIZE from 32 to 256 packets, and it definitely solved
the problem.
Comment 4 Adryan Ban 2007-10-07 18:09:35 UTC
I've got the same problem both TX/RX ring. The problem depends on wich direction have more traffic. In my case if on one iface i've got RX and on other i forward the traffic that arrives on the first, i got the message on first with RX ring and on second TX ring.

I've rise up to 256 RX_RING_SIZE and TX_RING_SIZE, and seems to be ok. Do not forget the max_interrupt_work! Rise this from 32 to about 1024 or 2048. This is another big problem to this driver :(
Comment 5 Natalie Protasevich 2008-04-04 01:31:49 UTC
What is the status of this problem? Maybe the fix described in #3 and #4 need to be submitted to lkml?
Comment 6 Steffen Klassert 2008-04-04 03:22:46 UTC
I tried to decrease RX_RING_SIZE and TX_RING_SIZE to 2 and added some printk's.
I did some stress tests with iperf and the tx/rx rings were full many times
but I've got no such problems with the driver.
I tested this with 2.6.24 and 3c905B/C cards. 

It seems that those problems occur when the rings are full. Increasing the ring
sizes makes it of course less likely that the ring size is full, but the problem
is probaply still there.  

Could somebody try whether the problems are still present in newer kernels
and send the config if the problem is still there.
Comment 7 Steffen Klassert 2008-04-15 03:53:50 UTC
I added a dependency on bug #6444, the symptoms are very similar.
Also I posted a patch to support ringsize changes with ethtool there.
Detail about the patch can be found at the webpage of bug #6444.
Comment 8 Devin Crain 2008-04-24 10:30:03 UTC
I've increased the RX_RING_SIZE and TX_RING_SIZE and max_interrupt_work and I still have to ifdown/ifup on the device running this driver every so often.  It doesn't get to the state where it won't accept or send anymore packets, but it does hit a size limit, where I can got most webpages or small files but anything larger than a few KB will fail to download, either timing out after getting the initial chunk or silently failing. 
Comment 9 Alan 2009-03-17 08:57:10 UTC
*** Bug 6444 has been marked as a duplicate of this bug. ***
Comment 10 Pierre Ynard 2012-05-14 03:03:13 UTC
I understand that you'd close the ticket because there is little point in keeping it open. But, for the record, I still need to patch the driver to increase the ring buffer size to mitigate this issue.

Note You need to log in before you can comment on or make changes to this bug.