Most recent kernel where this bug did not occur: < 2.6.16 Distribution: Debian (unstable) Hardware Environment: Pentium MMX, Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 64) Problem Description: Without any warning, my 3c59x network interface suddenly receives no more packets. A tcpdump shows output packets going out, but no received packet at all (whereas there *is* traffic on the network). From the outside, it seems that the affected host still outputs packets in a normal way, but is not responding anymore to ARP or anything, and this results in an almost total loss of network connectivity. Fortunately, it seems that the bug only appears under rare condition: it happened to me twice or thrice during last year. Lately, I have been able to correlate it to echo problems on the network. For some reason, it may happen on my network that there is a loop causing echo in IPv6 traffic: my network card then receives duplicate IPv6 packets, with an interval around 200 ns between each other. At some apparently random point, the bug is triggered. See attached logs below. Reloading the 3c59x module restores normal condition.
Created attachment 9390 [details] 3c59x module logs Kernel logs with full debug output from 3c59x module. I estimate by host timeouts that the bug has been triggered aroung 19:15.
I have read through the driver code and made further debugging. My understanding is the following. At the moment when the device "crashes", logs indicates that it receives 32 packets from the network, which is the size of the rx_ring for incoming packets. I can confirm that those 32 packets are really 32 different valid packets, sent on the network at exactly the same time, i.e. between two interrupts, and that actually there are very likely more than 32 of them, which is more than the size of the rx_ring... It looks like those further overflowing packets are eventually "ignored", since the interrupt handler reads the 32 packets on the rx_ring and returns, and then no more interrupt is ever sent by the device to signal reception of new packets. Note that it is only my understanding, but yes, I have reasons to believe that my network does send more than 32 different packets at the same time before my driver can handle them (at least that is what happens according to the driver). I guess that I could increase the size of the rx_ring to see if it "fixes" the problem.
I tried changing RX_RING_SIZE from 32 to 256 packets, and it definitely solved the problem.
I've got the same problem both TX/RX ring. The problem depends on wich direction have more traffic. In my case if on one iface i've got RX and on other i forward the traffic that arrives on the first, i got the message on first with RX ring and on second TX ring. I've rise up to 256 RX_RING_SIZE and TX_RING_SIZE, and seems to be ok. Do not forget the max_interrupt_work! Rise this from 32 to about 1024 or 2048. This is another big problem to this driver :(
What is the status of this problem? Maybe the fix described in #3 and #4 need to be submitted to lkml?
I tried to decrease RX_RING_SIZE and TX_RING_SIZE to 2 and added some printk's. I did some stress tests with iperf and the tx/rx rings were full many times but I've got no such problems with the driver. I tested this with 2.6.24 and 3c905B/C cards. It seems that those problems occur when the rings are full. Increasing the ring sizes makes it of course less likely that the ring size is full, but the problem is probaply still there. Could somebody try whether the problems are still present in newer kernels and send the config if the problem is still there.
I added a dependency on bug #6444, the symptoms are very similar. Also I posted a patch to support ringsize changes with ethtool there. Detail about the patch can be found at the webpage of bug #6444.
I've increased the RX_RING_SIZE and TX_RING_SIZE and max_interrupt_work and I still have to ifdown/ifup on the device running this driver every so often. It doesn't get to the state where it won't accept or send anymore packets, but it does hit a size limit, where I can got most webpages or small files but anything larger than a few KB will fail to download, either timing out after getting the initial chunk or silently failing.
*** Bug 6444 has been marked as a duplicate of this bug. ***
I understand that you'd close the ticket because there is little point in keeping it open. But, for the record, I still need to patch the driver to increase the ring buffer size to mitigate this issue.