Most recent kernel where this bug did *NOT* occur: 2.6.18.6 Distribution: CentOS 4.4 Hardware Environment: ASUS M2N-E (nForce 570 chipset) AMD64 X2 Dual Core 5200+ Built in nic hooked up to a HP ProCurve 2428 gigabit switch. Software Environment: Using a plain kernel without any distro patches. Using forcedeth network driver. No nVidia binary drivers what so ever. IP tables is turned off. Problem Description: At random all network traffic just stops without apparent reason. Running tcpdump shows outgoing packets but no incoming packets. Ifconfig does not show any rx or tx errors. No error messages show up in dmesg. Transfering data at high speed triggers the problem almost right away. When running an interactive session in ssh it tends to happen after 5-10 minutes. Kernel 2.6.19.2 has the same problem as well. Doing a ifdown eth0 + ifup eth0 makes things work again for a while, but sooner or later it happens again. When hooking it up to a 100 megabit switch it works perfectly. I've tried messing around with all parameters you can set with ethtool but nothing helped.
what does "ethtool etho" say when it stops working?
sorry, 'ethtool eth0' i meant to type (if that wasn't obvious)
It shows: Settings for eth0: Supported ports: [ MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: MII PHYAD: 1 Transceiver: externel Auto-negotiation: on Supports Wake-on: g Wake-on: d Link detected: yes Which is also exactly the same output as when the net works OK.
there have been some changes to the forcedeath driver since the kernel you are using are you able to test a mainline kernel just as another reference point? also, do you know if this happens if you use another switch (is that possible to test?)
Sure, which kernel version do you want me to test? Unfortunately I don't have a different gigabit switch to test with.
i wondered if commit f35723ec48ca60f2f3493ea40d63a9bc5b585c28 might address this case, 2.6.20 has that change Ayaz perhaps you can comment here if that is likely to change anything?
For the failing case, could you please enable dprintk macro near the top of the forcedeth.c file (Change the #if 0 to #if 1) and send me the output of dmesg?
I've already tried 2.6.20, thats why I put it in the version field. :) It falls flat on its face, both 2.6.19.2 and 2.6.20 behaves exactly the same. Seems like I made a typo, the switch model number is 2824 and nothing else. I'll change the printk and give it a spin first thing tomorrow morning.
Created attachment 10328 [details] Logfile with forcedeth dprintk enabled. I've now done a run with logging enabled on v2.6.20. I've cut out the relevant part from /var/log/messages and attached as messages.gz. The bug triggered at about 12:24 and at 12:26 I did an ifdown/ifup to fix it. Also, due to the debug messages network traffic slowed to a crawl (topped out at 400 kb/s or so) and it took quite a while before the bug triggered. Another thing I forgot to mention is that I have "Use Rx and Tx Polling (NAPI)" enabled. When I was recompiling to turn of debugging again I decided to try disabling polling as well. It turns out that now things seem to work just great, network traffic has been working without interruption for over an hour. So the bug is related to the rx/tx polling option.
Great! That really helps in narrowing it down. I will take alook at the code. Thanks.
Could you please send me the output of "ethtool -d ethX" and "ethtool -S ethX" and "ifconfig" when it is in the state where you can not receive packets?
Created attachment 10361 [details] Fix for NAPI Can you try this patch please?
Created attachment 10366 [details] Output of ethtool -d eth0 I've attached the output from ethtool -d since it was quite long. ethtool -S eth0 says: NIC statistics: tx_bytes: 188278 tx_zero_rexmt: 860 tx_one_rexmt: 0 tx_many_rexmt: 0 tx_late_collision: 0 tx_fifo_errors: 0 tx_carrier_errors: 0 tx_excess_deferral: 0 tx_retry_error: 0 tx_deferral: 0 tx_packets: 860 tx_pause: 0 rx_frame_error: 0 rx_extra_byte: 0 rx_late_collision: 0 rx_runt: 0 rx_frame_too_long: 0 rx_over_errors: 1 rx_crc_errors: 0 rx_frame_align_error: 0 rx_length_error: 0 rx_unicast: 1240 rx_multicast: 13 rx_broadcast: 808 rx_bytes: 203676 rx_pause: 0 rx_drop_frame: 186 rx_packets: 2061 rx_errors_total: 1 And ifconfig: eth0 Link encap:Ethernet HWaddr 00:18:F3:E2:07:27 inet addr:x.x.x.2 Bcast:x.x.x.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1866 errors:0 dropped:0 overruns:0 frame:0 TX packets:845 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:176313 (172.1 KiB) TX bytes:182984 (178.6 KiB) Interrupt:20 Base address:0x8000 I was unable to test your patch, it did not apply to to 2.6.20. I did try to apply it by hand as well but it turned out that the functions nv_alloc_rx_optimized and nv_rx_process_optimized does not exist in 2.6.20. Perhaps you can include the entire forcedeth.c file?
Ok, I know why the patch didn't apply. I was also following a thread with similar symptoms which is against 2.6.20-rc6-mm3. The patch fix won't help in this case. I did notice that the Rx interrupt mask is off. Can you try running in the failing setup (NAPI) with MSI/MSIX turned off? There are two module params that you can use: msi=0 msix=0
Having been a bit busy I havn't gotten around to test with MSI off until I got the golden opportunity today. Suddenly network traffic went all erratic with lots of dropouts and this error message appeared in dmesg every 10 seconds: Feb 13 20:36:39 srv002 kernel: NETDEV WATCHDOG: eth0: transmit timed out Feb 13 20:36:39 srv002 kernel: eth0: Got tx_timeout. irq: 00000000 Feb 13 20:36:39 srv002 kernel: eth0: Ring at 6f33e000: next 28084618 nic 28084362 Feb 13 20:36:39 srv002 kernel: eth0: Dumping tx registers Feb 13 20:36:39 srv002 kernel: 0: 00002010 000000ff 00000003 036e03ca 00000000 00000000 00000000 00000000 Feb 13 20:36:39 srv002 kernel: 20: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 ... This was with NAPI diabled but MSI and MSIX enabled. Let me know if the complete register dump is of any use to you. As a test I unloaded the forcedeth module and loaded my "other" version compiled with NAPI enabled and also adding the msi=0 msix=0 options. This caused the error messages and network dropouts to stop. It has been running perfectly for an hour now and I've also run a couple of network benchmarks, which would previously cause traffic to stop almost immediatly, without problems. So it seems that NAPI works fine if MSI interrupts are turned off.
Thanks. Can we try to isolate it a bit further? Could you please try with msi=1 and msix=0 followed by msi=0 and msix=1?
Loading the module with msi=1 msix=0 seems to work fine, it ran without incident for 30 minutes and passed my benchmark tests as well. When I loaded it with msi=0 and msix=1 it failed to bring the link up, dmesg said: forcedeth.c: Reverse Engineered nForce ethernet driver. Version 0.59. ACPI: PCI Interrupt 0000:00:08.0[A] -> Link [APCH] -> GSI 20 (level, low) -> IRQ 20 forcedeth: using HIGHDMA eth0: forcedeth.c: subsystem: 01043:8239 bound to 0000:00:08.0 eth0: no link during initialization. But then the line "eth0: link up." that normally follows did not appear. However ethtool showed the link to be detected: Settings for eth0: Supported ports: [ MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: MII PHYAD: 1 Transceiver: externel Auto-negotiation: on Supports Wake-on: g Wake-on: d Link detected: yes ifconfig started to show plenty of dropped packets: eth0 Link encap:Ethernet HWaddr 00:18:F3:E2:07:27 inet addr:x.x.x.2 Bcast:x.x.x.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:295 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Interrupt:20 Base address:0x8000
I gave it a another try with msi=0 msix=1 and this time I got link and the problems in my original report where replicated. Network traffic hangs after a while and a ifdown & ifup cures it temporarily. Perhaps it just happened to hang before link negotiation completed on my first try or something.
It turns out that about every 2 days the "NETDEV WATCHDOG: eth0: transmit timed out" error messages show up and network traffic goes bonkers. Therefore I've enabled napi and msi with msix turned off. However now this mysterious message have started to show up in my logs: eth0: too many iterations (6) in nv_nic_irq. There are quick bursts where this message is printed several hundred times now and then. Network traffic seems to continue working OK though.
You can change the "max work" parameter fir the adapter. The error message "too many iterations" is just a warning.
Mmmm, kay. But are the msix problems fixed now? In that case, where can I find the pach?
Bug was fixed by disabling msix by default in the driver. This is now in 2.6.20.1.
OK! Thanks for the info.