Bug 4552

Summary:	(net forcedeth) nForce NIC sometimes stop working
Product:	Drivers	Reporter:	Robert Cernansky (openhs)
Component:	Network	Assignee:	Jeff Garzik (jgarzik)
Status:	REJECTED DOCUMENTED
Severity:	normal	CC:	aabdulla, antoine, bakhos, bunk, faceprint, hostmaster, jstubbs, kernel, khanreaper, mail, manfred, netllama, niels, osdl, peter, svec
Priority:	P2
Hardware:	i386
OS:	Linux
Kernel Version:	2.6.12_rc3	Subsystem:
Regression:	---	Bisected commit-id:
Attachments:	Add statistics to forcedeth tx handler dmesg with debug info Forcedeth with debug info, pinging out Forcedeth with debug info, being pinged Forcedeth tcpdump -n -i eth0 output unconditionally check for completed packets Forcedeth with debug info, pinging out (added debug patch) dump MAC registers and tx ring on timeout Register dump of the hung NIC Another register dump of the hung NIC Crash dump... Register dump (crashed with patch) Register dump (crashed with forcedeth 038) YARD (hung with forcedeth 038) Forcedeth 040 hang Tx ring size increasement Modified packet filter flags Forcedeth 041 with patch hang further linkspeed changes forcedeth 0.42 IRQ are going high dmesg output dmesg output dprintk during failure dprintk after reboot

Description Robert Cernansky 2005-04-26 13:14:11 UTC

Distribution: Gentoo
Hardware Environment: AMD Athlon64 3200+, K8N Neo4 Platinum (nForce4 chipset)
Software Environment: forcedeth kernel module
Problem Description:

nForce onboard NIC sometimes stop working and remains blocked also after reboot.
No packets is coming in or out through this NIC.

In the /var/log/messages the errors appears:

Apr 24 21:44:30 amit NETDEV WATCHDOG: eth0: transmit timed out
Apr 24 21:44:30 amit nv_stop_tx: TransmitterStatus remained busy<7>eth0:
tx_timeout: dead entries!
Apr 24 21:44:30 amit Badness in local_bh_enable at kernel/softirq.c:140
Apr 24 21:44:30 amit
Apr 24 21:44:30 amit Call Trace: <IRQ> <ffffffff80135905>{local_bh_enable+53}
<ffffffff880e3d35>{:ip_conntrack:destroy_conntrack+53}
Apr 24 21:44:30 amit <ffffffff802bf764>{__kfree_skb+196}
<ffffffff88000535>{:forcedeth:nv_drain_tx+133}
Apr 24 21:44:30 amit <ffffffff880008ed>{:forcedeth:nv_tx_timeout+93}
<ffffffff801e7ed0>{cursor_timer_handler+0}
Apr 24 21:44:30 amit <ffffffff802d2350>{dev_watchdog+0}
<ffffffff802d23b3>{dev_watchdog+99}
Apr 24 21:44:30 amit <ffffffff801390de>{run_timer_softirq+366}
<ffffffff80135833>{__do_softirq+83}
Apr 24 21:44:30 amit <ffffffff801358c5>{do_softirq+53} <ffffffff801110b7>{do_IRQ+71}
Apr 24 21:44:30 amit <ffffffff8010ec69>{ret_from_intr+0}  <EOI>
<ffffffff8030e184>{thread_return+0}
Apr 24 21:44:30 amit <ffffffff88072aa0>{:parport:parport_ieee1284_write_compat+0}
Apr 24 21:44:30 amit <ffffffff8010ca60>{default_idle+0}
<ffffffff8010ca80>{default_idle+32}
Apr 24 21:44:30 amit <ffffffff8010cb91>{cpu_idle+49}
<ffffffff804f47cf>{start_kernel+463}
Apr 24 21:44:30 amit <ffffffff804f4263>{_sinittext+611}
Apr 24 21:44:30 amit Badness in local_bh_enable at kernel/softirq.c:140
Apr 24 21:44:30 amit
Apr 24 21:44:30 amit Call Trace: <IRQ> <ffffffff80135905>{local_bh_enable+53}
<ffffffff880e3d9f>{:ip_conntrack:destroy_conntrack+159}
Apr 24 21:44:30 amit <ffffffff802bf764>{__kfree_skb+196}
<ffffffff88000535>{:forcedeth:nv_drain_tx+133}
Apr 24 21:44:30 amit <ffffffff880008ed>{:forcedeth:nv_tx_timeout+93}
<ffffffff801e7ed0>{cursor_timer_handler+0}
Apr 24 21:44:30 amit <ffffffff802d2350>{dev_watchdog+0}
<ffffffff802d23b3>{dev_watchdog+99}
Apr 24 21:44:30 amit <ffffffff801390de>{run_timer_softirq+366}
<ffffffff80135833>{__do_softirq+83}
Apr 24 21:44:30 amit <ffffffff801358c5>{do_softirq+53} <ffffffff801110b7>{do_IRQ+71}
Apr 24 21:44:30 amit <ffffffff8010ec69>{ret_from_intr+0}  <EOI>
<ffffffff8030e184>{thread_return+0}
Apr 24 21:44:30 amit <ffffffff88072aa0>{:parport:parport_ieee1284_write_compat+0}
Apr 24 21:44:30 amit <ffffffff8010ca60>{default_idle+0}
<ffffffff8010ca80>{default_idle+32}
Apr 24 21:44:30 amit <ffffffff8010cb91>{cpu_idle+49}
<ffffffff804f47cf>{start_kernel+463}
Apr 24 21:44:30 amit <ffffffff804f4263>{_sinittext+611}

The only solution how to get out of this blocked state is to turn off computer
completely (plug out it from electricity). For me it happens 1 or 2 times in a
day. It depends on network traffic (lot of traffic - bigger chance to block the
NIC).

The kernels that i've tried:
 * Gentoo patched kernel 2.6.9-gentoo-r14 - work fine
 * Gentoo patched kernel 2.6.11-gentoo-r6 - bug is present
 * Vanilla kernel 2.6.12_rc3 - bug is present

See bugreport to gentoo bugzilla: http://bugs.gentoo.org/show_bug.cgi?id=90069

And also other sources describing this bug:

http://forums.gentoo.org/viewtopic-t-320241.html
http://forums.gentoo.org/viewtopic-t-318214.html
http://forums.gentoo.org/viewtopic-t-310223.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0502.0/0219.html

Steps to reproduce:

1. boot linux
2. make some network traffic (copy big file from/to network) - (it is not easy
to reproduce)

Comment 1 Manfred Spraul 2005-06-06 10:56:37 UTC

I've tried to reproduce the bug, without success.
10 hours at 80 MByte/sec, no hang with the latest kernel (2.6.12-rc5-git10) on
an nForce 250-Gb board.

Could you add a few more details?
- what's the link partner? A switch or a cross-over cable to another nic?
- at which link speed do you operate? Gigabit or 100 mbit?
- Could you send me (manfred@colorfullife.com) the source code from forcedeth.c
from 2.6.9-gentoo-r14? I'm not aware of a change that might have introduced the
regression.

Comment 2 Robert Cernansky 2005-06-07 09:53:19 UTC

Computer is connected to 10/100 Mbit switch whith straight thru cabel. Speed is
100 Mbit.

I've sended the forcedeth.c from 2.6.9-gentoo-r14 to you.

Comment 3 Michael Bakhos 2005-06-24 13:07:34 UTC

Distribution: Gentoo    
Hardware Environment: nForce4 chipset on 2 different systems (see bellow for    
details)    
Software Environment: forcedeth kernel module    
Problem Description:    
I have to second this problem. I have 2 computers with this problem.      
      
One is an Asus K8N (nforce4 chipset, Athlon64 CPU). When I'm using the nvidia      
NIC (the motherboard also have a sk98lin NIC built-in) I sometime have the      
problem. I got it ever since I have a kernel version of 2.6.11 (all      
gentoo-sources version I tried, I moved from a 2.6.9 to a 2.6.11, never tried      
2.6.10).      
I had it when my card was plugged using a straight-through cable directly to a      
100 Mbit/s card, and when it is plugged using a straight-through cable to a      
1000 Mbit/s switch (netgear).     
      
The second is a Tyan Thunder K8WE (S2895) (nforce4 chipset, dual dual-core      
Opteron, this board have 2 NIC from two nvidia parts ("1st from nForce&#8482; Prof.      
2200, 2nd from nForce Prof.      
2050" (http://www.tyan.com/products/html/thunderk8we_spec.html)) each      
connected to a cpu). The problem occured just 2 days ago (the system is only 3     
week old). It currently only affect the first NIC, the second one still work     
properly. The first NIC is connected to a 1000Mbit/s switch (I can't check the     
brand right now, but if needed I could), the second to another computer on     
it's second port of a Broadcom BCM5704C dual-channel integrated NIC (tg3     
driver), both uses straight-through cables.  
The problem on this system was noticed about 2h (but could have happened  
before) after we had a night long shutdown (normally it stays on but we had to  
turn it off since the building A/C was being repaired). On power on it booted  
a 2.6.12-rc6 kernel (from a 2.6.12-rc5, downloaded off kernel.org website, and  
NOT a gentoo kernel (we had to get 2.6.12 kernel to get the system working  
with the Interrupt Mode (bios option) set to APIC (default) (2.6.11  
gentoo-sources was only working using Interrupt Mode=PIC, but it had a lot of  
lag)).  
     
In both cases, when the problem occur it get (in the dmesg log):     
NETDEV WATCHDOG: eth0: transmit timed out     
nv_stop_tx: TransmitterStatus remained busy     
     
and after some time:      
nv_stop_tx: TransmitterStatus remained busy<7>eth0: tx_timeout: dead entries!     
     
The "NETDEV WATCHDOG: eth0: transmit timed out" happen from time to time, but     
usually when I try to restart the adapter. Otherwise the "nv_stop_tx:     
TransmitterStatus remained busy<7>eth0: tx_timeout: dead entries!" line can     
start appearing more and more, but at different rates (it looks like it     
depends on how much is tried to being transmitted).     
     
I've noticed that power cycling (unplug-plug back in the power), as stated 
previously corrects the problem (until it happens again). 
   
Also, on one or two occasion, I had only removed the network cable and did a   
soft reboot and it worked.   
   
This problem is intermittend. I sometimes use the Asus system for 1h before I   
get the problem, other time I can have it working for over a week without any   
problem. But I can't say it's related to high load, it seems to also happen on  
low load.  
  
One difference with the first report, it seems that sometimes the NIC is still  
able to receive some packets (only a few) when this problem occur, but it's  
not transmitting anything.

Comment 4 Michael Bakhos 2005-06-24 13:14:14 UTC

Correction, it is an Asus A8N (not K8N, specifically the A8N-Sli delux (still  
nforce 4 chipset)), sorry for the typo.

Comment 5 Peter 2005-06-27 05:59:27 UTC

I have a MSI K8N Diamond with all the same problems and errors under Debian
2.6.11-1-k7.

Comment 6 Santiago Garc 2005-06-28 12:12:20 UTC

Same problem here on a pristine 2.6.12 on a Asus A8N-E board, it has an nforce4
chipset with a marvell PHY. The machine is running Debian Sarge AMD64 version
and is connected to a gigabit switch, but this can happen when it is connected
using a direct cable to another nic or whatever.

If you need more info just ask.

Regards!

Comment 7 Manfred Spraul 2005-06-28 13:03:36 UTC

Created attachment 5234 [details]
Add statistics to forcedeth tx handler

Comment 8 Manfred Spraul 2005-06-28 13:12:38 UTC

Hi all.
It's a known bug, but unfortunately I do not understand what exactly causes the
hang. It would be great if you could the following:
- bring the nic into the hung state.
- rmmod the driver.
- Enable all debug outputs (The change is in line 124: replace "#if 0" with "#if
1"), recompile the nic, load the new driver and try to use the nic.
- Attach the complete dmesg output.

Other interesting tests might be to use version 0.35 of the forcedeth driver [it
should be in the latest git tree], to use the attached patch or to double check
that really both packet receive and transmit to not work: ping from another
computer and check with "tcpdump -n" if any packet is received. But a full debug
output would be the best starting point for me.

Comment 9 Daniel Drake 2005-06-29 04:06:50 UTC

FYI gentoo-sources-2.6.12-r2 includes forcedeth 0.35

Comment 10 Jan Gutter 2005-06-30 06:43:39 UTC

Created attachment 5242 [details]
dmesg with debug info

This is the dmesg of linux-2.6.12-gentoo-r2.
The network operated as normal until: 
nv_stop_tx: TransmitterStatus remained busy

After the freeze-up I removed forcedeth, and inserted the debug version.

I rebooted to Windows XP (SP2+nforce drivers) and the network card was still
inoperative. Did the power-down/plug out trick. The network socket's LEDs were
both lit without blinking until I plugged out the power cord. They were on even
when I plugged out the network cable! After the power-down/plug out trick, the
interface came back up normally.

Don't hesitate to ask me to experiment on my hardware... I work all day on my
PC and I experience network hangups about 2-4 times a day.

Comment 11 Manfred Spraul 2005-07-02 01:53:46 UTC

Thanks.

Unfortunately you didn't wait long enough with the debug version:
The initialization worked as expected.
The link was detected.
packet receive worked.
Packets were queued for transmission.
The only this that is missing is either the timeout from the packet transmission
or the tx done interrupt. The timeout is 5 seconds: ping another computer, wait
5 seconds and then check for a line that starts with "nv_tx_done: looking at
packet" in the dmesg dump.

Comment 12 Jan Gutter 2005-07-06 04:47:01 UTC

Created attachment 5285 [details]
Forcedeth with debug info, pinging out

This is take 2 of the dmesg dump.... I've actually concatenated a bunch of
dmesgs together to get this one, so it's long (deleted the duplicate lines by
hand so it *should* be just like one long dmesg). It's about 10-20 seconds of
me pinging out on a hung-up forcedeth connection.

Comment 13 Jan Gutter 2005-07-06 04:50:12 UTC

Created attachment 5286 [details]
Forcedeth with debug info, being pinged

Like above, but I stopped the outgoing ping and started an incoming ping from
another PC on the net.

Comment 14 Jan Gutter 2005-07-06 04:54:06 UTC

Created attachment 5287 [details]
Forcedeth tcpdump -n -i eth0 output

Finally, while the interface was "hung" I did a tcpdump -n -i eth0 to confirm
that data is actually received. Therefore, it looks like it's mainly outgoing
packets that are affected. Which might explain why my NIC didn't hang
yesterday: nobody copied from me...

Comment 15 Manfred Spraul 2005-07-06 12:50:29 UTC

Created attachment 5290 [details]
unconditionally check for completed packets

Thanks for the dmesg output - finally progress on that bug.
Could you again use the stock driver until the nic hangs, and then load a nic
driver with
- dprintk enabled
- the attached patch applied?
The patch will show if the tx engine really hangs, or if it just doesn't
generate interrupts.

Comment 16 Jan Gutter 2005-07-07 04:00:09 UTC

Created attachment 5294 [details]
Forcedeth with debug info, pinging out (added debug patch)

Another day, another hang... ;-) Here's the dmesg with the patched forcedeth.
Now it shows: 
eth0: nv_tx_done: looking at packet 0, Flags 0xa0800059.

Comment 17 Ayaz Abdulla 2005-07-08 15:16:34 UTC

Could you please output all the MAC registers (from 0x0 to 0x400) and the 
whole Tx ring after the hang? Thanks!

Comment 18 Jan Gutter 2005-07-08 15:45:23 UTC

How would I do that? I assume I use something like ethtool, or if you give me a
little C program I can run it too...

Comment 19 Manfred Spraul 2005-07-08 23:46:24 UTC

Created attachment 5303 [details]
dump MAC registers and tx ring on timeout

Unfortunately forcedeth doesn't support the ethtool register dump command yet.

I've written a patch that dumps everything interesting on a tx timeout.
Could you add this patch to a normal nic driver and then use it until it hangs?

Instead of the "normal" error message that you got so far, i.e.

  NETDEV WATCHDOG: eth1: transmit timed out
  eth1: Got tx_timeout. irq: 00000000

, the patch driver dumps around 100 lines with all registers and all tx
entries. Please send us that part of your dmesg file.

Thanks for your patience!

Comment 20 Jan Gutter 2005-07-14 01:14:13 UTC

Created attachment 5329 [details]
Register dump of the hung NIC

FINALLY! I've been running the patch since Monday, and this morning the NIC
hung! Attached, please find the register dump... Sorry, the hangs *seem* less
frequent than they used to be...

Comment 21 Jan Gutter 2005-07-14 05:27:58 UTC

Created attachment 5330 [details]
Another register dump of the hung NIC

Nope, it still hangs randomly. Another hang, another register dump, just in
case having two dumps might be useful.

Comment 22 Ayaz Abdulla 2005-07-14 12:44:13 UTC

Thanks for the debug output. I see some changes that can be made and will work 
with Manfred to give you a patch to try out.

Comment 23 Manfred Spraul 2005-07-15 12:45:15 UTC

Hi Ayaz,

The attached patch is what you want, correct?
The timeout was 5 seconds - that should be long enough to guarantee that 
the queue is drained.

--
    Manfred
--- 2.6/drivers/net/forcedeth.c	2005-07-15 21:42:35.000000000 +0200
+++ build-2.6/drivers/net/forcedeth.c	2005-07-15 21:42:30.000000000 +0200
@@ -306,7 +306,7 @@
 
 #define NV_TX2_LASTPACKET	(1<<29)
 #define NV_TX2_RETRYERROR	(1<<18)
-#define NV_TX2_LASTPACKET1	(1<<23)
+#define NV_TX2_LASTPACKET1	(1<<30)
 #define NV_TX2_DEFERRED		(1<<25)
 #define NV_TX2_CARRIERLOST	(1<<26)
 #define NV_TX2_LATECOLLISION	(1<<27)

Comment 24 Ayaz Abdulla 2005-07-15 13:09:21 UTC

Yes, that is correct.

Jan, please give it a try and let me know if you encounter the hang.

Comment 25 Jan Gutter 2005-07-19 06:51:28 UTC

Created attachment 5347 [details]
Crash dump...

I *think* I left my PC on with the updated forcedeth, but I might be completely
wrong. This is what I found this morning... I won't trust this for
incontrovertible evidence of a hang, but I'll log a new dump as soon as it
hangs. Cross fingers!

Comment 26 Jan Gutter 2005-07-20 08:48:41 UTC

Created attachment 5352 [details]
Register dump (crashed with patch)

Definite hang this time with the one-liner patch in the previous comment.

Comment 27 Ayaz Abdulla 2005-07-21 08:27:40 UTC

One more experient to try is the new tx interrupt scheme (in forcedeth version 
38). You can find the patch here: 
http://www.colorfullife.com/~manfred/Linux-kernel/forcedeth/

Thanks,
Ayaz

Comment 28 Jan Gutter 2005-07-25 05:19:47 UTC

Created attachment 5373 [details]
Register dump (crashed with forcedeth 038)

Up for roughly 5 hours, then a crash. This time with forcedeth 038, so even
with the new interrupt routine it fails...

Comment 29 Jan Gutter 2005-07-25 05:29:05 UTC

Created attachment 5374 [details]
YARD (hung with forcedeth 038)

So, 2 minutes after rebooting from the previous hang, it hangs again! I'm now
trying forcedeth 040... I wonder if 64-bit DMA will have any effect?

Comment 30 Jan Gutter 2005-07-25 08:00:47 UTC

Created attachment 5375 [details]
Forcedeth 040 hang

Forcedeth 040 also fails to fix the problem...

Comment 31 Ayaz Abdulla 2005-07-25 09:50:27 UTC

Created attachment 5376 [details]
Tx ring size increasement

Could you please try out this new patch? It will increase the Tx ring size to
1024.

Thanks,
Ayaz

Comment 32 Jan Gutter 2005-07-26 02:02:03 UTC

Sorry, even with the enlarged tx ring it hangs. Do you want another register
dump, or are they getting a bit redundant?

Comment 33 Ayaz Abdulla 2005-07-29 14:21:58 UTC

Can you describe your exact setup? ie. switch? hub? link speed/duplex? network 
environment (many machines, access to internet, etc).
Also, what kind of network traffic are you doing? alot of FTP traffic? web 
traffic? NFS shares? etc, etc.

Comment 34 Jan Gutter 2005-08-02 09:18:45 UTC

1) Link characteristics:
I'm currently on a corporate LAN that's using loads of different equipment and
has grown over years... I'll get the specific switch specs from the IT guys
tomorrow... BTW, I've had crashes at home running a crossover between the nvidia
NIC and a 3com NIC (100Mb/s).

Ethtool reports: 100Mb/s Full duplex, autonegotiated.

2) I've found that opening a samba share on my box (sharing nigh 200MB data ;-)
causes the link to go down almost reliably after +- 30 minutes to 2 hours. This
is totally random, though, and may manifest itself as fast as 10 minutes or I
can have a whole day relatively problem-free. So yeah, I assume with lots of
outgoing traffic, the hangs manifest quicker. 

I'm currently using the backup sk98lin NIC on my motherboard, but since it's a
patch out of mainline, I'd actually prefer to get the nforce NIC working.

Yah, intermittent failures are HORRIBLE to debug...

Comment 35 Jan Gutter 2005-08-02 09:20:08 UTC

Sorry, meant 200GB. AFAIK, as soon as a couple of GB is transferred to one
destination, it hangs...

Comment 36 Jan Gutter 2005-08-03 02:35:59 UTC

I've found out the switch we're using is an Alcatel 6024 with cat6 utp copper
cabling... Any other info that's needed?

Comment 37 Ayaz Abdulla 2005-08-03 15:03:00 UTC

Created attachment 5496 [details]
Modified packet filter flags

Could you please try this new patch? Thanks.

Comment 38 Jan Gutter 2005-08-04 03:16:36 UTC

Created attachment 5504 [details]
Forcedeth 041 with patch hang

Hung again after +/- 3 hours of samba traffic...

Comment 39 Ayaz Abdulla 2005-08-04 15:20:34 UTC

Certains switches send pause frames to the ethernet device when there is heavy 
load. The previous patch I created disabled pause frame handling in forcedeth. 
But it might be worthwhile to see if you can turn of pause (flow) frames on 
your switch as a cross check.

Also, it will be worthwhile to try "nvnet" binary driver to help isolate the 
issue as a forcedeth driver issue vs. hardware issue. The nvnet driver can be 
downloaded from nvidia website.

Comment 40 Jan Gutter 2005-08-04 23:54:10 UTC

I don't think it's the pause frame issue because I've had the same problem with
a crossover cable and a 3c59x card (unless that ALSO can send pause frames), but
I'll ask the IT people if they can turn off pause/flow frames on my port...
Also, I've never experienced this problem in Windows and I've done some serious
traffic there too with the exact same setup. Depending, naturally on whether the
Windows driver uses different features on the NIC to do the same thing...
Finally, the nvnet driver refuses to build cleanly with the newer kernel
(2.6.12-gentoo-r7 x86_64) I'm using, so I'll look up a patch or *gasp* try to
figure out how to finagle one...

Comment 41 Peter 2005-08-05 00:02:17 UTC

I have the same damn error.
I only have a cable-modem and I simple surf the internet (http) and I
get this error, too.
No FTP, no NFS.

Comment 42 Jan Gutter 2005-08-05 00:16:38 UTC

I've just confirmed with the IT guys that flow control/pause frames are disabled
on our switch. I've also successfully patched nvenet.c (one-liner) and I'm
currently running on nvnet. If it hangs I'll report, otherwise assume that it's
stable ;-)

Comment 43 Manfred Spraul 2005-08-06 14:21:43 UTC

Let me try to summarize the reports: The TX engine crashes, only a hard power
cycle can restart it.

A) What could cause a hang of the tx engine?
- PAUSE frame reception. Some 3c59x cards support PAUSE, but I don't know if
they generate PAUSE frames or only listen to them. But PAUSE is definitively
unlikely with a cable modem.
- PAUSE frame sending: Some nics send a pause packet on rx ring overflow. I'll
try to test that.
- DMA underruns. Perhaps a graphic card blocks the HT link too long? Does the
bug appear from text mode, without loading the nv module? I've tried to force
that condition by reducing the HT frequency, but I didn't run into any problems.
Perhaps my graphic card is not fast enough (ATI 9200, open source driver)

B) Is there a way to reset the tx engine even harder that we do right now in
nv_probe/nv_open?

Comment 44 Peter Svec 2005-08-09 08:12:54 UTC

I would like to summarize some of my personal experiences:

My Hardware:
three nForce 2 GigaBit Boards (MSI K7N Delta 2) running GentooLinux and WinXP
one brandnew Apple Powerbook 15" GigaBit-LAN running MacOSX 10.4.2
one D-Link DGS-1005D GigaBit-Switch
             
My Problems clearly started while upgrading from kernel 2.6.10 to 2.6.11.x by
the ethtool patch to v0.31
http://www.colorfullife.com/~manfred/Linux-kernel/forcedeth/patch-forcedeth-031-ethtool

My perfect working solution so far (kernel 2.6.12) is to use the good old v0.30.

After various testing i've found that netio v1.23 is very reliable to tell me if
my nForce-NIC is working or not.
- with v0.30 i'm getting fantastic speeds of 90-113 MBytes/s
- with v0.31, v0.35, v0.41 only 80 KBytes/s - 2 MBytes/s and some lookups
  which needed the full powercycle
- with v0.42 it improved very much, but not perfect, to 75 - 105 MB/s
  and very strangely from my PowerBook to only 30 - 100 MBytes/s
Different other variants of tests like direct cross-cables, vanilla- vs.
gentoo-sources, 2.6.11 vs. 2.6.12.x, WinXP vs. Linux vs. MacOS didn't show any
significant differences so far.

So whatever is causing this problems, for me it was introduced in v0.31 and only
partially resolved in v0.42.

PS: I also would like to announce that from now on i have a spare nForce-machine
with installed Linux and WinXP and so i am ready to run further tests and patches.

Comment 45 Manfred Spraul 2005-08-09 22:52:18 UTC

Created attachment 5573 [details]
further linkspeed changes

Thanks!
I've sent 0.42 to Jeff, it should appear in the kernel soon.

Now we must figure out how the fix the powerbook slowdown:
- Could you try the attached patch, on top of 0.42?
- Could you boot without a network cable attached, and then attach it after
boot?
- Boot, and then load the network interface manually: ifdown ethX;rmmod
forcedeth; modprobe forcedeth;<wait>;ifup ethX. It seems that recent Linux
distos do ifdown/ifup during boot. Perhaps they issue some ethtools commands,
too.

Ayaz: Perhaps excessive collisions cause the problem, not PAUSE frames?
Do "modprobe forcedeth; ifup ethX;<wait>;ifdown ethX;<wait>;ifup ethX". Then
the link speed registers are not initialized properly.

Comment 46 Ayaz Abdulla 2005-08-10 06:44:36 UTC

The logic in nv_tx_timeout is fine. You can be more aggressive and perform a 
reset of the NvRegTxRxControl register but then you will also need to halt the 
rx engine.

I believe collisions only happen in half duplex.

Comment 47 Peter Svec 2005-08-10 15:07:24 UTC

Ok Manfred Spraul, I applied your patch #5573 and followed all your instructions
for each v0.42 & 0.42 with patch.
I've tested also on 2.6.12 and 2.6.13-rc6, with fully manual net-config (only
ifconfig up/down ...).
But I'm sorry, nothing yielded in an significant change.

Here you have my netio v1.23 measurements:
(PowerBook was client -> Linux-forcedeth was netio-server)

-> with forcedeth v0.30
Packet size  1k bytes:   65904 KByte/s Tx,   71846 KByte/s Rx.
Packet size  2k bytes:   85714 KByte/s Tx,   86864 KByte/s Rx.
Packet size  4k bytes:  103276 KByte/s Tx,   93461 KByte/s Rx.
Packet size  8k bytes:  110972 KByte/s Tx,  108088 KByte/s Rx.
Packet size 16k bytes:  110914 KByte/s Tx,  109928 KByte/s Rx.
Packet size 32k bytes:  112113 KByte/s Tx,  109661 KByte/s Rx.

-> with forcedeth v0.42 with patch #5573
Packet size  1k bytes:   35722 KByte/s Tx,   70138 KByte/s Rx.
Packet size  2k bytes:   47363 KByte/s Tx,   86635 KByte/s Rx.
Packet size  4k bytes:   57858 KByte/s Tx,   93541 KByte/s Rx.
Packet size  8k bytes:   63085 KByte/s Tx,  108703 KByte/s Rx.
Packet size 16k bytes:   64960 KByte/s Tx,  110812 KByte/s Rx.
Packet size 32k bytes:   66648 KByte/s Tx,  110466 KByte/s Rx.

As you can see, the recieving-speed of the nForce-Nic is nearly cut to half
 (+- 10MBytes/s) while sending is quite stable.

Could it be possible that v0.42 is using more CPU-power on recieving than v0.30,
because netio is very CPU-intensive?

Comment 48 Peter Svec 2005-08-11 03:26:16 UTC

For completition, here my measurements nForce <-> nForce
(WinXP was client -> Linux-forcedeth was netio-server)

-> with forcedeth v0.30
Packet size  1k bytes:   83960 KByte/s Tx,  115644 KByte/s Rx.
Packet size  2k bytes:  100446 KByte/s Tx,  115671 KByte/s Rx.
Packet size  4k bytes:  113597 KByte/s Tx,  115664 KByte/s Rx.
Packet size  8k bytes:  114886 KByte/s Tx,  115683 KByte/s Rx.
Packet size 16k bytes:  114898 KByte/s Tx,  115682 KByte/s Rx.
Packet size 32k bytes:  114903 KByte/s Tx,  115667 KByte/s Rx.

-> with forcedeth v0.42 with patch #5573
Packet size  1k bytes:   69623 KByte/s Tx,  113654 KByte/s Rx.
Packet size  2k bytes:   88212 KByte/s Tx,  113944 KByte/s Rx.
Packet size  4k bytes:   95181 KByte/s Tx,  114135 KByte/s Rx.
Packet size  8k bytes:  112202 KByte/s Tx,  114065 KByte/s Rx.
Packet size 16k bytes:  112974 KByte/s Tx,  114051 KByte/s Rx.
Packet size 32k bytes:  113790 KByte/s Tx,  114100 KByte/s Rx.

I've also done some test with running "top" and "Windows Task-Manager" to prove
my "CPU-Power"-suspect, but i wasn't able to determine any special relationship
between CPU-Load, netio and v0.30 vs. v0.42.

Comment 49 Jan Gutter 2005-08-11 09:14:43 UTC

Sorry, had a bunch of public holidays this week...

1) I can conclusively state that nvnet works without lockups: strike hardware...
2) Forcedeth 042 looks like it did the trick!

Another reason to keep the "forced" in the name: forced linkinit!

I'll report back the moment the NIC hangs again, hopefully never...

Comment 50 Manfred Spraul 2005-08-11 09:40:08 UTC

Thanks for the netio output. I must reproduce it myself and then figure out what
went wrong. Perhaps rx hardware checksumming doesn't work anymore, or the 64-bit
DMA patch has unintended side effects.

Which board do you use? nForce 3 or 4?

Comment 51 Peter Svec 2005-08-11 14:02:21 UTC

Neither, I have only nForce 2 GigaBit boards from MSI, look here:
http://www.msi.com.tw/program/products/mainboard/mbd/pro_mbd_detail.php?UID=613

The Powerbook has an so called "SunGEM" (Sun Gigabit Ethernet) NIC, sorry, don't
have any further info about it, but mabe you can get a look to
kernel-sources/drivers/net/sungem.c

Comment 52 Ed 2005-08-13 02:24:36 UTC

Can someone add v0.42 i wanna test it too.

Comment 53 Manfred Spraul 2005-08-13 04:20:31 UTC

Created attachment 5623 [details]
forcedeth 0.42

Here is the forcedeth 0.42 patch.
All recent patches are available from 
 http://www.colorfullife.com/~manfred/Linux-kernel/forcedeth/

The slowdown is still unresolved, I'll try to look at the issue tomorrow.

Comment 54 Peter 2005-08-13 08:50:49 UTC

Created attachment 5629 [details]
IRQ are going high

Hi, I am a kernel-dummy.
But I noticed that every time my network card stops these week (4 times).
I have much traffic on irq (see attachment) and here:
root@pc1:~# cat /var/log/messages| grep "TransmitterStatus"
Aug  7 12:28:51 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug  7 12:30:17 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug  7 12:31:32 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug  7 12:32:48 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug  7 17:10:12 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug  7 17:11:57 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug  7 17:12:18 pc1 kernel: nv_stop_tx: TransmitterStatus remained
busy<7>capilib_new_ncci:kcapi: appl 2 ncci 0x10101 up
Aug  7 21:41:11 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug  8 09:42:49 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug  8 09:43:59 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:51:11 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:51:21 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:51:32 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:51:42 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:51:53 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:52:03 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:52:14 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:52:24 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:52:35 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:52:45 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:52:56 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:53:06 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:53:17 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:53:27 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
Aug 11 17:53:38 pc1 kernel: nv_stop_tx: TransmitterStatus remained busy<7>eth1:
tx_timeout:dead entries!
root@pc1:~#						  

I run http://hotsanic.sourceforge.net/ for some debugging at my pc.

Maybe it helps someone, mybe not, I do not know...

Comment 55 Mudreac Nelu 2005-08-22 23:23:38 UTC

Have same problem with my new MB ASUS A8N SLI Premium nFOrce4 chip set on CPU
AMD X2 4400+ 2G RAM .
my dmesg output may help ?

warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
nv_stop_tx: TransmitterStatus remained busy<6>forcedeth.c: Reverse Engineered
nForce ethernet driver. Version 0.35.
ACPI: PCI Interrupt 0000:00:0a.0[A] -> Link [LMAC] -> GSI 3 (level, low) -> IRQ 3
PCI: Setting latency timer of device 0000:00:0a.0 to 64
eth0: forcedeth.c: subsystem: 01043:8141 bound to 0000:00:0a.0
nv_stop_tx: TransmitterStatus remained busy<7>eth0: no IPv6 routers present
nv_stop_tx: TransmitterStatus remained busy<6>ld[2613]: segfault at
0000000000000020 rip 00002aaaaad194d5 rsp 00007fffff8e0470 error 4
ld[8686]: segfault at 0000000000000020 rip 00002aaaaad194d5 rsp 00007fffff824bc0
error 4
forcedeth.c: Reverse Engineered nForce ethernet driver. Version 0.35.
ACPI: PCI Interrupt 0000:00:0a.0[A] -> Link [LMAC] -> GSI 3 (level, low) -> IRQ 3
PCI: Setting latency timer of device 0000:00:0a.0 to 64
eth0: forcedeth.c: subsystem: 01043:8141 bound to 0000:00:0a.0
eth0: no link during initialization.
nv_stop_tx: TransmitterStatus remained busyeth0: no link during initialization.
eth0: link up.
eth0: no IPv6 routers present
nv_stop_tx: TransmitterStatus remained busy<6>forcedeth.c: Reverse Engineered
nForce ethernet driver. Version 0.35.
ACPI: PCI Interrupt 0000:00:0a.0[A] -> Link [LMAC] -> GSI 3 (level, low) -> IRQ 3
PCI: Setting latency timer of device 0000:00:0a.0 to 64
eth0: forcedeth.c: subsystem: 01043:8141 bound to 0000:00:0a.0
eth0: no link during initialization.
nv_stop_tx: TransmitterStatus remained busyeth0: no link during initialization.
eth0: no IPv6 routers present
eth0: link up.
nv_stop_tx: TransmitterStatus remained busy<6>ld[2085]: segfault at
0000000000000020 rip 00002aaaaad194d5 rsp 00007fffff85c9c0 error 4
ld[2322]: segfault at 0000000000000020 rip 00002aaaaad194d5 rsp 00007fffff99c590
error 4

I hope Problem will be fix soon
Thx and Best Regards

Comment 56 Michael Bakhos 2005-08-23 10:10:54 UTC

I've been running with forcedeth 0.42 for a week now and never had the hang 
again.

Comment 57 Ed 2005-08-25 14:20:41 UTC

On my box, 0,42 seems stable.

Comment 58 Greg Poucher 2005-08-27 14:28:10 UTC

I've been experiencing the same lockups on my Abit NI8-SLI running a P4 with
EM64T, and also previously with an Epox board and an Athlon64 (can't recall the
board model). Hopefully the 0.42 patch will fix the problem.

However, the actual reason I'm posting here is that I noticed that the 0.42
patch doesn't update the driver version number. It adds a line to the changelog
about 0.42, but the #define FORCEDETH_VERSION line is still "0.41". I didn't
know where else to post this - it certainly doesn't seem to deserve its own bug.
Anyway, just a heads-up.

Comment 59 Jan Gutter 2005-08-30 08:12:46 UTC

The force-linkinit patch (the bit that fixes the hangup problems) is in the
latest gentoo-sources kernel. Since I've had that one-liner patch on my kernel,
the NIC hasn't blinked.

Comment 60 Mudreac Nelu 2005-09-03 02:12:13 UTC

I can not find this patches in latest Vanila Kernel 2.6.13 > ?

Comment 61 Mudreac Nelu 2005-09-03 02:21:20 UTC

I try to apply patch 4.2 to 2.6.13 kernel and have error in forcedeth.c.rej

cat forcedeth.c.rej
--- 2180,2188 ----
                writel(NVREG_MIISTAT_MASK, base + NvRegMIIStatus);
                dprintk(KERN_INFO "startup: got 0x%08x.\n", miistat);
        }
+       /* set linkspeed to invalid value, thus force nv_update_linkspeed
+        * to init hw */
+       np->linkspeed = 0;
        ret = nv_update_linkspeed(dev);
        nv_start_rx(dev);
        nv_start_tx(dev);

Comment 62 Jan Gutter 2005-09-03 06:03:19 UTC

Re: post #61

On vanilla kernel 2.6.13, I applied: 

http://www.colorfullife.com/~manfred/Linux-kernel/forcedeth/patch-forcedeth-042-forcelinkinit

It caused one rejection (in the comments and can basically be ignored) and
basically causes the line:

np->linkspeed = 0;

to be added in front of the line:

ret = nv_update_linkspeed(dev);

in the file:

drivers/net/forcedeth.c (relative to the kernel sources directory)

This fixed the bug for me (the gentoo sources are patched in the exact same way,
according to the .orig file ;-)

Personally I vote for this bug to be closed THE MOMENT that one-liner gets
accepted into the "stable kernel release". Any other problems with the driver
are clearly not related to this specific bug. The link speed regressions should
be handled as a new bug, not a variant of this one. I'd be very suprised if the
hangs and regressions are symptoms of the same bug...

Comment 63 Manfred Spraul 2005-09-04 01:54:21 UTC

I agree: As far as I can see, 0.42 fixes the bug.
The patch is in 2.6.13-git3. Unfortunately, I forgot to increase the version
number, but I won't send another patch just to increase the number.

Could someone close the bug?

I see the performance regression, too, although less severe: The cpu load at 60
MB/sec is around 60% instead of 50% as it was before. I have no idea why, but
I'll try to figure out which change caused that.

Comment 64 Ayaz Abdulla 2005-10-12 20:32:32 UTC

Peter, can you close out this bug?

Comment 65 Peter 2005-10-13 01:11:40 UTC

Hi,
first of all I wonder why I am the owner of this bug?
Maybe someone else can change this for me, please?

I got last week this:

/softirq.c:140
 [<c011feb2>] local_bh_enable+0x72/0x80
 [<f91d90d3>] destroy_conntrack+0x83/0xd0 [ip_conntrack]
 [<c0238c27>] __kfree_skb+0xa7/0x130
 [<c0238b74>] kfree_skbmem+0x24/0x30
 [<f8a8084b>] nv_drain_tx+0x3b/0x70 [forcedeth]
 [<f8a80b66>] nv_tx_timeout+0x56/0xd0 [forcedeth]
 [<c024f3e0>] dev_watchdog+0x0/0xa0
 [<c024f477>] dev_watchdog+0x97/0xa0
 [<c0123b66>] run_timer_softirq+0xb6/0x1a0
 [<c011fdfd>] __do_softirq+0x7d/0x90
 [<c011fe36>] do_softirq+0x26/0x30
 [<c010565e>] do_IRQ+0x1e/0x30
 [<c0103b0a>] common_interrupt+0x1a/0x20
 [<f8d0aa70>] acpi_processor_idle+0x0/0x299 [processor]
 [<f8d0ab75>] acpi_processor_idle+0x105/0x299 [processor]
 [<c01010d8>] cpu_idle+0x48/0x60
 [<c03627db>] start_kernel+0x17b/0x1c0
 [<c0362390>] unknown_bootoption+0x0/0x1b0
Badness in local_bh_enable at kernel/softirq.c:140
 [<c011feb2>] local_bh_enable+0x72/0x80
 [<f91d90d3>] destroy_conntrack+0x83/0xd0 [ip_conntrack]
 [<c0238c27>] __kfree_skb+0xa7/0x130
 [<c0238b74>] kfree_skbmem+0x24/0x30
 [<f8a8084b>] nv_drain_tx+0x3b/0x70 [forcedeth]
 [<f8a80b66>] nv_tx_timeout+0x56/0xd0 [forcedeth]
 [<c024f3e0>] dev_watchdog+0x0/0xa0
 [<c024f477>] dev_watchdog+0x97/0xa0
 [<c0123b66>] run_timer_softirq+0xb6/0x1a0
 [<c011fdfd>] __do_softirq+0x7d/0x90
 [<c011fe36>] do_softirq+0x26/0x30
 [<c010565e>] do_IRQ+0x1e/0x30
 [<c0103b0a>] common_interrupt+0x1a/0x20
 [<f8d0aa70>] acpi_processor_idle+0x0/0x299 [processor]
 [<f8d0ab75>] acpi_processor_idle+0x105/0x299 [processor]
 [<c01010d8>] cpu_idle+0x48/0x60
 [<c03627db>] start_kernel+0x17b/0x1c0
 [<c0362390>] unknown_bootoption+0x0/0x1b0
Badness in local_bh_enable at kernel/softirq.c:140
 [<c011feb2>] local_bh_enable+0x72/0x80
 [<f91d90d3>] destroy_conntrack+0x83/0xd0 [ip_conntrack]
 [<c0238c27>] __kfree_skb+0xa7/0x130
 [<c0238b74>] kfree_skbmem+0x24/0x30
 [<f8a8084b>] nv_drain_tx+0x3b/0x70 [forcedeth]
 [<f8a80b66>] nv_tx_timeout+0x56/0xd0 [forcedeth]
 [<c024f3e0>] dev_watchdog+0x0/0xa0
 [<c024f477>] dev_watchdog+0x97/0xa0
 [<c0123b66>] run_timer_softirq+0xb6/0x1a0
 [<c011fdfd>] __do_softirq+0x7d/0x90
 [<c011fe36>] do_softirq+0x26/0x30
 [<c010565e>] do_IRQ+0x1e/0x30
 [<c0103b0a>] common_interrupt+0x1a/0x20
 [<f8d0aa70>] acpi_processor_idle+0x0/0x299 [processor]
 [<f8d0ab75>] acpi_processor_idle+0x105/0x299 [processor]
 [<c01010d8>] cpu_idle+0x48/0x60
 [<c03627db>] start_kernel+0x17b/0x1c0
 [<c0362390>] unknown_bootoption+0x0/0x1b0
Badness in local_bh_enable at kernel/softirq.c:140
 [<c011feb2>] local_bh_enable+0x72/0x80
 [<f91d90d3>] destroy_conntrack+0x83/0xd0 [ip_conntrack]
 [<c0238c27>] __kfree_skb+0xa7/0x130
 [<c0238b74>] kfree_skbmem+0x24/0x30
 [<f8a8084b>] nv_drain_tx+0x3b/0x70 [forcedeth]
 [<f8a80b66>] nv_tx_timeout+0x56/0xd0 [forcedeth]
 [<c024f3e0>] dev_watchdog+0x0/0xa0
 [<c024f477>] dev_watchdog+0x97/0xa0
 [<c0123b66>] run_timer_softirq+0xb6/0x1a0
 [<c011fdfd>] __do_softirq+0x7d/0x90
 [<c011fe36>] do_softirq+0x26/0x30
 [<c010565e>] do_IRQ+0x1e/0x30
 [<c0103b0a>] common_interrupt+0x1a/0x20
 [<f8d0aa70>] acpi_processor_idle+0x0/0x299 [processor]
 [<f8d0ab75>] acpi_processor_idle+0x105/0x299 [processor]
 [<c01010d8>] cpu_idle+0x48/0x60
 [<c03627db>] start_kernel+0x17b/0x1c0
 [<c0362390>] unknown_bootoption+0x0/0x1b0
Badness in local_bh_enable at kernel/softirq.c:140
 [<c011feb2>] local_bh_enable+0x72/0x80
 [<f91d90d3>] destroy_conntrack+0x83/0xd0 [ip_conntrack]
 [<c0238c27>] __kfree_skb+0xa7/0x130
 [<c0238b74>] kfree_skbmem+0x24/0x30
 [<f8a8084b>] nv_drain_tx+0x3b/0x70 [forcedeth]
 [<f8a80b66>] nv_tx_timeout+0x56/0xd0 [forcedeth]
 [<c024f3e0>] dev_watchdog+0x0/0xa0
 [<c024f477>] dev_watchdog+0x97/0xa0
 [<c0123b66>] run_timer_softirq+0xb6/0x1a0
 [<c011fdfd>] __do_softirq+0x7d/0x90
 [<c011fe36>] do_softirq+0x26/0x30
 [<c010565e>] do_IRQ+0x1e/0x30
 [<c0103b0a>] common_interrupt+0x1a/0x20
 [<f8d0aa70>] acpi_processor_idle+0x0/0x299 [processor]
 [<f8d0ab75>] acpi_processor_idle+0x105/0x299 [processor]
 [<c01010d8>] cpu_idle+0x48/0x60
 [<c03627db>] start_kernel+0x17b/0x1c0
 [<c0362390>] unknown_bootoption+0x0/0x1b0
Badness in local_bh_enable at kernel/softirq.c:140
 [<c011feb2>] local_bh_enable+0x72/0x80
 [<f91d90d3>] destroy_conntrack+0x83/0xd0 [ip_conntrack]
 [<c0238c27>] __kfree_skb+0xa7/0x130
 [<c0238b74>] kfree_skbmem+0x24/0x30
 [<f8a8084b>] nv_drain_tx+0x3b/0x70 [forcedeth]
 [<f8a80b66>] nv_tx_timeout+0x56/0xd0 [forcedeth]
 [<c024f3e0>] dev_watchdog+0x0/0xa0
 [<c024f477>] dev_watchdog+0x97/0xa0
 [<c0123b66>] run_timer_softirq+0xb6/0x1a0
 [<c011fdfd>] __do_softirq+0x7d/0x90
 [<c011fe36>] do_softirq+0x26/0x30
 [<c010565e>] do_IRQ+0x1e/0x30
 [<c0103b0a>] common_interrupt+0x1a/0x20
 [<f8d0aa70>] acpi_processor_idle+0x0/0x299 [processor]
 [<f8d0ab75>] acpi_processor_idle+0x105/0x299 [processor]
 [<c01010d8>] cpu_idle+0x48/0x60
 [<c03627db>] start_kernel+0x17b/0x1c0
 [<c0362390>] unknown_bootoption+0x0/0x1b0
Badness in local_bh_enable at kernel/softirq.c:140
 [<c011feb2>] local_bh_enable+0x72/0x80
 [<f91d90d3>] destroy_conntrack+0x83/0xd0 [ip_conntrack]
 [<c0238c27>] __kfree_skb+0xa7/0x130
 [<c0238b74>] kfree_skbmem+0x24/0x30
 [<f8a8084b>] nv_drain_tx+0x3b/0x70 [forcedeth]
 [<f8a80b66>] nv_tx_timeout+0x56/0xd0 [forcedeth]
 [<c024f3e0>] dev_watchdog+0x0/0xa0
 [<c024f477>] dev_watchdog+0x97/0xa0
 [<c0123b66>] run_timer_softirq+0xb6/0x1a0
 [<c011fdfd>] __do_softirq+0x7d/0x90
 [<c011fe36>] do_softirq+0x26/0x30
 [<c010565e>] do_IRQ+0x1e/0x30
 [<c0103b0a>] common_interrupt+0x1a/0x20
 [<f8d0aa70>] acpi_processor_idle+0x0/0x299 [processor]
 [<f8d0ab75>] acpi_processor_idle+0x105/0x299 [processor]
 [<c01010d8>] cpu_idle+0x48/0x60
 [<c03627db>] start_kernel+0x17b/0x1c0
 [<c0362390>] unknown_bootoption+0x0/0x1b0
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
nv_stop_tx: TransmitterStatus remained busy<7>capilib_new_ncci: kcapi: appl 2
ncci 0x10101 up
kcapi: appl 2 ncci 0x10101 down
capilib_new_ncci: kcapi: appl 2 ncci 0x10101 up
capidrv-1: incoming call ,1,1,6099700
capidrv-1: patching si2=1 to 0 for VBOX
isdn_net: Incoming call without OAD, assuming '0'
isdn_net: call from 0 -> 0 6099700 ignored
isdn_tty: Incoming call without OAD, assuming '0'
isdn_tty: call from 0 -> 6099700 ignored
capidrv-1: incoming call ,1,0,6099700 ignored
kcapi: appl 2 ncci 0x10101 down
eth1: no IPv6 routers present
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
NETDEV WATCHDOG: eth1: transmit timed out
nv_stop_tx: TransmitterStatus remained busy<7>eth1: tx_timeout: dead entries!
nv_stop_tx: TransmitterStatus remained busy<7>eth1: no IPv6 routers present

With kernel Linux pc1 2.6.13.1 #1 Thu Sep 15 20:44:51 CEST 2005 i686 GNU/Linux
is it good or bad sign?

Gruss,
Peter

Comment 66 Anonymous Emailer 2005-10-13 03:25:49 UTC

Reply-To: info@padberg-it.com

**********************************************************
**   Achtung! Bei Antwort/Reply auf diese eMail bitte   **
** NICHT das Subject/Betreff ver

Comment 67 Anonymous Emailer 2005-10-13 03:25:50 UTC

Reply-To: info@padberg-it.com

**********************************************************
**   Achtung! Bei Antwort/Reply auf diese eMail bitte   **
** NICHT das Subject/Betreff ver

Comment 68 Manfred Spraul 2005-10-13 14:09:06 UTC

Try 2.6.13.2, it contains the fix:

http://www.kernel.org/diff/diffview.cgi?file=%2Fpub%2Flinux%2Fkernel%2Fv2.6%2Fpatch-2.6.13.2.bz2;z=all#11

Comment 69 Ayaz Abdulla 2005-11-09 12:05:24 UTC

Is this issue still a problem?

Comment 70 Michael Bakhos 2005-11-09 16:09:37 UTC

The problem seems to have re-surfaced for me (Tyan Thunder K8WE (S2895) 
(nforce4 chipset, dual dual-core Opteron)) in the last week, for a few 
version. Gentoo kernel 2.6.12-r9, 2.6.12-r3 (both had the forcedeth updated to 
0.42), and 2.6.13-r5 (both with the default driver and driver version up to 
0.46). 
Also, the problem seems to have re-surfaced not long after the computer was 
changed to a 100mbps switch. 
 
I'm going to send the output of dmesg

Comment 71 Michael Bakhos 2005-11-09 16:14:22 UTC

Created attachment 6515 [details]
dmesg output

dmesg output of Tyan K8WE, nic connected to 100mbps switch.

Comment 72 Ayaz Abdulla 2005-11-09 23:47:06 UTC

Based on the dmesg dump, it seems that the link speed is 100mbps and the 
duplex is Half. 

Can you verify using ethtool the link speed and duplex before and after the tx 
hang?

Comment 73 Michael Bakhos 2005-11-10 07:24:37 UTC

The link connection is 100mbps half-duplex, both before and after, note that the
speed & duplex is auto-negotiatied.

Comment 74 Jason Stubbs 2005-11-20 01:12:48 UTC

Created attachment 6620 [details]
dmesg output

Same here on a Tyan K8WE and kernel 2.6.14-ck5. Attached via cross-over to a
100mbs adsl "router" which is configured for full-duplex. Likewise, I generally
only get hangs when there's a lot of outbound traffic.

Comment 75 Ayaz Abdulla 2005-11-20 14:57:20 UTC

Michael, what is the model of the switch? What kind of traffic are you 
running? I can not reproduce the problem when running at 100mbps and half 
dulex.

Jason, based on your dmesg output, you are running version 41 which does not 
have the fix for this. Please try version 42 and above.

Comment 76 Jason Stubbs 2005-11-20 15:29:29 UTC

The patch is applied... I confirmed that while initially reading through this 
bug.

Comment 77 Ayaz Abdulla 2005-11-21 15:26:26 UTC

Could you please enable the debug messages and send me the output after the 
timeout?

Please set the following define to 1

#if 0
#define dprintk			printk
#else
#define dprintk(x...)		do { } while (0)
#endif

Comment 78 Michael Bakhos 2005-11-21 18:02:07 UTC

The switch is a: 
3com Superstack II Switch 3000 
 
As for debug messages, I will do it as soon as I can get the system back on 
the onboard nic (we borrowed an another one since we have to do a few critical 
things this few days).

Comment 79 Jason Stubbs 2005-11-22 06:28:10 UTC

Created attachment 6642 [details]
dprintk during failure

(gzipped due to size constraints)

I've tried to kill off most of the output and just capture a bit of normal
traffic before the failure, what happens during the failure and a couple of
unsuccessful resets afterward. I've still got the full log so let me know if
I've cut it too close.

Comment 80 Jason Stubbs 2005-11-22 06:29:54 UTC

Created attachment 6643 [details]
dprintk after reboot

(gzipped due to size constraints)

This is what happens during the following reboot.

Comment 81 Ayaz Abdulla 2005-11-22 12:29:02 UTC

Jason,
Thanks for the output. Just want to double check your duplex settings. In an 
earlier post you said you were running at full-duplex. Based on output, you 
are running in half duplex. Is that true? You can use ethtool to verify before 
the failure and after the failure.

The reason I ask is because if there is a discrepany in what the MAC thinks as 
the speed/duplex versus what the PHY is running at, there could be a hang.

Comment 82 Jason Stubbs 2005-11-22 15:51:55 UTC

Bingo. ethtool reports half duplex. That's consistent with the scenarios under 
which it hangs too - lots of incoming and outgoing packets. I'll switch the 
router to half duplex for the time being, but is there anything that can be 
done for that?

Comment 83 Ayaz Abdulla 2005-11-22 19:18:08 UTC

Can you send me the full output? Based on the capture before the failure, the 
NIC was already in half duplex (due to router only advertising 100 half). I 
want to know if it ever was in full duplex to begin with.

Comment 84 Jason Stubbs 2005-11-22 22:41:54 UTC

I don't have the log since the boot, but there is about 20 minutes worth 
(=several gigabytes) before the hang. I researched a little into how to read 
the output and it seemed that the link was in fact 100/half from the start of 
what I do have. I've done a few tests with my router and come up with the 
following result though. 
 
With the router set to auto-negotate and plugging in the network cable: 
 
eth0: mii_rw read from reg 5 at PHY 1: 0x41e1. 
eth0: nv_update_linkspeed: PHY advertises 0x0de1, lpa 0x41e1. 
eth0: changing link setting from 66536/0 to 65636/1. 
eth0: link up. 
eth0: nv_start_rx 
eth0: nv_start_rx to duplex 1, speed 0x00010064. 
 
Everything seems good. However, specifically setting to router to any of 
100/full, 100/half, 10/full or 10/half yields: 
 
eth0: mii_rw read from reg 5 at PHY 1: 0x81. 
eth0: nv_update_linkspeed: PHY advertises 0x0de1, lpa 0x0081. 
eth0: changing link setting from 66536/0 to 65636/0. 
eth0: link up. 
eth0: nv_start_rx 
eth0: nv_start_rx to duplex 0, speed 0x00010064. 
 
The router is always being detected as 100/half and subsequent 
nv_update_linkspeed and nv_start_rx lines continue to show the same. However, 
ethtool (nv_get_settings?) correctly detects a speed of 10mb/s... When I got 
the last hang the router was specifically set to run at 100/full so it'd be 
safe to assume that the nic was running in 100/half from the start.

Comment 85 Ayaz Abdulla 2005-11-22 22:52:19 UTC

That is the problem. You can not use force settings on the router and then set 
autoneg on the client machine. If you force one side you must force the other 
side aswell. Otherwise, the outcome is undetermined.

Please use autoneg on the router and autoneg on forcedeth (autoneg is 
default). Or if you use force settings on router, you must use forced settings 
on forcedeth (through ethtool). I suggest leaving it to autoneg. That should 
resolve your tx hang problem.

Comment 86 Jason Stubbs 2005-11-22 23:42:15 UTC

Changing it was actually a misguided attempt at solving hangs when the router 
was set to auto-negotiate. I wouldn't call it "the" problem as mismatched 
settings shouldn't cause a hard-lock of the nic but there's a FIXME in the 
code for 'parallel detection' which is good enough for me. 
 
Sorry for the noise on this issue, but if I get another hang I'll be 
definitely be able to produce more useful info. :)

Comment 87 Ayaz Abdulla 2005-12-05 11:45:56 UTC

Jeff, can this bug be closed out now?

Comment 88 ElColonel 2006-01-15 19:22:39 UTC

I seem to be running into something similar to this problem. It may just be
coincidence, but it seems to happen almost always when I load KDE and NOT before
I login. No real network activity either way, just minor work over SSH.

asus a8n-vm csm
2.6.14-1.1656_FC4 x86_64

Comment 89 Lonni J Friedman 2006-01-15 19:34:33 UTC

2.6.14-1.1656_FC4 ships with forcedeth-0.41.  You need to upgrade to a much
newer version of forcedeth to resolve this problem.  See here for semi-official
FC4 kernel RPMS with the latest in-kernel forcedeth:
http://people.redhat.com/linville/kernels/fedora-netdev/