Bug 6480

Summary: Transmitter of CK804 hangs and cannot be reset (except by power-cycling it)
Product: Drivers Reporter: Ingo Oeser (kernel-bugs)
Component: NetworkAssignee: Ayaz Abdulla (aabdulla)
Status: CLOSED CODE_FIX    
Severity: high CC: aabdulla, manfred, protasnb, stephen
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.16.11 Subsystem:
Regression: --- Bisected commit-id:
Attachments: The hardware as lshw sees it of the problematic machine
Log messages from the kernel regarding this bug
My kernel config
Fix for calling interrupt routine in nv_do_nic_poll
Fix tx timeout routine

Description Ingo Oeser 2006-05-02 06:06:53 UTC
Most recent kernel where this bug did not occur:
Distribution: Debian sarge
Hardware Environment: Tyan NF-CK804, AMD Athlon(tm) 64 Processor 3200+, 
 CK804 Ethernet Controller (the Problem), More see attachment
Software Environment: Services: Routing, Firewalling, Mail, Proxy
Problem Description:
 After a while the transmitter of the CK804 NIC hangs and recovers
 only after reboot. The netdev watchdog cannot successfully reset the NIC.
Steps to reproduce:
 No usage pattern detected so far. Happens after more than one day usage.
 Cabling is Cat5 crossover connection to one LAN-port of a "Sonicwall TZ170 
 Enhanced". Cables and machines were replaced with no luck. 
 Using the other Onboard NIC here (the Net-Extreme) helps.
 The same combination with a Cisco switch between the Sonicwall and the
 CK804-NIC works without problems (at least since 2.6.13.4).
Comment 1 Ingo Oeser 2006-05-02 06:08:46 UTC
Created attachment 8010 [details]
The hardware as lshw sees it of the problematic machine
Comment 2 Ingo Oeser 2006-05-02 06:12:27 UTC
Created attachment 8011 [details]
Log messages from the kernel regarding this bug
Comment 3 Ingo Oeser 2006-05-02 06:15:04 UTC
Created attachment 8012 [details]
My kernel config

Please ignore the drbd part. This is is never opened so it is nothing more
but a BLOCK device handing around in the device tree.
Comment 4 Ingo Oeser 2006-05-02 06:18:07 UTC
Oh! Forgot to complete the summary line
Comment 5 Ingo Oeser 2006-05-02 06:18:58 UTC
Comment on attachment 8010 [details]
The hardware as lshw sees it of the problematic machine

correct mime type
Comment 6 Anonymous Emailer 2006-05-02 06:49:49 UTC
Reply-To: netdev@axxeo.de

Hi Manfred,

I filed BUG 6480 describing the problem and providing lots
of info. If you need more info, just ask.

At the moment I have no idea about the reasons.
Except Crossover-Cabling vs. using a switch.

The machine is a production machine, so I cannot test
kernels and patches. 

But I have several machines with identical hardware where
I can test this.

Once we can reproduce it without a Sonicwall, I can build a 
test setup using any kernel hackery required to resolve the issue :-)

Could you please contact the right nVIDIA people, if needed?

PS: Stephen, you are not CC'ed from bugzilla and 
      "agreed to help out with net driver maintenance" 
       so I CC'ed you here manually.

Regards

Ingo Oeser

Comment 7 Stephen Hemminger 2006-12-18 12:02:27 UTC
Does this still occur with the current forcedeth driver (2.6.18 or later)?
Comment 8 Ingo Oeser 2006-12-19 03:32:36 UTC
Hi Stephen,

bugme-daemon@bugzilla.kernel.org schrieb:
> ------- Additional Comments From shemminger@osdl.org  2006-12-18 12:02 -------
> Does this still occur with the current forcedeth driver (2.6.18 or later)?

I'll ask the customer, if I can flip interfaces on his mailserver again, 
but this will take time. 

I can also only test upto 2.6.18.x at the moment until that disk corruption 
problem is sorted out.

But I'll try my best to set up a test.

Comment 9 Ayaz Abdulla 2007-02-23 16:38:50 UTC
Can you attach your forcedeth.c file?
Comment 10 Ayaz Abdulla 2007-03-11 20:59:16 UTC
Created attachment 10705 [details]
Fix for calling interrupt routine in nv_do_nic_poll

Patch 1/2
Comment 11 Ayaz Abdulla 2007-03-11 21:00:28 UTC
Created attachment 10706 [details]
Fix tx timeout routine

Patch 2/2
Comment 12 Natalie Protasevich 2007-07-19 01:05:10 UTC
The patches are in now. Ingo does everything work for you now?
If so we can close this bug, thanks.
Comment 13 Ingo Oeser 2007-07-24 06:01:53 UTC
We've hit it just with this customer and have several machines deployed 
and 4 deployed in the same configuration without problems.

We'll try this at the next complete on-site day for the customer.
But this day is one or two months away.

So I close this bug and reopen it, when we hit it again. 

NIC cannot be replaced (without soldering), as it is on-board in the chipset.