Bug 6480 - Transmitter of CK804 hangs and cannot be reset (except by power-cycling it)
Summary: Transmitter of CK804 hangs and cannot be reset (except by power-cycling it)
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Ayaz Abdulla
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-05-02 06:06 UTC by Ingo Oeser
Modified: 2007-07-24 06:01 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.16.11
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
The hardware as lshw sees it of the problematic machine (12.37 KB, text/plain)
2006-05-02 06:08 UTC, Ingo Oeser
Details
Log messages from the kernel regarding this bug (121.13 KB, text/x-log)
2006-05-02 06:12 UTC, Ingo Oeser
Details
My kernel config (33.73 KB, text/plain)
2006-05-02 06:15 UTC, Ingo Oeser
Details
Fix for calling interrupt routine in nv_do_nic_poll (454 bytes, patch)
2007-03-11 20:59 UTC, Ayaz Abdulla
Details | Diff
Fix tx timeout routine (381 bytes, patch)
2007-03-11 21:00 UTC, Ayaz Abdulla
Details | Diff

Description Ingo Oeser 2006-05-02 06:06:53 UTC
Most recent kernel where this bug did not occur:
Distribution: Debian sarge
Hardware Environment: Tyan NF-CK804, AMD Athlon(tm) 64 Processor 3200+, 
 CK804 Ethernet Controller (the Problem), More see attachment
Software Environment: Services: Routing, Firewalling, Mail, Proxy
Problem Description:
 After a while the transmitter of the CK804 NIC hangs and recovers
 only after reboot. The netdev watchdog cannot successfully reset the NIC.
Steps to reproduce:
 No usage pattern detected so far. Happens after more than one day usage.
 Cabling is Cat5 crossover connection to one LAN-port of a "Sonicwall TZ170 
 Enhanced". Cables and machines were replaced with no luck. 
 Using the other Onboard NIC here (the Net-Extreme) helps.
 The same combination with a Cisco switch between the Sonicwall and the
 CK804-NIC works without problems (at least since 2.6.13.4).
Comment 1 Ingo Oeser 2006-05-02 06:08:46 UTC
Created attachment 8010 [details]
The hardware as lshw sees it of the problematic machine
Comment 2 Ingo Oeser 2006-05-02 06:12:27 UTC
Created attachment 8011 [details]
Log messages from the kernel regarding this bug
Comment 3 Ingo Oeser 2006-05-02 06:15:04 UTC
Created attachment 8012 [details]
My kernel config

Please ignore the drbd part. This is is never opened so it is nothing more
but a BLOCK device handing around in the device tree.
Comment 4 Ingo Oeser 2006-05-02 06:18:07 UTC
Oh! Forgot to complete the summary line
Comment 5 Ingo Oeser 2006-05-02 06:18:58 UTC
Comment on attachment 8010 [details]
The hardware as lshw sees it of the problematic machine

correct mime type
Comment 6 Anonymous Emailer 2006-05-02 06:49:49 UTC
Reply-To: netdev@axxeo.de

Hi Manfred,

I filed BUG 6480 describing the problem and providing lots
of info. If you need more info, just ask.

At the moment I have no idea about the reasons.
Except Crossover-Cabling vs. using a switch.

The machine is a production machine, so I cannot test
kernels and patches. 

But I have several machines with identical hardware where
I can test this.

Once we can reproduce it without a Sonicwall, I can build a 
test setup using any kernel hackery required to resolve the issue :-)

Could you please contact the right nVIDIA people, if needed?

PS: Stephen, you are not CC'ed from bugzilla and 
      "agreed to help out with net driver maintenance" 
       so I CC'ed you here manually.

Regards

Ingo Oeser

Comment 7 Stephen Hemminger 2006-12-18 12:02:27 UTC
Does this still occur with the current forcedeth driver (2.6.18 or later)?
Comment 8 Ingo Oeser 2006-12-19 03:32:36 UTC
Hi Stephen,

bugme-daemon@bugzilla.kernel.org schrieb:
> ------- Additional Comments From shemminger@osdl.org  2006-12-18 12:02 -------
> Does this still occur with the current forcedeth driver (2.6.18 or later)?

I'll ask the customer, if I can flip interfaces on his mailserver again, 
but this will take time. 

I can also only test upto 2.6.18.x at the moment until that disk corruption 
problem is sorted out.

But I'll try my best to set up a test.

Comment 9 Ayaz Abdulla 2007-02-23 16:38:50 UTC
Can you attach your forcedeth.c file?
Comment 10 Ayaz Abdulla 2007-03-11 20:59:16 UTC
Created attachment 10705 [details]
Fix for calling interrupt routine in nv_do_nic_poll

Patch 1/2
Comment 11 Ayaz Abdulla 2007-03-11 21:00:28 UTC
Created attachment 10706 [details]
Fix tx timeout routine

Patch 2/2
Comment 12 Natalie Protasevich 2007-07-19 01:05:10 UTC
The patches are in now. Ingo does everything work for you now?
If so we can close this bug, thanks.
Comment 13 Ingo Oeser 2007-07-24 06:01:53 UTC
We've hit it just with this customer and have several machines deployed 
and 4 deployed in the same configuration without problems.

We'll try this at the next complete on-site day for the customer.
But this day is one or two months away.

So I close this bug and reopen it, when we hit it again. 

NIC cannot be replaced (without soldering), as it is on-board in the chipset.

Note You need to log in before you can comment on or make changes to this bug.