6638 – tg3 output freezes on compaq nc6000

Bug 6638 - tg3 output freezes on compaq nc6000

Summary: tg3 output freezes on compaq nc6000

Status:	CLOSED PATCH_ALREADY_AVAILABLE

Alias:	None

Product:	Drivers
Classification:	Unclassified
Component:	Network (show other bugs)
Hardware:	i386 Linux

Importance:	P2 normal
Assignee:	Jeff Garzik

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-06-02 05:37 UTC by Klaus Reichl
Modified:	2006-11-09 04:49 UTC (History)
CC List:	2 users (show)

See Also:
Kernel Version:	2.6.16.19
Subsystem:
Regression:	---
Bisected commit-id:

Attachments
Requested action data taken (346.01 KB, text/plain) 2006-06-14 07:47 UTC, Klaus Reichl	Details
Requested action data taken (346.01 KB, text/plain) 2006-06-14 07:48 UTC, Klaus Reichl	Details
output of "ethtool -d eth0" after NIC broke down (341.19 KB, text/plain) 2006-11-08 16:11 UTC, Timo Reimann	Details
Add an attachment (proposed patch, testcase, etc.)

Description Klaus Reichl 2006-06-02 05:37:58 UTC

Most recent kernel where this bug did not occur: none
Distribution: Debian Sarge with latest kernel
Hardware Environment: compaq nc6000
Software Environment: 
Problem Description: 
The output engine of the tg3 driver freezes when generating high load.

`ifconfig' shows incomming packets, however, outgoing counter is not incremented
any more.

Resetting the device (ifdown eth0, ifup eth0) heals the problem.

Steps to reproduce:
Heavily copy files to NFS disk.

Comment 1 Andrew Morton 2006-06-02 11:20:00 UTC

On Fri, 2 Jun 2006 05:40:51 -0700
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=6638
> 
>            Summary: tg3 output freezes on compaq nc6000
>     Kernel Version: 2.6.16.19
>             Status: NEW
>           Severity: normal
>              Owner: jgarzik@pobox.com
>          Submitter: Klaus.Reichl@alcatel.at
> 
> 
> Most recent kernel where this bug did not occur: none
> Distribution: Debian Sarge with latest kernel
> Hardware Environment: compaq nc6000
> Software Environment: 
> Problem Description: 
> The output engine of the tg3 driver freezes when generating high load.
> 
> `ifconfig' shows incomming packets, however, outgoing counter is not incremented
> any more.
> 
> Resetting the device (ifdown eth0, ifup eth0) heals the problem.
> 
> Steps to reproduce:
> Heavily copy files to NFS disk.
> 
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.

Comment 2 Anonymous Emailer 2006-06-02 16:00:42 UTC

Reply-To: mchan@broadcom.com

On Fri, 2006-06-02 at 11:22 -0700, Andrew Morton wrote:
> On Fri, 2 Jun 2006 05:40:51 -0700
> bugme-daemon@bugzilla.kernel.org wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=6638
> > 
> >            Summary: tg3 output freezes on compaq nc6000
> >     Kernel Version: 2.6.16.19
> >             Status: NEW
> >           Severity: normal
> >              Owner: jgarzik@pobox.com
> >          Submitter: Klaus.Reichl@alcatel.at
> > 
> > 
> > Most recent kernel where this bug did not occur: none
> > Distribution: Debian Sarge with latest kernel
> > Hardware Environment: compaq nc6000
> > Software Environment: 
> > Problem Description: 
> > The output engine of the tg3 driver freezes when generating high load.
> > 
> > `ifconfig' shows incomming packets, however, outgoing counter is not incremented
> > any more.
> > 
Please provide:

1. tg3 probing output during ifconfig up.
2. /proc/interrupts output to see if interrupt counter is increasing
after failure.
3. "ethtool -d eth0 > dump" after the failure.

Comment 3 Klaus Reichl 2006-06-14 07:47:17 UTC

Created attachment 8305 [details]
Requested action data taken

Comment 4 Klaus Reichl 2006-06-14 07:48:02 UTC

Created attachment 8306 [details]
Requested action data taken

Comment 5 Klaus Reichl 2006-06-14 07:52:37 UTC

Attachment 8306 [details] is a duplicate of 8305.

Comment 6 Thomas M Steenholdt 2006-08-08 13:34:40 UTC

Does the NETDEV WATCHDOG catch this after some time (a few hours)?
Also, do you have tcp segmentation offloading enabled (ethtool -k eth0)?

I'm seeing a problem that looks like what you describe and I haven't had the
problem in a few days, since I disabled tso (ethtool -K eth0 tso off).

I'm wondering if this is the same issue or something else.

I'm on 2.6.17.4 btw.

Comment 7 Klaus Reichl 2006-08-17 00:29:23 UTC

>>>>> Thomas M Steenholdt == tmus@tmus.dk writes:

> ------- Additional Comments From tmus@tmus.dk  2006-08-08 13:34 -------
> Does the NETDEV WATCHDOG catch this after some time (a few hours)?

I'm not sure whether I waited that long (this laptop is my workhorse
:-().  

Where I wrote in one of my last postings, I can easily reproduce the
situation, this is not true any more.

Seems some precondition necessary has changed on our net or 2.6.17 has
positive influence - yes I know I should not change Kernels when
searching for a bug, but as I said before I work on that machine.

> Also, do you have tcp segmentation offloading enabled (ethtool -k eth0)?

Segment offload is off:

bash# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off

> I'm seeing a problem that looks like what you describe and I haven't had the
> problem in a few days, since I disabled tso (ethtool -K eth0 tso off).

> I'm wondering if this is the same issue or something else.

> I'm on 2.6.17.4 btw.

This is 2.6.17.  Will upgrade however as soon things get stabilized
after my holidays.

Best regards,
Klaus

Comment 8 Theodor Milkov 2006-09-20 01:24:06 UTC

We have a similar problem. Some of our servers loose network connection during
backup to NFS mount.

     44  14e4:1648 (rev 10)
     32  14e4:16a6 (rev 02)
    119  14e4:16a7 (rev 02)
     59  14e4:16c7 (rev 10)

Only the 59 servers equipped with "14e4:16c7 (rev 10)" are freezing
occasionally. None of the other servers has this problem.

Comment 9 Timo Reimann 2006-10-24 16:46:59 UTC

I can confirm this bug using 2.6.15 from the Ubuntu distribution (Dapper). Prior
to upgrading from Breezy (using 2.6.12), I didn't have this problem.

The bug showed up first when I tried to backup my laptop (HP nc6120) using
rdiff-backup over NFS. Strangely, the backup succeeded when I used the
client-server-mode of rdiff-backup instead of NFS.

Today, I was writing changes to a bunch of MP3 files on my server attached via
NFS. Again, the driver locked: Wasn't able to do any networking neither locally
nor on the Internet unless I ifdown'ed and ifup'ed the eth0 device (ifup and
ifdown and Debian-based higher level networking mechanisms).

If there's still interest in resolving this bug, I'd be willing to do some
testing on the 2.6.{12|15|17} series kernels. The highest-versioned one will be
added after my upgrade to Ubuntu Edgy in less than a week.

Comment 10 Timo Reimann 2006-11-08 16:09:44 UTC

I've upgraded to 2.6.17 but still suffer from this bug. :(

Although I've spent hours modifying MP3 files over NFS without a problem, with
the new kernel rdiff-backup traffic still seems to be too much for the driver as
it drops out.

Has anyone made any progress on this issue? I'd be really willing to help debugging.

I'll provide what mchan@broadcom.com sought before:

1. dmesg output of ifconfig -v eth0 up:

   Nov  9 00:50:58 localhost kernel: [17210726.392000] ADDRCONF(NETDEV_UP):
eth0: link is not ready
   Nov  9 00:51:00 localhost kernel: [17210727.960000] tg3: eth0: Link is up at
100  Mbps, full duplex.
   Nov  9 00:51:00 localhost kernel: [17210727.960000] tg3: eth0: Flow control
is on for TX and on for RX.
   Nov  9 00:51:00 localhost kernel: [17210727.960000] ADDRCONF(NETDEV_CHANGE):
eth0: link becomes ready

   dmesg output of modprobe -v:

   [17210352.316000] tg3.c:v3.59.1 (August 25, 2006)
   [17210352.316000] ACPI: PCI Interrupt 0000:02:0e.0[A] -> GSI 16 (level, low)
-> IRQ 169
   [17210352.348000] eth0: Tigon3 [partno(BCM95705A50) rev 3003 PHY(5705)]    
   (PCI:33MHz:32-bit) 10/100/1000BaseT Ethernet 00:14:38:1a:20:0a
   [17210352.348000] eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0]   
   WireSpeed[0] TSOcap[1] 
   [17210352.348000] eth0: dma_rwctrl[763f0000] dma_mask[64-bit]


2. interrupt counter of eth0 (`i915@pci:0000:00:02.0, eth0' in /proc/interrupts)
   keeps increasing after network failure tremendously slowly, very close to 
   stalling (note: I'm pinging some Internet host to make sure that still there
   should be some traffic)


3. ethtool output attached

Comment 11 Timo Reimann 2006-11-08 16:11:44 UTC

Created attachment 9433 [details]
output of "ethtool -d eth0" after NIC broke down

Comment 12 Klaus Reichl 2006-11-09 04:25:22 UTC

Hi,

I've upgraded recently to 2.6.18.1 and am running Emacs INBOX again on
the laptop  (actually auto-save of my rather big INBOX triggered my
problem originally).

Since than (more than one week now), I did not see the problem again.

I don't have time to compare the tg3 driver, however, I saw in the
ChangeLogs that there was activity on tg3.  Maybe somebody with
knowledge about the changes could comment here, maybe we can close the
issue ;-).

Cheers,
Klaus

Note You need to log in before you can comment on or make changes to this bug.