Bug 11147

Summary: tg3 + jumbo frames + scatter-gather == inoperative NIC
Product: Drivers Reporter: Jon Nelson (jnelson-kernel-bugzilla)
Component: NetworkAssignee: drivers_network (drivers_network)
Status: CLOSED OBSOLETE    
Severity: normal CC: alan, mcarlson
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25.9 Subsystem:
Regression: No Bisected commit-id:
Attachments: { lspci -t ; lspci -vvv; lspci -k -xxx ; } > lspci-all.frank
registers from ethtool -d eth0 > registers.tg3
registers (9000 byte mtu)
registers (9000 byte mtu) with scatter-gather turned off

Description Jon Nelson 2008-07-22 12:45:54 UTC
Latest working kernel version: n/a
Earliest failing kernel version: 2.6.25.9
Distribution: openSUSE 11.0
Hardware Environment: 32bit x86 (AMD Athlon XP 2200+)
Software Environment: openSUSE 11.0
Problem Description:

When I enable jumbo frames the board still *appears* to work but
nobody can communicate with it. The other machines (multiple) in the
network don't see ANY traffic FROM the tg3.  On the machine with the
tg3, tcpdump shows arp requests (from the other machines) and arp
replies.

After much work, I have determined the problem to be a combination of
scatter-gather and jumbo frames. When both are used, the card does not
work correctly. Disabling *either* appears to work but no benefit is
realized until the interface is brought down and back up again.

An "ethtool -t eth0" shows a single failed test (registers). I have a
register dump and anything else anybody might want if this can be
useful in bug-fixing.

I also saw this:

tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000006]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2


Steps to reproduce:

I've also tried the tg3.[c,h] compiled against 2.6.25.9 but using the latest git source from either linux-2.6 or netdev-2.6 as of 17 July 2008.

tg3.c:v3.92.1 (June 9, 2008)
the tg3.c sha1sum is 97c198a8152045f2e7da5fe0d702df1cd185cb8d.
Comment 1 Anonymous Emailer 2008-07-22 13:14:03 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Tue, 22 Jul 2008 12:45:54 -0700 (PDT)
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=11147
> 
>            Summary: tg3 + jumbo frames + scatter-gather == inoperative NIC
>            Product: Drivers
>            Version: 2.5
>      KernelVersion: 2.6.25.9
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Network
>         AssignedTo: jgarzik@pobox.com
>         ReportedBy: jnelson-kernel-bugzilla@jamponi.net
> 
> 
> Latest working kernel version: n/a
> Earliest failing kernel version: 2.6.25.9
> Distribution: openSUSE 11.0
> Hardware Environment: 32bit x86 (AMD Athlon XP 2200+)
> Software Environment: openSUSE 11.0
> Problem Description:
> 
> When I enable jumbo frames the board still *appears* to work but
> nobody can communicate with it. The other machines (multiple) in the
> network don't see ANY traffic FROM the tg3.  On the machine with the
> tg3, tcpdump shows arp requests (from the other machines) and arp
> replies.
> 
> After much work, I have determined the problem to be a combination of
> scatter-gather and jumbo frames. When both are used, the card does not
> work correctly. Disabling *either* appears to work but no benefit is
> realized until the interface is brought down and back up again.
> 
> An "ethtool -t eth0" shows a single failed test (registers). I have a
> register dump and anything else anybody might want if this can be
> useful in bug-fixing.
> 
> I also saw this:
> 
> tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000006]
> tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
> tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2
> tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2
> 
> 
> Steps to reproduce:
> 
> I've also tried the tg3.[c,h] compiled against 2.6.25.9 but using the latest
> git source from either linux-2.6 or netdev-2.6 as of 17 July 2008.
> 
> tg3.c:v3.92.1 (June 9, 2008)
> the tg3.c sha1sum is 97c198a8152045f2e7da5fe0d702df1cd185cb8d.
> 
> 
Comment 2 Anonymous Emailer 2008-07-22 18:38:29 UTC
Reply-To: jnelson@jamponi.net

What other information can I provide?
This is 100% reproducible.
Comment 3 Michael Chan 2008-07-22 23:26:01 UTC
Jon Nelson wrote:

> What other information can I provide?
> This is 100% reproducible.
>

Please provide lspci output and tg3 signon output so that
we know what device is failing.
Comment 4 Jon Nelson 2008-07-23 06:32:18 UTC
Created attachment 16949 [details]
{ lspci -t ; lspci -vvv; lspci -k -xxx ; } > lspci-all.frank
Comment 5 Jon Nelson 2008-07-23 06:36:18 UTC
Created attachment 16950 [details]
registers from ethtool -d eth0 > registers.tg3
Comment 6 Anonymous Emailer 2008-07-23 06:42:44 UTC
Reply-To: jnelson@jamponi.net

On Wed, Jul 23, 2008 at 1:25 AM, Michael Chan <mchan@broadcom.com> wrote:
> Jon Nelson wrote:
>
>> What other information can I provide?
>> This is 100% reproducible.
>>
>
> Please provide lspci output and tg3 signon output so that
> we know what device is failing.

OK. I added the output of:

{ lspci -t ; lspci -vvv; lspci -k -xxx ; } > lspci-all.frank

as an attachment to the bug.

Also:


frank:~ # ethtool -t eth0
The test result is FAIL
The test extra info:
nvram test     (online)          0
link test      (online)          0
register test  (offline)         1
memory test    (offline)         0
loopback test  (offline)         0
interrupt test (offline)         0

frank:~ #

So, I added the output of:

ethtool -d eth0

Now. The above is all with 1500 byte frames.
If I switch to 9000 byte frames and take the interface down and then
back up again, I am unable to communicate with it, *provided*
scatter-gather is also enabled (which it is by default).

From a cold boot (and 1500 byte frames):

frank:~ # ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
frank:~ #

I'll be adding ethtool -d using 9000 byte frames in the next 30 minutes.
Is there anything else I can provide?
Comment 7 Jon Nelson 2008-07-23 06:51:37 UTC
Created attachment 16951 [details]
registers (9000 byte mtu)
Comment 8 Jon Nelson 2008-07-23 06:51:56 UTC
Created attachment 16952 [details]
registers (9000 byte mtu) with scatter-gather turned off
Comment 9 Anonymous Emailer 2008-07-23 08:43:35 UTC
Reply-To: jnelson@jamponi.net

I am able to confirm that 2.6.26 from openSUSE-Factory
(2.6.26-20-pae-20.1) seems to work for a minute or two and then fails
in the same manner as 2.6.25.[9,11].

--
Jon
Comment 10 Jon Nelson 2008-11-04 11:19:12 UTC
Confirmed for 2.6.27.1
Will be trying 2.6.27.4 ASAP.


All I have to do is turn scatter-gather *off* and the card works again.
With it on *and* jumbo frames (9000 mtu) the card becomes inoperative.

I can provide whatever debugging you like.
Comment 11 Alan 2012-05-22 13:03:21 UTC
Closing as obsolete. If this can be reproduced with a modern kernel please re-open