Bug 11039

Summary: 2.6.28-rc8-git3 forcedeth WARNING (kills the interface)
Product: Drivers Reporter: Rafael J. Wysocki (rjw)
Component: NetworkAssignee: drivers_network (drivers_network)
Status: CLOSED CODE_FIX    
Severity: normal CC: brad, florian
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.28-rc8-git3 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 10492    
Attachments: dmesg of complete boot and failure with acpi, apic and msi disabled

Description Rafael J. Wysocki 2008-07-05 13:07:50 UTC
Subject    : 2.6.28-rc8-git3 forcedeth WARNING (kills the interface)
Submitter  : Brad Campbell <brad@wasp.net.au>
Date       : 2008-07-03 10:07
References : http://marc.info/?l=linux-netdev&m=121508714430752&w=4

This entry is being used for tracking a regression from 2.6.25.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Rafael J. Wysocki 2008-07-06 10:11:18 UTC
On Sunday, 6 of July 2008, Brad Campbell wrote:
> Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> > 
> > The following bug entry is on the current list of known regressions
> > from 2.6.25.  Please verify if it still should be listed.
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=11039
> > Subject             : 2.6.28-rc8-git3 forcedeth WARNING (kills the
> interface)
> > Submitter   : Brad Campbell <brad@wasp.net.au>
> > Date                : 2008-07-03 10:07 (4 days old)
> > References  : http://marc.info/?l=linux-netdev&m=121508714430752&w=4
> 
> While it is certainly a problem I can't verify it as a regression. When I got
> the machine I ran it 
> with 2.6.25 but found SATA errors were locking the box.
> 
> The SATA issue is resolved with 2.6.26-rc and I'm not terribly keen to risk
> my data to go back and 
> check unless someone absolutely needs me to.
> 
> It does appear to be quite a problem though.
> 
> brad@srv:~$ dmesg | head -n5
> [    0.000000] Linux version 2.6.26-rc8-git4 (brad@srv) (gcc version 4.1.2
> 20061115 (prerelease) 
> (Debian 4.1.1-21)) #5 SMP Fri Jul 4 23:08:38 GST 2008
> [    0.000000] Command line: root=/dev/md1 ro
> [    0.000000] BIOS-provided physical RAM map:
> [    0.000000]  BIOS-e820: 0000000000000000 - 000000000009d400 (usable)
> [    0.000000]  BIOS-e820: 000000000009d400 - 00000000000a0000 (reserved)
> 
> brad@srv:~$ dmesg | grep 'eth1: tx_timeout' | wc -l
> 27
> 
> brad@srv:~$ uptime
>   17:40:25 up 1 day,  1:15,  5 users,  load average: 0.73, 0.61, 0.49
Comment 2 Brad Campbell 2008-07-07 00:07:06 UTC
Created attachment 16755 [details]
dmesg of complete boot and failure with acpi, apic and msi disabled
Comment 3 Rafael J. Wysocki 2008-07-13 13:03:12 UTC
On Sunday, 13 of July 2008, Brad Campbell wrote:
> Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> > 
> > The following bug entry is on the current list of known regressions
> > from 2.6.25.  Please verify if it still should be listed.
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=11039
> > Subject             : 2.6.28-rc8-git3 forcedeth WARNING (kills the
> interface)
> > Submitter   : Brad Campbell <brad@wasp.net.au>
> > Date                : 2008-07-03 10:07 (11 days old)
> > References  : http://marc.info/?l=linux-netdev&m=121508714430752&w=4
> 
> I tested a 2.6.26-rc kernel with forcedeth.c from 2.6.25 and it failed in the
> same way, so I don't 
> think its a regression as such. In any case, with the -rc9 it still fails
> regularly.
Comment 4 Brad Campbell 2008-08-04 22:18:52 UTC
I've just returned from a week away, and as there was nobody home the traffic on eth0 was almost none while eth1 remained quite busy. Over this period the interface did not lock up once.

On further testing it appears eth1 only locks up in the event of heavier usage of eth0.

eth0 is a GBit connection to my home network while eth1 is connected to my ADSL router directly. eth1 is rate limited to 256kbit out and 760kbit in so it's nowhere near fully loaded, but it does process a lot of packets per second and the traffic stream is pretty much constant at the limits.

eth0 utilisation varies from 0 to about 20MB/s as it is the main file serving link for the network.
Comment 5 Florian Mickler 2010-08-16 12:13:05 UTC
Hi Brad!

I just stumbled upon this bugreport. I'm guessing this is fixed in the meantime? 

Or do you still suffer these lockups / changed hardware?

Also there was a commit in 2009 to fix the tx-ring setup in case of tx_timeout which could maybe have fixed the reported warning... potentially...

commit 8f955d7f042e4ac44891a400d5000928f8db9f58
Author: Ayaz Abdulla <aabdulla@nvidia.com>
Date:   Sat Apr 25 09:17:56 2009 +0000

    forcedeth: tx timeout fix
    
    This patch fixes the tx_timeout() to properly handle the clean up of the
    tx ring. It also sets the tx put pointer back to the correct position to
    be in sync with HW.
    
    Signed-off-by: Ayaz Abdulla <aabdulla@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>


(this would fit the reported stacktrace):

[ 2936.865224] Call Trace:
[ 2936.865224]  <IRQ>  [<ffffffff802338e4>] warn_on_slowpath+0x64/0xa0
[ 2936.865224]  [<ffffffffa0046198>] :forcedeth:reg_delay+0x58/0xb0
[ 2936.865224]  [<ffffffffa0046fb6>] :forcedeth:nv_drain_tx+0xb6/0x1a0
[ 2936.865224]  [<ffffffffa004504a>] :forcedeth:setup_hw_rings+0x2a/0x100
[ 2936.865224]  [<ffffffffa00492e7>] :forcedeth:nv_tx_timeout+0x287/0x2c0
[ 2936.865224]  [<ffffffff80463ce5>] dev_watchdog+0xf5/0x110
[ 2936.865224]  [<ffffffff8023d1b2>] run_timer_softirq+0x192/0x200
[ 2936.865224]  [<ffffffff802391a9>] __do_softirq+0x69/0xe0
[ 2936.865224]  [<ffffffff8020c61c>] call_softirq+0x1c/0x30
[ 2936.865224]  [<ffffffff8020eb85>] do_softirq+0x35/0x70
[ 2936.865224]  [<ffffffff80239137>] irq_exit+0x87/0x90
[ 2936.865224]  [<ffffffff8021b5fc>] smp_apic_timer_interrupt+0x7c/0xc0
[ 2936.865224]  [<ffffffff8020c0c6>] apic_timer_interrupt+0x66/0x70
[ 2936.865224]  <EOI>
[ 2936.865224] ---[ end trace 6e6bcab61ac567c9 ]---
Comment 6 Florian Mickler 2010-08-18 09:07:48 UTC
Please reopen if this still happens.