Bug 8962

Summary: sky2: network intermittently unavailable after ifdown/ifup under load
Product: Drivers Reporter: Greg Bailey (gbailey)
Component: NetworkAssignee: Stephen Hemminger (stephen)
Status: REJECTED INSUFFICIENT_DATA    
Severity: normal CC: alan, nhorman, rf, tony, ucelsanicin
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.23-rc4 Subsystem:
Regression: --- Bisected commit-id:
Attachments: debugfs sky2
debugfs sky2 (when it did work)

Description Greg Bailey 2007-08-30 11:22:39 UTC
Most recent kernel where this bug did not occur:  Unknown

Distribution:  CentOS 4.5

Hardware Environment:  Intel server board SE7320VP21

02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8050 PCI-E ASF Gigabit Ethernet Controller (rev 18)
        Subsystem: Intel Corporation: Unknown device 3466
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at deefc000 (64-bit, non-prefetchable) [size=16K]
        I/O ports at b800 [size=256]
        Expansion ROM at deec0000 [disabled] [size=128K]
        Capabilities: [48] Power Management version 2
        Capabilities: [50] Vital Product Data
        Capabilities: [5c] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable-
        Capabilities: [e0] Express Legacy Endpoint IRQ 0

Software Environment:  CentOS 4.5 install + "vanilla" kernel 2.6.23-rc4

Problem Description:

Discovered while attempting to troubleshoot:
https://bugzilla.redhat.com/show_bug.cgi?id=228733

I'm trying to understand the "tx timeout" messages, and how to reproduce them.  In my test environment, I have 2 servers, each of which has a sky2 Marvell NIC connected to a switch as "eth0".

On server "B", I type "nc -l -p 3409 > /dev/null"

On server "A", I type "nc server-B 3409 < /dev/zero"

I see lots of traffic from A->B, as would be expected.  If I shutdown eth0 on server "B" using "ifdown eth0", wait a few seconds, and then re-enable eth0 on server "B" using "ifup eth0", I see the following in "dmesg" output on server B:

sky2 eth0: disabling interface
sky2 eth0: enabling interface
sky2 eth0: ram buffer 48K
sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx
ip_tables: (C) 2000-2006 Netfilter Core Team

As expected...  The problem is that server B can occasionally end up in a state where it is unable to ping or access the local subnet anymore.  Both "mii-tool" and "ethtool eth0" shows a link present.

If I perform "ifdown eth0; ifup eth0" on server B, it doesn't help anything. 
If I unload the sky2 module, then things clear up and I'm back on the network again.

I'm curious about this testcase because the symptom seems to match the earlier "tx timeout" messages; the driver tried to re-enable itself after a timeout, but it's still not able to see any traffic.

Steps to reproduce:

See "Problem Description" above.  While traffic is continuously being transmitted from server "A" to server "B", shutdown the network interface on server "B", and then start the interface on server "B".  Monitoring RX traffic on server "B" will indicate when it is no longer receiving the bytes sent from server "A".
Comment 1 Stephen Hemminger 2007-08-30 13:40:30 UTC
CentOS has older version of driver please update to latest version from 2.6.22.6 or 2.6.23-rc4.  There are several bugs that caused tx timeouts (hung chip),
and a problem that led to PHY clock issues.
Comment 2 Greg Bailey 2007-08-30 14:42:09 UTC
The kernel version I encountered this on is 2.6.23-rc4, as marked in the bug report and is why I chose CentOS 4.5 install + "vanilla" kernel 2.6.23-rc4" under "Software Environment".
Comment 3 Stephen Hemminger 2007-09-05 06:35:35 UTC
Please enable the sky2 debugfs kernel configuration option.
Mount debugfs on somewhere (/debug)
Hang system then capture sky2 state.  (cat /debug/sky2/eth0 >savefile)
It will show the status of IRQ and receive/transmit.
Comment 4 Greg Bailey 2007-09-07 13:59:57 UTC
Rebuilt 2.6.23-rc5 with SKY2_DEBUG.  I've reproduced the issue where ifdown/ifup does not reset the interface properly.

# cat /debug/sky2/eth0
IRQ src=0 mask=c000001d control=0
Status ring (empty)
Tx ring pending=24...24 report=24 done=24

Rx ring hw get=169 put=169 last=1023
Comment 5 Anthony Awtrey 2007-09-10 13:49:12 UTC
I can confirm that we can reproduce this issue (or one nearly identical to it). We are using the current stable 2.6.22.6 kernel on a system with a Marvell 88E8055 (Panasonic Toughbook CF-74).

To reproduce it, we can open any kind of persistent socket connection (such as an Apache SSL connection using a browser) and then yank the cable. We wait a bit and pop the cable back in and the driver is dead. We can't ping in or out until we down the interface, remove and reinsert the sky2 driver and bring the interface back up.

I will be happy to provide any info or test any patches you provide.
Comment 6 Andy Matteson 2007-09-11 17:04:30 UTC
I'm having trouble here too.  Using an Ubuntu Gutsy kernel:

Linux andy-desktop 2.6.22-10-generic #1 SMP Wed Aug 22 07:42:05 GMT 2007 x86_64 GNU/Linux

I'm not getting tx timeouts AFAIK.  I'm not getting any driver crash dumps either.  I'm just having connection issues.  I'm not transferring anything big.  I will be browsing the web, then all of a sudden the interface will get in some type of corrupted state where nothing works.  Sometimes ifdown/ifup will do it, sometimes it will not.  Sometimes dhclient works, sometimes not.  Unloading sky2 and reloading it *always* fixes the problem, indicating some type of issue with the "current state" of the driver.  Maybe a variable not getting cleared/etc but I can only guess.

Sometimes ifdown/ifup will work and then it will only work for about a minute.  Redoing ifdown/ifup will make sky2 work for another few hours (it's like refilling your gas tank, just on a smaller level ;)).

Sometimes I will get Destination Host Unreachable from pinging my router, sometimes ping says nothing at all.

I tried with the modprobe sky2 debug=16 option but the log output looks not much different from when the adapter is working.  And, I haven't caught it just when it stopped working, yet.  I have only turned on my monitor to notice that my net wasn't working and then dumped a few logs of it.  In any case, I don't think they're helpful but if you need them I will gladly post them.

Most importantly, this is a regression from 2.6.20.  I hope this can get fixed and if so I'll notify those at Ubuntu and get this into the kernel and hopefully an exception for it if necessary.

Ubuntu bug link: https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/138611
Comment 7 Andy Matteson 2007-09-16 11:14:38 UTC
I fixed my problems by using 2.6.23-rc6.
Comment 8 Neil Horman 2007-09-17 11:49:19 UTC
Interesting, the only thing that went in between rc5 and rc6 was the restore multicast list on resume, which while potentially applicable, doesn't sound like it addresses the whole of the problem.  Does rc5 fix the problem for you as well?
Comment 9 Andy Matteson 2007-09-26 15:26:02 UTC
Sorry for the misunderstanding.

I fixed my problems by upgrading from the Ubuntu Linux 2.6.22-11 kernel to the vanilla 2.6.23-rc6 kernel.  I hadn't even tried any other 2.6.23 yet.  I'm thinking the Ubuntu kernel has a problem due to mismatched or partially backported patches, at least in my case.
Comment 10 Andy Matteson 2007-09-30 16:55:57 UTC
Created attachment 13006 [details]
debugfs sky2
Comment 11 Andy Matteson 2007-09-30 16:56:11 UTC
Created attachment 13007 [details]
debugfs sky2 (when it did work)
Comment 12 Andy Matteson 2007-09-30 16:56:43 UTC
I am still having issues with 2.6.23-rc6 and rc8, but it took awhile for them to begin happening again.  I attached two debugfs logs of sky2.
Comment 13 Richard F 2007-10-27 10:20:54 UTC
I'm running SuSe 10.3 and with an updated kernel (2.6.23.1-164-default) the problem remains. 
The interface is listed as "sky2 0000:02:00.0: v1.18 addr 0xd5020000 irq 17 Yukon-EC (0xb6) rev 1"
I only run 100mbit to a switch.  Using it on a media server and unfortunately after a few hours of reasonably heavy use streaming media, the interface dies, then a 3-4 hours later, the machine crashes.  
If I get to the machine before it dies, I can restart the interface, but as others report, it lasts for a shorter time.  
When restarting it, "ifstatus" reports it as up in the failed mode, doing an "ifdown" and "ifup" restarts it.
ifup reports: "device: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 19)"
I see nothing in dmesg when the interface dies
Comment 14 Stephen Hemminger 2007-10-29 08:15:48 UTC
There is a problem on Yukon-EC that causes the receive fifo to hang.
Workaround code in 2.6.23 that is supposed to detect and fix it.

The problem also only occurs if there is no flow control. The sky2
autonegotiates to enable flow control but some hardware doesn't support
flow control or has it disables.
Comment 15 Richard F 2007-10-29 19:49:16 UTC
Thanks.
Unfortunately the log reports:
kernel: [  982.916325] sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both
So I'm not sure it's limited to the case when flow control is on.
I noticed some threads earlier this year where you tried flow control off. Is that worth trying again with latest release, if so how?
Comment 16 Richard F 2007-10-30 11:50:32 UTC
I can repeat the failure by trying to copy about 20G of files over a Samba connection from a Windows box.  I can never get past 5G before it fails.  So perhaps I can do some debugging for you?
Comment 17 Stephen Hemminger 2008-01-07 22:27:32 UTC
Is this the same bug as the original report, or is the bug becoming a tar baby for all the possible "my sky2 has hung" reports?

The original report said problem was reproducible after up/down. Not one
of the "my box hangs under load" problems.
Comment 18 Richard F 2008-01-09 16:45:47 UTC
Sorry, no, to avoid raising another bug on sky2 this was the nearest I could find.
Sky2 hangs under load, that's the problem.  Very repeatable.  
I've now compiled and switched to the Marvell driver sk98lin, and that gives me no problems... 
Comment 19 Andrej Krutak 2008-04-24 15:28:20 UTC
Tried to find the bug source, but couldn't ;-( I used ubuntu 2.6.24 sources, placed the 2.6.22 (ubuntu) sky2.[ch] (ver. 1.18) files into the tree and applied the

[NET]: Make NAPI polling independent of struct net_device objects.
+
[NET]: Nuke SET_MODULE_OWNER macro.

patches (from git). Then I build the module, did a rmmod/modprobe, but nothing changed - the sky2 still fails with "sky2 eth0: rx error ..." in the dmesg.

Thus I guess the error could be somewhere else (maybe the napi polling isn't working quite right?), or maybe... I guess I'm gonna try to really find the bug...
Comment 20 Ryan Roth 2008-04-30 14:34:41 UTC
I have consistently had the same issue reported above where the kernel reports the following and the interface does not work.  It seems to work fine the first timeyou bring up the interface, but if you do a ifdown/ifup you get the following message, but no connection.

"sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both"
Comment 21 Alan 2009-03-23 11:34:11 UTC
Closing out old bugs