Bug 24902

Summary: r8169 regression with lockups
Product: Drivers Reporter: Jason Newton (nevion)
Component: NetworkAssignee: Francois Romieu (romieu)
Status: RESOLVED INSUFFICIENT_DATA    
Severity: normal CC: akpm, alan, romieu
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.37rc5 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: lspci -vvv
dmesg

Description Jason Newton 2010-12-14 19:15:58 UTC
I have the following entries from lspci (vvv version attached):
07:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)
08:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)

These dual-nics are integrated onto my EX58-EXTREME motherboard from gigabyte.  These used to work fine though got alot of netlink cannot allocate memory problems in 2.6.34.  Now in 2.6.37 and 2.6.36, I get alot of lockups and automatic reboots (a watchdog is catching the lockups?) and in 2.6.37 I finally got a message in syslog to help track the problem down:

Dec 14 07:00:52 archer kernel: [26968.898341] ------------[ cut here ]------------
Dec 14 07:00:52 archer kernel: [26968.898350] WARNING: at net/sched/sch_generic.c:258 dev_watchdog+0xec/0x1
4c()
Dec 14 07:00:52 archer kernel: [26968.898352] Hardware name: EX58-EXTREME
Dec 14 07:00:52 archer kernel: [26968.898354] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Dec 14 07:00:52 archer kernel: [26968.898356] Modules linked in: tun fuse fglrx(P) cpufreq_conservative cpu
freq_powersave acpi_cpufreq mperf xt_mark btrfs zlib_deflate crc32c libcrc32c hid_logitech usbhid r8169 but
ton uhci_hcd ehci_hcd usbcore ext4 mbcache jbd2 ata_piix ata_generic pata_jmicron edd fan thermal processor
Dec 14 07:00:52 archer kernel: [26968.898376] Pid: 0, comm: kworker/0:0 Tainted: P            2.6.37-rc5 #2
Dec 14 07:00:52 archer kernel: [26968.898378] Call Trace:
Dec 14 07:00:52 archer kernel: [26968.898380]  <IRQ>  [<ffffffff8103cdb2>] warn_slowpath_common+0x80/0x98
Dec 14 07:00:52 archer kernel: [26968.898389]  [<ffffffff8103ce5e>] warn_slowpath_fmt+0x41/0x43
Dec 14 07:00:52 archer kernel: [26968.898392]  [<ffffffff81386bfb>] ? netif_tx_lock+0x3f/0x67

It looks incomplete but that's all there is before my system apparently rebooted.  I've been away from the computer each time it has happened but it seems to reboot pretty quickly.

This has been a very unwanted regression that had me thinking my power supply went awol and worse yet it's my main workstation :-(.
Comment 1 Jason Newton 2010-12-14 19:18:00 UTC
Created attachment 40162 [details]
lspci -vvv
Comment 2 Francois Romieu 2010-12-14 22:56:59 UTC
This message is mostly a symptom : eth0 was not able to send for too long
and the network device TX watchdog kicked in. It will not reboot the computer
by itself. Can you try running the little script below ?

#!/bin/sh

while : ; do
        dir=/tmp/gloo/$(date +%Y%m%d%H%M%S)
        mkdir -p ${dir}
        cat /proc/interrupts > ${dir}/interrupts
        cat /proc/slabinfo > ${dir}/slab
        sync
        sleep 60
done

Please check that your system logger does not operate asynchronously btw.

Can you add a bit of context, say :
- usual uptime with a 2.6.34 kernel
- a short description of the network usage on both interfaces
- MTU
- complete dmesg, especially the XID line from the r8169 driver
- no overclocking in sight ?

I am not confortable with the proprietary fglrx module. It would be
nice to reproduce the problem after a fresh boot without this module.

-- 
Ueimor
Comment 3 Jason Newton 2010-12-14 23:01:59 UTC
Created attachment 40202 [details]
dmesg
Comment 4 Jason Newton 2010-12-14 23:15:01 UTC
I am running the script now though I'm remote.

Usual uptime with 2.6.34: 2-3 months, usually taken out by a power outage.
eth0: main uplink.  Constant trickle with a few hours of relatively intense (1mB+) usage every day.  Problem occurs more often when in these intense times though that also happens to be when I'm using tte computer most.
eth1: lan traffic, used to do alot of traffic all day (this machine serves as a gateway), lately only a few hours of 100-400kB traffic a day, if even.

Both devices have an MTU of 1500 although upon checking just now eth0 was at 576 for some reason (this iface is dynamically configured)
As for overclocking, yes this machine (an i7 920) is lightly overclocked but not overvolted or anything.  Has been since I got it and I never have had any problems with your typical benchmarks or strange behaviors otherwise (superpi and memcheck have both worked for hours without any problems on top of kernel compiles and countless other workloads). I know it's good troubleshooting to turn it off but really, I think the probability that this is the culprit is insanely low.

As for getting it to happen with fglrx, I'll see what I can do later tonight.

I use opensuse 11.3 and syslog-ng - any way to check if I'm using async logging?
Comment 5 Francois Romieu 2010-12-14 23:39:30 UTC
Ok, this is a 8168d (r8169.c::RTL_GIGA_MAC_VER_25).

Before looking any further, you should rebuild your r8169 module with the
patches available at :
- http://marc.info/?l=linux-netdev&m=129118104512684
- http://marc.info/?l=linux-netdev&m=129119732929951

-- 
Ueimor
Comment 6 Jason Newton 2010-12-15 08:49:06 UTC
I found and applied V2 of net-r8169-Remove-the-firmware-of-RTL8111D.patch and the firmware adding patch, compiled and reloaded the module.  I'll notify the next time it crashes.
Comment 7 Jason Newton 2010-12-16 03:14:57 UTC
It seems it did it again 2 hours ago (I'm still at work), no crash log though so it's not for sure that problem still.  I'll have to stress test it on the weekend or something with and without fglrx.
Comment 8 Francois Romieu 2011-02-22 17:04:08 UTC
A lockup with r8169.c::RTL_GIGA_MAC_VER_25 has been fixed in current -git
kernel (see 1519e57fe81c14bb8fa4855579f19264d1ef63b4).

Can you give it a try ?

Current -git kernel includes (not for long) a nasty cast error but your
8168 revision can not notice it.

-- 
Ueimor
Comment 9 Jason Newton 2011-02-24 05:39:13 UTC
I've had a few sudden reboots in the interim, much lower chances of occuring on 2.6.37 from opensuse tumbleweed. It occurs alot more on the desktop flavour kernel than generic (desktop won't last the night, generic can last a month+).

Don't really have time to test it out now, maybe in a few weeks.