I have the following entries from lspci (vvv version attached): 07:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03) 08:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03) These dual-nics are integrated onto my EX58-EXTREME motherboard from gigabyte. These used to work fine though got alot of netlink cannot allocate memory problems in 2.6.34. Now in 2.6.37 and 2.6.36, I get alot of lockups and automatic reboots (a watchdog is catching the lockups?) and in 2.6.37 I finally got a message in syslog to help track the problem down: Dec 14 07:00:52 archer kernel: [26968.898341] ------------[ cut here ]------------ Dec 14 07:00:52 archer kernel: [26968.898350] WARNING: at net/sched/sch_generic.c:258 dev_watchdog+0xec/0x1 4c() Dec 14 07:00:52 archer kernel: [26968.898352] Hardware name: EX58-EXTREME Dec 14 07:00:52 archer kernel: [26968.898354] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out Dec 14 07:00:52 archer kernel: [26968.898356] Modules linked in: tun fuse fglrx(P) cpufreq_conservative cpu freq_powersave acpi_cpufreq mperf xt_mark btrfs zlib_deflate crc32c libcrc32c hid_logitech usbhid r8169 but ton uhci_hcd ehci_hcd usbcore ext4 mbcache jbd2 ata_piix ata_generic pata_jmicron edd fan thermal processor Dec 14 07:00:52 archer kernel: [26968.898376] Pid: 0, comm: kworker/0:0 Tainted: P 2.6.37-rc5 #2 Dec 14 07:00:52 archer kernel: [26968.898378] Call Trace: Dec 14 07:00:52 archer kernel: [26968.898380] <IRQ> [<ffffffff8103cdb2>] warn_slowpath_common+0x80/0x98 Dec 14 07:00:52 archer kernel: [26968.898389] [<ffffffff8103ce5e>] warn_slowpath_fmt+0x41/0x43 Dec 14 07:00:52 archer kernel: [26968.898392] [<ffffffff81386bfb>] ? netif_tx_lock+0x3f/0x67 It looks incomplete but that's all there is before my system apparently rebooted. I've been away from the computer each time it has happened but it seems to reboot pretty quickly. This has been a very unwanted regression that had me thinking my power supply went awol and worse yet it's my main workstation :-(.
Created attachment 40162 [details] lspci -vvv
This message is mostly a symptom : eth0 was not able to send for too long and the network device TX watchdog kicked in. It will not reboot the computer by itself. Can you try running the little script below ? #!/bin/sh while : ; do dir=/tmp/gloo/$(date +%Y%m%d%H%M%S) mkdir -p ${dir} cat /proc/interrupts > ${dir}/interrupts cat /proc/slabinfo > ${dir}/slab sync sleep 60 done Please check that your system logger does not operate asynchronously btw. Can you add a bit of context, say : - usual uptime with a 2.6.34 kernel - a short description of the network usage on both interfaces - MTU - complete dmesg, especially the XID line from the r8169 driver - no overclocking in sight ? I am not confortable with the proprietary fglrx module. It would be nice to reproduce the problem after a fresh boot without this module. -- Ueimor
Created attachment 40202 [details] dmesg
I am running the script now though I'm remote. Usual uptime with 2.6.34: 2-3 months, usually taken out by a power outage. eth0: main uplink. Constant trickle with a few hours of relatively intense (1mB+) usage every day. Problem occurs more often when in these intense times though that also happens to be when I'm using tte computer most. eth1: lan traffic, used to do alot of traffic all day (this machine serves as a gateway), lately only a few hours of 100-400kB traffic a day, if even. Both devices have an MTU of 1500 although upon checking just now eth0 was at 576 for some reason (this iface is dynamically configured) As for overclocking, yes this machine (an i7 920) is lightly overclocked but not overvolted or anything. Has been since I got it and I never have had any problems with your typical benchmarks or strange behaviors otherwise (superpi and memcheck have both worked for hours without any problems on top of kernel compiles and countless other workloads). I know it's good troubleshooting to turn it off but really, I think the probability that this is the culprit is insanely low. As for getting it to happen with fglrx, I'll see what I can do later tonight. I use opensuse 11.3 and syslog-ng - any way to check if I'm using async logging?
Ok, this is a 8168d (r8169.c::RTL_GIGA_MAC_VER_25). Before looking any further, you should rebuild your r8169 module with the patches available at : - http://marc.info/?l=linux-netdev&m=129118104512684 - http://marc.info/?l=linux-netdev&m=129119732929951 -- Ueimor
I found and applied V2 of net-r8169-Remove-the-firmware-of-RTL8111D.patch and the firmware adding patch, compiled and reloaded the module. I'll notify the next time it crashes.
It seems it did it again 2 hours ago (I'm still at work), no crash log though so it's not for sure that problem still. I'll have to stress test it on the weekend or something with and without fglrx.
A lockup with r8169.c::RTL_GIGA_MAC_VER_25 has been fixed in current -git kernel (see 1519e57fe81c14bb8fa4855579f19264d1ef63b4). Can you give it a try ? Current -git kernel includes (not for long) a nasty cast error but your 8168 revision can not notice it. -- Ueimor
I've had a few sudden reboots in the interim, much lower chances of occuring on 2.6.37 from opensuse tumbleweed. It occurs alot more on the desktop flavour kernel than generic (desktop won't last the night, generic can last a month+). Don't really have time to test it out now, maybe in a few weeks.