Bug 14709

Summary: 8169 hangs box hard under heavy load
Product: Drivers Reporter: Jon Nelson (jnelson-kernel-bugzilla)
Component: NetworkAssignee: Francois Romieu (romieu)
Status: CLOSED CODE_FIX    
Severity: high CC: akpm, alan, bugs, dariush, fabricio.nihues, masp01, moneta.mace, richard, victorpablosceruelo
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31.5 Subsystem:
Regression: Yes Bisected commit-id:

Description Jon Nelson 2009-12-01 01:59:00 UTC
I have a box with a realtek8169 as the built in gigabit ethernet.
I'm unable to put it under heavy network traffic as if I do the box hangs
shortly thereafter.  Can be replicated 100% of the time, usually within 2-3 minutes of heavy (network) load.

from /v/l/m


Nov 22 16:34:33 frank kernel: [28634.284677] The following is only an harmless
informational message.
Nov 22 16:34:33 frank kernel: [28634.284687] Unless you get a
_continuous_flood_ of these messages it means
Nov 22 16:34:33 frank kernel: [28634.284693] everything is working fine.
Allocations from irqs cannot be
Nov 22 16:34:33 frank kernel: [28634.284697] perfectly reliable and the kernel
is designed to handle that.
Nov 22 16:34:33 frank kernel: [28634.284704] swapper: page allocation failure.
order:0, mode:0x20
Nov 22 16:34:33 frank kernel: [28634.284713] Pid: 0, comm: swapper Not tainted
2.6.31.5-0.1-default #1
Nov 22 16:34:33 frank kernel: [28634.284718] Call Trace:
Nov 22 16:34:33 frank kernel: [28634.284748]  [<ffffffff81011749>]
try_stack_unwind+0x189/0x1b0
Nov 22 16:34:33 frank kernel: [28634.284763]  [<ffffffff8101013d>]
dump_trace+0x9d/0x330
Nov 22 16:34:33 frank kernel: [28634.284776]  [<ffffffff81011254>]
show_trace_log_lvl+0x64/0x90
Nov 22 16:34:33 frank kernel: [28634.284787]  [<ffffffff810112a3>]
show_trace+0x23/0x40
Nov 22 16:34:33 frank kernel: [28634.284801]  [<ffffffff81554378>]
dump_stack+0x81/0x9e
Nov 22 16:34:33 frank kernel: [28634.284814]  [<ffffffff811110c2>]
__alloc_pages_slowpath+0x572/0x580
Nov 22 16:34:33 frank kernel: [28634.284826]  [<ffffffff81111221>]
__alloc_pages_nodemask+0x151/0x160
Nov 22 16:34:33 frank kernel: [28634.284838]  [<ffffffff8114beac>]
kmem_getpages+0x6c/0x190
Nov 22 16:34:33 frank kernel: [28634.284850]  [<ffffffff8114cfca>]
fallback_alloc+0x1ca/0x290
Nov 22 16:34:33 frank kernel: [28634.284861]  [<ffffffff8114cd10>]
____cache_alloc_node+0xb0/0x1a0
Nov 22 16:34:33 frank kernel: [28634.284874]  [<ffffffff8114c1a2>]
kmem_cache_alloc_node+0xa2/0x220
Nov 22 16:34:33 frank kernel: [28634.284885]  [<ffffffff8114c39f>]
__kmalloc_node+0x7f/0x110
Nov 22 16:34:33 frank kernel: [28634.284898]  [<ffffffff81461cf4>]
__alloc_skb+0x84/0x1a0
Nov 22 16:34:33 frank kernel: [28634.284910]  [<ffffffff81462150>]
__netdev_alloc_skb+0x40/0x80
Nov 22 16:34:33 frank kernel: [28634.284936]  [<ffffffffa011f0a8>]
rtl8169_rx_fill+0xc8/0x280 [r8169]
Nov 22 16:34:34 frank kernel: [28634.284966]  [<ffffffffa011f63c>]
rtl8169_rx_interrupt+0x3dc/0x590 [r8169]
Nov 22 16:34:34 frank kernel: [28634.284990]  [<ffffffffa0120bca>]
rtl8169_poll+0x4a/0x298 [r8169]
Nov 22 16:34:34 frank kernel: [28634.285011]  [<ffffffff81470b49>]
net_rx_action+0x149/0x2e0
Nov 22 16:34:34 frank kernel: [28634.285024]  [<ffffffff81076713>]
__do_softirq+0xd3/0x240
Nov 22 16:34:34 frank kernel: [28634.285035]  [<ffffffff8100d80c>]
call_softirq+0x1c/0x30
Nov 22 16:34:34 frank kernel: [28634.285046]  [<ffffffff8100f8c5>]
do_softirq+0xb5/0x110
Nov 22 16:34:34 frank kernel: [28634.285056]  [<ffffffff810763a5>]
irq_exit+0xb5/0xd0
Nov 22 16:34:34 frank kernel: [28634.285066]  [<ffffffff8100ed1c>]
do_IRQ+0x7c/0x100
Nov 22 16:34:34 frank kernel: [28634.285078]  [<ffffffff8100d013>]
ret_from_intr+0x0/0x11
Nov 22 16:34:34 frank kernel: [28634.285092]  [<ffffffff81039766>]
native_safe_halt+0x6/0x10
Nov 22 16:34:34 frank kernel: [28634.285107]  [<ffffffff810168c2>]
default_idle+0x62/0x120
Nov 22 16:34:34 frank kernel: [28634.285119]  [<ffffffff81016a42>]
c1e_idle+0xc2/0x170
Nov 22 16:34:34 frank kernel: [28634.285129]  [<ffffffff8100b44c>]
cpu_idle+0xcc/0x120
Nov 22 16:34:34 frank kernel: [28634.285140]  [<ffffffff8153da49>]
rest_init+0x79/0xa0
Nov 22 16:34:34 frank kernel: [28634.285152]  [<ffffffff819e4766>]
start_kernel+0x3db/0x3f8
Nov 22 16:34:34 frank kernel: [28634.285165]  [<ffffffff819e370d>]
x86_64_start_reservations+0x134/0x14f
Nov 22 16:34:34 frank kernel: [28634.285176]  [<ffffffff819e3883>]
x86_64_start_kernel+0x15b/0x17e
Nov 22 16:34:34 frank kernel: [28634.285185] Mem-Info:
Nov 22 16:34:34 frank kernel: [28634.285189] Node 0 DMA per-cpu:
Nov 22 16:34:34 frank kernel: [28634.285196] CPU    0: hi:    0, btch:   1 usd:
  0
Nov 22 16:34:34 frank kernel: [28634.285201] CPU    1: hi:    0, btch:   1 usd:
  0
Nov 22 16:34:34 frank kernel: [28634.285206] CPU    2: hi:    0, btch:   1 usd:
  0
Nov 22 16:34:34 frank kernel: [28634.285210] CPU    3: hi:    0, btch:   1 usd:
  0
Nov 22 16:34:34 frank kernel: [28634.285214] Node 0 DMA32 per-cpu:
Nov 22 16:34:34 frank kernel: [28634.285220] CPU    0: hi:  186, btch:  31 usd:
170
Nov 22 16:34:34 frank kernel: [28634.285226] CPU    1: hi:  186, btch:  31 usd:
162
Nov 22 16:34:34 frank kernel: [28634.285231] CPU    2: hi:  186, btch:  31 usd:
182
Nov 22 16:34:34 frank kernel: [28634.285235] CPU    3: hi:  186, btch:  31 usd:
178



It sometimes takes longer than 3-5 minutes, but eventually the entire box becomes totally unresponsive. Not even magic sysrq works.
Comment 1 Dariush Forouher 2009-12-06 15:28:46 UTC
I'm observing the same issue with 2.6.32 (on a 10ec:8168 card).

Only  difference to the bug reporter's problem is that I don't see the above kernel warning before the kernel locks up.

1. Produce medium load on the NIC (3-4 MB/s over half an hour are usually enough)
2. After a while network connectivity breaks down (no pings come through)
3. about a minute later the whole system locks up (magic sysrq doesn't work either)

ciao
Dariush
Comment 2 Maximilian Güntner 2009-12-06 21:12:38 UTC
I can also confirm the impact but not the symptoms, which means that I don't get any warning messages prior to the freeze.
After some minutes of traffic (in 2.6.32 it happens subjectively faster compared with 2.6.31) the systems hangs and doesn't react to any input.
After waiting about 3 minutes the system sometimes recovers and I'm able to work until the next Ooops occurs.
It's also worth mentioning that multiple connections (like bittorrent) lead to this bug more frequently than heavy traffic with only one connection (e.g. nfs).

Here's how the dmesg looks after the freeze:

------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x1ec/0x200()
Hardware name: GA-MA790XT-UD4P
NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Modules linked in: ipv6 isofs fan battery ac snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device nfs lockd fscache nfs_acl snd_pcm_oss ohci1394 auth_rpcgss snd_mixer_oss nvidia(P) ieee1394 snd_hda_codec_realtek sunrpc processor sg snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc joydev floppy usbhid hid evdev ati_agp agpgart i2c_piix4 i2c_core button thermal r8169 mii kvm_amd kvm loop rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 sr_mod cdrom pata_jmicron sd_mod pata_atiixp pata_acpi ahci ata_generic ohci_hcd ehci_hcd libata scsi_mod
Pid: 0, comm: swapper Tainted: P           2.6.32-pae-x52 #1
Call Trace:
 [<c103bdae>] ? warn_slowpath_common+0x6e/0xb0
 [<c124599c>] ? dev_watchdog+0x1ec/0x200
 [<c103be3b>] ? warn_slowpath_fmt+0x2b/0x30
 [<c124599c>] ? dev_watchdog+0x1ec/0x200
 [<c1050039>] ? sys_getpriority+0x69/0x1f0
 [<c1018613>] ? lapic_next_event+0x13/0x20
 [<c12457b0>] ? dev_watchdog+0x0/0x200
 [<c104972d>] ? run_timer_softirq+0x12d/0x310
 [<c1063f66>] ? tick_dev_program_event+0x46/0xd0
 [<c104218f>] ? __do_softirq+0x8f/0x200
 [<c104233d>] ? do_softirq+0x3d/0x40
 [<c1042515>] ? irq_exit+0x65/0x70
 [<c1018fe4>] ? smp_apic_timer_interrupt+0x54/0x90
 [<c1004535>] ? apic_timer_interrupt+0x31/0x38
 [<c1021262>] ? native_safe_halt+0x2/0x10
 [<c100b41f>] ? default_idle+0x3f/0xb0
 [<c100b8a8>] ? c1e_idle+0x88/0x110
 [<c1002ae4>] ? cpu_idle+0x94/0xe0
---[ end trace 40eab05dc9dc63c7 ]---
Comment 3 Jon Nelson 2009-12-08 16:32:32 UTC
See Novell bug 557623 - https://bugzilla.novell.com/show_bug.cgi?id=557623 - wherein a specific git commit was tried and it appears to have helped.

The commit, 05af2142d09845de2f4ae34181c72addd72d5ef9, which has date Sat Oct 24 06:47:33 2009 -0700, may be related.
Comment 4 Dariush Forouher 2009-12-08 20:50:11 UTC
Thanks for the link. As this commit is included in 2.6.32, which I already use, sadly it doesn't seem to help in my case.

(To be sure I also reverted it - the bug occurs with and without this commit.)

The first symptom seems to be the break of network connectivity, after which the
system usually remains responsive for several minutes and thus could be debugged. Is there any information (sysfs readouts, additional printk's, register dumps?) which could help find the cause of this problem?

I have a dedicated system on which I can reproduce the problem usually within an hour (and which I can't use for anything productive at the moment anyway). So I'll gladly test out any crazy patches thrown at me. :)

ciao
Dariush
Comment 5 Victor Pablos Ceruelo 2009-12-18 08:49:33 UTC
I think that this bug is related to debian linux image bug #528362:
It is present from linux image 2.6.29 to 2.6.32.

Bug#528362: linux-image-2.6.29-2-686: r8169 hangs when your transmission speed is really high.

More info:
http://lists.debian.org/debian-kernel/2009/05/msg00251.html
http://osdir.com/ml/general/2009-10/msg04031.html
http://www.mail-archive.com/debian-kernel@lists.debian.org/msg50859.html
http://groups.google.com/group/linux.debian.bugs.dist/browse_thread/thread/b423a272cb443897/a9e8129f124378c0?lnk=raot&fwc=1
http://www.linux-archive.org/debian-kernel/296795-bug-528362-linux-image-2-6-32-trunk-686-not-fixed-2-6-32-a.html

regards, 

Victor
Comment 6 Victor Pablos Ceruelo 2009-12-18 09:21:38 UTC
Here is the complete bug report in debian:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=528362
Comment 7 Fabrício Nihues 2010-01-15 15:01:39 UTC
I got the same problems, and occurs exactoly as related in #4, 
It started with an Archlinux update from december 2009, it updated kernel to .32, nvidia to 190 and xorg, and a lot more.

I think it is related to nvidia 190+ drivers, if I remember correclty I was using 185 driver, but I cannot revert back without reverting xorg too, it is more complicated in arch.

I've installed ubuntu 9.10 and got same problems, on arch tried ,30, ,31, ,32 and ,33rc4 kernel, tried the realtek 8168 driver and the problem persist, tried acpi=off, pci=noapic

But, there is something new, I can reproduce really fast the crash, it ONLY occours here if I go play Warcraft 3 using wine (no matter the wine version), if I play battle.net the error takes 1seg to 20min, if I play Single Player takes ~10min max, I've used Dota Maps only, this is why I think is NV driver related too...

This is my crash:

RNING: at net/sched/sch_generic.c:246 dev_watchdog+0x1ec/0x200()
Hardware name: GeForce 8000 series
NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Modules linked in: w83627ehf hwmon_vid nfs fscache ipv6 nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ext2 snd_usb_audio snd_usb_lib snd_rawmidi snd_hda_codec_nvhdmi usbhid hid snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_idt nvidia(P) agpgart fan battery usb_storage ac wmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer cpufreq_powersave snd soundcore snd_page_alloc cpufreq_ondemand thermal button powernow_k8 freq_table sg i2c_nforce2 i2c_core processor psmouse shpchp pci_hotplug evdev ohci_hcd r8169 mii ehci_hcd serio_raw pcspkr vboxdrv autofs4 usblp usbcore fuse rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 sr_mod cdrom sd_mod pata_acpi ata_generic ahci pata_amd libata scsi_mod
Pid: 0, comm: swapper Tainted: P 2.6.31-ARCH #1
Call Trace:
[<c10464da>] ? warn_slowpath_common+0x7a/0xc0
[<c12807fc>] ? dev_watchdog+0x1ec/0x200
[<c1046597>] ? warn_slowpath_fmt+0x37/0x60
[<c12807fc>] ? dev_watchdog+0x1ec/0x200
[<c106e5c6>] ? getnstimeofday+0x56/0x110
[<c1067821>] ? ktime_get_ts+0x31/0x80
[<c101e522>] ? lapic_next_event+0x22/0x40
[<c1280610>] ? dev_watchdog+0x0/0x200
[<c105381e>] ? run_timer_softirq+0x11e/0x210
[<c10731de>] ? tick_dev_program_event+0x4e/0xe0
[<c104db2f>] ? __do_softirq+0x9f/0x220
[<c1067a73>] ? hrtimer_interrupt+0x1a3/0x250
[<c104dd05>] ? do_softirq+0x55/0x60
[<c104df1d>] ? irq_exit+0x7d/0x90
[<c101f19f>] ? smp_apic_timer_interrupt+0x5f/0xb0
[<c1054543>] ? get_next_timer_interrupt+0x1d3/0x250
[<c10046d5>] ? apic_timer_interrupt+0x31/0x38
[<c1028da2>] ? native_safe_halt+0x2/0x10
[<c100c8fa>] ? default_idle+0x6a/0x150
[<c100cf13>] ? c1e_idle+0xa3/0x110
[<c1002aec>] ? cpu_idle+0x9c/0xf0
---[ end trace 85b7c75960f71918 ]---
r8169: eth0: link down
r8169: eth0: link down

lspci -v
06:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)

Another think I've noted, when the error occours, the (rev 02) change to random number like (rev 4f).
Comment 8 Richard Wall 2010-04-26 09:54:05 UTC
I'm seeing a similar problem. We have a pair of very busy Squid servers (peak ~15000 TCP connections, ~100Mb/sec) using identical Asus server motherboards. The networking on one box intermittently fails - sometimes after a few hours sometimes after as much as 7 days.

Here's the relevant info and an extract from syslog / dmesg

# uname -r
2.6.31.12

# cat /sys/class/net/eth0/device/vendor 
0x14e4
# cat /sys/class/net/eth0/device/device 
0x1659

{{{
http://www.pcidatabase.com/vendor_details.php?id=767
0x1659	
Chip Number:	BCM5721
Chip Description:	NetXtreme Gigabit Ethernet PCI Express
}}}

# ethtool -i eth0
driver: tg3
version: 3.99
firmware-version: 5721-v3.65
bus-info: 0000:03:00.0

#cat /var/log/syslog
Apr 22 23:49:35  kernel: [623449.988504] ------------[ cut here ]------------
Apr 22 23:49:35  kernel: [623449.988511] WARNING: at net/sched/sch_generic.c:246 dev_watchdog+0x1be/0x1d0()
Apr 22 23:49:35  kernel: [623449.988514] Hardware name: System Product Name
Apr 22 23:49:35  kernel: [623449.988516] NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
Apr 22 23:49:35  kernel: [623449.988518] Modules linked in: ip_gre e1000 via_rhine tg3 libphy r8169 pcnet32 e100 8139too mii w83627ehf vt8231 via686a hwm
on_vid coretemp asus_atk0110 hwmon
Apr 22 23:49:35  kernel: [623449.988533] Pid: 0, comm: swapper Not tainted 2.6.31.12 #2
Apr 22 23:49:35  kernel: [623449.988535] Call Trace:
Apr 22 23:49:35  kernel: [623449.988540]  [<c012864e>] ? warn_slowpath_common+0x6e/0xb0
Apr 22 23:49:35  kernel: [623449.988543]  [<c034e4fe>] ? dev_watchdog+0x1be/0x1d0
Apr 22 23:49:35  kernel: [623449.988546]  [<c01286db>] ? warn_slowpath_fmt+0x2b/0x30
Apr 22 23:49:35  kernel: [623449.988549]  [<c034e4fe>] ? dev_watchdog+0x1be/0x1d0
Apr 22 23:49:35  kernel: [623449.988553]  [<c011f842>] ? __wake_up+0x42/0x60
Apr 22 23:49:35  kernel: [623449.988557]  [<c01376f2>] ? insert_work+0x42/0x50
Apr 22 23:49:35  kernel: [623449.988560]  [<c034e340>] ? dev_watchdog+0x0/0x1d0
Apr 22 23:49:35  kernel: [623449.988564]  [<c0131149>] ? run_timer_softirq+0xf9/0x1c0
Apr 22 23:49:35  kernel: [623449.988567]  [<c012d2c0>] ? __do_softirq+0x80/0x100
Apr 22 23:49:35  kernel: [623449.988570]  [<c012d36d>] ? do_softirq+0x2d/0x40
Apr 22 23:49:35  kernel: [623449.988574]  [<c0114d94>] ? smp_apic_timer_interrupt+0x54/0x90
Apr 22 23:49:35  kernel: [623449.988577]  [<c0103676>] ? apic_timer_interrupt+0x2a/0x30
Apr 22 23:49:35  kernel: [623449.988581]  [<c03f00d8>] ? klist_add_before+0x18/0x50
Apr 22 23:49:35  kernel: [623449.988585]  [<c0109dc2>] ? mwait_idle+0x42/0x60
Apr 22 23:49:35  kernel: [623449.988587]  [<c0101d55>] ? cpu_idle+0x35/0x60
Apr 22 23:49:35  kernel: [623449.988590] ---[ end trace 346a74434bf31555 ]---
Apr 22 23:49:35  kernel: [623449.988592] tg3: eth0: transmit timed out, resetting
Apr 22 23:49:35  kernel: [623449.988596] tg3: DEBUG: MAC_TX_STATUS[0000000f] MAC_RX_STATUS[00000008]
Apr 22 23:49:35  kernel: [623449.988601] tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
Apr 22 23:49:35  kernel: [623450.089694] tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2
Apr 22 23:49:35  kernel: [623450.248154] tg3: eth0: Link is down.
Comment 9 Alan 2012-06-18 12:52:42 UTC
Closing as obsolete, if this is still seen on the r8169 with a modern kernel please update the bug report