Bug 14709
Summary: | 8169 hangs box hard under heavy load | ||
---|---|---|---|
Product: | Drivers | Reporter: | Jon Nelson (jnelson-kernel-bugzilla) |
Component: | Network | Assignee: | Francois Romieu (romieu) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | akpm, alan, bugs, dariush, fabricio.nihues, masp01, moneta.mace, richard, victorpablosceruelo |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.31.5 | Subsystem: | |
Regression: | Yes | Bisected commit-id: |
Description
Jon Nelson
2009-12-01 01:59:00 UTC
I'm observing the same issue with 2.6.32 (on a 10ec:8168 card). Only difference to the bug reporter's problem is that I don't see the above kernel warning before the kernel locks up. 1. Produce medium load on the NIC (3-4 MB/s over half an hour are usually enough) 2. After a while network connectivity breaks down (no pings come through) 3. about a minute later the whole system locks up (magic sysrq doesn't work either) ciao Dariush I can also confirm the impact but not the symptoms, which means that I don't get any warning messages prior to the freeze. After some minutes of traffic (in 2.6.32 it happens subjectively faster compared with 2.6.31) the systems hangs and doesn't react to any input. After waiting about 3 minutes the system sometimes recovers and I'm able to work until the next Ooops occurs. It's also worth mentioning that multiple connections (like bittorrent) lead to this bug more frequently than heavy traffic with only one connection (e.g. nfs). Here's how the dmesg looks after the freeze: ------------[ cut here ]------------ WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x1ec/0x200() Hardware name: GA-MA790XT-UD4P NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out Modules linked in: ipv6 isofs fan battery ac snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device nfs lockd fscache nfs_acl snd_pcm_oss ohci1394 auth_rpcgss snd_mixer_oss nvidia(P) ieee1394 snd_hda_codec_realtek sunrpc processor sg snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc joydev floppy usbhid hid evdev ati_agp agpgart i2c_piix4 i2c_core button thermal r8169 mii kvm_amd kvm loop rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 sr_mod cdrom pata_jmicron sd_mod pata_atiixp pata_acpi ahci ata_generic ohci_hcd ehci_hcd libata scsi_mod Pid: 0, comm: swapper Tainted: P 2.6.32-pae-x52 #1 Call Trace: [<c103bdae>] ? warn_slowpath_common+0x6e/0xb0 [<c124599c>] ? dev_watchdog+0x1ec/0x200 [<c103be3b>] ? warn_slowpath_fmt+0x2b/0x30 [<c124599c>] ? dev_watchdog+0x1ec/0x200 [<c1050039>] ? sys_getpriority+0x69/0x1f0 [<c1018613>] ? lapic_next_event+0x13/0x20 [<c12457b0>] ? dev_watchdog+0x0/0x200 [<c104972d>] ? run_timer_softirq+0x12d/0x310 [<c1063f66>] ? tick_dev_program_event+0x46/0xd0 [<c104218f>] ? __do_softirq+0x8f/0x200 [<c104233d>] ? do_softirq+0x3d/0x40 [<c1042515>] ? irq_exit+0x65/0x70 [<c1018fe4>] ? smp_apic_timer_interrupt+0x54/0x90 [<c1004535>] ? apic_timer_interrupt+0x31/0x38 [<c1021262>] ? native_safe_halt+0x2/0x10 [<c100b41f>] ? default_idle+0x3f/0xb0 [<c100b8a8>] ? c1e_idle+0x88/0x110 [<c1002ae4>] ? cpu_idle+0x94/0xe0 ---[ end trace 40eab05dc9dc63c7 ]--- See Novell bug 557623 - https://bugzilla.novell.com/show_bug.cgi?id=557623 - wherein a specific git commit was tried and it appears to have helped. The commit, 05af2142d09845de2f4ae34181c72addd72d5ef9, which has date Sat Oct 24 06:47:33 2009 -0700, may be related. Thanks for the link. As this commit is included in 2.6.32, which I already use, sadly it doesn't seem to help in my case. (To be sure I also reverted it - the bug occurs with and without this commit.) The first symptom seems to be the break of network connectivity, after which the system usually remains responsive for several minutes and thus could be debugged. Is there any information (sysfs readouts, additional printk's, register dumps?) which could help find the cause of this problem? I have a dedicated system on which I can reproduce the problem usually within an hour (and which I can't use for anything productive at the moment anyway). So I'll gladly test out any crazy patches thrown at me. :) ciao Dariush I think that this bug is related to debian linux image bug #528362: It is present from linux image 2.6.29 to 2.6.32. Bug#528362: linux-image-2.6.29-2-686: r8169 hangs when your transmission speed is really high. More info: http://lists.debian.org/debian-kernel/2009/05/msg00251.html http://osdir.com/ml/general/2009-10/msg04031.html http://www.mail-archive.com/debian-kernel@lists.debian.org/msg50859.html http://groups.google.com/group/linux.debian.bugs.dist/browse_thread/thread/b423a272cb443897/a9e8129f124378c0?lnk=raot&fwc=1 http://www.linux-archive.org/debian-kernel/296795-bug-528362-linux-image-2-6-32-trunk-686-not-fixed-2-6-32-a.html regards, Victor Here is the complete bug report in debian: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=528362 I got the same problems, and occurs exactoly as related in #4, It started with an Archlinux update from december 2009, it updated kernel to .32, nvidia to 190 and xorg, and a lot more. I think it is related to nvidia 190+ drivers, if I remember correclty I was using 185 driver, but I cannot revert back without reverting xorg too, it is more complicated in arch. I've installed ubuntu 9.10 and got same problems, on arch tried ,30, ,31, ,32 and ,33rc4 kernel, tried the realtek 8168 driver and the problem persist, tried acpi=off, pci=noapic But, there is something new, I can reproduce really fast the crash, it ONLY occours here if I go play Warcraft 3 using wine (no matter the wine version), if I play battle.net the error takes 1seg to 20min, if I play Single Player takes ~10min max, I've used Dota Maps only, this is why I think is NV driver related too... This is my crash: RNING: at net/sched/sch_generic.c:246 dev_watchdog+0x1ec/0x200() Hardware name: GeForce 8000 series NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out Modules linked in: w83627ehf hwmon_vid nfs fscache ipv6 nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ext2 snd_usb_audio snd_usb_lib snd_rawmidi snd_hda_codec_nvhdmi usbhid hid snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_idt nvidia(P) agpgart fan battery usb_storage ac wmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer cpufreq_powersave snd soundcore snd_page_alloc cpufreq_ondemand thermal button powernow_k8 freq_table sg i2c_nforce2 i2c_core processor psmouse shpchp pci_hotplug evdev ohci_hcd r8169 mii ehci_hcd serio_raw pcspkr vboxdrv autofs4 usblp usbcore fuse rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 sr_mod cdrom sd_mod pata_acpi ata_generic ahci pata_amd libata scsi_mod Pid: 0, comm: swapper Tainted: P 2.6.31-ARCH #1 Call Trace: [<c10464da>] ? warn_slowpath_common+0x7a/0xc0 [<c12807fc>] ? dev_watchdog+0x1ec/0x200 [<c1046597>] ? warn_slowpath_fmt+0x37/0x60 [<c12807fc>] ? dev_watchdog+0x1ec/0x200 [<c106e5c6>] ? getnstimeofday+0x56/0x110 [<c1067821>] ? ktime_get_ts+0x31/0x80 [<c101e522>] ? lapic_next_event+0x22/0x40 [<c1280610>] ? dev_watchdog+0x0/0x200 [<c105381e>] ? run_timer_softirq+0x11e/0x210 [<c10731de>] ? tick_dev_program_event+0x4e/0xe0 [<c104db2f>] ? __do_softirq+0x9f/0x220 [<c1067a73>] ? hrtimer_interrupt+0x1a3/0x250 [<c104dd05>] ? do_softirq+0x55/0x60 [<c104df1d>] ? irq_exit+0x7d/0x90 [<c101f19f>] ? smp_apic_timer_interrupt+0x5f/0xb0 [<c1054543>] ? get_next_timer_interrupt+0x1d3/0x250 [<c10046d5>] ? apic_timer_interrupt+0x31/0x38 [<c1028da2>] ? native_safe_halt+0x2/0x10 [<c100c8fa>] ? default_idle+0x6a/0x150 [<c100cf13>] ? c1e_idle+0xa3/0x110 [<c1002aec>] ? cpu_idle+0x9c/0xf0 ---[ end trace 85b7c75960f71918 ]--- r8169: eth0: link down r8169: eth0: link down lspci -v 06:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02) Another think I've noted, when the error occours, the (rev 02) change to random number like (rev 4f). I'm seeing a similar problem. We have a pair of very busy Squid servers (peak ~15000 TCP connections, ~100Mb/sec) using identical Asus server motherboards. The networking on one box intermittently fails - sometimes after a few hours sometimes after as much as 7 days. Here's the relevant info and an extract from syslog / dmesg # uname -r 2.6.31.12 # cat /sys/class/net/eth0/device/vendor 0x14e4 # cat /sys/class/net/eth0/device/device 0x1659 {{{ http://www.pcidatabase.com/vendor_details.php?id=767 0x1659 Chip Number: BCM5721 Chip Description: NetXtreme Gigabit Ethernet PCI Express }}} # ethtool -i eth0 driver: tg3 version: 3.99 firmware-version: 5721-v3.65 bus-info: 0000:03:00.0 #cat /var/log/syslog Apr 22 23:49:35 kernel: [623449.988504] ------------[ cut here ]------------ Apr 22 23:49:35 kernel: [623449.988511] WARNING: at net/sched/sch_generic.c:246 dev_watchdog+0x1be/0x1d0() Apr 22 23:49:35 kernel: [623449.988514] Hardware name: System Product Name Apr 22 23:49:35 kernel: [623449.988516] NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out Apr 22 23:49:35 kernel: [623449.988518] Modules linked in: ip_gre e1000 via_rhine tg3 libphy r8169 pcnet32 e100 8139too mii w83627ehf vt8231 via686a hwm on_vid coretemp asus_atk0110 hwmon Apr 22 23:49:35 kernel: [623449.988533] Pid: 0, comm: swapper Not tainted 2.6.31.12 #2 Apr 22 23:49:35 kernel: [623449.988535] Call Trace: Apr 22 23:49:35 kernel: [623449.988540] [<c012864e>] ? warn_slowpath_common+0x6e/0xb0 Apr 22 23:49:35 kernel: [623449.988543] [<c034e4fe>] ? dev_watchdog+0x1be/0x1d0 Apr 22 23:49:35 kernel: [623449.988546] [<c01286db>] ? warn_slowpath_fmt+0x2b/0x30 Apr 22 23:49:35 kernel: [623449.988549] [<c034e4fe>] ? dev_watchdog+0x1be/0x1d0 Apr 22 23:49:35 kernel: [623449.988553] [<c011f842>] ? __wake_up+0x42/0x60 Apr 22 23:49:35 kernel: [623449.988557] [<c01376f2>] ? insert_work+0x42/0x50 Apr 22 23:49:35 kernel: [623449.988560] [<c034e340>] ? dev_watchdog+0x0/0x1d0 Apr 22 23:49:35 kernel: [623449.988564] [<c0131149>] ? run_timer_softirq+0xf9/0x1c0 Apr 22 23:49:35 kernel: [623449.988567] [<c012d2c0>] ? __do_softirq+0x80/0x100 Apr 22 23:49:35 kernel: [623449.988570] [<c012d36d>] ? do_softirq+0x2d/0x40 Apr 22 23:49:35 kernel: [623449.988574] [<c0114d94>] ? smp_apic_timer_interrupt+0x54/0x90 Apr 22 23:49:35 kernel: [623449.988577] [<c0103676>] ? apic_timer_interrupt+0x2a/0x30 Apr 22 23:49:35 kernel: [623449.988581] [<c03f00d8>] ? klist_add_before+0x18/0x50 Apr 22 23:49:35 kernel: [623449.988585] [<c0109dc2>] ? mwait_idle+0x42/0x60 Apr 22 23:49:35 kernel: [623449.988587] [<c0101d55>] ? cpu_idle+0x35/0x60 Apr 22 23:49:35 kernel: [623449.988590] ---[ end trace 346a74434bf31555 ]--- Apr 22 23:49:35 kernel: [623449.988592] tg3: eth0: transmit timed out, resetting Apr 22 23:49:35 kernel: [623449.988596] tg3: DEBUG: MAC_TX_STATUS[0000000f] MAC_RX_STATUS[00000008] Apr 22 23:49:35 kernel: [623449.988601] tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000] Apr 22 23:49:35 kernel: [623450.089694] tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2 Apr 22 23:49:35 kernel: [623450.248154] tg3: eth0: Link is down. Closing as obsolete, if this is still seen on the r8169 with a modern kernel please update the bug report |