Bug 12500

Summary: r8169: NETDEV WATCHDOG: eth0 (r8169): transmit timed out
Product: Drivers Reporter: Rafael J. Wysocki (rjw)
Component: NetworkAssignee: Jeff Garzik (jgarzik)
Status: CLOSED CODE_FIX    
Severity: normal CC: Arsen.Shnurkov, dsd, eike-kernel, frol, huwald, janjoris, kernel, kernel, kernel, ldorileo, mark, mpagano, paradyse, pilo, reflexeos, rm+bko, romieu, soltys, untrusted1
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.28 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 11808    

Description Rafael J. Wysocki 2009-01-19 14:26:03 UTC
Subject    : 2.6.28: r8169: NETDEV WATCHDOG: eth0 (r8169): transmit timed out
Submitter  : Justin Piszcz <jpiszcz@lucidpixels.com>
Date       : 2009-01-13 21:19
References : http://marc.info/?l=linux-kernel&m=123188160811322&w=4
Handled-By : Francois Romieu <romieu@fr.zoreil.com>

This entry is being used for tracking a regression from 2.6.27.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Rolf Eike Beer 2009-01-24 05:26:52 UTC
probably dupe of #10109
Comment 2 Michal Soltys 2009-03-08 03:10:09 UTC
I can confirm this bug as well (2.6.28.6) - it happens to every couple of days, whenever I put some higher load on the NIC.

02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
        Subsystem: Giga-byte Technology GA-EP45-DS5 Motherboard
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 316
        Region 0: I/O ports at de00 [size=256]
        Region 2: Memory at fdfff000 (64-bit, prefetchable) [size=4K]
        Region 4: Memory at fdfe0000 (64-bit, prefetchable) [size=64K]
        [virtual] Expansion ROM at fdf00000 [disabled] [size=64K]
        Capabilities: <access denied>
        Kernel driver in use: r8169
        Kernel modules: r8169

------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xf1/0x171()
NETDEV WATCHDOG: ethi (r8169): transmit timed out
Modules linked in: pppoatm ppp_generic slhc speedtch usbatm atm ohci_hcd uhci_hcd cpufreq_ondemand powernow_k8 xt_CLASSIFY xt_connmark xt_CONNMARK iptable_raw sch_sfq sch_hfsc sr_mod cdrom ata_generic ehci_hcd pata_atiixp i2c_piix4 pcspkr usbcore k8temp 3c59x i2c_core r8169 sg ati_agp evdev thermal processor fan button battery ac xt_iprange ipt_REJECT iptable_filter ipt_LOG xt_tcpudp xt_MARK iptable_mangle ipt_MASQUERADE xt_conntrack xt_multiport iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack ip_tables x_tables fbcon tileblit font bitblit softcursor fb rtc_cmos rtc_core rtc_lib sd_mod ahci raid1 ext4 mbcache jbd2 md_mod dm_mod pata_pdc2027x libata scsi_mod [last unloaded: speedtch]
Pid: 0, comm: swapper Not tainted 2.6.28.6-HVQ1 #1
Call Trace:
 [<c012346e>] warn_slowpath+0x5a/0x79
 [<c02769cd>] tcp_current_mss+0x5b/0xd3
 [<c02758c3>] tcp_rcv_established+0x5f6/0x794
 [<c025ea5c>] nf_iterate+0x30/0x61
 [<c027b38c>] tcp_v4_do_rcv+0x22/0x174
 [<c01d59d8>] strlcpy+0x11/0x3d
 [<c0256691>] dev_watchdog+0xf1/0x171
 [<c01345ba>] hrtimer_forward+0x10c/0x124
 [<c013804d>] getnstimeofday+0x4a/0xcd
 [<c010ee69>] lapic_next_event+0x10/0x13
 [<c02565a0>] dev_watchdog+0x0/0x171
 [<c012a3e1>] run_timer_softirq+0x138/0x18f
 [<c02565a0>] dev_watchdog+0x0/0x171
 [<c0127357>] __do_softirq+0x83/0x11e
 [<c0127424>] do_softirq+0x32/0x36
 [<c0127522>] irq_exit+0x35/0x69
 [<c010f4e1>] smp_apic_timer_interrupt+0x6e/0x78
 [<c0104440>] apic_timer_interrupt+0x28/0x30
 [<c01083a0>] default_idle+0x2a/0x3d
 [<c01084ff>] c1e_idle+0xc4/0xc7
 [<c0102983>] cpu_idle+0x68/0x81
---[ end trace 6271a577780eb771 ]---
r8169: ethi: link up
Comment 3 Leandro Dorileo 2009-04-01 22:06:49 UTC
I can confirm the bug. Under the same conditions as described by Michal Soltys.

# lspci -vvv -s 05:00.0

05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8101E/RTL8102E PCI Express Fast Ethernet controller (rev 01)
	Subsystem: Toshiba America Info Systems Device ff00
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 508
	Region 0: I/O ports at 4000 [size=256]
	Region 2: Memory at da000000 (64-bit, non-prefetchable) [size=4K]
	[virtual] Expansion ROM at d4000000 [disabled] [size=64K]
	Capabilities: [40] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [48] Vital Product Data
pcilib: sysfs_read_vpd: read failed: Connection timed out
		Not readable
	Capabilities: [50] MSI: Mask- 64bit+ Count=1/2 Enable+
		Address: 00000000fee0300c  Data: 4181
	Capabilities: [60] Express (v1) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
			ExtTag+ AttnBtn+ AttnInd+ PwrInd+ RBE- FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
	Capabilities: [84] Vendor Specific Information <?>
	Capabilities: [100] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		AERCap:	First Error Pointer: 14, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [12c] Virtual Channel <?>
	Capabilities: [148] Device Serial Number 36-81-ec-10-00-00-10-01
	Capabilities: [154] Power Budgeting <?>
	Kernel driver in use: r8169

Syslog -----------------------------------------------------------------------
Mar 23 09:52:38 lpt kernel: [ 4277.000064] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xf6/0x17c()
Mar 23 09:52:38 lpt kernel: [ 4277.000067] NETDEV WATCHDOG: eth0 (r8169): transmit timed out
Mar 23 09:52:38 lpt kernel: [ 4277.000070] Modules linked in: i915 drm binfmt_misc parport_pc parport ipv6 ext2 firewire_sbp2 loop hid_dell hid_pl hid_cypress hid_zpff hid_gyration hid_bright hid_sony hid_samsung hid_microsoft hid_tmff hid_monterey hid_ezkey hid_apple hid_a4tech hid_logitech usbhid ff_memless hid_cherry hid_sunplus hid_petalynx hid_belkin hid_chicony hid arc4 ecb joydev ath5k pcmcia snd_hda_intel sdhci_pci sdhci mac80211 snd_pcm_oss snd_mixer_oss sg tifm_7xx1 led_class yenta_socket rsrc_nonstatic rng_core snd_pcm uhci_hcd ehci_hcd i2c_i801 r8169 mmc_core tifm_core firewire_ohci firewire_core pcmcia_core psmouse iTCO_wdt snd_timer snd soundcore snd_page_alloc rfkill usbcore i2c_core intel_agp agpgart sr_mod mii cfg80211 crc_itu_t pcspkr serio_raw cdrom input_polldev video evdev battery container ac button output ext3 jbd mbcache sd_mod crc_t10dif thermal processor fan thermal_sys ide_pci_generic ide_core ata_generic ata_piix libata scsi_mod
Mar 23 09:52:38 lpt kernel: [ 4277.000172] Pid: 0, comm: swapper Not tainted 2.6.28-1-686 #1
Mar 23 09:52:38 lpt kernel: [ 4277.000175] Call Trace:
Mar 23 09:52:38 lpt kernel: [ 4277.000182]  [<c0126d7a>] warn_slowpath+0x5a/0x79
Mar 23 09:52:38 lpt kernel: [ 4277.000187]  [<c011ba68>] place_entity+0x63/0x92
Mar 23 09:52:38 lpt kernel: [ 4277.000191]  [<c011de3f>] enqueue_entity+0x6b/0x112
Mar 23 09:52:38 lpt kernel: [ 4277.000194]  [<c011e161>] enqueue_task_fair+0x19/0x51
Mar 23 09:52:38 lpt kernel: [ 4277.000198]  [<c011beed>] enqueue_task+0x52/0x5d
Mar 23 09:52:38 lpt kernel: [ 4277.000202]  [<c011bfeb>] activate_task+0x16/0x1b
Mar 23 09:52:38 lpt kernel: [ 4277.000205]  [<c0121f16>] try_to_wake_up+0x168/0x172
Mar 23 09:52:38 lpt kernel: [ 4277.000210]  [<c01fc234>] strlcpy+0x11/0x3d
Mar 23 09:52:38 lpt kernel: [ 4277.000213]  [<c028dbb6>] dev_watchdog+0xf6/0x17c
Mar 23 09:52:38 lpt kernel: [ 4277.000216]  [<c011d6e4>] __wake_up+0x29/0x39
Mar 23 09:52:38 lpt kernel: [ 4277.000220]  [<c01346d7>] __queue_work+0x4d/0x5a
Mar 23 09:52:38 lpt kernel: [ 4277.000223]  [<c028dac0>] dev_watchdog+0x0/0x17c
Mar 23 09:52:38 lpt kernel: [ 4277.000228]  [<c012e4f8>] run_timer_softirq+0x14a/0x1b4
Mar 23 09:52:38 lpt kernel: [ 4277.000231]  [<c028dac0>] dev_watchdog+0x0/0x17c
Mar 23 09:52:38 lpt kernel: [ 4277.000247]  [<c012b2d3>] __do_softirq+0x8c/0x130
Mar 23 09:52:38 lpt kernel: [ 4277.000250]  [<c012b3bc>] do_softirq+0x45/0x53
Mar 23 09:52:38 lpt kernel: [ 4277.000253]  [<c012b4c4>] irq_exit+0x35/0x69
Mar 23 09:52:38 lpt kernel: [ 4277.000258]  [<c0105bc0>] do_IRQ+0x6c/0x7c
Mar 23 09:52:38 lpt kernel: [ 4277.000261]  [<c0104507>] common_interrupt+0x23/0x28
Mar 23 09:52:38 lpt kernel: [ 4277.000281]  [<f818e147>] acpi_idle_enter_bm+0x31c/0x3a5 [processor]
Mar 23 09:52:38 lpt kernel: [ 4277.000285]  [<c026ee11>] menu_select+0x37/0x96
Mar 23 09:52:38 lpt kernel: [ 4277.000289]  [<c026e48f>] cpuidle_idle_call+0x5d/0x8e
Mar 23 09:52:38 lpt kernel: [ 4277.000292]  [<c0102a37>] cpu_idle+0x71/0x8a
Mar 23 09:52:38 lpt kernel: [ 4277.000295] ---[ end trace 3d198b186ab92aff ]---
Mar 23 09:52:38 lpt kernel: [ 4277.016115] r8169: eth0: link up
Mar 23 09:53:20 lpt kernel: [ 4319.016107] r8169: eth0: link up
Mar 23 09:53:45 lpt kernel: [ 4343.858427] r8169: eth0: link up
Mar 23 09:53:55 lpt kernel: [ 4354.084070] eth0: no IPv6 routers present
Comment 4 reflexeos 2009-04-03 05:14:22 UTC
I'm SADLY part of the crew. Is there a way out to this?

Debian Squeeze (2.6.26-1-686 and 2.6.28-1-686)
I read somewhere that using r8101 may solve the problem but it didn't.
(from 2.6.26-1-686)


05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8101E/RTL8102E PCI Express Fast Ethernet controller (rev 01)
	Subsystem: Toshiba America Info Systems Device ff00
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 18
	Region 0: I/O ports at 4000 [size=256]
	Region 2: Memory at da000000 (64-bit, non-prefetchable) [size=4K]
	[virtual] Expansion ROM at d4000000 [disabled] [size=64K]
	Capabilities: <access denied>
	Kernel driver in use: r8101
	Kernel modules: r8169




(from 2.6.28-1-686  with r8196):

------------[ cut here ]------------
[ 3515.989047] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xf6/0x17c()
[ 3515.989050] NETDEV WATCHDOG: eth0 (r8169): transmit timed out
[ 3515.989052] Modules linked in: i915 drm rfcomm l2cap bluetooth ppdev parport_pc lp parport ipv6 acpi_cpufreq cpufreq_userspace cpufreq_stats cpufreq_conservative cpufreq_powersave firewire_sbp2 loop arc4 ecb snd_hda_intel iwl3945 snd_pcm snd_seq snd_timer snd_seq_device joydev mac80211 snd led_class tifm_7xx1 i2c_i801 soundcore rng_core serio_raw tifm_core rfkill rsrc_nonstatic pcmcia_core i2c_core snd_page_alloc intel_agp agpgart pcspkr cfg80211 iTCO_wdt psmouse evdev input_polldev battery video output container button ac ext3 jbd mbcache sg usbhid hid sr_mod cdrom sd_mod crc_t10dif ide_pci_generic ide_core ata_generic ata_piix sdhci_pci sdhci firewire_ohci mmc_core libata firewire_core crc_itu_t scsi_mod ehci_hcd uhci_hcd usbcore r8169 mii thermal processor fan thermal_sys
[ 3515.989138] Pid: 0, comm: swapper Not tainted 2.6.28-1-686 #1
[ 3515.989141] Call Trace:
[ 3515.989149]  [<c0126d7a>] warn_slowpath+0x5a/0x79
[ 3515.989154]  [<c029f33f>] ip_finish_output+0x1c4/0x1fb
[ 3515.989160]  [<c01f860c>] __next_cpu+0x12/0x21
[ 3515.989164]  [<c01f860c>] __next_cpu+0x12/0x21
[ 3515.989167]  [<c011f8eb>] find_busiest_group+0x307/0x78f
[ 3515.989170]  [<c011f8eb>] find_busiest_group+0x307/0x78f
[ 3515.989176]  [<c01fc234>] strlcpy+0x11/0x3d
[ 3515.989179]  [<c028dbb6>] dev_watchdog+0xf6/0x17c
[ 3515.989185]  [<c013aef7>] sched_clock_tick+0x95/0x9e
[ 3515.989188]  [<c0138dea>] hrtimer_forward+0x10c/0x124
[ 3515.989192]  [<c013cadf>] getnstimeofday+0x4f/0xd1
[ 3515.989197]  [<c0110245>] lapic_next_event+0x10/0x13
[ 3515.989200]  [<c028dac0>] dev_watchdog+0x0/0x17c
[ 3515.989206]  [<c012e4f8>] run_timer_softirq+0x14a/0x1b4
[ 3515.989209]  [<c028dac0>] dev_watchdog+0x0/0x17c
[ 3515.989213]  [<c012b2d3>] __do_softirq+0x8c/0x130
[ 3515.989216]  [<c012b3bc>] do_softirq+0x45/0x53
[ 3515.989219]  [<c012b4c4>] irq_exit+0x35/0x69
[ 3515.989223]  [<c01109ad>] smp_apic_timer_interrupt+0x6e/0x78
[ 3515.989227]  [<c0104620>] apic_timer_interrupt+0x28/0x30
[ 3515.989256]  [<f8075147>] acpi_idle_enter_bm+0x31c/0x3a5 [processor]
[ 3515.989270]  [<f807539d>] acpi_idle_enter_simple+0x1cd/0x224 [processor]
[ 3515.989276]  [<c026ee11>] menu_select+0x37/0x96
[ 3515.989279]  [<c026e48f>] cpuidle_idle_call+0x5d/0x8e
[ 3515.989283]  [<c0102a37>] cpu_idle+0x71/0x8a
[ 3515.989286] ---[ end trace 7b6311bc953d584e ]---
[ 3516.005184] r8169: eth0: link up
Comment 5 kernel 2009-04-06 18:23:57 UTC
I have reproduced this bug in 2.6.29.1 FWIW. It takes quite a bit to reproduce this, but I am able to do so within a few hours. I have 10 machines running this kernel, running scp in a round robin style will cause this timeout on a random machine about ever 2-3 hours. As of right now, the stable kernel that I can run and not have this problem is 2.6.26.3. The only other kernel I have tried is 2.6.28.9, and that produced the bug as well.

Apr  6 00:26:14 svc52 kernel: ------------[ cut here ]------------
Apr  6 00:26:14 svc52 kernel: WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0x11f/0x1b0()
Apr  6 00:26:14 svc52 kernel: Hardware name: TF720 A2+
Apr  6 00:26:14 svc52 kernel: NETDEV WATCHDOG: eth0 (r8169): transmit timed out
Apr  6 00:26:14 svc52 kernel: Modules linked in: forcedeth
Apr  6 00:26:14 svc52 kernel: Pid: 0, comm: swapper Not tainted 2.6.29.1 #2
Apr  6 00:26:14 svc52 kernel: Call Trace:
Apr  6 00:26:14 svc52 kernel:  <IRQ>  [<ffffffff8023428f>] warn_slowpath+0xd8/0xf5
Apr  6 00:26:14 svc52 kernel:  [<ffffffff8022ce88>] default_wake_function+0x0/0x9
Apr  6 00:26:14 svc52 kernel:  [<ffffffff804b4541>] __qdisc_run+0x103/0x1d7
Apr  6 00:26:14 svc52 kernel:  [<ffffffff80360500>] cpumask_next_and+0x2a/0x3e
Apr  6 00:26:14 svc52 kernel:  [<ffffffff8022b598>] find_busiest_group+0x27f/0x7b2
Apr  6 00:26:14 svc52 kernel:  [<ffffffff804b495b>] dev_watchdog+0x11f/0x1b0
Apr  6 00:26:14 svc52 kernel:  [<ffffffff8024ab69>] getnstimeofday+0x57/0xb7
Apr  6 00:26:14 svc52 kernel:  [<ffffffff80247902>] ktime_get_ts+0x22/0x4b
Apr  6 00:26:14 svc52 kernel:  [<ffffffff804b483c>] dev_watchdog+0x0/0x1b0
Apr  6 00:26:14 svc52 kernel:  [<ffffffff8023c192>] run_timer_softirq+0x12c/0x193
Apr  6 00:26:14 svc52 kernel:  [<ffffffff802387ec>] __do_softirq+0x7a/0x13d
Apr  6 00:26:14 svc52 kernel:  [<ffffffff8020c23c>] call_softirq+0x1c/0x28
Apr  6 00:26:14 svc52 kernel:  [<ffffffff8020d0f8>] do_softirq+0x2c/0x6c
Apr  6 00:26:14 svc52 kernel:  [<ffffffff8021ac5c>] smp_apic_timer_interrupt+0x93/0xab
Apr  6 00:26:14 svc52 kernel:  [<ffffffff8020bc73>] apic_timer_interrupt+0x13/0x20
Apr  6 00:26:14 svc52 kernel:  <EOI>  [<ffffffff80210c40>] default_idle+0x27/0x3b
Apr  6 00:26:14 svc52 kernel:  [<ffffffff80210e43>] c1e_idle+0xe5/0xe9
Apr  6 00:26:14 svc52 kernel:  [<ffffffff8020a047>] cpu_idle+0x40/0x5e
Apr  6 00:26:14 svc52 kernel: ---[ end trace f604628d7fa5821b ]---
Apr  6 00:26:14 svc52 kernel: r8169: eth0: link up

05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
        Subsystem: Biostar Microtech Int'l Corp Device 2307
        Flags: bus master, fast devsel, latency 0, IRQ 1275
        I/O ports at e800 [size=256]
        Memory at febff000 (64-bit, non-prefetchable) [size=4K]
        Memory at fbff0000 (64-bit, prefetchable) [size=64K]
        Expansion ROM at febc0000 [disabled] [size=128K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
        Capabilities: [70] Express Endpoint, MSI 01
        Capabilities: [b0] MSI-X: Enable- Mask- TabSize=2
        Capabilities: [d0] Vital Product Data <?>
        Kernel driver in use: r8169
Comment 6 Christoph Nelles 2009-04-12 10:41:20 UTC
Same here. Put a bit of load on the card and it breaks. With the old realtek driver it run without problems. I have two r8169 in the computer, the first onboard, the second one as add on card.  

Kernel 2.6.29.1:
------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0x1fa/0x210()
Hardware name: GA-MA78G-DS3H
NETDEV WATCHDOG: eth0 (r8169): transmit timed out
Modules linked in: cpufreq_stats snd_seq_dummy snd_seq_oss snd_seq_midi_event sn                                                                              d_seq snd_seq_device snd_pcm_oss snd_mixer_oss ipv6 xt_state nf_nat_ftp nf_connt                                                                              rack_ftp genrtc fuse i2c_piix4 ohci1394 st ieee1394 i2c_core psmouse evdev sg
Pid: 0, comm: swapper Not tainted 2.6.29.1 #3
Call Trace:
 <IRQ>  [<ffffffff80238fe2>] warn_slowpath+0xf2/0x130
 [<ffffffff80431030>] ahci_qc_prep+0x50/0x130
 [<ffffffff8041ca87>] ata_qc_issue+0x127/0x2b0
 [<ffffffff8036d8a0>] sg_init_table+0x20/0x50
 [<ffffffff803f4440>] scsi_done+0x0/0x10
 [<ffffffff80363183>] cpumask_next_and+0x23/0x40
 [<ffffffff8022f72c>] find_busiest_group+0x20c/0x8a0
 [<ffffffff80234e26>] tg_shares_up+0xe6/0x1e0
 [<ffffffff8022a7cb>] enqueue_task+0xb/0x20
 [<ffffffff8022a85a>] activate_task+0x1a/0x30
 [<ffffffff80234e26>] tg_shares_up+0xe6/0x1e0
 [<ffffffff80253fa9>] getnstimeofday+0x49/0xe0
 [<ffffffff8036893e>] strlcpy+0x4e/0x80
 [<ffffffff8051371a>] dev_watchdog+0x1fa/0x210
 [<ffffffff80251020>] ktime_get_ts+0x20/0x60
 [<ffffffff802426aa>] run_timer_softirq+0x1ba/0x220
 [<ffffffff8023e3bb>] __do_softirq+0x8b/0x150
 [<ffffffff80221672>] hpet_interrupt_handler+0x12/0x40
 [<ffffffff8020c6fc>] call_softirq+0x1c/0x30
 [<ffffffff8020e015>] do_softirq+0x35/0x80
 [<ffffffff8023e325>] irq_exit+0x85/0x90
 [<ffffffff8020e243>] do_IRQ+0x83/0x110
 [<ffffffff8020bfd3>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff80212ca7>] default_idle+0x27/0x40
 [<ffffffff80212eca>] c1e_idle+0xba/0x100
 [<ffffffff8020a316>] cpu_idle+0x46/0x90
---[ end trace 4c2c60d95eca3dcf ]---
r8169: eth0: link up

Boot up messages:
r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
r8169 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
r8169 0000:03:00.0: setting latency timer to 64
r8169 0000:03:00.0: irq 28 for MSI/MSI-X
eth0: RTL8168b/8111b at 0xffffc20000026000, 00:e0:4c:68:0d:9b, XID 38000000 IRQ
28
r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
r8169 0000:04:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
r8169 0000:04:00.0: setting latency timer to 64
r8169 0000:04:00.0: irq 29 for MSI/MSI-X
eth1: RTL8168c/8111c at 0xffffc2000002a000, 00:1f:d0:54:e7:bb, XID 3c4000c0 IRQ
29

Device listing:
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)
        Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 28
        Region 0: I/O ports at ee00 [size=256]
        Region 2: Memory at fdeff000 (64-bit, non-prefetchable) [size=4K]
        [virtual] Expansion ROM at fdd00000 [disabled] [size=128K]
        Capabilities: [40] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] Vital Product Data <?>
        Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable+
                Address: 00000000fee0300c  Data: 4181
        Capabilities: [60] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
                        ExtTag+ AttnBtn+ AttnInd+ PwrInd+ RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 unlimited, L1 unlimited
                        ClockPM- Suprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [84] Vendor Specific Information <?>
        Capabilities: [100] Advanced Error Reporting <?>
        Capabilities: [12c] Virtual Channel <?>
        Capabilities: [148] Device Serial Number 68-81-ec-10-00-00-0d-b4
        Capabilities: [154] Power Budgeting <?>
        Kernel driver in use: r8169

04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
        Subsystem: Giga-byte Technology Unknown device e000
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 29
        Region 0: I/O ports at de00 [size=256]
        Region 2: Memory at fdbff000 (64-bit, prefetchable) [size=4K]
        Region 4: Memory at fdbe0000 (64-bit, prefetchable) [size=64K]
        [virtual] Expansion ROM at fdb00000 [disabled] [size=64K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable+
                Address: 00000000fee0300c  Data: 4189
        Capabilities: [70] Express (v1) Endpoint, MSI 01
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <8us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us
                        ClockPM+ Suprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [b0] MSI-X: Enable- Mask- TabSize=2
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00000800
        Capabilities: [d0] Vital Product Data <?>
        Capabilities: [100] Advanced Error Reporting <?>
        Capabilities: [140] Virtual Channel <?>
        Capabilities: [160] Device Serial Number 78-56-34-12-78-56-34-12
        Kernel driver in use: r8169
Comment 7 Jan Huwald 2009-05-02 10:57:43 UTC
Can confirm it on 2.6.29.1, 2.6.29.2. I need approx. 20s to reproduce it using the 'iperf' benchmarking tool and two crossconnected, r8169 driven nodes.

Should be mentioned that I get this bug only if the bandwith exploitation is close to a 1 Gb/s. In a rate-limited environment (e.g. ~ 450 Mb/s during tcpdump on one host) I never got the bug (tested ~ 3 days on 4 nodes).

The (elsewhere) suggested workarounds using ethtool (disabling autoneg, setting the rate by hand, ...) all didn't work.
Comment 8 Daniel Drake 2009-05-11 22:59:36 UTC
Another report:
https://bugs.gentoo.org/show_bug.cgi?id=266761

Given that some people can reproduce this easily, and it is a regression, would anyone be up for doing a git bisect? It will identify the exact commit which introduced this bug.
http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches/
Comment 9 Jan Huwald 2009-05-12 07:20:50 UTC
I will do an bisection between 2.6.25.4 (works for me) and 2.6.29.2. But given the wide range of patches I remind that bisection will find neither multiple nor hierarchical regressions. I expect to finish on friday.
Comment 10 Jan Huwald 2009-05-13 07:46:52 UTC
Bad news. Bisecting between 2.6.25 and 2.6.29 delievered the following potential first bad commit:

# git bisect bad
1d8cca44b6a244b7e378546d719041819049a0f9 is first bad commit
commit 1d8cca44b6a244b7e378546d719041819049a0f9
Author: Harvey Harrison <harvey.harrison@gmail.com>
Date:   Sat Oct 18 20:28:37 2008 -0700

    byteorder: provide swabb.h generically in asm/byteorder.h

    This is needed during the transition to the new byteorder headers as the
    swabb.h functionality will be provided from asm/byteorder.h in the new
    version.  To avoid breakage on arches still using the old implementation,
    provide swabb.h from asm/byteorder.h as well.

    Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

:040000 040000 e548e9798e03daeee187cf0012ba7c16712d7261 b747283fd042fd18b1fd6275de9c99a1f904f730 M      include
#

Well, that doesn't seem to be what we've been looking for (although one should still investigate if). I have to mention that the frequency of the bug occurance varied across the different builds (between 0.2 Hz and 0.01 Hz). I tested if a build was good using 50 iterations of iperf (which does a 10s test with 1 Gb/s if the bug does not occur). Perhaps in a build marked as good the frequency was too low to be discovered.

I'll stand by until it is clear that a bisection with longer test is neccessary.
Comment 11 Erik Karlsson 2009-05-17 15:16:18 UTC
I have a RTL8169sb/8110sb PCI card which breaks during high TCP traffic, but rarely during UDP. Every kernel from 2.6.22 to 2.6.30-rc5 is affected, compiling 2.6.20 now.

I use one outgoing iperf and three incoming, from different hosts, on the affected computer which causes the NIC to exhibit symptoms within seconds.

Curiously, I have an identical card in one of the "attacking" machines which does not exhibit any symptoms, so perhaps my case is not related.
Comment 12 Jan Joris Vereijken 2009-05-18 04:55:52 UTC
I have the same issue on an Intel D945GCLF2 MB with embedded NIC. It is very reproducible; when I put full load on it, it occurs about once a minute.

Observations:

1) Issue only occurs for me on outgoing traffic. For incoming traffic, the full gigabit load can be on for a 10 hours, an everything works (tested),

2) Issue is very reproducible for me under Ubuntu 09.04 amd64 (2.6.28) but not under Debian Lenny amd64 (2.6.26).

This box was intended as central backup server, but the bug renders it unable to enter production: I can make backups fine (e.g. store 100GB in a big burst over Samba), but cannot possibly restore (i.e. pulling back 100GB always crashes in the first couple GBs).

Thanks,

- Jan Joris -

From dmesg:

[    3.399902] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[    3.399944] r8169 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[    3.399976] r8169 0000:01:00.0: setting latency timer to 64
[    3.400168] r8169 0000:01:00.0: irq 2300 for MSI/MSI-X
[    3.402117] eth0: RTL8168c/8111c at 0xffffc20000036000, 00:1c:c0:b5:00:65, XID 3c4000c0 IRQ 2300

From uname:

Linux pinky 2.6.28-11-generic #42-Ubuntu SMP Fri Apr 17 01:58:03 UTC 2009 x86_64 GNU/Linux

From lspci:

00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01)
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)

From syslog:

May 13 05:14:18 pinky kernel: [23718.804043] NETDEV WATCHDOG: eth0 (r8169): transmit timed out
May 13 05:14:18 pinky kernel: [23718.804047] Modules linked in: i915 drm binfmt_misc bridge stp bnep video output input_polldev nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc lp snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event ppdev snd_seq snd_timer snd_seq_device iTCO_wdt iTCO_vendor_support psmouse serio_raw pcspkr usblp intel_agp snd soundcore snd_page_alloc parport_pc parport usb_storage r8169 mii fbcon tileblit font bitblit softcursor
May 13 05:14:18 pinky kernel: [23718.804131] Pid: 0, comm: swapper Not tainted 2.6.28-11-generic #42-Ubuntu
May 13 05:14:18 pinky kernel: [23718.804135] Call Trace:
May 13 05:14:18 pinky kernel: [23718.804140]  <IRQ>  [<ffffffff80250927>] warn_slowpath+0xb7/0xf0
May 13 05:14:18 pinky kernel: [23718.804158]  [<ffffffff80416ffa>] ? __next_cpu+0x1a/0x30
May 13 05:14:18 pinky kernel: [23718.804165]  [<ffffffff802476dc>] ? find_busiest_group+0x1dc/0x9a0
May 13 05:14:18 pinky kernel: [23718.804175]  [<ffffffff80270969>] ? getnstimeofday+0x59/0xe0
May 13 05:14:18 pinky kernel: [23718.804182]  [<ffffffff8026c659>] ? ktime_get_ts+0x59/0x60
May 13 05:14:18 pinky kernel: [23718.804188]  [<ffffffff8026c671>] ? ktime_get+0x11/0x50
May 13 05:14:18 pinky kernel: [23718.804196]  [<ffffffff8041d2da>] ? strlcpy+0x4a/0x60
May 13 05:14:18 pinky kernel: [23718.804203]  [<ffffffff805cb6f0>] dev_watchdog+0x270/0x280
May 13 05:14:18 pinky kernel: [23718.804211]  [<ffffffff802424b2>] ? enqueue_entity+0x122/0x2b0
May 13 05:14:18 pinky kernel: [23718.804218]  [<ffffffff802486cd>] ? enqueue_task_fair+0x3d/0x80
May 13 05:14:18 pinky kernel: [23718.804226]  [<ffffffff802199e6>] ? read_tsc+0x16/0x40
May 13 05:14:18 pinky kernel: [23718.804233]  [<ffffffff805cb480>] ? dev_watchdog+0x0/0x280
May 13 05:14:18 pinky kernel: [23718.804241]  [<ffffffff8025be79>] run_timer_softirq+0x179/0x260
May 13 05:14:18 pinky kernel: [23718.804249]  [<ffffffff8027375f>] ? clockevents_program_event+0x4f/0x90
May 13 05:14:18 pinky kernel: [23718.804257]  [<ffffffff80256acc>] __do_softirq+0x9c/0x170
May 13 05:14:18 pinky kernel: [23718.804264]  [<ffffffff80213d8c>] call_softirq+0x1c/0x30
May 13 05:14:18 pinky kernel: [23718.804271]  [<ffffffff80214ffd>] do_softirq+0x5d/0xa0
May 13 05:14:18 pinky kernel: [23718.804277]  [<ffffffff8025684d>] irq_exit+0x8d/0xa0
May 13 05:14:18 pinky kernel: [23718.804286]  [<ffffffff80227648>] smp_apic_timer_interrupt+0x88/0xc0
May 13 05:14:18 pinky kernel: [23718.804293]  [<ffffffff80213668>] apic_timer_interrupt+0x88/0x90
May 13 05:14:18 pinky kernel: [23718.804297]  <EOI>  [<ffffffff8021a95a>] ? mwait_idle+0x4a/0x50
May 13 05:14:18 pinky kernel: [23718.804311]  [<ffffffff80210dd2>] ? enter_idle+0x22/0x30
May 13 05:14:18 pinky kernel: [23718.804318]  [<ffffffff80210e85>] ? cpu_idle+0x65/0xc0
May 13 05:14:18 pinky kernel: [23718.804326]  [<ffffffff80698f23>] ? start_secondary+0x9e/0xcb
May 13 05:14:18 pinky kernel: [23718.804331] ---[ end trace 1cfb5a1b92b2d7c9 ]---
May 13 05:14:18 pinky kernel: [23718.821866] r8169: eth0: link up
Comment 13 Arsen.Shnurkov 2009-06-02 17:39:18 UTC
My bug disappers when I have added 'noacpi' option as a kernel command line parameter.
Comment 14 Arsen.Shnurkov 2009-06-02 19:53:10 UTC
>'acpi=off' option 
didn't help. After some time netcard stop sending packets again.
Comment 15 Francois Romieu 2009-06-15 21:43:48 UTC
Please give 2.6.30 a try. Some specific changes in it could fix the bug.

-- 
Ueimor
Comment 16 Lev Povalahev 2009-06-21 14:16:00 UTC
I had this problem with 2.6.26, 2.6.29. Just tried 2.6.30 and it seems to be gone in the first tests. I'll be testing this more now, but looks good so far.
Comment 17 Christoph Nelles 2009-10-02 22:19:56 UTC
As of 2.6.31.1 it is fixed for me. Both NICs are recognized and work correctly and very fast, even under load.