Bug 12500
Summary: | r8169: NETDEV WATCHDOG: eth0 (r8169): transmit timed out | ||
---|---|---|---|
Product: | Drivers | Reporter: | Rafael J. Wysocki (rjw) |
Component: | Network | Assignee: | Jeff Garzik (jgarzik) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | Arsen.Shnurkov, dsd, eike-kernel, frol, huwald, janjoris, kernel, kernel, kernel, ldorileo, mark, mpagano, paradyse, pilo, reflexeos, rm+bko, romieu, soltys, untrusted1 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.28 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 11808 |
Description
Rafael J. Wysocki
2009-01-19 14:26:03 UTC
probably dupe of #10109 I can confirm this bug as well (2.6.28.6) - it happens to every couple of days, whenever I put some higher load on the NIC. 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02) Subsystem: Giga-byte Technology GA-EP45-DS5 Motherboard Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 316 Region 0: I/O ports at de00 [size=256] Region 2: Memory at fdfff000 (64-bit, prefetchable) [size=4K] Region 4: Memory at fdfe0000 (64-bit, prefetchable) [size=64K] [virtual] Expansion ROM at fdf00000 [disabled] [size=64K] Capabilities: <access denied> Kernel driver in use: r8169 Kernel modules: r8169 ------------[ cut here ]------------ WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xf1/0x171() NETDEV WATCHDOG: ethi (r8169): transmit timed out Modules linked in: pppoatm ppp_generic slhc speedtch usbatm atm ohci_hcd uhci_hcd cpufreq_ondemand powernow_k8 xt_CLASSIFY xt_connmark xt_CONNMARK iptable_raw sch_sfq sch_hfsc sr_mod cdrom ata_generic ehci_hcd pata_atiixp i2c_piix4 pcspkr usbcore k8temp 3c59x i2c_core r8169 sg ati_agp evdev thermal processor fan button battery ac xt_iprange ipt_REJECT iptable_filter ipt_LOG xt_tcpudp xt_MARK iptable_mangle ipt_MASQUERADE xt_conntrack xt_multiport iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack ip_tables x_tables fbcon tileblit font bitblit softcursor fb rtc_cmos rtc_core rtc_lib sd_mod ahci raid1 ext4 mbcache jbd2 md_mod dm_mod pata_pdc2027x libata scsi_mod [last unloaded: speedtch] Pid: 0, comm: swapper Not tainted 2.6.28.6-HVQ1 #1 Call Trace: [<c012346e>] warn_slowpath+0x5a/0x79 [<c02769cd>] tcp_current_mss+0x5b/0xd3 [<c02758c3>] tcp_rcv_established+0x5f6/0x794 [<c025ea5c>] nf_iterate+0x30/0x61 [<c027b38c>] tcp_v4_do_rcv+0x22/0x174 [<c01d59d8>] strlcpy+0x11/0x3d [<c0256691>] dev_watchdog+0xf1/0x171 [<c01345ba>] hrtimer_forward+0x10c/0x124 [<c013804d>] getnstimeofday+0x4a/0xcd [<c010ee69>] lapic_next_event+0x10/0x13 [<c02565a0>] dev_watchdog+0x0/0x171 [<c012a3e1>] run_timer_softirq+0x138/0x18f [<c02565a0>] dev_watchdog+0x0/0x171 [<c0127357>] __do_softirq+0x83/0x11e [<c0127424>] do_softirq+0x32/0x36 [<c0127522>] irq_exit+0x35/0x69 [<c010f4e1>] smp_apic_timer_interrupt+0x6e/0x78 [<c0104440>] apic_timer_interrupt+0x28/0x30 [<c01083a0>] default_idle+0x2a/0x3d [<c01084ff>] c1e_idle+0xc4/0xc7 [<c0102983>] cpu_idle+0x68/0x81 ---[ end trace 6271a577780eb771 ]--- r8169: ethi: link up I can confirm the bug. Under the same conditions as described by Michal Soltys. # lspci -vvv -s 05:00.0 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8101E/RTL8102E PCI Express Fast Ethernet controller (rev 01) Subsystem: Toshiba America Info Systems Device ff00 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 508 Region 0: I/O ports at 4000 [size=256] Region 2: Memory at da000000 (64-bit, non-prefetchable) [size=4K] [virtual] Expansion ROM at d4000000 [disabled] [size=64K] Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] Vital Product Data pcilib: sysfs_read_vpd: read failed: Connection timed out Not readable Capabilities: [50] MSI: Mask- 64bit+ Count=1/2 Enable+ Address: 00000000fee0300c Data: 4181 Capabilities: [60] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited ExtTag+ AttnBtn+ AttnInd+ PwrInd+ RBE- FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [84] Vendor Specific Information <?> Capabilities: [100] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [12c] Virtual Channel <?> Capabilities: [148] Device Serial Number 36-81-ec-10-00-00-10-01 Capabilities: [154] Power Budgeting <?> Kernel driver in use: r8169 Syslog ----------------------------------------------------------------------- Mar 23 09:52:38 lpt kernel: [ 4277.000064] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xf6/0x17c() Mar 23 09:52:38 lpt kernel: [ 4277.000067] NETDEV WATCHDOG: eth0 (r8169): transmit timed out Mar 23 09:52:38 lpt kernel: [ 4277.000070] Modules linked in: i915 drm binfmt_misc parport_pc parport ipv6 ext2 firewire_sbp2 loop hid_dell hid_pl hid_cypress hid_zpff hid_gyration hid_bright hid_sony hid_samsung hid_microsoft hid_tmff hid_monterey hid_ezkey hid_apple hid_a4tech hid_logitech usbhid ff_memless hid_cherry hid_sunplus hid_petalynx hid_belkin hid_chicony hid arc4 ecb joydev ath5k pcmcia snd_hda_intel sdhci_pci sdhci mac80211 snd_pcm_oss snd_mixer_oss sg tifm_7xx1 led_class yenta_socket rsrc_nonstatic rng_core snd_pcm uhci_hcd ehci_hcd i2c_i801 r8169 mmc_core tifm_core firewire_ohci firewire_core pcmcia_core psmouse iTCO_wdt snd_timer snd soundcore snd_page_alloc rfkill usbcore i2c_core intel_agp agpgart sr_mod mii cfg80211 crc_itu_t pcspkr serio_raw cdrom input_polldev video evdev battery container ac button output ext3 jbd mbcache sd_mod crc_t10dif thermal processor fan thermal_sys ide_pci_generic ide_core ata_generic ata_piix libata scsi_mod Mar 23 09:52:38 lpt kernel: [ 4277.000172] Pid: 0, comm: swapper Not tainted 2.6.28-1-686 #1 Mar 23 09:52:38 lpt kernel: [ 4277.000175] Call Trace: Mar 23 09:52:38 lpt kernel: [ 4277.000182] [<c0126d7a>] warn_slowpath+0x5a/0x79 Mar 23 09:52:38 lpt kernel: [ 4277.000187] [<c011ba68>] place_entity+0x63/0x92 Mar 23 09:52:38 lpt kernel: [ 4277.000191] [<c011de3f>] enqueue_entity+0x6b/0x112 Mar 23 09:52:38 lpt kernel: [ 4277.000194] [<c011e161>] enqueue_task_fair+0x19/0x51 Mar 23 09:52:38 lpt kernel: [ 4277.000198] [<c011beed>] enqueue_task+0x52/0x5d Mar 23 09:52:38 lpt kernel: [ 4277.000202] [<c011bfeb>] activate_task+0x16/0x1b Mar 23 09:52:38 lpt kernel: [ 4277.000205] [<c0121f16>] try_to_wake_up+0x168/0x172 Mar 23 09:52:38 lpt kernel: [ 4277.000210] [<c01fc234>] strlcpy+0x11/0x3d Mar 23 09:52:38 lpt kernel: [ 4277.000213] [<c028dbb6>] dev_watchdog+0xf6/0x17c Mar 23 09:52:38 lpt kernel: [ 4277.000216] [<c011d6e4>] __wake_up+0x29/0x39 Mar 23 09:52:38 lpt kernel: [ 4277.000220] [<c01346d7>] __queue_work+0x4d/0x5a Mar 23 09:52:38 lpt kernel: [ 4277.000223] [<c028dac0>] dev_watchdog+0x0/0x17c Mar 23 09:52:38 lpt kernel: [ 4277.000228] [<c012e4f8>] run_timer_softirq+0x14a/0x1b4 Mar 23 09:52:38 lpt kernel: [ 4277.000231] [<c028dac0>] dev_watchdog+0x0/0x17c Mar 23 09:52:38 lpt kernel: [ 4277.000247] [<c012b2d3>] __do_softirq+0x8c/0x130 Mar 23 09:52:38 lpt kernel: [ 4277.000250] [<c012b3bc>] do_softirq+0x45/0x53 Mar 23 09:52:38 lpt kernel: [ 4277.000253] [<c012b4c4>] irq_exit+0x35/0x69 Mar 23 09:52:38 lpt kernel: [ 4277.000258] [<c0105bc0>] do_IRQ+0x6c/0x7c Mar 23 09:52:38 lpt kernel: [ 4277.000261] [<c0104507>] common_interrupt+0x23/0x28 Mar 23 09:52:38 lpt kernel: [ 4277.000281] [<f818e147>] acpi_idle_enter_bm+0x31c/0x3a5 [processor] Mar 23 09:52:38 lpt kernel: [ 4277.000285] [<c026ee11>] menu_select+0x37/0x96 Mar 23 09:52:38 lpt kernel: [ 4277.000289] [<c026e48f>] cpuidle_idle_call+0x5d/0x8e Mar 23 09:52:38 lpt kernel: [ 4277.000292] [<c0102a37>] cpu_idle+0x71/0x8a Mar 23 09:52:38 lpt kernel: [ 4277.000295] ---[ end trace 3d198b186ab92aff ]--- Mar 23 09:52:38 lpt kernel: [ 4277.016115] r8169: eth0: link up Mar 23 09:53:20 lpt kernel: [ 4319.016107] r8169: eth0: link up Mar 23 09:53:45 lpt kernel: [ 4343.858427] r8169: eth0: link up Mar 23 09:53:55 lpt kernel: [ 4354.084070] eth0: no IPv6 routers present I'm SADLY part of the crew. Is there a way out to this? Debian Squeeze (2.6.26-1-686 and 2.6.28-1-686) I read somewhere that using r8101 may solve the problem but it didn't. (from 2.6.26-1-686) 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8101E/RTL8102E PCI Express Fast Ethernet controller (rev 01) Subsystem: Toshiba America Info Systems Device ff00 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 18 Region 0: I/O ports at 4000 [size=256] Region 2: Memory at da000000 (64-bit, non-prefetchable) [size=4K] [virtual] Expansion ROM at d4000000 [disabled] [size=64K] Capabilities: <access denied> Kernel driver in use: r8101 Kernel modules: r8169 (from 2.6.28-1-686 with r8196): ------------[ cut here ]------------ [ 3515.989047] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xf6/0x17c() [ 3515.989050] NETDEV WATCHDOG: eth0 (r8169): transmit timed out [ 3515.989052] Modules linked in: i915 drm rfcomm l2cap bluetooth ppdev parport_pc lp parport ipv6 acpi_cpufreq cpufreq_userspace cpufreq_stats cpufreq_conservative cpufreq_powersave firewire_sbp2 loop arc4 ecb snd_hda_intel iwl3945 snd_pcm snd_seq snd_timer snd_seq_device joydev mac80211 snd led_class tifm_7xx1 i2c_i801 soundcore rng_core serio_raw tifm_core rfkill rsrc_nonstatic pcmcia_core i2c_core snd_page_alloc intel_agp agpgart pcspkr cfg80211 iTCO_wdt psmouse evdev input_polldev battery video output container button ac ext3 jbd mbcache sg usbhid hid sr_mod cdrom sd_mod crc_t10dif ide_pci_generic ide_core ata_generic ata_piix sdhci_pci sdhci firewire_ohci mmc_core libata firewire_core crc_itu_t scsi_mod ehci_hcd uhci_hcd usbcore r8169 mii thermal processor fan thermal_sys [ 3515.989138] Pid: 0, comm: swapper Not tainted 2.6.28-1-686 #1 [ 3515.989141] Call Trace: [ 3515.989149] [<c0126d7a>] warn_slowpath+0x5a/0x79 [ 3515.989154] [<c029f33f>] ip_finish_output+0x1c4/0x1fb [ 3515.989160] [<c01f860c>] __next_cpu+0x12/0x21 [ 3515.989164] [<c01f860c>] __next_cpu+0x12/0x21 [ 3515.989167] [<c011f8eb>] find_busiest_group+0x307/0x78f [ 3515.989170] [<c011f8eb>] find_busiest_group+0x307/0x78f [ 3515.989176] [<c01fc234>] strlcpy+0x11/0x3d [ 3515.989179] [<c028dbb6>] dev_watchdog+0xf6/0x17c [ 3515.989185] [<c013aef7>] sched_clock_tick+0x95/0x9e [ 3515.989188] [<c0138dea>] hrtimer_forward+0x10c/0x124 [ 3515.989192] [<c013cadf>] getnstimeofday+0x4f/0xd1 [ 3515.989197] [<c0110245>] lapic_next_event+0x10/0x13 [ 3515.989200] [<c028dac0>] dev_watchdog+0x0/0x17c [ 3515.989206] [<c012e4f8>] run_timer_softirq+0x14a/0x1b4 [ 3515.989209] [<c028dac0>] dev_watchdog+0x0/0x17c [ 3515.989213] [<c012b2d3>] __do_softirq+0x8c/0x130 [ 3515.989216] [<c012b3bc>] do_softirq+0x45/0x53 [ 3515.989219] [<c012b4c4>] irq_exit+0x35/0x69 [ 3515.989223] [<c01109ad>] smp_apic_timer_interrupt+0x6e/0x78 [ 3515.989227] [<c0104620>] apic_timer_interrupt+0x28/0x30 [ 3515.989256] [<f8075147>] acpi_idle_enter_bm+0x31c/0x3a5 [processor] [ 3515.989270] [<f807539d>] acpi_idle_enter_simple+0x1cd/0x224 [processor] [ 3515.989276] [<c026ee11>] menu_select+0x37/0x96 [ 3515.989279] [<c026e48f>] cpuidle_idle_call+0x5d/0x8e [ 3515.989283] [<c0102a37>] cpu_idle+0x71/0x8a [ 3515.989286] ---[ end trace 7b6311bc953d584e ]--- [ 3516.005184] r8169: eth0: link up I have reproduced this bug in 2.6.29.1 FWIW. It takes quite a bit to reproduce this, but I am able to do so within a few hours. I have 10 machines running this kernel, running scp in a round robin style will cause this timeout on a random machine about ever 2-3 hours. As of right now, the stable kernel that I can run and not have this problem is 2.6.26.3. The only other kernel I have tried is 2.6.28.9, and that produced the bug as well. Apr 6 00:26:14 svc52 kernel: ------------[ cut here ]------------ Apr 6 00:26:14 svc52 kernel: WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0x11f/0x1b0() Apr 6 00:26:14 svc52 kernel: Hardware name: TF720 A2+ Apr 6 00:26:14 svc52 kernel: NETDEV WATCHDOG: eth0 (r8169): transmit timed out Apr 6 00:26:14 svc52 kernel: Modules linked in: forcedeth Apr 6 00:26:14 svc52 kernel: Pid: 0, comm: swapper Not tainted 2.6.29.1 #2 Apr 6 00:26:14 svc52 kernel: Call Trace: Apr 6 00:26:14 svc52 kernel: <IRQ> [<ffffffff8023428f>] warn_slowpath+0xd8/0xf5 Apr 6 00:26:14 svc52 kernel: [<ffffffff8022ce88>] default_wake_function+0x0/0x9 Apr 6 00:26:14 svc52 kernel: [<ffffffff804b4541>] __qdisc_run+0x103/0x1d7 Apr 6 00:26:14 svc52 kernel: [<ffffffff80360500>] cpumask_next_and+0x2a/0x3e Apr 6 00:26:14 svc52 kernel: [<ffffffff8022b598>] find_busiest_group+0x27f/0x7b2 Apr 6 00:26:14 svc52 kernel: [<ffffffff804b495b>] dev_watchdog+0x11f/0x1b0 Apr 6 00:26:14 svc52 kernel: [<ffffffff8024ab69>] getnstimeofday+0x57/0xb7 Apr 6 00:26:14 svc52 kernel: [<ffffffff80247902>] ktime_get_ts+0x22/0x4b Apr 6 00:26:14 svc52 kernel: [<ffffffff804b483c>] dev_watchdog+0x0/0x1b0 Apr 6 00:26:14 svc52 kernel: [<ffffffff8023c192>] run_timer_softirq+0x12c/0x193 Apr 6 00:26:14 svc52 kernel: [<ffffffff802387ec>] __do_softirq+0x7a/0x13d Apr 6 00:26:14 svc52 kernel: [<ffffffff8020c23c>] call_softirq+0x1c/0x28 Apr 6 00:26:14 svc52 kernel: [<ffffffff8020d0f8>] do_softirq+0x2c/0x6c Apr 6 00:26:14 svc52 kernel: [<ffffffff8021ac5c>] smp_apic_timer_interrupt+0x93/0xab Apr 6 00:26:14 svc52 kernel: [<ffffffff8020bc73>] apic_timer_interrupt+0x13/0x20 Apr 6 00:26:14 svc52 kernel: <EOI> [<ffffffff80210c40>] default_idle+0x27/0x3b Apr 6 00:26:14 svc52 kernel: [<ffffffff80210e43>] c1e_idle+0xe5/0xe9 Apr 6 00:26:14 svc52 kernel: [<ffffffff8020a047>] cpu_idle+0x40/0x5e Apr 6 00:26:14 svc52 kernel: ---[ end trace f604628d7fa5821b ]--- Apr 6 00:26:14 svc52 kernel: r8169: eth0: link up 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02) Subsystem: Biostar Microtech Int'l Corp Device 2307 Flags: bus master, fast devsel, latency 0, IRQ 1275 I/O ports at e800 [size=256] Memory at febff000 (64-bit, non-prefetchable) [size=4K] Memory at fbff0000 (64-bit, prefetchable) [size=64K] Expansion ROM at febc0000 [disabled] [size=128K] Capabilities: [40] Power Management version 3 Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+ Capabilities: [70] Express Endpoint, MSI 01 Capabilities: [b0] MSI-X: Enable- Mask- TabSize=2 Capabilities: [d0] Vital Product Data <?> Kernel driver in use: r8169 Same here. Put a bit of load on the card and it breaks. With the old realtek driver it run without problems. I have two r8169 in the computer, the first onboard, the second one as add on card. Kernel 2.6.29.1: ------------[ cut here ]------------ WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0x1fa/0x210() Hardware name: GA-MA78G-DS3H NETDEV WATCHDOG: eth0 (r8169): transmit timed out Modules linked in: cpufreq_stats snd_seq_dummy snd_seq_oss snd_seq_midi_event sn d_seq snd_seq_device snd_pcm_oss snd_mixer_oss ipv6 xt_state nf_nat_ftp nf_connt rack_ftp genrtc fuse i2c_piix4 ohci1394 st ieee1394 i2c_core psmouse evdev sg Pid: 0, comm: swapper Not tainted 2.6.29.1 #3 Call Trace: <IRQ> [<ffffffff80238fe2>] warn_slowpath+0xf2/0x130 [<ffffffff80431030>] ahci_qc_prep+0x50/0x130 [<ffffffff8041ca87>] ata_qc_issue+0x127/0x2b0 [<ffffffff8036d8a0>] sg_init_table+0x20/0x50 [<ffffffff803f4440>] scsi_done+0x0/0x10 [<ffffffff80363183>] cpumask_next_and+0x23/0x40 [<ffffffff8022f72c>] find_busiest_group+0x20c/0x8a0 [<ffffffff80234e26>] tg_shares_up+0xe6/0x1e0 [<ffffffff8022a7cb>] enqueue_task+0xb/0x20 [<ffffffff8022a85a>] activate_task+0x1a/0x30 [<ffffffff80234e26>] tg_shares_up+0xe6/0x1e0 [<ffffffff80253fa9>] getnstimeofday+0x49/0xe0 [<ffffffff8036893e>] strlcpy+0x4e/0x80 [<ffffffff8051371a>] dev_watchdog+0x1fa/0x210 [<ffffffff80251020>] ktime_get_ts+0x20/0x60 [<ffffffff802426aa>] run_timer_softirq+0x1ba/0x220 [<ffffffff8023e3bb>] __do_softirq+0x8b/0x150 [<ffffffff80221672>] hpet_interrupt_handler+0x12/0x40 [<ffffffff8020c6fc>] call_softirq+0x1c/0x30 [<ffffffff8020e015>] do_softirq+0x35/0x80 [<ffffffff8023e325>] irq_exit+0x85/0x90 [<ffffffff8020e243>] do_IRQ+0x83/0x110 [<ffffffff8020bfd3>] ret_from_intr+0x0/0xa <EOI> [<ffffffff80212ca7>] default_idle+0x27/0x40 [<ffffffff80212eca>] c1e_idle+0xba/0x100 [<ffffffff8020a316>] cpu_idle+0x46/0x90 ---[ end trace 4c2c60d95eca3dcf ]--- r8169: eth0: link up Boot up messages: r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded r8169 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 r8169 0000:03:00.0: setting latency timer to 64 r8169 0000:03:00.0: irq 28 for MSI/MSI-X eth0: RTL8168b/8111b at 0xffffc20000026000, 00:e0:4c:68:0d:9b, XID 38000000 IRQ 28 r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded r8169 0000:04:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18 r8169 0000:04:00.0: setting latency timer to 64 r8169 0000:04:00.0: irq 29 for MSI/MSI-X eth1: RTL8168c/8111c at 0xffffc2000002a000, 00:1f:d0:54:e7:bb, XID 3c4000c0 IRQ 29 Device listing: 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01) Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 28 Region 0: I/O ports at ee00 [size=256] Region 2: Memory at fdeff000 (64-bit, non-prefetchable) [size=4K] [virtual] Expansion ROM at fdd00000 [disabled] [size=128K] Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] Vital Product Data <?> Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable+ Address: 00000000fee0300c Data: 4181 Capabilities: [60] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited ExtTag+ AttnBtn+ AttnInd+ PwrInd+ RBE- FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 4096 bytes DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 unlimited, L1 unlimited ClockPM- Suprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [84] Vendor Specific Information <?> Capabilities: [100] Advanced Error Reporting <?> Capabilities: [12c] Virtual Channel <?> Capabilities: [148] Device Serial Number 68-81-ec-10-00-00-0d-b4 Capabilities: [154] Power Budgeting <?> Kernel driver in use: r8169 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02) Subsystem: Giga-byte Technology Unknown device e000 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 29 Region 0: I/O ports at de00 [size=256] Region 2: Memory at fdbff000 (64-bit, prefetchable) [size=4K] Region 4: Memory at fdbe0000 (64-bit, prefetchable) [size=64K] [virtual] Expansion ROM at fdb00000 [disabled] [size=64K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable+ Address: 00000000fee0300c Data: 4189 Capabilities: [70] Express (v1) Endpoint, MSI 01 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <8us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 4096 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us ClockPM+ Suprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [b0] MSI-X: Enable- Mask- TabSize=2 Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00000800 Capabilities: [d0] Vital Product Data <?> Capabilities: [100] Advanced Error Reporting <?> Capabilities: [140] Virtual Channel <?> Capabilities: [160] Device Serial Number 78-56-34-12-78-56-34-12 Kernel driver in use: r8169 Can confirm it on 2.6.29.1, 2.6.29.2. I need approx. 20s to reproduce it using the 'iperf' benchmarking tool and two crossconnected, r8169 driven nodes. Should be mentioned that I get this bug only if the bandwith exploitation is close to a 1 Gb/s. In a rate-limited environment (e.g. ~ 450 Mb/s during tcpdump on one host) I never got the bug (tested ~ 3 days on 4 nodes). The (elsewhere) suggested workarounds using ethtool (disabling autoneg, setting the rate by hand, ...) all didn't work. Another report: https://bugs.gentoo.org/show_bug.cgi?id=266761 Given that some people can reproduce this easily, and it is a regression, would anyone be up for doing a git bisect? It will identify the exact commit which introduced this bug. http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches/ I will do an bisection between 2.6.25.4 (works for me) and 2.6.29.2. But given the wide range of patches I remind that bisection will find neither multiple nor hierarchical regressions. I expect to finish on friday. Bad news. Bisecting between 2.6.25 and 2.6.29 delievered the following potential first bad commit: # git bisect bad 1d8cca44b6a244b7e378546d719041819049a0f9 is first bad commit commit 1d8cca44b6a244b7e378546d719041819049a0f9 Author: Harvey Harrison <harvey.harrison@gmail.com> Date: Sat Oct 18 20:28:37 2008 -0700 byteorder: provide swabb.h generically in asm/byteorder.h This is needed during the transition to the new byteorder headers as the swabb.h functionality will be provided from asm/byteorder.h in the new version. To avoid breakage on arches still using the old implementation, provide swabb.h from asm/byteorder.h as well. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> :040000 040000 e548e9798e03daeee187cf0012ba7c16712d7261 b747283fd042fd18b1fd6275de9c99a1f904f730 M include # Well, that doesn't seem to be what we've been looking for (although one should still investigate if). I have to mention that the frequency of the bug occurance varied across the different builds (between 0.2 Hz and 0.01 Hz). I tested if a build was good using 50 iterations of iperf (which does a 10s test with 1 Gb/s if the bug does not occur). Perhaps in a build marked as good the frequency was too low to be discovered. I'll stand by until it is clear that a bisection with longer test is neccessary. I have a RTL8169sb/8110sb PCI card which breaks during high TCP traffic, but rarely during UDP. Every kernel from 2.6.22 to 2.6.30-rc5 is affected, compiling 2.6.20 now. I use one outgoing iperf and three incoming, from different hosts, on the affected computer which causes the NIC to exhibit symptoms within seconds. Curiously, I have an identical card in one of the "attacking" machines which does not exhibit any symptoms, so perhaps my case is not related. I have the same issue on an Intel D945GCLF2 MB with embedded NIC. It is very reproducible; when I put full load on it, it occurs about once a minute. Observations: 1) Issue only occurs for me on outgoing traffic. For incoming traffic, the full gigabit load can be on for a 10 hours, an everything works (tested), 2) Issue is very reproducible for me under Ubuntu 09.04 amd64 (2.6.28) but not under Debian Lenny amd64 (2.6.26). This box was intended as central backup server, but the bug renders it unable to enter production: I can make backups fine (e.g. store 100GB in a big burst over Samba), but cannot possibly restore (i.e. pulling back 100GB always crashes in the first couple GBs). Thanks, - Jan Joris - From dmesg: [ 3.399902] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [ 3.399944] r8169 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 3.399976] r8169 0000:01:00.0: setting latency timer to 64 [ 3.400168] r8169 0000:01:00.0: irq 2300 for MSI/MSI-X [ 3.402117] eth0: RTL8168c/8111c at 0xffffc20000036000, 00:1c:c0:b5:00:65, XID 3c4000c0 IRQ 2300 From uname: Linux pinky 2.6.28-11-generic #42-Ubuntu SMP Fri Apr 17 01:58:03 UTC 2009 x86_64 GNU/Linux From lspci: 00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01) 01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02) From syslog: May 13 05:14:18 pinky kernel: [23718.804043] NETDEV WATCHDOG: eth0 (r8169): transmit timed out May 13 05:14:18 pinky kernel: [23718.804047] Modules linked in: i915 drm binfmt_misc bridge stp bnep video output input_polldev nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc lp snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event ppdev snd_seq snd_timer snd_seq_device iTCO_wdt iTCO_vendor_support psmouse serio_raw pcspkr usblp intel_agp snd soundcore snd_page_alloc parport_pc parport usb_storage r8169 mii fbcon tileblit font bitblit softcursor May 13 05:14:18 pinky kernel: [23718.804131] Pid: 0, comm: swapper Not tainted 2.6.28-11-generic #42-Ubuntu May 13 05:14:18 pinky kernel: [23718.804135] Call Trace: May 13 05:14:18 pinky kernel: [23718.804140] <IRQ> [<ffffffff80250927>] warn_slowpath+0xb7/0xf0 May 13 05:14:18 pinky kernel: [23718.804158] [<ffffffff80416ffa>] ? __next_cpu+0x1a/0x30 May 13 05:14:18 pinky kernel: [23718.804165] [<ffffffff802476dc>] ? find_busiest_group+0x1dc/0x9a0 May 13 05:14:18 pinky kernel: [23718.804175] [<ffffffff80270969>] ? getnstimeofday+0x59/0xe0 May 13 05:14:18 pinky kernel: [23718.804182] [<ffffffff8026c659>] ? ktime_get_ts+0x59/0x60 May 13 05:14:18 pinky kernel: [23718.804188] [<ffffffff8026c671>] ? ktime_get+0x11/0x50 May 13 05:14:18 pinky kernel: [23718.804196] [<ffffffff8041d2da>] ? strlcpy+0x4a/0x60 May 13 05:14:18 pinky kernel: [23718.804203] [<ffffffff805cb6f0>] dev_watchdog+0x270/0x280 May 13 05:14:18 pinky kernel: [23718.804211] [<ffffffff802424b2>] ? enqueue_entity+0x122/0x2b0 May 13 05:14:18 pinky kernel: [23718.804218] [<ffffffff802486cd>] ? enqueue_task_fair+0x3d/0x80 May 13 05:14:18 pinky kernel: [23718.804226] [<ffffffff802199e6>] ? read_tsc+0x16/0x40 May 13 05:14:18 pinky kernel: [23718.804233] [<ffffffff805cb480>] ? dev_watchdog+0x0/0x280 May 13 05:14:18 pinky kernel: [23718.804241] [<ffffffff8025be79>] run_timer_softirq+0x179/0x260 May 13 05:14:18 pinky kernel: [23718.804249] [<ffffffff8027375f>] ? clockevents_program_event+0x4f/0x90 May 13 05:14:18 pinky kernel: [23718.804257] [<ffffffff80256acc>] __do_softirq+0x9c/0x170 May 13 05:14:18 pinky kernel: [23718.804264] [<ffffffff80213d8c>] call_softirq+0x1c/0x30 May 13 05:14:18 pinky kernel: [23718.804271] [<ffffffff80214ffd>] do_softirq+0x5d/0xa0 May 13 05:14:18 pinky kernel: [23718.804277] [<ffffffff8025684d>] irq_exit+0x8d/0xa0 May 13 05:14:18 pinky kernel: [23718.804286] [<ffffffff80227648>] smp_apic_timer_interrupt+0x88/0xc0 May 13 05:14:18 pinky kernel: [23718.804293] [<ffffffff80213668>] apic_timer_interrupt+0x88/0x90 May 13 05:14:18 pinky kernel: [23718.804297] <EOI> [<ffffffff8021a95a>] ? mwait_idle+0x4a/0x50 May 13 05:14:18 pinky kernel: [23718.804311] [<ffffffff80210dd2>] ? enter_idle+0x22/0x30 May 13 05:14:18 pinky kernel: [23718.804318] [<ffffffff80210e85>] ? cpu_idle+0x65/0xc0 May 13 05:14:18 pinky kernel: [23718.804326] [<ffffffff80698f23>] ? start_secondary+0x9e/0xcb May 13 05:14:18 pinky kernel: [23718.804331] ---[ end trace 1cfb5a1b92b2d7c9 ]--- May 13 05:14:18 pinky kernel: [23718.821866] r8169: eth0: link up My bug disappers when I have added 'noacpi' option as a kernel command line parameter. >'acpi=off' option
didn't help. After some time netcard stop sending packets again.
Please give 2.6.30 a try. Some specific changes in it could fix the bug. -- Ueimor I had this problem with 2.6.26, 2.6.29. Just tried 2.6.30 and it seems to be gone in the first tests. I'll be testing this more now, but looks good so far. As of 2.6.31.1 it is fixed for me. Both NICs are recognized and work correctly and very fast, even under load. |