Bug 12411

Summary: 2.6.28: BUG in r8169
Product: Drivers Reporter: Rafael J. Wysocki (rjw)
Component: NetworkAssignee: Francois Romieu (romieu)
Status: CLOSED INSUFFICIENT_DATA    
Severity: normal CC: alan, andres.430, db.pub.mail, eike-kernel, florian, kernel, masp01, mpagano, romieu, vadim.v.panov, victorpablosceruelo, vr5
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32.2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 11808    

Description Rafael J. Wysocki 2009-01-10 15:59:16 UTC
Subject    : 2.6.28: BUG in r8169
Submitter  : "Andrey Vul" <andrey.vul@gmail.com>
Date       : 2008-12-31 18:37
References : http://marc.info/?l=linux-kernel&m=123074869611409&w=4
Notify-Also : Francois Romieu <romieu@fr.zoreil.com>

This entry is being used for tracking a regression from 2.6.27.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Rolf Eike Beer 2009-01-24 05:26:54 UTC
probably dupe of #10109
Comment 2 Doug Bazarnic 2009-02-16 00:20:25 UTC
I just had this pop up again today.   Running 2.6.28.1 on Centos 5.2 x86_64.  If IRQBALANCE is enabled, it happens immediately.  Uptime is 17 days on brand new hardware.  The box is under heavy network load.

------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0x206/0x220()
NETDEV WATCHDOG: eth0 (r8169): transmit timed out
Modules linked in: dm_mirror dm_region_hash dm_log dm_multipath dm_mod serio_raw pcspkr
Pid: 0, comm: swapper Not tainted 2.6.28.1 #6
Call Trace:
 <IRQ>  [<ffffffff8023633c>] warn_slowpath+0x10c/0x150
 [<ffffffff8024ab39>] autoremove_wake_function+0x9/0x30
 [<ffffffff8022c65a>] __wake_up_common+0x5a/0x90
 [<ffffffff8022c367>] source_load+0x37/0x70
 [<ffffffff803cfb5a>] __next_cpu+0x1a/0x30
 [<ffffffff8022daec>] find_busiest_group+0x18c/0x820
 [<ffffffff803d52fe>] strlcpy+0x4e/0x80
 [<ffffffff804b78b6>] dev_watchdog+0x206/0x220
 [<ffffffff802122d9>] read_tsc+0x9/0x20
 [<ffffffff80250368>] getnstimeofday+0x48/0xe0
 [<ffffffff8024d5e8>] run_hrtimer_pending+0x18/0x120
 [<ffffffff804b76b0>] dev_watchdog+0x0/0x220
 [<ffffffff8023fbff>] run_timer_softirq+0x15f/0x1c0
 [<ffffffff8023b52c>] __do_softirq+0x9c/0x170
 [<ffffffff8020c87c>] call_softirq+0x1c/0x30
 [<ffffffff8020e135>] do_softirq+0x35/0x70
 [<ffffffff8021c205>] smp_apic_timer_interrupt+0x85/0xd0
 [<ffffffff8020c2cb>] apic_timer_interrupt+0x6b/0x70
 <EOI>  [<ffffffff802130c1>] mwait_idle+0x41/0x50
 [<ffffffff8020a2da>] cpu_idle+0x3a/0x70
---[ end trace 33d76deea67d0fe1 ]---
r8169: eth0: link up
r8169: eth0: link up
Comment 3 Rafael J. Wysocki 2009-02-16 08:31:14 UTC
Please try something later than 2.6.28.1, preferably 2.6.29-rc5 or the latest 2.6.28.y .
Comment 4 John Daiker 2009-03-21 11:53:45 UTC
I can confirm a similar bug on 2.6.28.7 and 2.6.28.8

------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0x247/0x260()
NETDEV WATCHDOG: eth0 (r8169): transmit timed out
Modules linked in: nfsd exportfs r8169 mii thermal processor fuse
Pid: 0, comm: swapper Not tainted 2.6.28.8-090316-2248 #1
Call Trace:
 <IRQ>  [<ffffffff8023e657>] warn_slowpath+0xb7/0xf0
 [<ffffffff8050153b>] ? sock_def_readable+0x3b/0x70
 [<ffffffff80502349>] ? sock_queue_rcv_skb+0xe9/0x130
 [<ffffffff803cd2a9>] ? __next_cpu+0x19/0x30
 [<ffffffff8023589c>] ? find_busiest_group+0x1dc/0x990
 [<ffffffff80213319>] ? read_tsc+0x9/0x20
 [<ffffffff8025b489>] ? getnstimeofday+0x59/0xe0
 [<ffffffff80258389>] ? ktime_get_ts+0x59/0x60
 [<ffffffff803d3209>] ? strlcpy+0x49/0x60
 [<ffffffff8021e895>] ? lapic_next_event+0x15/0x20
 [<ffffffff8051e627>] dev_watchdog+0x247/0x260
 [<ffffffff8025a313>] ? sched_clock_cpu+0x143/0x190
 [<ffffffff8051e3e0>] ? dev_watchdog+0x0/0x260
 [<ffffffff80248d4f>] run_timer_softirq+0x13f/0x210
 [<ffffffff80258389>] ? ktime_get_ts+0x59/0x60
 [<ffffffff8025e20f>] ? clockevents_program_event+0x4f/0x90
 [<ffffffff80243e14>] __do_softirq+0x94/0x160
 [<ffffffff8020cddc>] call_softirq+0x1c/0x30
 [<ffffffff8020e815>] do_softirq+0x45/0x80
 [<ffffffff80243b9d>] irq_exit+0x8d/0xa0
 [<ffffffff8021f138>] smp_apic_timer_interrupt+0x88/0xc0
 [<ffffffff8020c82b>] apic_timer_interrupt+0x6b/0x70
 <EOI>  [<ffffffff802141aa>] ? mwait_idle+0x4a/0x50
 [<ffffffff8020a502>] ? enter_idle+0x22/0x30
 [<ffffffff8020a5ae>] ? cpu_idle+0x5e/0xb0
 [<ffffffff805a9faf>] ? start_secondary+0x152/0x1a3
---[ end trace 7b3bb601968af4c5 ]---
r8169: eth0: link up
Comment 5 Andres Suarez 2009-03-23 17:57:17 UTC
I'm having the same issue with ubuntu 9.04 alpha

[ 3538.000050] WARNING: at /build/buildd/linux-2.6.28/net/sched/sch_generic.c:226 dev_watchdog+0x219/0x230()
[ 3538.000056] NETDEV WATCHDOG: eth2 (r8169): transmit timed out
[ 3538.000060] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat binfmt_misc i915 drm bridge stp bnep video output input_polldev smsc47m1 smsc47m192 hwmon_vid i2c_i801 sbp2 ieee1394 lp snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy ppdev snd_seq_oss psmouse snd_seq_midi snd_rawmidi serio_raw iTCO_wdt iTCO_vendor_support snd_seq_midi_event snd_seq pcspkr snd_timer snd_seq_device snd soundcore intel_agp snd_page_alloc parport_pc parport agpgart usbhid usb_storage r8169 mii fbcon tileblit font bitblit softcursor
[ 3538.000146] Pid: 0, comm: swapper Not tainted 2.6.28-11-generic #36-Ubuntu
[ 3538.000151] Call Trace:
[ 3538.000162]  [<c0139a60>] warn_slowpath+0x60/0x80
[ 3538.000171]  [<c02c0030>] ? as_deactivate_request+0x30/0x40
[ 3538.000178]  [<c013128d>] ? find_busiest_group+0x15d/0x7f0
[ 3538.000185]  [<c012c6dc>] ? enqueue_entity+0x13c/0x360
[ 3538.000193]  [<c0156873>] ? getnstimeofday+0x53/0x110
[ 3538.000201]  [<c0119a00>] ? lapic_get_maxlvt+0x0/0x30
[ 3538.000208]  [<c047ced7>] ? icmp_send+0x167/0x560
[ 3538.000215]  [<c0156873>] ? getnstimeofday+0x53/0x110
[ 3538.000221]  [<c02cadbd>] ? strlcpy+0x1d/0x60
[ 3538.000229]  [<c0431382>] ? netdev_drivername+0x32/0x40
[ 3538.000235]  [<c0445ee9>] dev_watchdog+0x219/0x230
[ 3538.000243]  [<c043b7ab>] ? neigh_table_init_no_netlink+0x14b/0x1d0
[ 3538.000249]  [<c043af80>] ? neigh_periodic_timer+0x0/0x190
[ 3538.000257]  [<c01444d7>] ? mod_timer+0x37/0x80
[ 3538.000263]  [<c043b0a6>] ? neigh_periodic_timer+0x126/0x190
[ 3538.000269]  [<c0143aa0>] run_timer_softirq+0x130/0x200
[ 3538.000275]  [<c0445cd0>] ? dev_watchdog+0x0/0x230
[ 3538.000281]  [<c0445cd0>] ? dev_watchdog+0x0/0x230
[ 3538.000289]  [<c013f147>] __do_softirq+0x97/0x170
[ 3538.000296]  [<c0152c36>] ? hrtimer_interrupt+0x186/0x1b0
[ 3538.000302]  [<c0152a89>] ? ktime_get+0x19/0x40
[ 3538.000308]  [<c013f27d>] do_softirq+0x5d/0x60
[ 3538.000314]  [<c013f3f5>] irq_exit+0x55/0x90
[ 3538.000321]  [<c011a0ab>] smp_apic_timer_interrupt+0x5b/0x90
[ 3538.000328]  [<c0105318>] apic_timer_interrupt+0x28/0x30
[ 3538.000335]  [<c010b002>] ? mwait_idle+0x42/0x50
[ 3538.000340]  [<c010285d>] cpu_idle+0x6d/0xd0
[ 3538.000348]  [<c04f11ee>] rest_init+0x4e/0x60
[ 3538.000353] ---[ end trace 05614c40c8f508dd ]-
Comment 6 Andres Suarez 2009-03-24 10:33:49 UTC
And confirmed now again on Ubuntu 8.10 Intrepid Ibex which is weird since it used to work fine before, maybe I need to update my system in this case
Comment 7 Andres Suarez 2009-03-24 13:29:06 UTC
dmesg output from Ubuntu Intrepid Ibex 8.10 2.6.27-11-generic

[ 2521.964027] ------------[ cut here ]------------
[ 2521.964044] WARNING: at /build/buildd/linux-2.6.27/net/sched/sch_generic.c:219 dev_watchdog+0x21a/0x230()
[ 2521.964051] NETDEV WATCHDOG: eth2 (r8169): transmit timed out
[ 2521.964056] Modules linked in: nls_cp437 cifs i915 drm binfmt_misc af_packet bridge stp bnep sco rfcomm l2cap bluetooth ppdev cp$
[ 2521.964227] Pid: 0, comm: swapper Not tainted 2.6.27-11-generic #1
[ 2521.964235]  [<c0131e15>] warn_slowpath+0x65/0x90
[ 2521.964246]  [<c0240030>] ? get_request+0xc0/0x360
[ 2521.964256]  [<c012990d>] ? find_busiest_group+0x15d/0x7c0
[ 2521.964267]  [<c037edde>] ? account_scheduler_latency+0xe/0x220
[ 2521.964277]  [<c037edde>] ? account_scheduler_latency+0xe/0x220
[ 2521.964286]  [<c0118e38>] ? read_hpet+0x8/0x20
[ 2521.964297]  [<c014e6eb>] ? getnstimeofday+0x4b/0x100
[ 2521.964307]  [<c0136a26>] ? set_normalized_timespec+0x16/0x90
[ 2521.964316]  [<c0154437>] ? timer_stats_update_stats+0x17/0x250
[ 2521.964325]  [<c0254a19>] ? strlen+0x9/0x20
[ 2521.964333]  [<c0252a9d>] ? strlcpy+0x1d/0x60
[ 2521.964341]  [<c02f16a7>] ? netdev_drivername+0x37/0x40
[ 2521.964350]  [<c03068aa>] dev_watchdog+0x21a/0x230
[ 2521.964358]  [<c01136c0>] ? lapic_next_event+0x20/0x30
[ 2521.964368]  [<c0151dbf>] ? clockevents_program_event+0x9f/0x150
[ 2521.964377]  [<c013c038>] run_timer_softirq+0x138/0x210
[ 2521.964386]  [<c0306690>] ? dev_watchdog+0x0/0x230
[ 2521.964394]  [<c0306690>] ? dev_watchdog+0x0/0x230
[ 2521.964429]  [<c0137732>] __do_softirq+0x92/0x120
[ 2521.964436]  [<c013781d>] do_softirq+0x5d/0x60
[ 2521.964443]  [<c0137995>] irq_exit+0x55/0x90
[ 2521.964450]  [<c0113f8d>] smp_apic_timer_interrupt+0x5d/0x90
[ 2521.964459]  [<c01050f8>] apic_timer_interrupt+0x28/0x30
[ 2521.964468]  [<c010acca>] ? mwait_idle+0x4a/0x50
[ 2521.964476]  [<c010288d>] cpu_idle+0x7d/0x140
[ 2521.964483]  [<c037b471>] start_secondary+0x9d/0xcc
[ 2521.964492]  =======================
[ 2521.964497] ---[ end trace 436b5311b7770f56 ]---
[ 2521.983387] r8169: eth2: link up
Comment 8 Victor Pablos Ceruelo 2009-12-21 12:19:25 UTC
Same bug on debian kernels 2.6.29-4, 2.6.30-8 and 2.6.32-2.

see 
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=528362
http://www.kerneloops.org/submitresult.php?number=1044137

probably dup of #12500 
http://bugzilla.kernel.org/show_bug.cgi?id=12500

and #14709 (even with a non-gigabit ethernet card)
http://bugzilla.kernel.org/show_bug.cgi?id=14709

and #10134 (because sometimes system hang and it is not possible to get any log)
http://bugzilla.kernel.org/show_bug.cgi?id=10134
Comment 9 Victor Pablos Ceruelo 2009-12-21 12:29:48 UTC
Problem is usually solved by using boot parameters

noacpi
pci=nomsi

or even reloading modules mii and r8169.

But sooner or later it happens again.
Comment 10 vadim 2010-05-25 17:19:56 UTC
I had this problem with 2.6.33 from FC13.
It is partially solved since 2.6.34rc4, no kernel error appears.
But eth becomes frozen for 10-20second because RX Overflow occurs.
Using "noacpi, pci=nomsi" boot parameters helps.
Tested with 8169B rev2, mac_version 0x14
Comment 11 vadim 2010-05-28 15:20:22 UTC
Problem seems to be fixed in 2.6.34-rc7.

Backporting module changes to fedora 13 kernel would not help. Kernel error disapears, but something strange with msi support cause RX overflows occurs too fast. network works for 5 seconds then freeze for 10 seconds..

There is no problems with 2.6.34-rc7, may be earlier, but module fixes come in 2.6.34-rc4
Comment 12 Florian Mickler 2010-10-25 18:59:49 UTC
Thank you for following up on this!

Could you please test the latest 2.6.27.y and 2.6.32.y stable kernels, to see if that issue is also fixed there? 

Regards,
Flo
Comment 13 Florian Mickler 2010-11-11 19:13:46 UTC
Alright, if it is not fixed in 2.6.32.y and 2.6.27.y please reopen this bugreport. 

There are 5 commits from 2.6.34 regarding r8169  in 2.6.32.y. 

Regards,
Flo
Comment 14 david b 2011-01-12 08:06:04 UTC
Actually this may not be fixed... :/
I picked up a netgear GA311 today and it doesn't work properly.
It is just spewing management frames out onto the network uncontrollably. (perhaps the hardware I picked up is broken).

Anyway here is some kernel output.

[  693.610782] r8169 0000:06:00.0: eth1: link down
[  697.165046] r8169 0000:06:00.0: eth1: link up
[  841.824026] ------------[ cut here ]------------
[  841.824038] WARNING: at net/sched/sch_generic.c:258 dev_watchdog+0x151/0x1ff()
[  841.824041] Hardware name:         
[  841.824044] NETDEV WATCHDOG: eth1 (r8169): transmit queue 0 timed out
[  841.824048] Modules linked in: ipv6 joydev hid_logitech ff_memless usbhid hid loop snd_hda_codec_realtek tpm_tis serio_raw tpm tpm_bios psmouse evdev i2c_i801 i2c_core pcspkr snd_hda_intel parport_pc parport processor rng_core snd_hda_codec button snd_pcm iTCO_wdt snd_timer shpchp intel_agp pci_hotplug intel_gtt snd soundcore snd_page_alloc ext3 jbd mbcache sha256_generic aes_x86_64 aes_generic cbc dm_crypt dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid1 md_mod sg sr_mod sd_mod cdrom crc_t10dif ide_pci_generic piix ide_core r8169 ata_piix ata_generic e100 mii libata ehci_hcd scsi_mod uhci_hcd thermal fan thermal_sys
[  841.824127] Pid: 0, comm: swapper Tainted: G   M        2.6.37 #1
[  841.824130] Call Trace:
[  841.824134]  <IRQ>  [<ffffffff810441eb>] warn_slowpath_common+0x80/0x98
[  841.824147]  [<ffffffff81044297>] warn_slowpath_fmt+0x41/0x43
[  841.824152]  [<ffffffff8127dec4>] dev_watchdog+0x151/0x1ff
[  841.824158]  [<ffffffff8105aeca>] ? __queue_work+0x24a/0x259
[  841.824164]  [<ffffffff81050a86>] run_timer_softirq+0x210/0x2de
[  841.824169]  [<ffffffff8127dd73>] ? dev_watchdog+0x0/0x1ff
[  841.824176]  [<ffffffff81066cf0>] ? ktime_get+0x60/0xb9
[  841.824181]  [<ffffffff8104a0c8>] __do_softirq+0xd3/0x19b
[  841.824186]  [<ffffffff8106ad3d>] ? tick_program_event+0x21/0x23
[  841.824192]  [<ffffffff8100395c>] call_softirq+0x1c/0x28
[  841.824196]  [<ffffffff810055ad>] do_softirq+0x41/0x7e
[  841.824200]  [<ffffffff81049f53>] irq_exit+0x36/0x78
[  841.824207]  [<ffffffff8101b745>] smp_apic_timer_interrupt+0x88/0x96
[  841.824213]  [<ffffffff81003413>] apic_timer_interrupt+0x13/0x20
[  841.824216]  <EOI>  [<ffffffff8100a460>] ? mwait_idle+0xbc/0xca
[  841.824226]  [<ffffffff81062f20>] ? atomic_notifier_call_chain+0x13/0x15
[  841.824231]  [<ffffffff81001e46>] cpu_idle+0xb4/0x125
[  841.824239]  [<ffffffff812dc679>] rest_init+0x6d/0x6f
[  841.824245]  [<ffffffff816c1d60>] start_kernel+0x3c8/0x3d3
[  841.824250]  [<ffffffff816c12b1>] x86_64_start_reservations+0xb8/0xbc
[  841.824255]  [<ffffffff816c13bb>] x86_64_start_kernel+0x106/0x115
[  841.824259] ---[ end trace 370ec34b2c707a8e ]---
[  841.840080] r8169 0000:06:00.0: eth1: link up


lspci 
06:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10)
Comment 15 Florian Mickler 2011-01-12 11:59:00 UTC
Can you bisect it? As far as I understood this bug got introduced between 2.6.27 and 2.6.28?
Comment 16 Francois Romieu 2011-07-21 10:03:48 UTC
I would welcome several things :

- above all the XID info and ethtool -d output. lspci is not specific enough.
  You can find the XID line in dmesg. Attaching a complete dmesg is fine.

- a status report with a current kernel if the bug does not take ages to
  happen - Doug's problem may be a bit different. I do not have the resources
  to go out and figure the different sauces in each vendor's kernel.

"current kernel" means either Linus's or David Miller's net-next git
branches. For instance see:
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git

David B.: is it an identified regression for you or did you only
experience NETDEV WATCHDOG messages _and_ loss of connectivity without
any previously known working kernel version ? Last time I tried, my GA311
sent mac control frames too but it did not crash under pktgen (less than
a week ago, current davem-next).

-- 
Ueimor