Bug 34092 - intermittent device hangs with e1000e
Summary: intermittent device hangs with e1000e
Status: RESOLVED OBSOLETE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-04-29 15:47 UTC by Bill Nottingham
Modified: 2012-08-29 17:33 UTC (History)
4 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Bill Nottingham 2011-04-29 15:47:49 UTC
MSI mini-itx Atom motherboard with:

01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

onboard.

Roughly once a month or so (hard to definitively state, usually takes me a week or two to notice), I get:

Oct  7 17:09:41 localhost kernel: ------------[ cut here ]------------
Oct  7 17:09:41 localhost kernel: WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0xc6/0x154()
Oct  7 17:09:41 localhost kernel: Hardware name: A9830IMS
Oct  7 17:09:41 localhost kernel: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Oct  7 17:09:41 localhost kernel: Modules linked in: sunrpc ipv6 cpufreq_ondemand acpi_cpufreq snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer microcode snd iTCO_wdt soundcore iTCO_vendor_support e1000e snd_page_alloc serio_raw i2c_i801 raid1 i915 drm_kms_helper usb_storage drm i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan]
Oct  7 17:09:41 localhost kernel: Pid: 0, comm: swapper Not tainted 2.6.34.7-56.fc13.i686.PAE #1
Oct  7 17:09:41 localhost kernel: Call Trace:
Oct  7 17:09:41 localhost kernel: [<c043f32a>] warn_slowpath_common+0x6a/0x81
Oct  7 17:09:41 localhost kernel: [<c072646c>] ? dev_watchdog+0xc6/0x154
Oct  7 17:09:41 localhost kernel: [<c043f37f>] warn_slowpath_fmt+0x29/0x2c
Oct  7 17:09:41 localhost kernel: [<c072646c>] dev_watchdog+0xc6/0x154
Oct  7 17:09:41 localhost kernel: [<c0449abc>] ? internal_add_timer+0x93/0x97
Oct  7 17:09:41 localhost kernel: [<c0449b58>] ? cascade+0x50/0x63
Oct  7 17:09:41 localhost kernel: [<c0449cd2>] run_timer_softirq+0x167/0x1e9
Oct  7 17:09:41 localhost kernel: [<c07263a6>] ? dev_watchdog+0x0/0x154
Oct  7 17:09:41 localhost kernel: [<c0444751>] __do_softirq+0xb1/0x157
Oct  7 17:09:41 localhost kernel: [<c044482d>] do_softirq+0x36/0x41
Oct  7 17:09:41 localhost kernel: [<c044494b>] irq_exit+0x2e/0x61
Oct  7 17:09:41 localhost kernel: [<c041e7e5>] smp_apic_timer_interrupt+0x73/0x81
Oct  7 17:09:41 localhost kernel: [<c07a37f5>] apic_timer_interrupt+0x31/0x38
Oct  7 17:09:41 localhost kernel: [<c040edba>] ? mwait_idle+0x61/0x6c
Oct  7 17:09:41 localhost kernel: [<c0407e94>] cpu_idle+0x96/0xb2
Oct  7 17:09:41 localhost kernel: [<c079eaea>] start_secondary+0x24d/0x28d
Oct  7 17:09:41 localhost kernel: ---[ end trace e6b78ff394e12b5c ]---
Oct  7 17:09:41 localhost kernel: 0000:01:00.0: eth0: Error reading PHY register

or

Apr 25 17:09:51 sulaco kernel: [579284.016028] ------------[ cut here ]------------
Apr 25 17:09:51 sulaco kernel: [579284.016050] WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0xe2/0x147()
Apr 25 17:09:51 sulaco kernel: [579284.016059] Hardware name: A9830IMS
Apr 25 17:09:51 sulaco kernel: [579284.016066] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Apr 25 17:09:51 sulaco kernel: [579284.016072] Modules linked in: nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm microcode snd_timer i2c_i801 serio_raw e1000e snd joydev soundcore snd_page_alloc iTCO_wdt iTCO_vendor_support raid1 i915 usb_storage uas drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
Apr 25 17:09:51 sulaco kernel: [579284.016140] Pid: 0, comm: swapper Not tainted 2.6.38.3-15.rc1.fc15.i686.PAE #1
Apr 25 17:09:51 sulaco kernel: [579284.016145] Call Trace:
Apr 25 17:09:51 sulaco kernel: [579284.016158]  [<c044200d>] ? warn_slowpath_common+0x7c/0x91
Apr 25 17:09:51 sulaco kernel: [579284.016167]  [<c076036e>] ? dev_watchdog+0xe2/0x147
Apr 25 17:09:51 sulaco kernel: [579284.016175]  [<c076036e>] ? dev_watchdog+0xe2/0x147
Apr 25 17:09:51 sulaco kernel: [579284.016183]  [<c04420ad>] ? warn_slowpath_fmt+0x33/0x35
Apr 25 17:09:51 sulaco kernel: [579284.016191]  [<c076036e>] ? dev_watchdog+0xe2/0x147
Apr 25 17:09:51 sulaco kernel: [579284.016201]  [<c044ce3d>] ? run_timer_softirq+0x152/0x207
Apr 25 17:09:51 sulaco kernel: [579284.016210]  [<c076028c>] ? dev_watchdog+0x0/0x147
Apr 25 17:09:51 sulaco kernel: [579284.016219]  [<c04472f8>] ? __do_softirq+0xa9/0x163
Apr 25 17:09:51 sulaco kernel: [579284.016227]  [<c044724f>] ? __do_softirq+0x0/0x163
Apr 25 17:09:51 sulaco kernel: [579284.016235]  [<c044724f>] ? __do_softirq+0x0/0x163
Apr 25 17:09:51 sulaco kernel: [579284.016240]  <IRQ>  [<c044744b>] ? irq_exit+0x3c/0x70
Apr 25 17:09:51 sulaco kernel: [579284.016255]  [<c040b193>] ? do_IRQ+0x7e/0x92
Apr 25 17:09:51 sulaco kernel: [579284.016263]  [<c0409c30>] ? common_interrupt+0x30/0x38
Apr 25 17:09:51 sulaco kernel: [579284.016271]  [<c04400e0>] ? mmput+0xab/0xc8
Apr 25 17:09:51 sulaco kernel: [579284.016280]  [<c061859d>] ? intel_idle+0xc0/0xe7
Apr 25 17:09:51 sulaco kernel: [579284.016290]  [<c0726b4c>] ? cpuidle_idle_call+0xc5/0x136
Apr 25 17:09:51 sulaco kernel: [579284.016298]  [<c0408460>] ? cpu_idle+0x8e/0xa8
Apr 25 17:09:51 sulaco kernel: [579284.016308]  [<c07cb401>] ? rest_init+0x5d/0x5f
Apr 25 17:09:51 sulaco kernel: [579284.016317]  [<c0a70827>] ? start_kernel+0x357/0x35d
Apr 25 17:09:51 sulaco kernel: [579284.016325]  [<c0a7021d>] ? unknown_bootoption+0x0/0x19e
Apr 25 17:09:51 sulaco kernel: [579284.016334]  [<c0a700e7>] ? i386_start_kernel+0xd6/0xdc
Apr 25 17:09:51 sulaco kernel: [579284.016340] ---[ end trace 643bcc246665b4f3 ]---
Apr 25 17:09:51 sulaco kernel: [579284.016370] e1000e 0000:01:00.0: eth0: Reset adapter

When this happens, the device is unresponsive.

I attempt to reset it, by reloading the module, and only get:

Apr 27 00:22:01 sulaco kernel: [691614.851609] e1000e: Intel(R) PRO/1000 Network Driver - 1.2.20-k2
Apr 27 00:22:01 sulaco kernel: [691614.851619] e1000e: Copyright(c) 1999 - 2011 Intel Corporation.
Apr 27 00:22:01 sulaco kernel: [691614.851731] e1000e 0000:01:00.0: enabling device (0000 -> 0002)
Apr 27 00:22:01 sulaco kernel: [691614.851756] e1000e 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Apr 27 00:22:01 sulaco kernel: [691614.853345] e1000e 0000:01:00.0: PCI INT A disabled
Apr 27 00:22:01 sulaco kernel: [691614.853597] e1000e: probe of 0000:01:00.0 failed with error -2
Apr 27 00:22:01 sulaco kernel: [691614.853648] e1000e 0000:02:00.0: enabling device (0000 -> 0002)
Apr 27 00:22:01 sulaco kernel: [691614.853670] e1000e 0000:02:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
Apr 27 00:22:01 sulaco kernel: [691614.854341] e1000e 0000:02:00.0: PCI INT A disabled
Apr 27 00:22:01 sulaco kernel: [691614.854367] e1000e: probe of 0000:02:00.0 failed with error -2

I need to reboot to get networking back.

I have, on occasion with 2.6.37/2.6.38, gotten messages of the form:

Apr 20 01:44:00 sulaco kernel: [91733.059212] e1000e 0000:01:00.0: eth0: Detected Hardware Unit Hang:
Apr 20 01:44:00 sulaco kernel: [91733.059216]   TDH                  <4b>
Apr 20 01:44:00 sulaco kernel: [91733.059219]   TDT                  <3c>
Apr 20 01:44:00 sulaco kernel: [91733.059221]   next_to_use          <3c>
Apr 20 01:44:00 sulaco kernel: [91733.059223]   next_to_clean        <48>
Apr 20 01:44:00 sulaco kernel: [91733.059225] buffer_info[next_to_clean]:
Apr 20 01:44:00 sulaco kernel: [91733.059228]   time_stamp           <0>
Apr 20 01:44:00 sulaco kernel: [91733.059230]   next_to_watch        <3d>
Apr 20 01:44:00 sulaco kernel: [91733.059232]   jiffies              <5732863>
Apr 20 01:44:00 sulaco kernel: [91733.059234]   next_to_watch.status <0>
Apr 20 01:44:00 sulaco kernel: [91733.059237] MAC Status             <80080783>
Apr 20 01:44:00 sulaco kernel: [91733.059239] PHY Status             <796d>
Apr 20 01:44:00 sulaco kernel: [91733.059241] PHY 1000BASE-T Status  <3c00>
Apr 20 01:44:00 sulaco kernel: [91733.059243] PHY Extended Status    <3000>
Apr 20 01:44:00 sulaco kernel: [91733.059246] PCI Status             <10>

at some point prior to the tx queue hang. (In this case, 5 days before).
Comment 1 Alan 2012-08-23 13:45:32 UTC
Is this still seen with modern kernels ?
Comment 2 Bill Nottingham 2012-08-23 14:48:36 UTC
It appears to have gone away as of 3.4/3.5 or so.

Note You need to log in before you can comment on or make changes to this bug.