Bug 47331 - e1000: Detected Tx Unit Hang - network is not operational
Summary: e1000: Detected Tx Unit Hang - network is not operational
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-09-10 19:51 UTC by Khomutov Vladimir
Modified: 2018-03-01 13:29 UTC (History)
21 users (show)

See Also:
Kernel Version: 3.5.3
Tree: Mainline
Regression: No


Attachments
dmesg (1.96 KB, text/plain)
2012-09-10 19:51 UTC, Khomutov Vladimir
Details
ethtool (1.20 KB, text/plain)
2012-09-10 19:52 UTC, Khomutov Vladimir
Details
kernel config (91.25 KB, text/plain)
2012-09-10 19:52 UTC, Khomutov Vladimir
Details
lshw (25.38 KB, text/plain)
2012-09-10 19:52 UTC, Khomutov Vladimir
Details
lspci -vvv (41.34 KB, text/plain)
2012-09-10 19:53 UTC, Khomutov Vladimir
Details
uname (131 bytes, text/plain)
2012-09-10 19:53 UTC, Khomutov Vladimir
Details
dmesg output with verbose settings on (26.17 KB, text/plain)
2012-10-25 05:01 UTC, Khomutov Vladimir
Details
lspci -vvv after the issue occured (40.12 KB, text/plain)
2012-10-25 05:04 UTC, Khomutov Vladimir
Details
lspci -vvv after reboot (no issues) (40.10 KB, text/plain)
2012-10-25 17:52 UTC, Khomutov Vladimir
Details
kernel log with full dump of registers after the issue (148.67 KB, text/x-log)
2012-10-25 17:53 UTC, Khomutov Vladimir
Details
logs demonstrating the issue with 3.7.9 and 2 gig of ram (133.01 KB, application/x-gzip)
2013-02-27 20:30 UTC, Khomutov Vladimir
Details
dmesg (31.85 KB, text/plain)
2015-02-10 23:02 UTC, abandoned account
Details

Description Khomutov Vladimir 2012-09-10 19:51:47 UTC
Created attachment 79641 [details]
dmesg

Under any significant load the driver starts producing
detected Tx Unit hang messages and resets adapter, thus
networking is not really usable.
By significant load i mean an attempt to download big (>2 Mb) 
file over ssh. 
Please see attached files for system details.
The bug is 100% reproducible.

It looks like the problem is known for a long time, but
there is some mess with linux e1000 drivers: it looks
like kernel contains 7.x version (alive) and there is 
8.x version by intel on sourceforge (not maintained
and doesn't compile for newer kernels). A lot of suggestions
are about intel's version of the driver.
Please sched some light on the situation with e1000 in linux...
Comment 1 Khomutov Vladimir 2012-09-10 19:52:14 UTC
Created attachment 79651 [details]
ethtool
Comment 2 Khomutov Vladimir 2012-09-10 19:52:32 UTC
Created attachment 79661 [details]
kernel config
Comment 3 Khomutov Vladimir 2012-09-10 19:52:54 UTC
Created attachment 79671 [details]
lshw
Comment 4 Khomutov Vladimir 2012-09-10 19:53:18 UTC
Created attachment 79681 [details]
lspci -vvv
Comment 5 Khomutov Vladimir 2012-09-10 19:53:37 UTC
Created attachment 79691 [details]
uname
Comment 6 Stefan de Konink 2012-10-16 20:08:27 UTC
I want to confirm this also happened to me today using 3.3.1-gentoo after 192 days of uptime, while the system was _not_ under any significant load.

e1000 0000:01:03.0: eth0: Detected Tx Unit Hang
  Tx Queue             <0>
  TDH                  <bd>
  TDT                  <bd>
  next_to_use          <bd>
  next_to_clean        <73>
buffer_info[next_to_clean]
  time_stamp           <5fd26739>
  next_to_watch        <74>
  jiffies              <5fd267fb>
  next_to_watch.status <0>
e1000 0000:01:03.0: eth0: Detected Tx Unit Hang
  Tx Queue             <0>
  TDH                  <bd>
  TDT                  <bd>
  next_to_use          <bd>
  next_to_clean        <73>
buffer_info[next_to_clean]
  time_stamp           <5fd26739>
  next_to_watch        <74>
  jiffies              <5fd268c3>
  next_to_watch.status <0>
e1000 0000:01:03.0: eth0: Detected Tx Unit Hang
  Tx Queue             <0>
  TDH                  <bd>
  TDT                  <bd>
  next_to_use          <bd>
  next_to_clean        <73>
buffer_info[next_to_clean]
  time_stamp           <5fd26739>
  next_to_watch        <74>
  jiffies              <5fd2698b>
  next_to_watch.status <0>
------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0x1b3/0x1bc()
Hardware name: PowerEdge 650              
NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
Modules linked in: usb_storage usb_libusual xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables ohci_hcd usbcore usb_common
Pid: 0, comm: swapper Not tainted 3.3.1-gentoo #3
Call Trace:
 [<c101fa16>] warn_slowpath_common+0x67/0x8e
 [<c123d824>] ? dev_watchdog+0x1b3/0x1bc
 [<c123d824>] ? dev_watchdog+0x1b3/0x1bc
 [<c101fab9>] warn_slowpath_fmt+0x2e/0x30
 [<c123d824>] dev_watchdog+0x1b3/0x1bc
 [<c102a8d1>] run_timer_softirq+0xfb/0x28a
 [<c123d671>] ? netif_carrier_off+0x26/0x26
 [<c102461b>] __do_softirq+0x72/0x14a
 [<c10245a9>] ? __tasklet_hi_schedule_first+0x4b/0x4b
 <IRQ>  [<c102485f>] ? irq_exit+0x64/0x85
 [<c100395d>] ? do_IRQ+0x3d/0x84
 [<c12b6029>] ? common_interrupt+0x29/0x30
 [<c10082b3>] ? default_idle+0x4d/0x129
 [<c100167f>] ? cpu_idle+0x40/0x63
 [<c12aa515>] ? rest_init+0x55/0x60
 [<c13ce607>] ? start_kernel+0x24c/0x252
 [<c13ce13f>] ? loglevel+0x2b/0x2b
 [<c13ce044>] ? i386_start_kernel+0x44/0x46
---[ end trace 7a625de8614c18af ]---
Comment 7 Tushar 2012-10-23 22:10:55 UTC
Set the current msglvl by 'ethtool -s ethx msglvl 0x2c01' so driver will print hw ring info when problem occurs.
Please submit full dmesg log  and lspci -vvv output after issue occurs.

-Tushar
Comment 8 Khomutov Vladimir 2012-10-25 05:01:50 UTC
Created attachment 84761 [details]
dmesg output with verbose settings on
Comment 9 Khomutov Vladimir 2012-10-25 05:04:01 UTC
Created attachment 84771 [details]
lspci -vvv after the issue occured
Comment 10 Tushar 2012-10-25 05:14:25 UTC
dmesg log seems to be overwritten. It does not contain tx ring info. If you have not attached full dmesg please attach.

I do see PCI Master Abort error in lspci. 
07:03.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet 	Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR- INTx-
             ^^

You may need to freshly boot system and dump lspci -vvv before and after tx hang occurs to confirm that MAbort is the cause.
Comment 11 Khomutov Vladimir 2012-10-25 17:52:36 UTC
Created attachment 84821 [details]
lspci -vvv after reboot (no issues)
Comment 12 Khomutov Vladimir 2012-10-25 17:53:43 UTC
Created attachment 84831 [details]
kernel log with full dump of registers after the issue
Comment 13 Tushar 2012-10-25 23:25:19 UTC
So looking at lspci -vvv before and after confirms that root cause of the tx hang is PCI MAbort. (For some reason I don't see TX descriptor ring dump logged in dmesg log.)

Looks like system has 8GB RAM. Can you test with only 2 GB ram and see if issue occurs?
Comment 14 Stefan de Konink 2012-10-25 23:28:13 UTC
My problem was on a much older system, with only 1.5GB of RAM. Doubt that is related.
Comment 15 Tushar 2012-10-25 23:33:14 UTC
Was this working at all before and start appearing with kernel upgrade?
Comment 16 Khomutov Vladimir 2012-10-26 10:08:05 UTC
(In reply to comment #13)
 
> Looks like system has 8GB RAM. Can you test with only 2 GB ram and see if
> issue
> occurs?

When I've just installed the hardware and hit the issue, I spent some time
looking for solution, and one of possible causes named was ram > 4 G, so I've
tried with 4 G and the result was the same. Can't remember if I've tried with
2G. Anyway, I need it working with 8G.
Comment 17 Tushar 2012-11-01 20:53:44 UTC
pci bus trace would be very helpful to find out cause of Mabort error. Do you have facility to capture bus trace?

If not then can you send me the dmesg log again with Tx ring info.
(Last time you sent dmesg log - Comment #12, it has not dumped the tx ring. Make sure msglvl is set to value 0x2c01 - 'ethtool -s ethx msglvl 0x2c01')
Comment 18 Khomutov Vladimir 2012-11-01 21:01:27 UTC
I have no idea how to get PCI bus trace.
If it does not require some special hardware, I can try if you explain how.

I don't know why kernel didn't dump TX ring, since I've entered 
command 'ethtool -s ethx msglvl 0x2c01' and the result was that
there were a lot of verbose messages from driver, not just errors.

I thought that this line was one marking start of what you are looking for:

>> Oct 25 21:22:13 myhost kernel: e1000: Tx descriptor cache in 64bit format
Comment 19 Phil 2012-11-19 18:21:37 UTC
I upgraded a number of boxes using 82546GB from 3.1.x kernels to 3.6.x recently, and a number of them have started having these TX hangs regularly.  So what changed between 3.1 and 3.6 to cause this?  Any output I can provide which would assist?
Comment 20 Stefan de Konink 2012-11-19 18:27:19 UTC
With my report regarding 3.3.1 we can reduce that to: what happened between 3.1 and 3.3.1.
Comment 21 ulf kypke 2013-01-10 15:06:37 UTC
hey,
i have this at kernel 2.6 as well as with kernel 3.3.8
i use openwrt on a router with 5 intel e1000 and e1000e cards.
most of this routers have 1 or 2 gb ram
this bug happens after aprox. 5 hours of running, does not matter if there is a lot of traffic or not.
i can reproduce this bug on various kernel versions, but with kernel 3.3.8 it happens way more often then with kernel 2.6
g i also posted this bug @ 52571 (sorry for crossposting)

best ulf
Comment 22 Tushar 2013-01-10 21:22:57 UTC
(In reply to comment #18)
> I have no idea how to get PCI bus trace.
> If it does not require some special hardware, I can try if you explain how.
> I don't know why kernel didn't dump TX ring, since I've entered 
> command 'ethtool -s ethx msglvl 0x2c01' and the result was that
> there were a lot of verbose messages from driver, not just errors.

Yes please send me the full dmesg log taken.
> I thought that this line was one marking start of what you are looking for:
> >> Oct 25 21:22:13 myhost kernel: e1000: Tx descriptor cache in 64bit format

 
Would you also please try disabling tso with 'ethtool -K ethx tso off'. See if that makes any difference. Meanwhile I will see if I get hold off similar system as yours and can reproduce issue locally.
Comment 23 Rich Ercolani 2013-02-06 23:42:44 UTC
So this looks a lot like the problem I once had with these cards.

http://sourceforge.net/p/e1000/bugs/266/ is the relevant bug report.

Verbatim quote from the bug report at the time, though it seems to be gone from the updated version (sf.net migrated bug trackers I suppose):

"We were able to reproduce this bug here, and verify that the system in
question has a cache coherency problem.  the driver is correctly updating
the system memory and then requesting hardware to DMA the data.  When the
hardware DMA request is completed by the memory controller, the data in
question is stale (the value prior to the update) and then the software
suffers an apparent "tx hang"

We are still investigating if there is a fix possible."

I believe they eventually did indeed have a fix in the 8.x series - at least, at some point I downloaded it and used it, and the problem went away.

The motherboard being Intel-branded implies it probably isn't a crap board with cache coherency bugs...one hopes, at least.  But that is the same era of hardware we had these problems on.
Comment 24 Tushar 2013-02-06 23:43:13 UTC
I am on vacation 02/06

-Tushar
Comment 25 Khomutov Vladimir 2013-02-27 20:30:05 UTC
we all hope vacation was great =))

but ack on topic:

1) I was able to reproduce the problem with kernel 3.7.9
2) The problem is 100% reproducible with 2 gig of RAM
3) Turning tso off doesn't help
4) I was able to get full logs with "ethtool -s ethx msglvl 0x2c01"

I'm attaching tarball with logs (kernel log, lscpi and some other related information) for all this cases.
Comment 26 Khomutov Vladimir 2013-02-27 20:30:42 UTC
Created attachment 94201 [details]
logs demonstrating the issue with 3.7.9 and 2 gig of ram
Comment 27 abandoned account 2015-02-10 23:02:13 UTC
Created attachment 166421 [details]
dmesg

I got this same thing, unexpectedly, inside virtualbox(8GB RAM) with kernel 3.16.5-gentoo (found on install-amd64-minimal-20141204.iso)
Comment 28 peter.eldridge.bailey 2016-03-09 05:36:49 UTC
I am also affected by this on 4.4.0-1-686-pae #1 SMP Debian 4.4.2-3 (2016-02-21).
Comment 29 Victor Pablos Ceruelo 2016-05-16 15:58:53 UTC
Me too, Ubuntu 4.4.0-22.39-generic 4.4.8 + Intel(R) PRO/1000 Network Driver - 3.2.6-k

Trying now with options:

ethtool -K eno1 gso off gro off tso off
ethtool -s eno1 msglvl 0x2c01
ethtool --set-eee eno1 eee off
ethtool --set-eee eno1 advertise 0

Not tried yet to disable Active-State Power Management (boot option):

pcie_aspm=off

Seems to be the link goes off/on at some times, but no more details in dmesg.

Thinking into forcing the network adapter to work at 100Mb speed ...
Comment 30 Victor Pablos Ceruelo 2016-05-16 16:39:17 UTC
Hi again.

It seems to be the problem is fixed by setting the options I wrote before and reducing the speed to 100 Mb. 

ethtool -s eno1 speed 100 duplex full

I've been 1/2h downloading at 600Kb-1Mb.
At home I do not need more, but it could be interesting to know why ...

There are more ideas around, like 

ethtool -K eno1 gso off gro off tso off lro off

and upgrading to e1000e-3.3.3

https://communities.intel.com/thread/70244
https://downloadcenter.intel.com/download/15817/Network-Adapter-Driver-for-PCI-E-Gigabit-Network-Connections-under-Linux-?v=t

Hope it helps others ...
Comment 31 Till Schäfer 2018-03-01 13:29:19 UTC
I can confirm the issue with gentoo sources 4.15.2 on a C220 chipset under heavy load (> 500 Mbit / adapter hangs up every few seconds). I can also confirm, that the following workaround helps. The only popped up recently after using the hardware without any problem up to recent kernels. 

ethtool -K eno1 gso off gro off tso off



00:19.0 Ethernet controller [0200]: Intel Corporation Ethernet Connection I217-LM [8086:153a] (rev 05)
        Subsystem: ASUSTeK Computer Inc. Ethernet Connection I217-LM [1043:8535]
        Flags: bus master, fast devsel, latency 0, IRQ 27
        Memory at f7d00000 (32-bit, non-prefetchable) [size=128K]
        Memory at f7d35000 (32-bit, non-prefetchable) [size=4K]
        I/O ports at f080 [size=32]
        Capabilities: [c8] Power Management version 2
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [e0] PCI Advanced Features
        Kernel driver in use: e1000e

Note You need to log in before you can comment on or make changes to this bug.