Bug 118721 - e1000e hardware unit hangs when TSO is on
Summary: e1000e hardware unit hangs when TSO is on
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-05-22 21:30 UTC by Steinar H. Gunderson
Modified: 2025-05-21 07:21 UTC (History)
11 users (show)

See Also:
Kernel Version: 4.6.0-trunk-amd64 (Debian)
Subsystem:
Regression: No
Bisected commit-id:


Attachments
e1000e Detected Hardware Unit Hang when using VLAN and routing (72.68 KB, text/plain)
2016-11-19 02:54 UTC, Stefan Agner
Details
Full Journal doing Routing with VLAN causing E1000 Hardware Unit Hang 4.8.16-300.fc25 (365.66 KB, text/x-log)
2017-02-03 17:02 UTC, Marcel Ziswiler
Details
Full Journal doing Routing with VLAN causing E1000 Hardware Unit Hang 4.9.6-200.fc25 (361.12 KB, text/x-log)
2017-02-03 17:03 UTC, Marcel Ziswiler
Details

Description Steinar H. Gunderson 2016-05-22 21:30:46 UTC
Hi,

I've seen this with a lot of kernels (mainline), but I only got to report it until now, on a Debian kernel.

I have an Atom system with an onboard e1000e. It's not particularly loaded (it takes my home network), but it has two VLANs on a bridge. Every few minutes, it hangs with a message like this:

[26318.324173] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                 TDH                  <92>
                 TDT                  <a7>
                 next_to_use          <a7>
                 next_to_clean        <92>
               buffer_info[next_to_clean]:
                 time_stamp           <100633915>
                 next_to_watch        <95>
                 jiffies              <100634004>
                 next_to_watch.status <0>
               MAC Status             <80083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <3800>
               PHY Extended Status    <3000>
               PCI Status             <10>
[26320.323906] e1000e 0000:00:19.0 eno1: Reset adapter unexpectedly
[26320.327575] br0: port 2(eno1.10) entered disabled state
[26320.327786] br0: port 1(eno1.11) entered disabled state
[26324.141990] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[26324.142387] br0: port 2(eno1.10) entered blocking state
[26324.142390] br0: port 2(eno1.10) entered forwarding state
[26324.142550] br0: port 1(eno1.11) entered blocking state
[26324.142553] br0: port 1(eno1.11) entered forwarding state

Then it resets, and continues.

If I turn off tso (ethtool -K eno1 tso off), the problem goes away.
Comment 1 Stefan Agner 2016-07-20 20:26:47 UTC
I see the same thing since quite a while on a Lenovo T431s with various Arch Linux kernels. I use the interface untagged and with one VLAN tag 99. It seems to work fine as long as I don't address the VLAN 99, but as soon as I get traffic through the VLAN I see those Hardware Unit Hang quite often. Turning of TSO seems to alleviate the problem here too.

Currently I am running 4.6.3-1-ARCH. 

[147992.037386] e1000e 0000:00:19.0 net0: Detected Hardware Unit Hang:
                  TDH                  <3d>
                  TDT                  <46>
                  next_to_use          <46>
                  next_to_clean        <3a>
                buffer_info[next_to_clean]:
                  time_stamp           <102a40703>
                  next_to_watch        <3d>
                  jiffies              <102a40a6e>
                  next_to_watch.status <0>
                MAC Status             <80083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[147994.037539] e1000e 0000:00:19.0 net0: Detected Hardware Unit Hang:
                  TDH                  <3d>
                  TDT                  <46>
                  next_to_use          <46>
                  next_to_clean        <3a>
                buffer_info[next_to_clean]:
                  time_stamp           <102a40703>
                  next_to_watch        <3d>
                  jiffies              <102a40cc6>
                  next_to_watch.status <0>
                MAC Status             <80083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>

# lspci -v
...
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
	Subsystem: Lenovo Device 21f3
	Flags: bus master, fast devsel, latency 0, IRQ 31
	Memory at f1500000 (32-bit, non-prefetchable) [size=128K]
	Memory at f153b000 (32-bit, non-prefetchable) [size=4K]
	I/O ports at 4080 [size=32]
	Capabilities: [c8] Power Management version 2
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [e0] PCI Advanced Features
	Kernel driver in use: e1000e
	Kernel modules: e1000e
...
Comment 2 [account disabled by the administrator] 2016-08-19 02:33:08 UTC
Send me your syslog/dmesg as there may be warnings from the driver that further showcase hints to solving this problem.
Comment 3 Stefan Agner 2016-11-19 02:54:58 UTC
Created attachment 245091 [details]
e1000e Detected Hardware Unit Hang when using VLAN and routing

Sorry for the delay, dmesg attached now. Still reliable reproducible on 4.8.8.
Comment 4 Steinar H. Gunderson 2016-11-19 09:13:47 UTC
I was bothered with this on a Skylake NUC recently, too. Eventually I had to turn off absolutely all forms of acceleration (TSO, checksumming, scatter/gather…) _and_ compile a kernel (4.8.1) with CONFIG_PM=n. Either wouldn't do it on its own.

Unfortunately I had to leave the site before I could collect enough data, but there were no other warnings before the hangs.
Comment 5 Marcel Ziswiler 2017-02-03 17:02:55 UTC
Created attachment 253981 [details]
Full Journal doing Routing with VLAN causing E1000 Hardware Unit Hang 4.8.16-300.fc25

I also see this issue when connected to a HPE 1820-8G (J9979A running Linux) or HP 1810-8G (J9802A running eCos) switch with a separate VLAN being routed by my Lenovo T440s running Fedora 25. Trying to e.g. git clone OpenCV on an ARM target connected to that VLAN through the switch the notebook routes to the Internet reliably shows this within a few seconds. Tried both older 4.8.16-300.fc25 as well as latest 4.9.6-200.fc25 kernels.
Comment 6 Marcel Ziswiler 2017-02-03 17:03:39 UTC
Created attachment 253991 [details]
Full Journal doing Routing with VLAN causing E1000 Hardware Unit Hang 4.9.6-200.fc25

And the log file running latest kernel.
Comment 7 S. Eckardt 2017-02-15 16:14:48 UTC
Same issue here. 
Running debian 4.8 with 3.16.0-4-amd64 as a router with several VLANs. The Hangs already occur with the system not being under mentionable loads. Occurring with a Supermicro X9SCM-F and an Intel Desktop Board. I've already upgraded the driver to 3.3.5.3-NAPI, disabled eee and aspm. This did not change the issue.
Disabling tso seems to be a working workaround, but I don't like the idea of keeping it disabled.
Can supply further logs if helpful. But from my perspective they look similar to the ones already uploaded.
Comment 8 Dominik Mierzejewski 2017-11-27 11:01:06 UTC
Probably the same here. Fedora 26, Intel 82579V adapter:
# uname -r
4.13.13-200.fc26.x86_64
# lspci -vnn -s 00:19.0
00:19.0 Ethernet controller [0200]: Intel Corporation 82579V Gigabit Network Connection [8086:1503] (rev 04)
	Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:7751]
	Flags: bus master, fast devsel, latency 0, IRQ 28
	Memory at f7c00000 (32-bit, non-prefetchable) [size=128K]
	Memory at f7c38000 (32-bit, non-prefetchable) [size=4K]
	I/O ports at f080 [size=32]
	Capabilities: [c8] Power Management version 2
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [e0] PCI Advanced Features
	Kernel driver in use: e1000e
	Kernel modules: e1000e
# ethtool -i eno1
driver: e1000e
version: 3.2.6-k
firmware-version: 0.13-4
expansion-rom-version: 
bus-info: 0000:00:19.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

I've disabled TSO to see if it helps for now.
Comment 9 maze 2022-05-20 22:26:16 UTC
Same here. Ubuntu 18.04,

# uname -r
4.15.0-177-generic
# lspci -vnn -s 00:19.0
00:19.0 Ethernet controller [0200]: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) [8086:1502] (rev 05)
        Subsystem: Super Micro Computer Inc 82579LM Gigabit Network Connection (Lewisville) [15d9:1502]
        Flags: bus master, fast devsel, latency 0, IRQ 26
        Memory at f7a00000 (32-bit, non-prefetchable) [size=128K]
        Memory at f7a23000 (32-bit, non-prefetchable) [size=4K]
        I/O ports at f020 [size=32]
        Capabilities: <access denied>
        Kernel driver in use: e1000e
        Kernel modules: e1000e
# ethtool -i em1
driver: e1000e
version: 3.2.6-k
firmware-version: 0.13-4
expansion-rom-version: 
bus-info: 0000:00:19.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

Disabling TSO seems to have fixed the problem for me. (I needed to set it after a fresh boot, *before* the interface starts bailing out continually.)
Comment 10 Peter Jose De Sousa 2023-08-17 20:45:53 UTC
Also hit this issue - might be helpful to others, reloading the module with the parameter Node=0 (The NUMA node my NIC is on - modprobe e1000e Node=0) appears to have worked around the issue.
Comment 11 Patrick Schaaf 2024-12-05 11:56:44 UTC
Getting this here on stable 6.12.1 kernel, after upgrading from previous LTS 6.6.63 on same hardware. Never saw it before. With 6.12.1 it happened twice, after around 5 days of uptime. System becomes unresponsive and unreachable. I'll try a third time with tso switched off, as mentioned earlier here.

The kernels are self-built from kernel.org sources, with config that I've been running for years.

There is nothing else related to e1000e in dmesg, prior to the issue appearing. Once it appears, the Hang message gets logged every two seconds.

Dec 05 02:07:57 teller kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
                                 TDH                  <46>
                                 TDT                  <6b>
                                 next_to_use          <6b>
                                 next_to_clean        <45>
                               buffer_info[next_to_clean]:
                                 time_stamp           <1023c6d0d>
                                 next_to_watch        <46>
                                 jiffies              <1023c6dd0>
                                 next_to_watch.status <0>
                               MAC Status             <80083>
                               PHY Status             <796d>
                               PHY 1000BASE-T Status  <3c00>
                               PHY Extended Status    <3000>
                               PCI Status             <10>
Comment 12 Patrick Schaaf 2024-12-05 12:07:07 UTC
In case it matters / helps someone understand the issue, hardware is this, 
with an Intel(R) Celeron(R) 2955U processor

> [    0.000000] DMI: CompuLab Ltd. Intense-PC2 (IPC2)/Intense-PC2 (IPC2), BIOS
> IPC2_3.330.3 X64 09/03/2014

which has a mix of ethernet controller types

> 00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I218-LM
> (rev 04)
> 02:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network
> Connection (rev 03)

That second controller is using the igb driver. Both ethernet ports are configured in an active-backup bond, but only eth0, the one with the e1000e driver that hangs, is Up and active.

Here's the dmesg output of kernel 6.12.1 from boot

> teller:~ # dmesg | egrep '(igb|e1000)'
> [    0.974808] e1000e: Intel(R) PRO/1000 Network Driver
> [    0.974811] e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
> [    0.975004] e1000e 0000:00:19.0: Interrupt Throttling Rate (ints/sec) set
> to dynamic conservative mode
> [    1.047712] e1000e 0000:00:19.0 0000:00:19.0 (uninitialized): registered
> PHC clock
> [    1.112485] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width x1)
> 00:01:c0:16:1d:76
> [    1.112492] e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection
> [    1.112518] e1000e 0000:00:19.0 eth0: MAC: 11, PHY: 12, PBA No: FFFFFF-0FF
> [    1.112554] igb: Intel(R) Gigabit Ethernet Network Driver
> [    1.112556] igb: Copyright (c) 2007-2014 Intel Corporation.
> [    1.141437] igb 0000:02:00.0: added PHC on eth1
> [    1.141457] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
> [    1.141460] igb 0000:02:00.0: eth1: (PCIe:2.5Gb/s:Width x1)
> 00:01:c0:16:1d:77
> [    1.141464] igb 0000:02:00.0: eth1: PBA No: FFFFFF-0FF
> [    1.141466] igb 0000:02:00.0: Using MSI-X interrupts. 2 rx queue(s), 2 tx
> queue(s)
> [   16.042536] e1000e 0000:00:19.0 eth0: NIC Link is Down
> [   18.974728] e1000e 0000:00:19.0 eth0: NIC Link is Up 1000 Mbps Full
> Duplex, Flow Control: None
> [   19.062699] e1000e 0000:00:19.0 eth0: entered promiscuous mode
> [   19.062800] e1000e 0000:00:19.0 eth0: entered allmulticast mode
> [   19.095345] e1000e 0000:00:19.0 eth0: left promiscuous mode
> [   33.497246] e1000e 0000:00:19.0 eth0: entered promiscuous mode
Comment 13 Jelle Geerts 2025-01-27 17:26:13 UTC
Just in case it helps pinpoint the issue: I've just had this problem for the first time as well, similar to Patrick Schaaf (see comment #11). Yesterday, after upgrading from Fedora 40 to Fedora 41 which includes kernel 6.12.10, I've started noticing ethernet communications stopping completely at random times.

NIC hardware: Intel Corporation Ethernet Connection (16) I219-V [8086:1a1f] (rev 01)

(What seems to alleviate the issue is unplugging the cable and re-inserting it. I have not yet tried other workarounds.)

Anyway, this message kept being shown by dmesg every second or so:

[10164.837549] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                 TDH                  <c6>
                 TDT                  <e4>
                 next_to_use          <e4>
                 next_to_clean        <c5>
               buffer_info[next_to_clean]:
                 time_stamp           <10094e81e>
                 next_to_watch        <c6>
                 jiffies              <100968240>
                 next_to_watch.status <0>
               MAC Status             <40080083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <7800>
               PHY Extended Status    <3000>
               PCI Status             <10>
[10166.822527] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                 TDH                  <c6>
                 TDT                  <e4>
                 next_to_use          <e4>
                 next_to_clean        <c5>
               buffer_info[next_to_clean]:
                 time_stamp           <10094e81e>
                 next_to_watch        <c6>
                 jiffies              <100968a01>
                 next_to_watch.status <0>
               MAC Status             <40080083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <7800>
               PHY Extended Status    <3000>
               PCI Status             <10>
Comment 14 jronpaul 2025-02-14 19:31:10 UTC
is there a fix for this ?

im seeing it on a dell Precision T5600
running linux mint 21.3

ethtool -i enp0s25 
driver: e1000e
version: 5.15.0-131-generic
firmware-version: 0.13-4
expansion-rom-version: 
bus-info: 0000:00:19.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

ethtool --show-features enp0s25 |egrep "gso|gro|tso"
tx-gso-robust: off [fixed]
tx-gso-partial: off [fixed]
tx-gso-list: off [fixed]
rx-gro-hw: off [fixed]
rx-gro-list: off
rx-udp-gro-forwarding: off
Comment 15 jronpaul 2025-02-14 19:32:53 UTC
i also added pcie_aspm=off to grub.
Comment 16 Jelle Geerts 2025-04-21 16:11:07 UTC
For me the issue seems to happen more often when I'm running either Docker containers and/or libvirt-based VMs.
Comment 17 Michael Orlitzky 2025-04-28 16:23:59 UTC
Does disabling TSO still work around this? It seems like this has re(?)appeared with a vengeance circa 6.11 / 6.12 so I'm wondering if the root cause has changed.

Here are two more reports of what is likely the same issue:

https://bugzilla.kernel.org/show_bug.cgi?id=219489
Comment 18 Michael Orlitzky 2025-05-21 01:13:17 UTC
To answer my own question, yes, disabling TSO can still fix the problem.

I thought this was a new issue in 6.12.x, but it was happening with the old kernel, too. What changed is that in 6.12 the NIC is (apparently) no longer able to recover on its own. I could still see the hangs/resets in dmesg on the older kernel though.

So, I disabled TSO, and the dmesg entries stopped. After a while I worked up the courage to reboot into 6.12. Even that has been running for a while now without issue (now that TSO is off).
Comment 19 jronpaul 2025-05-21 03:22:25 UTC
(In reply to Jelle Geerts from comment #16)
> For me the issue seems to happen more often when I'm running either Docker
> containers and/or libvirt-based VMs.

huh.. funny you should mention that..
i think that was happening to me too.
i was running multiple virtualbox VMs and docker inside of those
Comment 20 Jelle Geerts 2025-05-21 07:21:20 UTC
In my case, the issue occurs despite TSO being disabled. Perhaps that has to do with other offloading settings such as 'generic-segmentation-offload', which was actually enabled here, even though 'tcp-segmentation-offload' was disabled. With 'the issue' I mean the infinite e1000e 'Detected Hardware Unit Hang' kernel messages. The driver is unable to recover the NIC to a working state. Kernel version: 6.12.10

The output below clearly shows that TSO (tcp-segmentation-offload) is disabled but GSO (generic-segmentation-offload) is *enabled*.

Note that I did *not* change any settings. These are the defaults.

$ ethtool --show-features enp0s31f6 |egrep "gso|gro|tso|offload"
tcp-segmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
tx-gso-robust: off [fixed]
tx-gso-partial: off [fixed]
tx-gso-list: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

Note You need to log in before you can comment on or make changes to this bug.