Bug 118721 - e1000e hardware unit hangs when TSO is on
Summary: e1000e hardware unit hangs when TSO is on
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-05-22 21:30 UTC by Steinar H. Gunderson
Modified: 2023-08-17 20:45 UTC (History)
7 users (show)

See Also:
Kernel Version: 4.6.0-trunk-amd64 (Debian)
Subsystem:
Regression: No
Bisected commit-id:


Attachments
e1000e Detected Hardware Unit Hang when using VLAN and routing (72.68 KB, text/plain)
2016-11-19 02:54 UTC, Stefan Agner
Details
Full Journal doing Routing with VLAN causing E1000 Hardware Unit Hang 4.8.16-300.fc25 (365.66 KB, text/x-log)
2017-02-03 17:02 UTC, Marcel Ziswiler
Details
Full Journal doing Routing with VLAN causing E1000 Hardware Unit Hang 4.9.6-200.fc25 (361.12 KB, text/x-log)
2017-02-03 17:03 UTC, Marcel Ziswiler
Details

Description Steinar H. Gunderson 2016-05-22 21:30:46 UTC
Hi,

I've seen this with a lot of kernels (mainline), but I only got to report it until now, on a Debian kernel.

I have an Atom system with an onboard e1000e. It's not particularly loaded (it takes my home network), but it has two VLANs on a bridge. Every few minutes, it hangs with a message like this:

[26318.324173] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                 TDH                  <92>
                 TDT                  <a7>
                 next_to_use          <a7>
                 next_to_clean        <92>
               buffer_info[next_to_clean]:
                 time_stamp           <100633915>
                 next_to_watch        <95>
                 jiffies              <100634004>
                 next_to_watch.status <0>
               MAC Status             <80083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <3800>
               PHY Extended Status    <3000>
               PCI Status             <10>
[26320.323906] e1000e 0000:00:19.0 eno1: Reset adapter unexpectedly
[26320.327575] br0: port 2(eno1.10) entered disabled state
[26320.327786] br0: port 1(eno1.11) entered disabled state
[26324.141990] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[26324.142387] br0: port 2(eno1.10) entered blocking state
[26324.142390] br0: port 2(eno1.10) entered forwarding state
[26324.142550] br0: port 1(eno1.11) entered blocking state
[26324.142553] br0: port 1(eno1.11) entered forwarding state

Then it resets, and continues.

If I turn off tso (ethtool -K eno1 tso off), the problem goes away.
Comment 1 Stefan Agner 2016-07-20 20:26:47 UTC
I see the same thing since quite a while on a Lenovo T431s with various Arch Linux kernels. I use the interface untagged and with one VLAN tag 99. It seems to work fine as long as I don't address the VLAN 99, but as soon as I get traffic through the VLAN I see those Hardware Unit Hang quite often. Turning of TSO seems to alleviate the problem here too.

Currently I am running 4.6.3-1-ARCH. 

[147992.037386] e1000e 0000:00:19.0 net0: Detected Hardware Unit Hang:
                  TDH                  <3d>
                  TDT                  <46>
                  next_to_use          <46>
                  next_to_clean        <3a>
                buffer_info[next_to_clean]:
                  time_stamp           <102a40703>
                  next_to_watch        <3d>
                  jiffies              <102a40a6e>
                  next_to_watch.status <0>
                MAC Status             <80083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[147994.037539] e1000e 0000:00:19.0 net0: Detected Hardware Unit Hang:
                  TDH                  <3d>
                  TDT                  <46>
                  next_to_use          <46>
                  next_to_clean        <3a>
                buffer_info[next_to_clean]:
                  time_stamp           <102a40703>
                  next_to_watch        <3d>
                  jiffies              <102a40cc6>
                  next_to_watch.status <0>
                MAC Status             <80083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>

# lspci -v
...
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
	Subsystem: Lenovo Device 21f3
	Flags: bus master, fast devsel, latency 0, IRQ 31
	Memory at f1500000 (32-bit, non-prefetchable) [size=128K]
	Memory at f153b000 (32-bit, non-prefetchable) [size=4K]
	I/O ports at 4080 [size=32]
	Capabilities: [c8] Power Management version 2
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [e0] PCI Advanced Features
	Kernel driver in use: e1000e
	Kernel modules: e1000e
...
Comment 2 [account disabled by the administrator] 2016-08-19 02:33:08 UTC
Send me your syslog/dmesg as there may be warnings from the driver that further showcase hints to solving this problem.
Comment 3 Stefan Agner 2016-11-19 02:54:58 UTC
Created attachment 245091 [details]
e1000e Detected Hardware Unit Hang when using VLAN and routing

Sorry for the delay, dmesg attached now. Still reliable reproducible on 4.8.8.
Comment 4 Steinar H. Gunderson 2016-11-19 09:13:47 UTC
I was bothered with this on a Skylake NUC recently, too. Eventually I had to turn off absolutely all forms of acceleration (TSO, checksumming, scatter/gather…) _and_ compile a kernel (4.8.1) with CONFIG_PM=n. Either wouldn't do it on its own.

Unfortunately I had to leave the site before I could collect enough data, but there were no other warnings before the hangs.
Comment 5 Marcel Ziswiler 2017-02-03 17:02:55 UTC
Created attachment 253981 [details]
Full Journal doing Routing with VLAN causing E1000 Hardware Unit Hang 4.8.16-300.fc25

I also see this issue when connected to a HPE 1820-8G (J9979A running Linux) or HP 1810-8G (J9802A running eCos) switch with a separate VLAN being routed by my Lenovo T440s running Fedora 25. Trying to e.g. git clone OpenCV on an ARM target connected to that VLAN through the switch the notebook routes to the Internet reliably shows this within a few seconds. Tried both older 4.8.16-300.fc25 as well as latest 4.9.6-200.fc25 kernels.
Comment 6 Marcel Ziswiler 2017-02-03 17:03:39 UTC
Created attachment 253991 [details]
Full Journal doing Routing with VLAN causing E1000 Hardware Unit Hang 4.9.6-200.fc25

And the log file running latest kernel.
Comment 7 S. Eckardt 2017-02-15 16:14:48 UTC
Same issue here. 
Running debian 4.8 with 3.16.0-4-amd64 as a router with several VLANs. The Hangs already occur with the system not being under mentionable loads. Occurring with a Supermicro X9SCM-F and an Intel Desktop Board. I've already upgraded the driver to 3.3.5.3-NAPI, disabled eee and aspm. This did not change the issue.
Disabling tso seems to be a working workaround, but I don't like the idea of keeping it disabled.
Can supply further logs if helpful. But from my perspective they look similar to the ones already uploaded.
Comment 8 Dominik Mierzejewski 2017-11-27 11:01:06 UTC
Probably the same here. Fedora 26, Intel 82579V adapter:
# uname -r
4.13.13-200.fc26.x86_64
# lspci -vnn -s 00:19.0
00:19.0 Ethernet controller [0200]: Intel Corporation 82579V Gigabit Network Connection [8086:1503] (rev 04)
	Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:7751]
	Flags: bus master, fast devsel, latency 0, IRQ 28
	Memory at f7c00000 (32-bit, non-prefetchable) [size=128K]
	Memory at f7c38000 (32-bit, non-prefetchable) [size=4K]
	I/O ports at f080 [size=32]
	Capabilities: [c8] Power Management version 2
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [e0] PCI Advanced Features
	Kernel driver in use: e1000e
	Kernel modules: e1000e
# ethtool -i eno1
driver: e1000e
version: 3.2.6-k
firmware-version: 0.13-4
expansion-rom-version: 
bus-info: 0000:00:19.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

I've disabled TSO to see if it helps for now.
Comment 9 maze 2022-05-20 22:26:16 UTC
Same here. Ubuntu 18.04,

# uname -r
4.15.0-177-generic
# lspci -vnn -s 00:19.0
00:19.0 Ethernet controller [0200]: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) [8086:1502] (rev 05)
        Subsystem: Super Micro Computer Inc 82579LM Gigabit Network Connection (Lewisville) [15d9:1502]
        Flags: bus master, fast devsel, latency 0, IRQ 26
        Memory at f7a00000 (32-bit, non-prefetchable) [size=128K]
        Memory at f7a23000 (32-bit, non-prefetchable) [size=4K]
        I/O ports at f020 [size=32]
        Capabilities: <access denied>
        Kernel driver in use: e1000e
        Kernel modules: e1000e
# ethtool -i em1
driver: e1000e
version: 3.2.6-k
firmware-version: 0.13-4
expansion-rom-version: 
bus-info: 0000:00:19.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

Disabling TSO seems to have fixed the problem for me. (I needed to set it after a fresh boot, *before* the interface starts bailing out continually.)
Comment 10 Peter Jose De Sousa 2023-08-17 20:45:53 UTC
Also hit this issue - might be helpful to others, reloading the module with the parameter Node=0 (The NUMA node my NIC is on - modprobe e1000e Node=0) appears to have worked around the issue.

Note You need to log in before you can comment on or make changes to this bug.