Hi, I've seen this with a lot of kernels (mainline), but I only got to report it until now, on a Debian kernel. I have an Atom system with an onboard e1000e. It's not particularly loaded (it takes my home network), but it has two VLANs on a bridge. Every few minutes, it hangs with a message like this: [26318.324173] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang: TDH <92> TDT <a7> next_to_use <a7> next_to_clean <92> buffer_info[next_to_clean]: time_stamp <100633915> next_to_watch <95> jiffies <100634004> next_to_watch.status <0> MAC Status <80083> PHY Status <796d> PHY 1000BASE-T Status <3800> PHY Extended Status <3000> PCI Status <10> [26320.323906] e1000e 0000:00:19.0 eno1: Reset adapter unexpectedly [26320.327575] br0: port 2(eno1.10) entered disabled state [26320.327786] br0: port 1(eno1.11) entered disabled state [26324.141990] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None [26324.142387] br0: port 2(eno1.10) entered blocking state [26324.142390] br0: port 2(eno1.10) entered forwarding state [26324.142550] br0: port 1(eno1.11) entered blocking state [26324.142553] br0: port 1(eno1.11) entered forwarding state Then it resets, and continues. If I turn off tso (ethtool -K eno1 tso off), the problem goes away.
I see the same thing since quite a while on a Lenovo T431s with various Arch Linux kernels. I use the interface untagged and with one VLAN tag 99. It seems to work fine as long as I don't address the VLAN 99, but as soon as I get traffic through the VLAN I see those Hardware Unit Hang quite often. Turning of TSO seems to alleviate the problem here too. Currently I am running 4.6.3-1-ARCH. [147992.037386] e1000e 0000:00:19.0 net0: Detected Hardware Unit Hang: TDH <3d> TDT <46> next_to_use <46> next_to_clean <3a> buffer_info[next_to_clean]: time_stamp <102a40703> next_to_watch <3d> jiffies <102a40a6e> next_to_watch.status <0> MAC Status <80083> PHY Status <796d> PHY 1000BASE-T Status <3800> PHY Extended Status <3000> PCI Status <10> [147994.037539] e1000e 0000:00:19.0 net0: Detected Hardware Unit Hang: TDH <3d> TDT <46> next_to_use <46> next_to_clean <3a> buffer_info[next_to_clean]: time_stamp <102a40703> next_to_watch <3d> jiffies <102a40cc6> next_to_watch.status <0> MAC Status <80083> PHY Status <796d> PHY 1000BASE-T Status <3800> PHY Extended Status <3000> PCI Status <10> # lspci -v ... 00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04) Subsystem: Lenovo Device 21f3 Flags: bus master, fast devsel, latency 0, IRQ 31 Memory at f1500000 (32-bit, non-prefetchable) [size=128K] Memory at f153b000 (32-bit, non-prefetchable) [size=4K] I/O ports at 4080 [size=32] Capabilities: [c8] Power Management version 2 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [e0] PCI Advanced Features Kernel driver in use: e1000e Kernel modules: e1000e ...
Send me your syslog/dmesg as there may be warnings from the driver that further showcase hints to solving this problem.
Created attachment 245091 [details] e1000e Detected Hardware Unit Hang when using VLAN and routing Sorry for the delay, dmesg attached now. Still reliable reproducible on 4.8.8.
I was bothered with this on a Skylake NUC recently, too. Eventually I had to turn off absolutely all forms of acceleration (TSO, checksumming, scatter/gather…) _and_ compile a kernel (4.8.1) with CONFIG_PM=n. Either wouldn't do it on its own. Unfortunately I had to leave the site before I could collect enough data, but there were no other warnings before the hangs.
Created attachment 253981 [details] Full Journal doing Routing with VLAN causing E1000 Hardware Unit Hang 4.8.16-300.fc25 I also see this issue when connected to a HPE 1820-8G (J9979A running Linux) or HP 1810-8G (J9802A running eCos) switch with a separate VLAN being routed by my Lenovo T440s running Fedora 25. Trying to e.g. git clone OpenCV on an ARM target connected to that VLAN through the switch the notebook routes to the Internet reliably shows this within a few seconds. Tried both older 4.8.16-300.fc25 as well as latest 4.9.6-200.fc25 kernels.
Created attachment 253991 [details] Full Journal doing Routing with VLAN causing E1000 Hardware Unit Hang 4.9.6-200.fc25 And the log file running latest kernel.
Same issue here. Running debian 4.8 with 3.16.0-4-amd64 as a router with several VLANs. The Hangs already occur with the system not being under mentionable loads. Occurring with a Supermicro X9SCM-F and an Intel Desktop Board. I've already upgraded the driver to 3.3.5.3-NAPI, disabled eee and aspm. This did not change the issue. Disabling tso seems to be a working workaround, but I don't like the idea of keeping it disabled. Can supply further logs if helpful. But from my perspective they look similar to the ones already uploaded.
Probably the same here. Fedora 26, Intel 82579V adapter: # uname -r 4.13.13-200.fc26.x86_64 # lspci -vnn -s 00:19.0 00:19.0 Ethernet controller [0200]: Intel Corporation 82579V Gigabit Network Connection [8086:1503] (rev 04) Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:7751] Flags: bus master, fast devsel, latency 0, IRQ 28 Memory at f7c00000 (32-bit, non-prefetchable) [size=128K] Memory at f7c38000 (32-bit, non-prefetchable) [size=4K] I/O ports at f080 [size=32] Capabilities: [c8] Power Management version 2 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [e0] PCI Advanced Features Kernel driver in use: e1000e Kernel modules: e1000e # ethtool -i eno1 driver: e1000e version: 3.2.6-k firmware-version: 0.13-4 expansion-rom-version: bus-info: 0000:00:19.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: no I've disabled TSO to see if it helps for now.
Same here. Ubuntu 18.04, # uname -r 4.15.0-177-generic # lspci -vnn -s 00:19.0 00:19.0 Ethernet controller [0200]: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) [8086:1502] (rev 05) Subsystem: Super Micro Computer Inc 82579LM Gigabit Network Connection (Lewisville) [15d9:1502] Flags: bus master, fast devsel, latency 0, IRQ 26 Memory at f7a00000 (32-bit, non-prefetchable) [size=128K] Memory at f7a23000 (32-bit, non-prefetchable) [size=4K] I/O ports at f020 [size=32] Capabilities: <access denied> Kernel driver in use: e1000e Kernel modules: e1000e # ethtool -i em1 driver: e1000e version: 3.2.6-k firmware-version: 0.13-4 expansion-rom-version: bus-info: 0000:00:19.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: no Disabling TSO seems to have fixed the problem for me. (I needed to set it after a fresh boot, *before* the interface starts bailing out continually.)
Also hit this issue - might be helpful to others, reloading the module with the parameter Node=0 (The NUMA node my NIC is on - modprobe e1000e Node=0) appears to have worked around the issue.