Bug 118721
Summary: | e1000e hardware unit hangs when TSO is on | ||
---|---|---|---|
Product: | Drivers | Reporter: | Steinar H. Gunderson (steinar+kernel) |
Component: | Network | Assignee: | drivers_network (drivers_network) |
Status: | NEW --- | ||
Severity: | normal | CC: | bugs_kernel.org, dominik, ingvarthorvald, jellegeerts, jronpaul, kernelorg, marcel, maze, michael, petedes, stefan |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.6.0-trunk-amd64 (Debian) | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
e1000e Detected Hardware Unit Hang when using VLAN and routing
Full Journal doing Routing with VLAN causing E1000 Hardware Unit Hang 4.8.16-300.fc25 Full Journal doing Routing with VLAN causing E1000 Hardware Unit Hang 4.9.6-200.fc25 |
Description
Steinar H. Gunderson
2016-05-22 21:30:46 UTC
I see the same thing since quite a while on a Lenovo T431s with various Arch Linux kernels. I use the interface untagged and with one VLAN tag 99. It seems to work fine as long as I don't address the VLAN 99, but as soon as I get traffic through the VLAN I see those Hardware Unit Hang quite often. Turning of TSO seems to alleviate the problem here too. Currently I am running 4.6.3-1-ARCH. [147992.037386] e1000e 0000:00:19.0 net0: Detected Hardware Unit Hang: TDH <3d> TDT <46> next_to_use <46> next_to_clean <3a> buffer_info[next_to_clean]: time_stamp <102a40703> next_to_watch <3d> jiffies <102a40a6e> next_to_watch.status <0> MAC Status <80083> PHY Status <796d> PHY 1000BASE-T Status <3800> PHY Extended Status <3000> PCI Status <10> [147994.037539] e1000e 0000:00:19.0 net0: Detected Hardware Unit Hang: TDH <3d> TDT <46> next_to_use <46> next_to_clean <3a> buffer_info[next_to_clean]: time_stamp <102a40703> next_to_watch <3d> jiffies <102a40cc6> next_to_watch.status <0> MAC Status <80083> PHY Status <796d> PHY 1000BASE-T Status <3800> PHY Extended Status <3000> PCI Status <10> # lspci -v ... 00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04) Subsystem: Lenovo Device 21f3 Flags: bus master, fast devsel, latency 0, IRQ 31 Memory at f1500000 (32-bit, non-prefetchable) [size=128K] Memory at f153b000 (32-bit, non-prefetchable) [size=4K] I/O ports at 4080 [size=32] Capabilities: [c8] Power Management version 2 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [e0] PCI Advanced Features Kernel driver in use: e1000e Kernel modules: e1000e ... Send me your syslog/dmesg as there may be warnings from the driver that further showcase hints to solving this problem. Created attachment 245091 [details]
e1000e Detected Hardware Unit Hang when using VLAN and routing
Sorry for the delay, dmesg attached now. Still reliable reproducible on 4.8.8.
I was bothered with this on a Skylake NUC recently, too. Eventually I had to turn off absolutely all forms of acceleration (TSO, checksumming, scatter/gather…) _and_ compile a kernel (4.8.1) with CONFIG_PM=n. Either wouldn't do it on its own. Unfortunately I had to leave the site before I could collect enough data, but there were no other warnings before the hangs. Created attachment 253981 [details]
Full Journal doing Routing with VLAN causing E1000 Hardware Unit Hang 4.8.16-300.fc25
I also see this issue when connected to a HPE 1820-8G (J9979A running Linux) or HP 1810-8G (J9802A running eCos) switch with a separate VLAN being routed by my Lenovo T440s running Fedora 25. Trying to e.g. git clone OpenCV on an ARM target connected to that VLAN through the switch the notebook routes to the Internet reliably shows this within a few seconds. Tried both older 4.8.16-300.fc25 as well as latest 4.9.6-200.fc25 kernels.
Created attachment 253991 [details]
Full Journal doing Routing with VLAN causing E1000 Hardware Unit Hang 4.9.6-200.fc25
And the log file running latest kernel.
Same issue here. Running debian 4.8 with 3.16.0-4-amd64 as a router with several VLANs. The Hangs already occur with the system not being under mentionable loads. Occurring with a Supermicro X9SCM-F and an Intel Desktop Board. I've already upgraded the driver to 3.3.5.3-NAPI, disabled eee and aspm. This did not change the issue. Disabling tso seems to be a working workaround, but I don't like the idea of keeping it disabled. Can supply further logs if helpful. But from my perspective they look similar to the ones already uploaded. Probably the same here. Fedora 26, Intel 82579V adapter: # uname -r 4.13.13-200.fc26.x86_64 # lspci -vnn -s 00:19.0 00:19.0 Ethernet controller [0200]: Intel Corporation 82579V Gigabit Network Connection [8086:1503] (rev 04) Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:7751] Flags: bus master, fast devsel, latency 0, IRQ 28 Memory at f7c00000 (32-bit, non-prefetchable) [size=128K] Memory at f7c38000 (32-bit, non-prefetchable) [size=4K] I/O ports at f080 [size=32] Capabilities: [c8] Power Management version 2 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [e0] PCI Advanced Features Kernel driver in use: e1000e Kernel modules: e1000e # ethtool -i eno1 driver: e1000e version: 3.2.6-k firmware-version: 0.13-4 expansion-rom-version: bus-info: 0000:00:19.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: no I've disabled TSO to see if it helps for now. Same here. Ubuntu 18.04, # uname -r 4.15.0-177-generic # lspci -vnn -s 00:19.0 00:19.0 Ethernet controller [0200]: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) [8086:1502] (rev 05) Subsystem: Super Micro Computer Inc 82579LM Gigabit Network Connection (Lewisville) [15d9:1502] Flags: bus master, fast devsel, latency 0, IRQ 26 Memory at f7a00000 (32-bit, non-prefetchable) [size=128K] Memory at f7a23000 (32-bit, non-prefetchable) [size=4K] I/O ports at f020 [size=32] Capabilities: <access denied> Kernel driver in use: e1000e Kernel modules: e1000e # ethtool -i em1 driver: e1000e version: 3.2.6-k firmware-version: 0.13-4 expansion-rom-version: bus-info: 0000:00:19.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: no Disabling TSO seems to have fixed the problem for me. (I needed to set it after a fresh boot, *before* the interface starts bailing out continually.) Also hit this issue - might be helpful to others, reloading the module with the parameter Node=0 (The NUMA node my NIC is on - modprobe e1000e Node=0) appears to have worked around the issue. Getting this here on stable 6.12.1 kernel, after upgrading from previous LTS 6.6.63 on same hardware. Never saw it before. With 6.12.1 it happened twice, after around 5 days of uptime. System becomes unresponsive and unreachable. I'll try a third time with tso switched off, as mentioned earlier here. The kernels are self-built from kernel.org sources, with config that I've been running for years. There is nothing else related to e1000e in dmesg, prior to the issue appearing. Once it appears, the Hang message gets logged every two seconds. Dec 05 02:07:57 teller kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang: TDH <46> TDT <6b> next_to_use <6b> next_to_clean <45> buffer_info[next_to_clean]: time_stamp <1023c6d0d> next_to_watch <46> jiffies <1023c6dd0> next_to_watch.status <0> MAC Status <80083> PHY Status <796d> PHY 1000BASE-T Status <3c00> PHY Extended Status <3000> PCI Status <10> In case it matters / helps someone understand the issue, hardware is this, with an Intel(R) Celeron(R) 2955U processor > [ 0.000000] DMI: CompuLab Ltd. Intense-PC2 (IPC2)/Intense-PC2 (IPC2), BIOS > IPC2_3.330.3 X64 09/03/2014 which has a mix of ethernet controller types > 00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I218-LM > (rev 04) > 02:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network > Connection (rev 03) That second controller is using the igb driver. Both ethernet ports are configured in an active-backup bond, but only eth0, the one with the e1000e driver that hangs, is Up and active. Here's the dmesg output of kernel 6.12.1 from boot > teller:~ # dmesg | egrep '(igb|e1000)' > [ 0.974808] e1000e: Intel(R) PRO/1000 Network Driver > [ 0.974811] e1000e: Copyright(c) 1999 - 2015 Intel Corporation. > [ 0.975004] e1000e 0000:00:19.0: Interrupt Throttling Rate (ints/sec) set > to dynamic conservative mode > [ 1.047712] e1000e 0000:00:19.0 0000:00:19.0 (uninitialized): registered > PHC clock > [ 1.112485] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width x1) > 00:01:c0:16:1d:76 > [ 1.112492] e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection > [ 1.112518] e1000e 0000:00:19.0 eth0: MAC: 11, PHY: 12, PBA No: FFFFFF-0FF > [ 1.112554] igb: Intel(R) Gigabit Ethernet Network Driver > [ 1.112556] igb: Copyright (c) 2007-2014 Intel Corporation. > [ 1.141437] igb 0000:02:00.0: added PHC on eth1 > [ 1.141457] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection > [ 1.141460] igb 0000:02:00.0: eth1: (PCIe:2.5Gb/s:Width x1) > 00:01:c0:16:1d:77 > [ 1.141464] igb 0000:02:00.0: eth1: PBA No: FFFFFF-0FF > [ 1.141466] igb 0000:02:00.0: Using MSI-X interrupts. 2 rx queue(s), 2 tx > queue(s) > [ 16.042536] e1000e 0000:00:19.0 eth0: NIC Link is Down > [ 18.974728] e1000e 0000:00:19.0 eth0: NIC Link is Up 1000 Mbps Full > Duplex, Flow Control: None > [ 19.062699] e1000e 0000:00:19.0 eth0: entered promiscuous mode > [ 19.062800] e1000e 0000:00:19.0 eth0: entered allmulticast mode > [ 19.095345] e1000e 0000:00:19.0 eth0: left promiscuous mode > [ 33.497246] e1000e 0000:00:19.0 eth0: entered promiscuous mode Just in case it helps pinpoint the issue: I've just had this problem for the first time as well, similar to Patrick Schaaf (see comment #11). Yesterday, after upgrading from Fedora 40 to Fedora 41 which includes kernel 6.12.10, I've started noticing ethernet communications stopping completely at random times. NIC hardware: Intel Corporation Ethernet Connection (16) I219-V [8086:1a1f] (rev 01) (What seems to alleviate the issue is unplugging the cable and re-inserting it. I have not yet tried other workarounds.) Anyway, this message kept being shown by dmesg every second or so: [10164.837549] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang: TDH <c6> TDT <e4> next_to_use <e4> next_to_clean <c5> buffer_info[next_to_clean]: time_stamp <10094e81e> next_to_watch <c6> jiffies <100968240> next_to_watch.status <0> MAC Status <40080083> PHY Status <796d> PHY 1000BASE-T Status <7800> PHY Extended Status <3000> PCI Status <10> [10166.822527] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang: TDH <c6> TDT <e4> next_to_use <e4> next_to_clean <c5> buffer_info[next_to_clean]: time_stamp <10094e81e> next_to_watch <c6> jiffies <100968a01> next_to_watch.status <0> MAC Status <40080083> PHY Status <796d> PHY 1000BASE-T Status <7800> PHY Extended Status <3000> PCI Status <10> is there a fix for this ? im seeing it on a dell Precision T5600 running linux mint 21.3 ethtool -i enp0s25 driver: e1000e version: 5.15.0-131-generic firmware-version: 0.13-4 expansion-rom-version: bus-info: 0000:00:19.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes ethtool --show-features enp0s25 |egrep "gso|gro|tso" tx-gso-robust: off [fixed] tx-gso-partial: off [fixed] tx-gso-list: off [fixed] rx-gro-hw: off [fixed] rx-gro-list: off rx-udp-gro-forwarding: off i also added pcie_aspm=off to grub. For me the issue seems to happen more often when I'm running either Docker containers and/or libvirt-based VMs. Does disabling TSO still work around this? It seems like this has re(?)appeared with a vengeance circa 6.11 / 6.12 so I'm wondering if the root cause has changed. Here are two more reports of what is likely the same issue: https://bugzilla.kernel.org/show_bug.cgi?id=219489 To answer my own question, yes, disabling TSO can still fix the problem. I thought this was a new issue in 6.12.x, but it was happening with the old kernel, too. What changed is that in 6.12 the NIC is (apparently) no longer able to recover on its own. I could still see the hangs/resets in dmesg on the older kernel though. So, I disabled TSO, and the dmesg entries stopped. After a while I worked up the courage to reboot into 6.12. Even that has been running for a while now without issue (now that TSO is off). (In reply to Jelle Geerts from comment #16) > For me the issue seems to happen more often when I'm running either Docker > containers and/or libvirt-based VMs. huh.. funny you should mention that.. i think that was happening to me too. i was running multiple virtualbox VMs and docker inside of those In my case, the issue occurs despite TSO being disabled. Perhaps that has to do with other offloading settings such as 'generic-segmentation-offload', which was actually enabled here, even though 'tcp-segmentation-offload' was disabled. With 'the issue' I mean the infinite e1000e 'Detected Hardware Unit Hang' kernel messages. The driver is unable to recover the NIC to a working state. Kernel version: 6.12.10 The output below clearly shows that TSO (tcp-segmentation-offload) is disabled but GSO (generic-segmentation-offload) is *enabled*. Note that I did *not* change any settings. These are the defaults. $ ethtool --show-features enp0s31f6 |egrep "gso|gro|tso|offload" tcp-segmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off [fixed] rx-vlan-offload: on tx-vlan-offload: on tx-gso-robust: off [fixed] tx-gso-partial: off [fixed] tx-gso-list: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: off [fixed] esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: off [fixed] tls-hw-tx-offload: off [fixed] tls-hw-rx-offload: off [fixed] rx-gro-hw: off [fixed] rx-gro-list: off macsec-hw-offload: off [fixed] rx-udp-gro-forwarding: off hsr-tag-ins-offload: off [fixed] hsr-tag-rm-offload: off [fixed] hsr-fwd-offload: off [fixed] hsr-dup-offload: off [fixed] |