Bug 216257 - igc: Detected Tx Unit Hang after losing carrier
Summary: igc: Detected Tx Unit Hang after losing carrier
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-07-17 09:15 UTC by xnoreq
Modified: 2023-11-02 07:57 UTC (History)
7 users (show)

See Also:
Kernel Version: 6.0.7
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output at time of error (3.53 KB, text/plain)
2022-07-17 09:15 UTC, xnoreq
Details
dmesg lts-5.15.70 (3.35 KB, text/plain)
2022-10-08 08:46 UTC, xnoreq
Details
dmesg 6.0.1 (3.80 KB, text/plain)
2022-10-17 08:34 UTC, xnoreq
Details
another crash with linux 6.0.7 (4.25 KB, text/plain)
2022-11-18 13:39 UTC, xnoreq
Details
yet another crash with linux 6.0.7 (4.51 KB, text/plain)
2022-11-18 13:40 UTC, xnoreq
Details

Description xnoreq 2022-07-17 09:15:04 UTC
Created attachment 301447 [details]
dmesg output at time of error

I have a router with multiple Intel I225-V ethernet controllers.
eth0, 1, 2 are bridged (br-lan).

After powering down a computer connected through eth0, errors started showing up in dmesg (see attachment) and packet loss increased.
Comment 2 xnoreq 2022-10-08 08:46:24 UTC
Created attachment 302959 [details]
dmesg lts-5.15.70

It also happened with an lts kernel:
Linux version 5.15.70-1-lts (linux-lts@archlinux) (gcc (GCC) 12.2.0, GNU ld (GNU Binutils) 2.39.0) #1 SMP Fri, 23 Sep 2022 16:05:15 +0000

I will attach the dmesg output at the time again.
Comment 3 xnoreq 2022-10-17 08:34:10 UTC
Created attachment 303019 [details]
dmesg 6.0.1

Another driver crash with Linux 6.0.1
Comment 4 xnoreq 2022-11-18 13:39:54 UTC
Created attachment 303210 [details]
another crash with linux 6.0.7
Comment 5 xnoreq 2022-11-18 13:40:10 UTC
Created attachment 303211 [details]
yet another crash with linux 6.0.7
Comment 6 Stefan 2023-02-27 10:31:27 UTC
se same problem on different NUC11TNKi3/NUC11TNBi3

root@09999103:~ uname -a
Linux 09999103 6.1.11 #1 SMP PREEMPT_DYNAMIC 2021-08-01T00:00:00+00:00 x86_64 GNU/Linux

57:00.0 Ethernet controller: Intel Corporation Intel(R) Ethernet Controller I225-LM (rev 03)
	Subsystem: Intel Corporation Device 3002
	Flags: bus master, fast devsel, latency 0, IRQ 17, IOMMU group 14
	Memory at 6a200000 (32-bit, non-prefetchable) [size=1M]
	Memory at 6a300000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
	Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
	Capabilities: [a0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 48-21-0b-ff-ff-32-3b-5f
	Capabilities: [1c0] Latency Tolerance Reporting
	Capabilities: [1f0] Precision Time Measurement
	Capabilities: [1e0] L1 PM Substates
	Kernel driver in use: igc


I run a test on 5 devices over weekend and all of them are broken. 
What I have done: 
run the system with connected ethernet to a switch. removed the uplink so that no 
dhcp is running.
On Monday when I come back all of them doesn't received a ip after I connected the uplink again.

When I connected over wifi I saw this error message when I disconnect RJ45 and reconnect it.

Feb 27 07:26:42 09999101 systemd-networkd[501]: eth0: Lost carrier
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: NIC Link is Down
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: Register Dump
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: Register Name   Value
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: CTRL            181c0641
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: STATUS          40280691   
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: CTRL_EXT        10000040
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: MDIC            18017949
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: ICR             00000000
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: RCTL            04408022
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: RDLEN[0-3]      00001000 00001000 00001000 00001000
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: RDH[0-3]        00000018 0000000c 00000082 00000029
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: RDT[0-3]        00000017 0000000b 00000081 00000028
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: RXDCTL[0-3]     02040808 02040808 02040808 02040808
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: RDBAL[0-3]      ffffb000 ffffa000 ffff9000 ffff8000
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: RDBAH[0-3]      00000000 00000000 00000000 00000000
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: TCTL            a503f0fa
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: TDBAL[0-3]      fffff000 ffffe000 ffffd000 ffffc000
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: TDBAH[0-3]      00000000 00000000 00000000 00000000
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: TDLEN[0-3]      00001000 00001000 00001000 00001000
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: TDH[0-3]        00000001 00000002 00000021 0000000c
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: TDT[0-3]        00000001 00000002 00000025 0000000c
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: TXDCTL[0-3]     02100108 02100108 02100108 02100108
Feb 27 07:26:42 09999101 kernel: igc 0000:57:00.0 eth0: Reset adapter


I saw a lot of such messages:

Feb 27 07:01:02 09999101 kernel: igc 0000:57:00.0 eth0: Detected Tx Unit Hang
                                              Tx Queue             <2>
                                              TDH                  <0>
                                              TDT                  <0>
                                              next_to_use          <0>
                                              next_to_clean        <0>
                                            buffer_info[next_to_clean]
                                              time_stamp           <10c55f45b>
                                              next_to_watch        <0000000012555140>
                                              jiffies              <10dc6cbc0>
                                              desc.status          <280200>


I removed the device form pci bus and made a rescan --> same problem. The error messages from kernel are gone
but no packages are transmitted. receiving messages is ok.
Comment 7 xnoreq 2023-02-27 11:04:11 UTC
A temporary workaround that I'm using atm is disabling "WoL link speed reduction" on the connected computer. This feature would reduce link speed to 10 Mbps every time that computer went into standby, which apparently (randomly) triggered this bug.
Comment 8 Schebicky 2023-11-02 07:57:32 UTC
Is it fixed in the latest stable 6.5.9 kernel?

Note You need to log in before you can comment on or make changes to this bug.