Bug 217678 - Unexplainable packet drop starting at v6.4
Summary: Unexplainable packet drop starting at v6.4
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-07-17 17:44 UTC by hq.dev+kernel
Modified: 2023-10-27 07:20 UTC (History)
7 users (show)

See Also:
Kernel Version: 6.4
Subsystem:
Regression: Yes
Bisected commit-id: e9031f2da1aef34b0b4c659ead613c335b46ae92


Attachments
dmesg on commit d42b1c47570eb2ed818dc3fe94b2678124af109d (94.74 KB, text/plain)
2023-07-18 03:28 UTC, hq.dev+kernel
Details
lspci on commit d42b1c47570eb2ed818dc3fe94b2678124af109d (2.87 KB, text/plain)
2023-07-18 03:29 UTC, hq.dev+kernel
Details
dmesg on commit 6e98b09da931a00bf4e0477d0fa52748bf28fcce (130.95 KB, text/plain)
2023-07-19 23:37 UTC, hq.dev+kernel
Details
bisect log for getting bad commit e9031f2da1aef34b0b4c659ead613c335b46ae92 (2.26 KB, text/plain)
2023-08-05 04:54 UTC, hq.dev+kernel
Details
Temp Patch to try out (1.38 KB, patch)
2023-09-18 07:10 UTC, Tirthendu Sarkar
Details | Diff
Patch for using next_to_process for calculating unused descriptors (3.23 KB, patch)
2023-09-20 11:08 UTC, Tirthendu Sarkar
Details | Diff
Patch with debug prints (1.42 KB, patch)
2023-09-28 14:52 UTC, Tirthendu Sarkar
Details | Diff
dmesg from journalctl until 2 minutes after lost reachability to 1.1.1.1 (443.58 KB, application/gzip)
2023-09-28 17:35 UTC, hq.dev+kernel
Details
Patch with temp fix and debug prints (1.86 KB, patch)
2023-09-29 11:21 UTC, Tirthendu Sarkar
Details | Diff
dmesg from journalctl after applying `0001-i40e-fix-and-debug-prints.patch` (43.73 KB, application/gzip)
2023-10-03 02:36 UTC, hq.dev+kernel
Details

Description hq.dev+kernel 2023-07-17 17:44:27 UTC
Hi,

After I updated to 6.4 through Archlinux kernel update, suddenly I noticed random packet losses on my routers like nodes. I have these networking relevant config on my nodes

1. Using archlinux
2. Network config through systemd-networkd
3. Using bird2 for BGP routing, but not relevant to this bug.
4. Using nftables for traffic control, but seems not relevant to this bug. 
5. Not using fail2ban like dymanic filtering tools, at least at L3/L4 level

After I ruled out systemd-networkd, nftables related issues. I tracked down issues to kernel.

Here's the tcpdump I'm seeing on one side of my node ""

```
sudo tcpdump -i fios_wan port 38851
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on fios_wan, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:33:06.073236 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
10:33:11.406607 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
10:33:16.739969 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
10:33:21.859856 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
10:33:27.193176 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
5 packets captured
5 packets received by filter
0 packets dropped by kernel
```

But on the other side "[REDACTED_PUBLIC_IPv4_1]", tcpdump is replying packets in this wireguard stream. So packet is lost somewhere in the link.

From the otherside, I can do "mtr" to "[BOS1_NODE]"'s public IP and found the moment the link got lost is right at "[BOS1_NODE]", that means "[BOS1_NODE]"'s networking stack completely drop the inbound packets from specific ip addresses.

Some more digging

1. This situation began after booting in different delays. Sometimes can trigger after 30 seconds after booting, and sometimes will be after 18 hours or more.
2. It can envolve into worse case that when I do "ip neigh show", the ipv4 ARP table and ipv6 neighbor discovery start to appear as "invalid", meaning the internet is completely loss.
3. When this happened to wan facing interface, it seems OK with lan facing interfaces. WAN interface was using Intel X710-T4L using i40e and lan side was using virtio
4. I tried to bisect in between 6.3 and 6.4, and the first bad commit it reports was "a3efabee5878b8d7b1863debb78cb7129d07a346". But this is not relevant to networking at all, maybe it's the wrong commit to look at. At the meantime, because I haven't found a reproducible way of 100% trigger the issue, it may be the case during bisect some "good" commits are actually bad. 
5. I also tried to look at "dmesg", nothing interesting pop up. But I'll make it available upon request.

This is my first bug reports. Sorry for any confusion it may lead to and thanks for reading.
Comment 1 Bagas Sanjaya 2023-07-18 00:20:53 UTC
(In reply to hq.dev+kernel from comment #0)
> Hi,
> 
> After I updated to 6.4 through Archlinux kernel update, suddenly I noticed
> random packet losses on my routers like nodes. I have these networking
> relevant config on my nodes
> 
> 1. Using archlinux
> 2. Network config through systemd-networkd
> 3. Using bird2 for BGP routing, but not relevant to this bug.
> 4. Using nftables for traffic control, but seems not relevant to this bug. 
> 5. Not using fail2ban like dymanic filtering tools, at least at L3/L4 level
> 
> After I ruled out systemd-networkd, nftables related issues. I tracked down
> issues to kernel.
> 
> Here's the tcpdump I'm seeing on one side of my node ""
> 
> ```
> sudo tcpdump -i fios_wan port 38851
> tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
> listening on fios_wan, link-type EN10MB (Ethernet), snapshot length 262144
> bytes
> 10:33:06.073236 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
> length 148
> 10:33:11.406607 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
> length 148
> 10:33:16.739969 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
> length 148
> 10:33:21.859856 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
> length 148
> 10:33:27.193176 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
> length 148
> 5 packets captured
> 5 packets received by filter
> 0 packets dropped by kernel
> ```
> 
> But on the other side "[REDACTED_PUBLIC_IPv4_1]", tcpdump is replying
> packets in this wireguard stream. So packet is lost somewhere in the link.
> 
> From the otherside, I can do "mtr" to "[BOS1_NODE]"'s public IP and found
> the moment the link got lost is right at "[BOS1_NODE]", that means
> "[BOS1_NODE]"'s networking stack completely drop the inbound packets from
> specific ip addresses.
> 
> Some more digging
> 
> 1. This situation began after booting in different delays. Sometimes can
> trigger after 30 seconds after booting, and sometimes will be after 18 hours
> or more.
> 2. It can envolve into worse case that when I do "ip neigh show", the ipv4
> ARP table and ipv6 neighbor discovery start to appear as "invalid", meaning
> the internet is completely loss.
> 3. When this happened to wan facing interface, it seems OK with lan facing
> interfaces. WAN interface was using Intel X710-T4L using i40e and lan side
> was using virtio
> 4. I tried to bisect in between 6.3 and 6.4, and the first bad commit it
> reports was "a3efabee5878b8d7b1863debb78cb7129d07a346". But this is not
> relevant to networking at all, maybe it's the wrong commit to look at. At
> the meantime, because I haven't found a reproducible way of 100% trigger the
> issue, it may be the case during bisect some "good" commits are actually
> bad. 
> 5. I also tried to look at "dmesg", nothing interesting pop up. But I'll
> make it available upon request.
> 

Can you repeat the bisection with well-defined reproducer?
Comment 2 Bagas Sanjaya 2023-07-18 00:38:54 UTC
Can you also attach dmesg output?
Comment 3 Bagas Sanjaya 2023-07-18 00:39:31 UTC
(In reply to hq.dev+kernel from comment #0)
> Hi,
> 
> After I updated to 6.4 through Archlinux kernel update, suddenly I noticed
> random packet losses on my routers like nodes. I have these networking
> relevant config on my nodes
> 
> 1. Using archlinux
> 2. Network config through systemd-networkd
> 3. Using bird2 for BGP routing, but not relevant to this bug.
> 4. Using nftables for traffic control, but seems not relevant to this bug. 
> 5. Not using fail2ban like dymanic filtering tools, at least at L3/L4 level
> 
> After I ruled out systemd-networkd, nftables related issues. I tracked down
> issues to kernel.
> 
> Here's the tcpdump I'm seeing on one side of my node ""
> 
> ```
> sudo tcpdump -i fios_wan port 38851
> tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
> listening on fios_wan, link-type EN10MB (Ethernet), snapshot length 262144
> bytes
> 10:33:06.073236 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
> length 148
> 10:33:11.406607 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
> length 148
> 10:33:16.739969 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
> length 148
> 10:33:21.859856 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
> length 148
> 10:33:27.193176 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP,
> length 148
> 5 packets captured
> 5 packets received by filter
> 0 packets dropped by kernel
> ```
> 
> But on the other side "[REDACTED_PUBLIC_IPv4_1]", tcpdump is replying
> packets in this wireguard stream. So packet is lost somewhere in the link.
> 
> From the otherside, I can do "mtr" to "[BOS1_NODE]"'s public IP and found
> the moment the link got lost is right at "[BOS1_NODE]", that means
> "[BOS1_NODE]"'s networking stack completely drop the inbound packets from
> specific ip addresses.
> 
> Some more digging
> 
> 1. This situation began after booting in different delays. Sometimes can
> trigger after 30 seconds after booting, and sometimes will be after 18 hours
> or more.
> 2. It can envolve into worse case that when I do "ip neigh show", the ipv4
> ARP table and ipv6 neighbor discovery start to appear as "invalid", meaning
> the internet is completely loss.
> 3. When this happened to wan facing interface, it seems OK with lan facing
> interfaces. WAN interface was using Intel X710-T4L using i40e and lan side
> was using virtio
> 4. I tried to bisect in between 6.3 and 6.4, and the first bad commit it
> reports was "a3efabee5878b8d7b1863debb78cb7129d07a346". But this is not
> relevant to networking at all, maybe it's the wrong commit to look at. At
> the meantime, because I haven't found a reproducible way of 100% trigger the
> issue, it may be the case during bisect some "good" commits are actually
> bad. 
> 5. I also tried to look at "dmesg", nothing interesting pop up. But I'll
> make it available upon request.
> 
> This is my first bug reports. Sorry for any confusion it may lead to and
> thanks for reading.

What is your hardware (attach lspci please)?
Comment 4 hq.dev+kernel 2023-07-18 03:28:56 UTC
Created attachment 304648 [details]
dmesg on commit d42b1c47570eb2ed818dc3fe94b2678124af109d
Comment 5 hq.dev+kernel 2023-07-18 03:29:26 UTC
Created attachment 304649 [details]
lspci on commit d42b1c47570eb2ed818dc3fe94b2678124af109d
Comment 6 hq.dev+kernel 2023-07-18 03:36:30 UTC
Thanks for your reply. I'll look into redo bisect to the best of I can.

Also one more problem I missed in original report.

When this bug happen, I cannot even ping some popular ip address such as 1.1.1.1. Some tcpdump log

```
sudo tcpdump -i fios_wan icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on fios_wan, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:31:33.578502 IP [BOS1_NODE] > one.one.one.one: ICMP echo request, id 2378, seq 563, length 64
23:31:34.591833 IP [BOS1_NODE] > one.one.one.one: ICMP echo request, id 2378, seq 564, length 64
23:31:35.605189 IP [BOS1_NODE] > one.one.one.one: ICMP echo request, id 2378, seq 565, length 64
23:31:36.618502 IP [BOS1_NODE] > one.one.one.one: ICMP echo request, id 2378, seq 566, length 64
23:31:37.631837 IP [BOS1_NODE] > one.one.one.one: ICMP echo request, id 2378, seq 567, length 64
23:31:38.645170 IP [BOS1_NODE] > one.one.one.one: ICMP echo request, id 2378, seq 568, length 64
^C
6 packets captured
53 packets received by filter
```

So I never receive reply.

But concurrently, I'm doing double NAT when debugging. The first NAT layer close to WAN can see reply from 1.1.1.1, meaning cloudflare, as well as my ISP, is not causing the issue.

Below is stdout from "ip neigh show dev fios_wan"

```
ip neigh show dev fios_wan
192.168.1.1 lladdr 78:67:0e:d6:14:1a REACHABLE 
fe80::7a67:eff:fed6:141a lladdr 78:67:0e:d6:14:1a router STALE
```

It happens roughly after 1.5 hours of booting.

Thank you and I'll redo bisect.
Comment 7 Bagas Sanjaya 2023-07-18 03:40:27 UTC
(In reply to hq.dev+kernel from comment #4)
> Created attachment 304648 [details]
> dmesg on commit d42b1c47570eb2ed818dc3fe94b2678124af109d

Are you on the middle of bisection?
Comment 8 hq.dev+kernel 2023-07-18 03:43:43 UTC
(In reply to Bagas Sanjaya from comment #7)
> (In reply to hq.dev+kernel from comment #4)
> > Created attachment 304648 [details]
> > dmesg on commit d42b1c47570eb2ed818dc3fe94b2678124af109d
> 
> Are you on the middle of bisection?

Yes. No conclusion yet.
Comment 9 hq.dev+kernel 2023-07-19 23:36:52 UTC
OK I redid bisec and this time it reports `6e98b09da931a00bf4e0477d0fa52748bf28fcce` as first bad commit.

My test method.

1. Boot the system
2. open two mtrs, one to 1.1.1.1, one to 2606:4700:4700::1111. Concurrently system will spawn three WireGuard links to my servers.
3. Run "iperf3 -c [SERVER INTERNAL IP] -t0" to generate constant traffic.
4. Wait 1.5 hours and watch whether WireGuard got disconnected, or mtr got break. Otherwise mark the commit as good.

I attached the dmesg1.txt over a run on "6e98b09da931a00bf4e0477d0fa52748bf28fcce" as well. Nothing special in the log again.
Comment 10 hq.dev+kernel 2023-07-19 23:37:31 UTC
Created attachment 304669 [details]
dmesg on commit 6e98b09da931a00bf4e0477d0fa52748bf28fcce
Comment 11 Bagas Sanjaya 2023-07-26 04:28:24 UTC
(In reply to hq.dev+kernel from comment #9)
> OK I redid bisec and this time it reports
> `6e98b09da931a00bf4e0477d0fa52748bf28fcce` as first bad commit.
> 
> My test method.
> 
> 1. Boot the system
> 2. open two mtrs, one to 1.1.1.1, one to 2606:4700:4700::1111. Concurrently
> system will spawn three WireGuard links to my servers.
> 3. Run "iperf3 -c [SERVER INTERNAL IP] -t0" to generate constant traffic.
> 4. Wait 1.5 hours and watch whether WireGuard got disconnected, or mtr got
> break. Otherwise mark the commit as good.
> 
> I attached the dmesg1.txt over a run on
> "6e98b09da931a00bf4e0477d0fa52748bf28fcce" as well. Nothing special in the
> log again.

* Can you try 9b78d919632b71 (net-next tip before merge)?
* Please also attach your bisection log.
Comment 12 hq.dev+kernel 2023-08-05 04:53:04 UTC
Thanks. 

Because the false negative rate was so high, I have to be more careful when I say a commit is good.

I did more rounds of bisect, when I say a commit was good, I'll have to make sure

1. Boot the system twice or more times.
2. Non of boots shows sign of corrupt networking stack after at least 2 hours of system running time
3. The definition of "bad status" is 3.1) Any mtr to famous site(1.1.1.1, 2606:4700:4700::1111) return no route to host 3.2) Unusal HTTP resonse time happens 3.3) WireGuard link down for no reason

This time, it reports the first bad commit is "e9031f2da1aef34b0b4c659ead613c335b46ae92"

During these rounds of bisect, there are still mysteries of 

1. Why the issue cannot be observed in every boot.
2. A very stable way of reproducing it other than waiting for several hours(2 hours in my experiment)
3. `dmesg` shows nothing strange among these runs.
Comment 13 hq.dev+kernel 2023-08-05 04:54:37 UTC
Created attachment 304781 [details]
bisect log for getting bad commit e9031f2da1aef34b0b4c659ead613c335b46ae92
Comment 14 Solomon Peachy 2023-08-15 15:09:09 UTC
I've run into the same problem.

 * Genuine Intel XXV710-DA2 (dual port 10/25GbE)
 * Single port utilized w/ 10GbE fiber
 * HP Z440 workstation (graphical desktop usage)

Currently on the Fedora 6.4.8-200.fc38 kernel; this problem did not occur with 6.3.X

I don't have a way to intentionally trigger the problem, but it happens reliably, multiple times a day, even under very light loads.

The problem manifests itself as intermittent DNS lookup failures or TCP connection hangs.  IPv4 or IPv6, seems to make no difference.  Sometimes the connections recover eventually, but they usually do not.

Removing the i40e module and reloading it fixes the problem immediately.  Here is a representative dmesg snippet from a reload-to-reload cycle.  The time-to-failure was a mere 37 minutes:

[320442.557276] i40e: Intel(R) Ethernet Connection XL710 Network Driver
[320442.557280] i40e: Copyright (c) 2013 - 2019 Intel Corporation.
[320442.570051] i40e 0000:03:00.0: fw 6.0.48442 api 1.7 nvm 6.02 0x80003620 1.1747.0 [8086:158b] [8086:0002]
[320442.804016] i40e 0000:03:00.0: MAC address: 6c:b3:11:50:cc:bc
[320442.813914] i40e 0000:03:00.0 eth0: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[320442.816360] i40e 0000:03:00.0: PCI-Express: Speed 8.0GT/s Width x8
[320442.825608] i40e 0000:03:00.0: Features: PF-id[0] VFs: 64 VSIs: 66 QP: 16 RSS FD_ATR FD_SB NTUPLE VxLAN Geneve PTP VEPA
[320442.837850] i40e 0000:03:00.1: fw 6.0.48442 api 1.7 nvm 6.02 0x80003620 1.1747.0 [8086:158b] [8086:0002]
[320442.842679] i40e 0000:03:00.0 ens5f0: renamed from eth0
[320443.070880] i40e 0000:03:00.1: MAC address: 6c:b3:11:50:cc:be
[320443.075746] i40e 0000:03:00.1: PCI-Express: Speed 8.0GT/s Width x8
[320443.076026] i40e 0000:03:00.1 ens5f1: renamed from eth0
[320443.076206] i40e 0000:03:00.1: Features: PF-id[1] VFs: 64 VSIs: 66 QP: 16 RSS FD_ATR FD_SB NTUPLE VxLAN Geneve PTP VEPA
[320443.853617] IPv6: ADDRCONF(NETDEV_CHANGE): ens5f0: link becomes ready
[322691.723080] nfs: server taster.fl.shaftnet.org not responding, still trying
[322692.236454] call_decode: 2 callbacks suppressed
[322692.236460] nfs: server taster.fl.shaftnet.org OK
[324872.933163] i40e 0000:03:00.1: i40e_ptp_stop: removed PHC on ens5f1
[324873.218345] i40e 0000:03:00.0: i40e_ptp_stop: removed PHC on ens5f0
[324875.751716] i40e: Intel(R) Ethernet Connection XL710 Network Driver
[324875.751720] i40e: Copyright (c) 2013 - 2019 Intel Corporation.
[324875.764070] i40e 0000:03:00.0: fw 6.0.48442 api 1.7 nvm 6.02 0x80003620 1.1747.0 [8086:158b] [8086:0002]
[324875.997322] i40e 0000:03:00.0: MAC address: 6c:b3:11:50:cc:bc
[324876.009212] i40e 0000:03:00.0 eth0: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[324876.011749] i40e 0000:03:00.0: PCI-Express: Speed 8.0GT/s Width x8
[324876.020999] i40e 0000:03:00.0: Features: PF-id[0] VFs: 64 VSIs: 66 QP: 16 RSS FD_ATR FD_SB NTUPLE VxLAN Geneve PTP VEPA
[324876.033186] i40e 0000:03:00.1: fw 6.0.48442 api 1.7 nvm 6.02 0x80003620 1.1747.0 [8086:158b] [8086:0002]
[324876.070057] i40e 0000:03:00.0 ens5f0: renamed from eth0
[324876.266179] i40e 0000:03:00.1: MAC address: 6c:b3:11:50:cc:be
[324876.270802] i40e 0000:03:00.1 ens5f1: renamed from eth0
[324876.270892] i40e 0000:03:00.1: PCI-Express: Speed 8.0GT/s Width x8
[324876.271347] i40e 0000:03:00.1: Features: PF-id[1] VFs: 64 VSIs: 66 QP: 16 RSS FD_ATR FD_SB NTUPLE VxLAN Geneve PTP VEPA
[324877.000839] IPv6: ADDRCONF(NETDEV_CHANGE): ens5f0: link becomes ready

lspci output for the device:

03:00.0 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
        Subsystem: Intel Corporation Ethernet Network Adapter XXV710-2
        Physical Slot: 5
        Flags: bus master, fast devsel, latency 0, IRQ 36
        Memory at f1000000 (64-bit, prefetchable) [size=16M]
        Memory at f3800000 (64-bit, prefetchable) [size=32K]
        Expansion ROM at f0200000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number bc-cc-50-ff-ff-11-b3-6c
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [1a0] Transaction Processing Hints
        Capabilities: [1b0] Access Control Services
        Capabilities: [1d0] Secondary PCI Express
        Kernel driver in use: i40e
        Kernel modules: i40e

03:00.1 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
        Subsystem: Intel Corporation Ethernet Network Adapter XXV710
        Physical Slot: 5
        Flags: bus master, fast devsel, latency 0, IRQ 36
        Memory at f2000000 (64-bit, prefetchable) [size=16M]
        Memory at f3808000 (64-bit, prefetchable) [size=32K]
        Expansion ROM at f0280000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number bc-cc-50-ff-ff-11-b3-6c
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [1a0] Transaction Processing Hints
        Capabilities: [1b0] Access Control Services
        Kernel driver in use: i40e
        Kernel modules: i40e

I have another system with an XXV710-DA2T (ie with external timesync support) but it's still on 6.3.12 because that machine is the site router and downtime needs to be scheduled.
Comment 15 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-08-29 14:47:59 UTC
was this resolved, or does it still happen with 6.5? If it does: did anyone try what Kuba asked for in https://lore.kernel.org/lkml/20230725174712.66a809c4@kernel.org/ ?
Comment 16 hq.dev+kernel 2023-08-29 18:49:33 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #15)
> was this resolved, or does it still happen with 6.5? If it does: did anyone
> try what Kuba asked for in
> https://lore.kernel.org/lkml/20230725174712.66a809c4@kernel.org/ ?

Thanks.

> was this resolved, or does it still happen with 6.5?

It's not resolved, and it's still happening with 6.5. Just tested commit 1c59d383390f970b891b503b7f79b63a02db2ec5 today and confirm it's still having the problem.

> did anyone try what Kuba asked for, which is testing whether commit
> "9b78d919632b71"(net-next tip before merge) is bad?

I did. 9b78d919632b71 has the issue, and I redid several rounds of bisects, the earliest commit I could say is e9031f2da1aef34b0b4c659ead613c335b46ae92.
Comment 17 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-08-30 10:14:26 UTC
Forwarded the issue to the right people:

https://lore.kernel.org/regressions/e9644f38-57be-5d26-0c08-08a74eee7cb1@leemhuis.info/
Comment 18 Solomon Peachy 2023-09-08 13:46:44 UTC
I can confirm I am able to trigger this on a second system (albeit much more infrequently) -- currently running 6.4.13 and an Intel XXV710-DA2T card.

Motherboard: Supermicro X9SCI (Intel C200 chipset, Xeon E3-1230v2)

01:00.0 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
        Subsystem: Intel Corporation Device 000b
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at dc000000 (64-bit, prefetchable) [size=16M]
        Memory at dd808000 (64-bit, prefetchable) [size=32K]
        Expansion ROM at dfd80000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number d0-2e-a6-ff-ff-91-96-b4
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [1a0] Transaction Processing Hints
        Capabilities: [1b0] Access Control Services
        Capabilities: [1d0] Secondary PCI Express
        Kernel driver in use: i40e
        Kernel modules: i40e
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25
GbE SFP28 (rev 02)
        Subsystem: Intel Corporation Ethernet Network Adapter XXV710
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at db000000 (64-bit, prefetchable) [size=16M]
        Memory at dd800000 (64-bit, prefetchable) [size=32K]
        Expansion ROM at dfd00000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number d0-2e-a6-ff-ff-91-96-b4
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [1a0] Transaction Processing Hints
        Capabilities: [1b0] Access Control Services
        Kernel driver in use: i40e
        Kernel modules: i40e

I haven't figured out a pattern to this failure yet, though it does seem that high-traffic connections seem to fare better than ones that only have intermittent/light loads.

Further details upon request.
Comment 19 Tirthendu Sarkar 2023-09-18 07:10:55 UTC
Created attachment 305123 [details]
Temp Patch to try out

Hi,

Can you please try out the attached patch and check if you see any change in the behavior.

Thanks
Tirthendu Sarkar
Comment 20 hq.dev+kernel 2023-09-19 14:41:02 UTC
(In reply to Tirthendu Sarkar from comment #19)
> Created attachment 305123 [details]
> Temp Patch to try out
> 
> Hi,
> 
> Can you please try out the attached patch and check if you see any change in
> the behavior.
> 
> Thanks
> Tirthendu Sarkar

Thanks for the patch. Unfortunately the problem exists after patching.
Comment 21 Tirthendu Sarkar 2023-09-20 11:08:21 UTC
Created attachment 305132 [details]
Patch for using next_to_process for calculating unused descriptors

Hi,

Thanks for trying out the previous patch. Can you please try out the attached new patch instead.

Thanks.
Comment 22 hq.dev+kernel 2023-09-21 14:41:02 UTC
(In reply to Tirthendu Sarkar from comment #21)
> Created attachment 305132 [details]
> Patch for using next_to_process for calculating unused descriptors
> 
> Hi,
> 
> Thanks for trying out the previous patch. Can you please try out the
> attached new patch instead.
> 
> Thanks.

Thanks again for the patch. I applied it and test/reboot several times of the VM that uses i40e module. Your patch seems to fix the issue. I tested it for about 20 hours. I also rebooted 4 times for making sure the patch is actually working.

Let me know if you want me to help test anything else.
Comment 23 Indrek Järve 2023-09-22 07:25:50 UTC
I was hit by this with Fedora 6.4 kernels as well, vanilla 6.5.4+patch up for 14 hours and so far so good, before that I couldn't last 10 minutes before NFS mounts went crazy and random stuff just stopped responding.
Comment 24 Tirthendu Sarkar 2023-09-22 15:28:38 UTC
Thanks for the confirmation. I will probably send an updated patch next week (still need to tie few loose ends) to try on the latest master.
Comment 25 Tirthendu Sarkar 2023-09-28 14:52:24 UTC
Created attachment 305158 [details]
Patch with debug prints

Hi,

Can you please try the attached patch with debug prints and provide the output of dmesg when the issue occurs.

Thanks.
Comment 26 hq.dev+kernel 2023-09-28 17:35:17 UTC
Created attachment 305160 [details]
dmesg from journalctl until 2 minutes after lost reachability to 1.1.1.1

Hi,

I applied your patch `0001-i40e-debug-prints.patch` and please find attached dmesg until 2 minutes ish after lost 1.1.1.1 was no longer accessiable through ping.
Comment 27 Tirthendu Sarkar 2023-09-29 11:21:20 UTC
Created attachment 305161 [details]
Patch with temp fix and debug prints

Hi,

Thanks for providing the log. Can you also try the new patch and check if the issue occurs and provide the dmesg output if the issue occurs.
 
Thanks.
Comment 28 Indrek Järve 2023-09-30 16:05:18 UTC
Finally had an opening to restart and test as well. Latest master + latest patch: no issues for last 4 hours, thus no logs to provide as well.

Will keep monitoring, but as said, with my box (X710 on a TR 1950X) the issues on 6.4+ started very repeatably and within minutes of booting up, so I'm feeling positive. The original 2nd patch ran fine from last Friday as well.

Thanks.
Comment 29 hq.dev+kernel 2023-10-03 02:36:57 UTC
Created attachment 305180 [details]
dmesg from journalctl after applying `0001-i40e-fix-and-debug-prints.patch`

Thanks for the patch.

I booted the system three times with `0001-i40e-fix-and-debug-prints.patch` applied. During 10 hours testing time. No issued found, and dmesg looks normal to me. Please find dmesg from the third boot.

Thanks
Comment 30 Tirthendu Sarkar 2023-10-03 13:35:07 UTC
Thanks to both of you. I will send the patch for upstreaming.
Comment 31 Andrew Rodland 2023-10-12 16:04:57 UTC
I've also had what seems to be the same issue, running an ASUS OEI-10G/X710-2T, which is an OEM Intel X710.

I've applied the patch and I'm testing (looks good so far), but while debugging on my own I found an interesting workaround.

I had noticed that the damage was limited to specific 4-tuples, and suspected a hung queue, so I tried using only queue 0 (with "ethtool -X eth0 equal 1"). This had the predictable result that everything worked fine for a while and then the system went totally unreachable when *all* connections hung instead of just a few of them. A friend of mine said "what if it's only queue 0?" That didn't seem too likely to me but I figured I'd try it anyway, so I did "ethtool -X eth0 start 1 equal 24" to use anything *but* queue 0. And surprisingly, it seemed to work: I ran that way for about a week with no signs of hung connections.
Comment 32 Solomon Peachy 2023-10-15 21:22:36 UTC
I've had 0001-i40e-fix-and-debug-prints.patch applied on top of 6.5.5 for about a week on one system, and 6.5.6 for a couple of days on another.  So far so good.

Any idea when this will get pushed upstream?

Thanks!
Comment 33 Tirthendu Sarkar 2023-10-17 04:39:55 UTC
It is currently in next-queue. Since 6.6.-rc6 is already out, I hope it makes in rc7/8. Most likely in 6.7.
Comment 34 Thorsten Leemhuis 2023-10-17 05:01:42 UTC
(In reply to Tirthendu Sarkar from comment #33)
> It is currently in next-queue. Since 6.6.-rc6 is already out, I hope it
> makes in rc7/8. Most likely in 6.7.

You might want to ask the maintainer to submit them to the net tree for merging them in this cycle, as this is a recent regression  – and thus should be round about handled like a regression from the current cycle according to Linus:

https://lore.kernel.org/all/CAHk-=wis_qQy4oDNynNKi5b7Qhosmxtoj1jxo5wmB6SRUwQUBQ@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wgD98pmSK3ZyHk_d9kZ2bhgN6DuNZMAJaV0WTtbkf=RDw@mail.gmail.com/
Comment 35 Andrew Rodland 2023-10-20 15:49:54 UTC
I can also confirm that the patch has been solid for me for over a week (I used the version that went to the netdev list rather than the one in this thread).

I don't mind running a patched kernel for a little while but I do agree that this should be a perfectly good candidate for backporting or late-merging.

Note You need to log in before you can comment on or make changes to this bug.