Bug 217120 - e1000e - slow receive / rx - i219-LM (Alder Lake)
Summary: e1000e - slow receive / rx - i219-LM (Alder Lake)
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 high
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-03-03 17:55 UTC by Scott Silverman
Modified: 2023-09-26 13:44 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.15 and newer
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Scott Silverman 2023-03-03 17:55:10 UTC
When running kernels 5.15 and newer, poor rx speeds are observed on Alder Lake i219-LM (8086:1A1C) on a Dell Precision 3260 Compact.

To Reproduce:
Using iperf3 from the system under test:
`iperf3 -c <server on local LAN> -R`

Under kernels 5.7-5.14 (inclusive) performance is near line rate, ~990 Mbps.
From kernels 5.15 through 6.2 performance is approximately 10-20 Mbps.

Also of note, the issue doesn't present with servers over the internet (i.e. higher latency links). Over a link with a latency of approximately 20ms I was able to achieve about 600Mbps.

Interesting workaround, on affected kernels (5.15+) I am able to resolve the issue if an additional device is installed into the systems PCIe slot (tested with an Intel i210 NIC (not connected to LAN) and an Nvidia Quadro P620 GPU).
Comment 1 Scott Silverman 2023-03-14 21:24:26 UTC
I've been able to refine the initial issue, there is no apparent relation to the rtt, but the mss is the trigger here. The device under test is on a LAN segment with 9000-byte MTU. Iperf uses a default mss of 8960 in this case, and the issue is present. If the mss is manually reduced to around 7000-Bytes or lower, then performance is back at reasonable levels. When the mss is increased the performance starts to drop off fairly regularly up to the worst case of 8960 bytes.

Sorry for the initial confusion around higher latency connections.
Comment 2 Paul Menzel 2023-05-10 15:22:24 UTC
Could you bisect the issue please?
Comment 3 Scott Silverman 2023-07-14 21:16:15 UTC
(In reply to Paul Menzel from comment #2)
> Could you bisect the issue please?

Sorry, I've been trying, and failing, to figure out how to do that properly.

I can say that 6.4 still presents the same issue.

I can also add that if instead of using a tcp test a udp receive test is performed, I can clearly see that the traffic is ending up as rx_missed_errors in ethtool -S.

I've attempted a few workarounds I've seen for similar issues, but did not observe any differences with any of them:
- intel_idle.max_cstate=1
- pcie_aspm=off
- unload mei modules
- disable MEI/AMT in firmware
- set "/sys/bus/pci/devices/<device>/power/control" to "on" (from auto)
Comment 4 Scott Silverman 2023-07-14 21:25:28 UTC
My hunch is that this worked when the "e1000_pch_tgp" workaround was added ("Subject: [PATCH v1 2/2] e1000e: Fixing packet loss issues on new platforms") and then was broken again when alder lake was split out from tgp, but the workaround wasn't included ("[net,1/2] e1000e: Separate ADP board type from TGP").

I'm not exactly sure how to test that theory, but on Monday I'll try to figure out how to build a module with a version of that workaround for e1000_pch_adp and see how it goes.
Comment 5 Scott Silverman 2023-07-17 18:05:02 UTC
My hunch was wrong.
Comment 6 Bagas Sanjaya 2023-09-26 00:56:47 UTC
(In reply to Scott Silverman from comment #4)
> My hunch is that this worked when the "e1000_pch_tgp" workaround was added
> ("Subject: [PATCH v1 2/2] e1000e: Fixing packet loss issues on new
> platforms") and then was broken again when alder lake was split out from
> tgp, but the workaround wasn't included ("[net,1/2] e1000e: Separate ADP
> board type from TGP").
> 
> I'm not exactly sure how to test that theory, but on Monday I'll try to
> figure out how to build a module with a version of that workaround for
> e1000_pch_adp and see how it goes.

Do you have the workaround (as a patch maybe)?
Comment 7 Scott Silverman 2023-09-26 13:44:04 UTC
(In reply to Bagas Sanjaya from comment #6)
> (In reply to Scott Silverman from comment #4)
> > My hunch is that this worked when the "e1000_pch_tgp" workaround was added
> > ("Subject: [PATCH v1 2/2] e1000e: Fixing packet loss issues on new
> > platforms") and then was broken again when alder lake was split out from
> > tgp, but the workaround wasn't included ("[net,1/2] e1000e: Separate ADP
> > board type from TGP").
> > 
> > I'm not exactly sure how to test that theory, but on Monday I'll try to
> > figure out how to build a module with a version of that workaround for
> > e1000_pch_adp and see how it goes.
> 
> Do you have the workaround (as a patch maybe)?

This was the patch, assuming I did it correctly I was just trying to have it also apply on the alder lake, but either I did that wrong, or it simply wasn't effective.

https://patchwork.ozlabs.org/project/intel-wired-lan/patch/20210922065542.3780389-1-sasha.neftin@intel.com/#2755197

I'm curious, are you also seeing the same behavior?

Note You need to log in before you can comment on or make changes to this bug.