Bug 209725 - ASPM L1 Latency is calculated incorrectly
Summary: ASPM L1 Latency is calculated incorrectly
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-18 11:32 UTC by Ian Kumlien
Modified: 2020-12-15 23:42 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.9.0
Tree: Mainline
Regression: No


Attachments
lspci without my patches (118.49 KB, text/plain)
2020-10-18 11:32 UTC, Ian Kumlien
Details
lspci with my patches (118.47 KB, text/plain)
2020-10-18 11:33 UTC, Ian Kumlien
Details

Description Ian Kumlien 2020-10-18 11:32:10 UTC
I noticed that my desktops network card was slow.

I was getting ~40mbit instead of the expected ~933mbit when connecting to machines on the internet (~6 hops), locally all was fine.

After some looking with help from Alexander Duyck it was discovered that disabling L1 ASPM on the networkcard fixed it.

This caused me to dig around in the code, eventually discovering that the ASPM latency check was incorrect, and as a side-effect it fixes my issue.
Comment 1 Ian Kumlien 2020-10-18 11:32:48 UTC
Created attachment 293047 [details]
lspci without my patches
Comment 2 Ian Kumlien 2020-10-18 11:33:08 UTC
Created attachment 293049 [details]
lspci with my patches
Comment 3 Ian Kumlien 2020-10-18 11:34:45 UTC
The fix is that it disables L1 ASPM on 0000:01:00.0-0000:00:01.2 when handling 04:00.0 and both devices are connected there.
Comment 4 Ian Kumlien 2020-10-24 16:03:05 UTC
The pcie paths for the devices are:
00:01.2/01:00.0/02:03.0/03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

and:
00:01.2/01:00.0/02:04.0/04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 816e (rev 1a)

So they share 01:00.0 as a switch.

            Exit latency       Acceptable latency
Tree:       L1       L0s       L1       L0s
----------  -------  -----     -------  ------
00:01.2     <32 us   -        
| 01:00.0   <32 us   -
|- 02:03.0  <32 us   -
| \03:00.0  <16 us   <2us      <64 us   <512ns
|
|- 02:04.0  <32 us   -
  \04:00.0  <64 us   unlimited <64 us   <512ns

04:00.0 has it's own max latency, as we walk the path the first switch will pass the latency mark, this is something that is currently not detected.

For my system the unlimited L0s could also be a problem with bus stalls...
Comment 5 Bjorn Helgaas 2020-12-15 23:42:27 UTC
Ian's system:

  ASUS Pro WS X570-ACE with AMD Ryzen 9 3900X
  BIOS Version: 2206, Release Date: 08/13/2020

Apparently there is a newer BIOS available (Version 3003, 2020/12/07).

Ian was able to work around the I211 NIC performance issue with:

  # echo 0 > /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/link/l1_aspm

The I211 NIC is at 03:00.0 in his system, and the path to it is:

  00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0

The shell command above disables ASPM L1 on the link from 00:01.2 to 01:00.0.  The "l1_aspm" sysfs file was added in v5.5.

More details in the thread at
https://lore.kernel.org/r/CAA85sZs8Li7+8BQWj0e+Qrxes1VF6K_Ukqrqgs1E3hHmaXqsbQ@mail.gmail.com

Note You need to log in before you can comment on or make changes to this bug.