Bug 209725

Summary: ASPM L1 Latency is calculated incorrectly
Product: Drivers Reporter: Ian Kumlien (Ian.kumlien)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: bjorn
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.9.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: lspci without my patches
lspci with my patches

Description Ian Kumlien 2020-10-18 11:32:10 UTC
I noticed that my desktops network card was slow.

I was getting ~40mbit instead of the expected ~933mbit when connecting to machines on the internet (~6 hops), locally all was fine.

After some looking with help from Alexander Duyck it was discovered that disabling L1 ASPM on the networkcard fixed it.

This caused me to dig around in the code, eventually discovering that the ASPM latency check was incorrect, and as a side-effect it fixes my issue.
Comment 1 Ian Kumlien 2020-10-18 11:32:48 UTC
Created attachment 293047 [details]
lspci without my patches
Comment 2 Ian Kumlien 2020-10-18 11:33:08 UTC
Created attachment 293049 [details]
lspci with my patches
Comment 3 Ian Kumlien 2020-10-18 11:34:45 UTC
The fix is that it disables L1 ASPM on 0000:01:00.0-0000:00:01.2 when handling 04:00.0 and both devices are connected there.
Comment 4 Ian Kumlien 2020-10-24 16:03:05 UTC
The pcie paths for the devices are:
00:01.2/01:00.0/02:03.0/03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

and:
00:01.2/01:00.0/02:04.0/04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 816e (rev 1a)

So they share 01:00.0 as a switch.

            Exit latency       Acceptable latency
Tree:       L1       L0s       L1       L0s
----------  -------  -----     -------  ------
00:01.2     <32 us   -        
| 01:00.0   <32 us   -
|- 02:03.0  <32 us   -
| \03:00.0  <16 us   <2us      <64 us   <512ns
|
|- 02:04.0  <32 us   -
  \04:00.0  <64 us   unlimited <64 us   <512ns

04:00.0 has it's own max latency, as we walk the path the first switch will pass the latency mark, this is something that is currently not detected.

For my system the unlimited L0s could also be a problem with bus stalls...
Comment 5 Bjorn Helgaas 2020-12-15 23:42:27 UTC
Ian's system:

  ASUS Pro WS X570-ACE with AMD Ryzen 9 3900X
  BIOS Version: 2206, Release Date: 08/13/2020

Apparently there is a newer BIOS available (Version 3003, 2020/12/07).

Ian was able to work around the I211 NIC performance issue with:

  # echo 0 > /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/link/l1_aspm

The I211 NIC is at 03:00.0 in his system, and the path to it is:

  00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0

The shell command above disables ASPM L1 on the link from 00:01.2 to 01:00.0.  The "l1_aspm" sysfs file was added in v5.5.

More details in the thread at
https://lore.kernel.org/r/CAA85sZs8Li7+8BQWj0e+Qrxes1VF6K_Ukqrqgs1E3hHmaXqsbQ@mail.gmail.com