Bug 200527

Summary: Configurations with mismatched PCIe MPS settings
Product: Drivers Reporter: Myron Stowe (myron.stowe)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: RESOLVED CODE_FIX    
Severity: normal CC: okaya
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.18 Subsystem:
Regression: No Bisected commit-id:
Attachments: Proposed patch for issue (will post to linux-pci mail list)

Description Myron Stowe 2018-07-17 18:48:11 UTC
Firmware typically configures the PCIe fabric with a consistent Max Payload Size (MPS) setting based on the devices present at boot.  A hot-added device typically has the power-on default MPS setting (128 bytes), which may not match the fabric.

In commit 27d868b5e6c ("PCI: Set MPS to match upstream bridge"), a new default setting was implemented - PCIE_BUS_DEFAULT - where we made sure every device's MPS setting matched its upstream bridge, making it more likely that hot-added devices would work in a system with an optimized MPS configuration.

Recently, I've encountered systems which exhibit the following -

  pci 0000:02:00.0: [144d:a822] type 00 class 0x010802
  pci 0000:02:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
  pci 0000:02:00.0: can't set Max Payload Size to 512; if necessary, use 
  "pci=pcie_bus_safe" and report a bug

Looking at the PCI hierarchy and capabilities of both the bridge (root port) and the endpoint device we see that a hot-added endpoint device's MPS Supported (MPSS) is not capable of matching its immediate upstream bridge's MPS setting (thus the 'dmesg' warning) -

  \-[0000:00]-+-00.0  AMD Family 17h (Models 00h-0fh) Root Complex
              +-01.0  AMD Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
              +-01.1-[01]--
              +-01.2-[02]----00.0  NVMe SSD Controller 172Xa
              +-01.3-[03]--

  00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) 
  PCIe GPP Bridge (prog-if 00 [Normal decode])
  [...]
  Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0
                        ExtTag+ RBE+
                DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 512 bytes, MaxReadReq 512 bytes

  02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD 
  Controller 172Xa (rev 01) (prog-if 02 [NVM Express])
  [...]
  Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 
                        unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ 
                        SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes

The kernel currently tries to set an MPS of 512 to the NVMe device which supports an MPSS of 256 and the operation fails, resulting in a default MPS of 128 being set on the endpoint.  With these settings in place, the upstream root port could send 512 byte packets (TLPs) to the endpoint, and the endpoint will consider them to be Malformed TLPs.

Booting with "pci=pcie_bus_safe", as suggested, does resolve the issue -

  pcieport 00:01.2: Max Payload Size set to  256/ 512 (was  512), Max Read Rq  512
  pci 02:00.0: Max Payload Size set to  256/ 256 (was  128), Max Read Rq  512

at the possible expense of other, non related, sub-topologies which could benefit from maintaining their larger MPS settings.
Comment 1 Myron Stowe 2018-07-18 18:14:47 UTC
Created attachment 277403 [details]
Proposed patch for issue (will post to linux-pci mail list)
Comment 2 Myron Stowe 2018-07-18 18:21:09 UTC
Seems like we can extend on the concept that commit 27d868b5e6c implemented in these situations, limiting to specific sub-topologies - root ports with downstream endpoint devices.  In such scenarios, we can obtain the largest MPSS value the endpoint supports and set both its root port and itself to that setting.

I've attached a patch that does such and with it applied to a 4.18.0-rc5 kernel, the afore mentioned scenario ends up with both the root port and endpoint devices getting set to a matching MPS value of 256 - 

  pciehp 0000:00:01.2:pcie004: Slot(162): Card present
  pciehp 0000:00:01.2:pcie004: Slot(162): Link Up
  pci 0000:02:00.0: [144d:a822] type 00 class 0x010802
  pci 0000:02:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
  pci 0000:02:00.0: Upstream bridge's Max Payload Size set to 256 (was 512, max 512)
  pci 0000:02:00.0: Max Payload Size set to 256 (was 128, max 256)
  iommu: Adding device 0000:02:00.0 to group 0
  pci 0000:02:00.0: BAR 0: assigned [mem 0xf7400000-0xf7403fff 64bit]
  pcieport 0000:00:01.2: PCI bridge to [bus 02]

I contemplated adding more error checking to the patch - checking the return value from the "pcie_set_mps(bridge, mpss)" call and possibly restoring the bridge's initial MPS setting if "rc = pcie_set_mps(dev, p_mps);" fails - but I could not imagine any situation where such would occur and decided against such.  If anyone can point out possible cases where errors could occur, or in topologies where things may fail with such an approach, please let me know.
Comment 3 Sinan Kaya 2018-07-18 18:24:53 UTC
Issue is there could be active traffic flowing from the other endpoints behind a switch while you are changing the root port's MPS. That's considered dangerous.

If you are planning to do something like this, you need to somehow stop all traffic.
Comment 4 Myron Stowe 2018-07-18 18:37:58 UTC
(In reply to Sinan Kaya from comment #3)
> Issue is there could be active traffic flowing from the other endpoints
> behind a switch while you are changing the root port's MPS. That's
> considered dangerous.
> 
> If you are planning to do something like this, you need to somehow stop all
> traffic.

Hay Sinan,

I'm intending to limit the case to boot and hot-plug of an endpoint device where the topology is limited to just a root port and the endpoint device.  As such, there should be no other traffic flowing through the root port in question.

I understand, and agree with, your concern but I don't intend for that type of topology to come into play with what I'm proposing - what am I missing?
Comment 5 Sinan Kaya 2018-07-18 18:45:25 UTC
Oh. I didn't realize you were looking for a reduced topology. I have no objections for such a use case, then.
Comment 6 Myron Stowe 2018-07-18 18:47:48 UTC
Sinan,

No problem at all, appreciate your "heads up" warning.  I'll post to the list and if you see anything suspicious then please reply there.
Comment 7 Myron Stowe 2018-08-13 16:30:46 UTC
From testing by Dongdong Liu -

I found a bug after applied the patch.

The topology is as below. The 82599 netcard with two functions connect to RP.
 +-[0000:80]-+-00.0-[81]--+-00.0  Device 8086:10fb
 |           |            \-00.1  Device 8086:10fb

1. lspci -s BDF -vvv  to get the value of device's MPSS , MPS and MRRS.
RP (80:00.0): MPSS=512 MPS=512 MRRS=512
EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
   PF1(81:00.1): MPSS=512 MPS=512 MRRS=512

2. Enable SRIOV.
echo 1  > /sys/devices/pci0000\:80/0000\:80\:00.0/0000\:81\:00.0/sriov_numvfs
RP(80:00.0): MPSS=512 MPS=128 MRRS=512
                          ^^^
EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
                              ^^^       
   PF1(81:00.1): MPSS=512 MPS=512 MRRS=512
                              ^^^       
   VF0(81:10.0): MPSS=128 MPS=128 MRRS=128
                              ^^^
The 82599 netcard PF (MPSS 512) and VF's MPSS (MPSS 128) are different.
Then RP (MPS 128) will report Malformed TLP when PF0/PF1 has memory write operation with MPS 512.

The 82599 netcard could work ok without the patch.
The values of MPSS, MPS, MRRS are as below without the patch.

RP(80:00.0): MPSS=512 MPS=512 MRRS=512
                          ^^^
EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
                              ^^^       
   PF1(81:00.1): MPSS=512 MPS=512 MRRS=512
                              ^^^       
   VF0(81:10.0): MPSS=128 MPS=128 MRRS=128
                              ^^^
Comment 8 Myron Stowe 2019-04-03 17:07:19 UTC
  commit 9f0e89359775ee21fe1ea732e34edb52aef5addf
  Date:   Mon Aug 13 12:19:46 2018 -0600
    PCI: Match Root Port's MPS to endpoint's MPSS as necessary

  commit 3dbe97efe8bf450b183d6dee2305cbc032e6b8a4
  Date:   Mon Aug 13 12:19:39 2018 -0600
    PCI: Skip MPS logic for Virtual Functions (VFs)