Bug 203243

Summary: When using hpmemsize=nnM the MMIO can be assigned double space
Product: Drivers Reporter: Nicholas Johnson (nicholas.johnson-opensource)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: high    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: Unmodified kernel dmesg
Unmodified kernel "lspci -xxxx"
Patched kernel dmesg
Patched kernel "lspci -xxxx"

Description Nicholas Johnson 2019-04-10 00:26:58 UTC
Reproduction:
==============================================================================

You will need some sort of PCI hotplug bridge - perhaps Thunderbolt 
add-in card. For all those who are deep into PCI parts of the kernel, I 
would recommend the Gigabyte GC-TITAN RIDGE Thunderbolt 3 add-in card. 
If you jump the correct two pins of the header then it stays awake and 
can allow for testing all kinds of things on an arbitrary system.
    ___
 __/   \__
|o o o o o| When looking into the receptacle on back of PCIe card.
|_________| Jump pins 3 and 5.

 1 2 3 4 5

The easiest way is to use the bundled cable and stick a paperclip in the 
other end of the cable to achieve the desired effect.

However, a proficient kernel developer can easily override their kernel 
to treat one of their root ports in their system as a hotplug bridge, 
possibly in drivers/pci/probe.c if my memory serves me correctly. I have 
done it before but it has been a while.

if (dev->vendor == 0x???? && dev->device == 0x????)
	dev->is_hotplug_bridge = 1;

Most Intel root ports have unique dev->device for some reason. My root 
ports of my Z270 chipset are 8086:a290, 8086:a294, 8086:a298, etc....

Since most systems disable unused CPU and PCH PCIe ports in BIOS at boot 
time, you might want to pick one that is populated.

Boot with pci=hpmemsize=128M,realloc,nocrs pcie_ports=native

At least with some systems, you can observe that the MMIO (32-bit) has 
been assigned 256M whilst the MMIO_PREF window has been assigned 128M.

The cause:
==============================================================================

File: drivers/pci/setup-bus.c
Function: find_free_bus_resource

find_free_bus_resource ignores resources that have already been 
assigned, when it should return them regardless of whether they have 
been assigned. The caller needs to know the difference between "nothing 
more is needed to be done" and "legitimately cannot find resource".

This function is called by pbus_size_io and pbus_size_mem. These 
functions are using find_free_bus_resource to determine which window to 
place a given resource (IO, MMIO, MMIO_PREF). The kernel does multiple 
passes with pci=realloc to re-attempt any failed assignments. In 
__pci_bus_size_bridges function, if resources fail to assign in 64-bit 
MMIO_PREF then their required space is added to the 32-bit MMIO window. 
If pbus_size_mem returns failure for MMIO_PREF, then this happens.

Because find_free_bus_resource skips over resources that are already 
assigned, it returns NULL because the only suitable resource is already 
assigned. This causes pbus_size_mem to return -ENOSPC, when in fact, it 
should return 0 (success) because there is nothing more to do.

1st pass: MMIO_PREF = success, 128M
	  MMIO = fail (need to expand parent window)

2nd pass: MMIO_PREF = fail because it is successfully assigned
	  MMIO = tries to allocate 128M MMIO + 128M from above failure


Implications:
==============================================================================

If you try to modify the kernel to set pci=hpmemsize independently then 
it gets way worse. In the current circumstances, we get double the 
requested MMIO. But if we were to request 128M MMIO and 64G MMIO_PREF, 
the result would be:

MMIO_PREF = assigned 64G
MMIO = requested 128M + 64G = HARD FAIL (32-bit address space only 4G).

Solution:
==============================================================================
Patch 3/4 in my patch series fixes this bug. It is found here:

https://lkml.org/lkml/2019/3/11/688

Apply all the patches in my series (or at least the first three) to 
solve the bug. This patch may be applied without the others preceding it 
but will not do so cleanly - you will need to fix a hunk or two 
manually.

My patch changes find_free_bus_resource to not skip over resources with 
non-NULL parent (assigned). It adds checks in pbus_size_io and 
pbus_size_mem to check if the returned resource has a parent and if so, 
return 0 (success / nothing more to be done).

Evidence: 
============================================================================== 
Attached files show the difference between a mainline kernel and one 
manually patched with only patch 3/4 in my series.
Comment 1 Nicholas Johnson 2019-04-10 00:29:03 UTC
Created attachment 282253 [details]
Unmodified kernel dmesg
Comment 2 Nicholas Johnson 2019-04-10 00:29:35 UTC
Created attachment 282255 [details]
Unmodified kernel "lspci -xxxx"
Comment 3 Nicholas Johnson 2019-04-10 00:30:05 UTC
Created attachment 282257 [details]
Patched kernel dmesg
Comment 4 Nicholas Johnson 2019-04-10 00:30:46 UTC
Created attachment 282259 [details]
Patched kernel "lspci -xxxx"