Created attachment 250241 [details] lspci 4.9.0 pci=realloc On a machine with complex PCI topology I am getting failures to assign BAR memory. This is incredibly machine specific so feel free to close. I attach a lspci -vvv and /var/log/messages. Topology: -[0000:00]-+-00.0 +-01.0-[01-10]----00.0-[02-10]--+-08.0-[03-08]----00.0-[04-08]--+-00.0-[05]--+-00.0 | | | \-00.1 | | +-09.0-[06]-- | | +-10.0-[07]----00.0 | | \-11.0-[08]-- | +-09.0-[09]----00.0 | +-10.0-[0a]----00.0 | \-11.0-[0b-10]----00.0-[0c-10]--+-01.0-[0d]-- | +-02.0-[0e]-- | +-03.0-[0f]-- | \-04.0-[10]-- +-02.0 +-14.0 +-16.0 +-16.3 +-19.0 +-1a.0 +-1b.0 +-1c.0-[11]-- +-1c.4-[12]----00.0 +-1c.6-[13]----00.0 +-1c.7-[14-17]----00.0-[15-17]--+-01.0-[16]-- | \-02.0-[17]----00.0 +-1d.0 +-1f.0 +-1f.2 \-1f.3
Created attachment 250251 [details] zipped messages log 4.9.0 pci=realloc
Created attachment 251231 [details] dmesg log (incomplete) Hi Harry, I'm attaching the last boot, which I extracted from your zipped messages log. It's actually not complete because it doesn't include the PCI enumeration information about individual devices. This would tell us about any initial resource assignments from the firmware. The output of the "dmesg" command or /var/log/dmesg should contain that. This topology might be more complicated than some, but I do consider it a bug if complete assignment is theoretically possible but Linux doesn't do it. I think it's also a bug if firmware gave us a working but incomplete assignment, and Linux makes it worse. It looks like that might be the case here: 05:00.0 probably had working BARs from firmware, but it looks like we broke it. In your email you mentioned 05:00.0 differences (valid config space but disabled BARs vs. "unknown header type 7f"). That looks like possibly a different problem to be teased out separately. The "unknown header" case looks like possibly we're reading 0xff from its config space due to bridge misconfiguration or some other issue. You also mentioned hotplug, which might be something we can look at separately. The patches you use to get things working correctly would also be helpful. Even if they're hacky, they would give a clue as to what's going wrong. As you say, it's pretty hard for us to debug issues in older kernels like CentOS 3.10.0, but if you can reproduce them on v4.9 as you did here, there's a chance we can make some progress.
Hi Harry, if you have a chance to collect and attach the complete dmesg log, that would be great. It should include lines like this that are missing from the one you attached: pci 0000:00:00.0: [8086:xxxx] type 00 class ... pci 0000:00:01.0: [8086:xxxx] type 01 class ... pci 0000:00:02.0: [8086:xxxx] type 00 class ... pci 0000:00:02.0: reg 0x10: [mem 0xf6c00000-0xf6ffffff 64bit] pci 0000:00:02.0: reg 0x18: [mem 0xd0000000-0xdfffffff 64bit pref] pci 0000:00:02.0: reg 0x20: [io 0xf000-f03f] This shows us what the BIOS initially assigned, so we can see where Linux went wrong.
Created attachment 255401 [details] Full DMESG
I also attach a collection of our patches that fix this. The one with changes to setup-bus.c is VERY hardware specific. The one that changes pciehp_hpc.c will not apply to the tip of the kernel tree.
Created attachment 255403 [details] ZIP of kernel patches
Created attachment 255405 [details] DMESG after patches