We use "pci=use_crs" to support hot-add of a device directly below a PCI host bridge. In this case the BIOS doesn't configure the device, so Linux has to program its BARs. The host bridge is not a P2B bridge (most HP parisc and ia64 host bridges don't even appear in PCI config space), so we can't read its PCI config space to learn what addresses it routes downstream. The only way to discover its apertures is to rely on information from firmware or to write a host bridge-specific driver to read them from the device. Gary added the "pci=use_crs" option [1] for this purpose. Here is a sample "lspci -vv" from such a machine [2]. In Gary's situation, the BIOS did everything right and supplied all the information we need. The user should not have to specify "pci=use_crs" to get hotplug to work. Linux should be smart enough to figure this out on its own. We tried making "pci=use_crs" the default [3], but this caused trouble on some machines [4]. I want to investigate this more and figure out a better solution than requiring the "pci=use_crs" option. [1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=62f420f8282 [2] https://bugzilla.redhat.com/attachment.cgi?id=327757 [3] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9e9f46c44e487af0a82eb61b624553e2f7118f5b [4] http://lkml.org/lkml/2009/6/23/715
Created attachment 23106 [details] Output of dmesg
Created attachment 23107 [details] Output of lspci -vv
Created attachment 23108 [details] Output of cat /proc/iomem
Created attachment 23109 [details] Output of # cd /sys/devices/; grep . pnp*/*/{id,resources}
Created attachment 23144 [details] dmesg with pci=use_crs and pnp.debug
I think I can explain some of what's happening here. The host bridge resources we find with "pci=use_crs" are PRODUCER resources, i.e., windows of address space that are forwarded downstream. PNP doesn't know how to handle bridges, so it ignores these, which explains why they don't show up on the PNP devices. pci_bus 0000:00: resource 0 io: [0x00-0xcf7] pci_bus 0000:00: resource 1 io: [0xd00-0xffff] The host bridge (pnp 00:01) CONSUMES the 0xcf8-0xcff ioport range for PCI config space access (this range IS shown in the PNP resources). The host bridge forwards everything else downstream to pci_bus 0000:00. pci_bus 0000:00: resource 2 mem: [0x0a0000-0x0bffff] Legacy VGA-related MMIO, routed to VGA device at 0000:00:12.0. pci_bus 0000:00: resource 3 mem: [0x0c0000-0x0c3fff] pci_bus 0000:00: resource 4 mem: [0x0c4000-0x0c7fff] pci_bus 0000:00: resource 5 mem: [0x0c8000-0x0cbfff] pci_bus 0000:00: resource 6 mem: [0x0cc000-0x0cffff] pci_bus 0000:00: resource 7 mem: [0x0d4000-0x0d7fff] pci_bus 0000:00: resource 8 mem: [0x0d8000-0x0dbfff] pci_bus 0000:00: resource 9 mem: [0x0dc000-0x0dffff] pci_bus 0000:00: resource 10 mem: [0x0e0000-0x0e3fff] pci_bus 0000:00: resource 11 mem: [0x0e4000-0x0e7fff] pci_bus 0000:00: resource 12 mem: [0x0e8000-0x0ebfff] pci_bus 0000:00: resource 13 mem: [0x0ec000-0x0effff] pci_bus 0000:00: resource 14 mem: [0x0f0000-0x0fffff] I don't know about these. Some of this space is used for option ROM and shadowing, so maybe it's related to that. I guess ACPI is telling us that the bridge forwards this region downstream, but I don't think we'd want to put any downstream devices there. pci_bus 0000:00: resource 15 mem: [0xc0000000-0xfebfffff] Host bridge MMIO aperture from _CRS (this makes a lot more sense than the [0x000000-0xffffffffffffffff] aperture reported in the non-_CRS dmesg in comment #1). pci_bus 0000:01: resource 1 mem: [0xfc100000-0xfc1fffff] P2P bridge MMIO aperture (read from MEMORY_BASE in config space of bridge at 0000:00:08.0). Note that this is contained within the upstream host bridge MMIO aperture, as it should be. pci_bus 0000:01: resource 3 io: [0x00-0xcf7] pci_bus 0000:01: resource 4 io: [0xd00-0xffff] pci_bus 0000:01: resource 5 mem: [0x0a0000-0x0bffff] pci_bus 0000:01: resource 6 mem: [0x0c0000-0x0c3fff] pci_bus 0000:01: resource 7 mem: [0x0c4000-0x0c7fff] pci_bus 0000:01: resource 8 mem: [0x0c8000-0x0cbfff] pci_bus 0000:01: resource 9 mem: [0x0cc000-0x0cffff] pci_bus 0000:01: resource 10 mem: [0x0d4000-0x0d7fff] pci_bus 0000:01: resource 11 mem: [0x0d8000-0x0dbfff] pci_bus 0000:01: resource 12 mem: [0x0dc000-0x0dffff] pci_bus 0000:01: resource 13 mem: [0x0e0000-0x0e3fff] pci_bus 0000:01: resource 14 mem: [0x0e4000-0x0e7fff] pci_bus 0000:01: resource 15 mem: [0x0e8000-0x0ebfff] The bridge at 00:08.0 is a transparent (subtractive decode) bridge, so pci_read_bridge_bases() copied resources 3-15 (PCI_BUS_NUM_RESOURCES == 16) from the parent bus (0000:00). This makes a little bit of sense, because a subtractive decode bridge forwards any unclaimed transactions downstream, so in a sense, it can use any address space forwarded by the upstream bridge. But this doesn't seem like a very clean way of handling a transparent bridge. For one thing, if we had to allocate space for a device below the bridge (on bus 0000:01), it's not enough to look at this list. In addition, we'd have to avoid space used by other devices on bus 0000:00, because space claimed by those devices will not be forwarded by the transparent bridge.
(In reply to comment #6) ... > The bridge at 00:08.0 is a transparent (subtractive decode) bridge, so > pci_read_bridge_bases() copied resources 3-15 (PCI_BUS_NUM_RESOURCES == 16) > from the parent bus (0000:00). This makes a little bit of sense, because a > subtractive decode bridge forwards any unclaimed transactions downstream, so > in > a sense, it can use any address space forwarded by the upstream bridge. > > But this doesn't seem like a very clean way of handling a transparent bridge. > For one thing, if we had to allocate space for a device below the bridge (on > bus 0000:01), it's not enough to look at this list. In addition, we'd have > to > avoid space used by other devices on bus 0000:00, because space claimed by > those devices will not be forwarded by the transparent bridge. Bjorn, Doesn't the code already avoid the ranges used by other devices on a transparent bridge parent bus by reserving those ranges _before_ it reserves ranges for the devices on the transparent bridge child bus?
Maybe, but I'm dubious that the transparent bridge windows and its secondary bus resources are updated when siblings of the bridge are hotplugged. For example, if we add a new device next to the bridge, the new device's resources should be removed from the bridge and its secondary bus.
Yea, based on the headaches we have had with that code it is always good to be dubious about it. :) The code seems quite fragile and it has always seemed to me like the transparent bridge support was grafted on as kind of an afterthought. I wasn't trying to defend the code, just trying to tell you why I *think* it may work in the situation that you described. The hotplug slots on the systems I have been working on are located on secondary buses of siblingless transparent bridges so I have no way to actually test this assertion. BTW, it seems like this concern should also apply to a topology where a transparent bridge w/sibling(s) is located on the secondary bus of a non-transparent/non-root bridge where a default pci=use_crs would not be a factor. I believe pci_read_bridge_bases() would copy all resources from the transparent bridge primary to secondary bus with this kind of structure as well.
Created attachment 23303 [details] possible fix Assuming that the non-response to http://lkml.org/lkml/2009/9/24/469 isn't an indication that the idea was stupid, here is a lightly tested patch that implements it. It at least provides some infrastructure that could possibly be used for the white/blacklisting that was being discussed.
Created attachment 23304 [details] possible fix, w/checkpatch.pl cleanups
I'm hesitant about this patch because I'd rather have a more general solution. We certainly have issues with the current handling of PCI host bridge apertures from _CRS, but I'm pretty sure Windows does use that aperture information, so I'm still hopeful that we can fix the Linux implementation. For example, in bug 14337, Windows shows the correct host bridge apertures. It's using pci.sys, and I don't see any chipset-specific driver, so I think the aperture information must be from _CRS. This whole area (e.g., the limit we impose on the number of root bus resources, the hacky handling of transparent bridges, amd_bus.c, the recent intel_bus.c stuff, the duplication of ia64/x86 _CRS parsing, etc.) just seems like a pile of kludges for special cases, and I hate to add more. I'd rather get some of the existing things working correctly than add workarounds for broken Linux code.
I also think that Windows on our systems uses root bridge _CRS information, at least for guidance during PCI hot-add. It is interesting that your bug 14337 investigation seems to indicate that Windows is either intentionally or unintentionally working around one case where _CRS may be returning bad data. This makes me wonder what else Windows might be doing to compensate for bad _CRS returned data on other systems. This seems to support my current feeling that the default use of root bridge _CRS needs to be initially limited to a small subset of systems to make it easier to deal with the problems that do turn up. I agree that avoiding the use of _CRS that returns too many entries for the fixed size resource array is a (hopefully temporary) workaround for broken Linux code. However, initially limiting root bridge _CRS use exposure to small subset of systems that includes those where it is critical for PCI hotplug does not seem like a workaround for broken Linux code to me. It just seems like a good way to introduce the function gradually while reducing the risk that it will get evicted from mainline again.
I see your point, and if this were an urgent bug fix and couldn't wait for a better solution, I'd probably agree. But it seems to me that it's more of an enhancement, and including the workaround makes it less likely that we'll ever get around to fixing the underlying Linux problems.
Not entirely sure this fits here, but I have a Toshiba laptop which had a weird problem where X didn't work (could only use VESA mode, not any of the Radeon drivers) if you had 4GB of memory in the machine (X works fine if you only have 2GB). Someone else finally tracked it down and it turns out that if we add pci=use_crs to the boot options it fixes the problem. I'd like to help get a fix for this so that people don't need the kernel option. Do you guys think this is related and I should try these patches?
Gary and Peter, can you attach dmidecode output for your machines that need "pci=use_crs"? I'm have some patches to try turning this on again, and I think I'll use a "BIOS date >= 2010 || whitelist" approach, so I'll add your machines to the whitelist to start.
Created attachment 24853 [details] dmidecode output from Toshiba Satellite A355-S6935 Here's the dmidecode output for my Toshiba Satellite A355-S6935. This is with the machine booted with pci=use_crs. I can compile kernels if you need me to test a patch.
Bjorn, Can you just use the information that I put in can_skip_pciprobe_dmi_table[] in arch/x86/pci/common.c to whitelist something else unique to the same systems? If not, let me know and I will get you the dmidecode outputs. Ditto for me on Peter's patch testing offer. I hope you also intend to include something like pci=no_crs that can be used to turn off the use of _CRS on the whitelisted boxes - just in case :)
Gary, are you saying that the can_skip_pciprobe_dmi_table[] lists the machines for which we should automatically use _CRS? The comments there talk about I/O resource alignment. But I'd be happy to use that information if those systems also need _CRS. Do you know the BIOS dates for those systems? Peter's is 2009, and if all of yours are 2009, too, maybe we should consider enabling on "BIOS date >= 2009" rather than 2010? It feels riskier in one sense, but on the other hand, there might be other systems like Peter's that would get fixed. Thanks for the testing offer. I'm definitely going to include a "pci=nocrs" option :-) My current patches break something in CardBus, but as soon as I track that down, I'll post them and you can try them out.
(In reply to comment #19) > Gary, are you saying that the can_skip_pciprobe_dmi_table[] lists the > machines > for which we should automatically use _CRS? Correct. All of our systems where a x3850, x3850, or x3950 product name appears in the dmidecode output need _CRS data for PCI hotplug to work correctly. > The comments there talk about I/O resource alignment. Yes, we also needed relief from the legacy I/O resource alignment constraints on the same systems. The header for the patch that added this included the following which, as I read and think about it, makes me wonder if this could be needed on other large non-IBM systems when _CRS is enabled. To conserve limited PCI i/o resource on some IBM multi-node systems, the BIOS allocates (via _CRS) and expects the kernel to use addresses in ranges currently excluded by pcibios_align_resource() [i386/pci/i386.c]. This change allows the kernel to use the currently excluded address ranges on the IBM x3800, x3850, and x3950. > But I'd be happy to use that information if those systems > also need _CRS. Thanks! This will definitely be helpful to me. > > Do you know the BIOS dates for those systems? Peter's is 2009, and if all of > yours are 2009, too, maybe we should consider enabling on "BIOS date >= 2009" > rather than 2010? It feels riskier in one sense, but on the other hand, > there > might be other systems like Peter's that would get fixed. It looks like we may need "BIOS date >= 2008" for some of the x3850 and x3950 systems and possibly even an earlier year for the older x3800 which I'm sure if we have in our lab any longer. > > Thanks for the testing offer. No problem. > I'm definitely going to include a "pci=nocrs" option :-) You don't seem like the kind of guy that would leave something important like that out but thought I better ask anyway. > My current patches break something in CardBus, but as soon as I > track that down, I'll post them and you can try them out. OK, I will watch for them.
I've tried the patches and they seem to work. I'm getting this weird effect of flashing blue horizontal lines all over the screen, but I'm fairly certain that's related to some other problem with Radeon drivers in the 2.6.33 kernel (I've found other complaints of that).
Created attachment 24914 [details] Output from dmesg command after patches applied After applying the 7 patches to Jesse's linux-next tree. Patch #2 was already applied. With the patches, the kernel booted with no problems.
Created attachment 24915 [details] patched kernel acpiphp load problem Bjorn, Finally getting to your patches but ran into the attached problem loading 'acpiphp'. Probably unrelated to your patches but I haven't backed them out yet to make sure. Also a good chance for pilot error.
After reverting all 7 patches, acpiphp load still fails.
I think this issue is resolved as of 2.6.37, and I'd like to close this issue. Any objections?
(In reply to comment #25) > I think this issue is resolved as of 2.6.37, and I'd like to close this > issue. > Any objections? As the last complainer, there are no objections from me. Any issues I was aware of were resolved. Thanks for your nice work on this.
Resolved in 2.6.37.