Bug 14183

Summary: "pci=use_crs" required for resource allocation of hot-added devices
Product: Drivers Reporter: Bjorn Helgaas (bjorn.helgaas)
Component: PCIAssignee: Bjorn Helgaas (bjorn.helgaas)
Status: RESOLVED CODE_FIX    
Severity: normal CC: bjorn, garyhade, Larry.Finger, lcm, peterh, sreenivasa-reddy.berahalli
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31 Subsystem:
Regression: No Bisected commit-id:
Attachments: Output of dmesg
Output of lspci -vv
Output of cat /proc/iomem
Output of # cd /sys/devices/; grep . pnp*/*/{id,resources}
dmesg with pci=use_crs and pnp.debug
possible fix
possible fix, w/checkpatch.pl cleanups
dmidecode output from Toshiba Satellite A355-S6935
Output from dmesg command after patches applied
patched kernel acpiphp load problem

Description Bjorn Helgaas 2009-09-16 23:07:42 UTC
We use "pci=use_crs" to support hot-add of a device directly below
a PCI host bridge.  In this case the BIOS doesn't configure the
device, so Linux has to program its BARs.  The host bridge is not
a P2B bridge (most HP parisc and ia64 host bridges don't even
appear in PCI config space), so we can't read its PCI config space
to learn what addresses it routes downstream.  The only way to
discover its apertures is to rely on information from firmware
or to write a host bridge-specific driver to read them from the
device.

Gary added the "pci=use_crs" option [1] for this purpose.  Here
is a sample "lspci -vv" from such a machine [2].

In Gary's situation, the BIOS did everything right and supplied all
the information we need.  The user should not have to specify
"pci=use_crs" to get hotplug to work.  Linux should be smart
enough to figure this out on its own.

We tried making "pci=use_crs" the default [3], but this caused
trouble on some machines [4].

I want to investigate this more and figure out a better solution
than requiring the "pci=use_crs" option.

[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=62f420f8282
[2] https://bugzilla.redhat.com/attachment.cgi?id=327757
[3] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9e9f46c44e487af0a82eb61b624553e2f7118f5b
[4] http://lkml.org/lkml/2009/6/23/715
Comment 1 Larry Finger 2009-09-17 00:26:57 UTC
Created attachment 23106 [details]
Output of dmesg
Comment 2 Larry Finger 2009-09-17 00:27:52 UTC
Created attachment 23107 [details]
Output of lspci -vv
Comment 3 Larry Finger 2009-09-17 00:29:43 UTC
Created attachment 23108 [details]
Output of cat /proc/iomem
Comment 4 Larry Finger 2009-09-17 00:30:59 UTC
Created attachment 23109 [details]
Output of # cd /sys/devices/; grep . pnp*/*/{id,resources}
Comment 5 Larry Finger 2009-09-22 23:32:52 UTC
Created attachment 23144 [details]
dmesg with pci=use_crs and pnp.debug
Comment 6 Bjorn Helgaas 2009-09-25 20:36:38 UTC
I think I can explain some of what's happening here.

The host bridge resources we find with "pci=use_crs" are PRODUCER resources, i.e., windows of address space that are forwarded downstream.  PNP doesn't know how to handle bridges, so it ignores these, which explains why they don't show up on the PNP devices.

  pci_bus 0000:00: resource 0 io:  [0x00-0xcf7]
  pci_bus 0000:00: resource 1 io:  [0xd00-0xffff]

The host bridge (pnp 00:01) CONSUMES the 0xcf8-0xcff ioport range for PCI config space access (this range IS shown in the PNP resources).  The host bridge forwards everything else downstream to pci_bus 0000:00.

  pci_bus 0000:00: resource 2 mem: [0x0a0000-0x0bffff]

Legacy VGA-related MMIO, routed to VGA device at 0000:00:12.0.

  pci_bus 0000:00: resource 3 mem: [0x0c0000-0x0c3fff]
  pci_bus 0000:00: resource 4 mem: [0x0c4000-0x0c7fff]
  pci_bus 0000:00: resource 5 mem: [0x0c8000-0x0cbfff]
  pci_bus 0000:00: resource 6 mem: [0x0cc000-0x0cffff]
  pci_bus 0000:00: resource 7 mem: [0x0d4000-0x0d7fff]
  pci_bus 0000:00: resource 8 mem: [0x0d8000-0x0dbfff]
  pci_bus 0000:00: resource 9 mem: [0x0dc000-0x0dffff]
  pci_bus 0000:00: resource 10 mem: [0x0e0000-0x0e3fff]
  pci_bus 0000:00: resource 11 mem: [0x0e4000-0x0e7fff]
  pci_bus 0000:00: resource 12 mem: [0x0e8000-0x0ebfff]
  pci_bus 0000:00: resource 13 mem: [0x0ec000-0x0effff]
  pci_bus 0000:00: resource 14 mem: [0x0f0000-0x0fffff]

I don't know about these.  Some of this space is used for option ROM and shadowing, so maybe it's related to that.  I guess ACPI is telling us that the bridge forwards this region downstream, but I don't think we'd want to put any downstream devices there.

  pci_bus 0000:00: resource 15 mem: [0xc0000000-0xfebfffff]

Host bridge MMIO aperture from _CRS (this makes a lot more sense than the [0x000000-0xffffffffffffffff] aperture reported in the non-_CRS dmesg in comment #1).

  pci_bus 0000:01: resource 1 mem: [0xfc100000-0xfc1fffff]

P2P bridge MMIO aperture (read from MEMORY_BASE in config space of bridge at 0000:00:08.0).  Note that this is contained within the upstream host bridge MMIO aperture, as it should be.

  pci_bus 0000:01: resource 3 io:  [0x00-0xcf7]
  pci_bus 0000:01: resource 4 io:  [0xd00-0xffff]
  pci_bus 0000:01: resource 5 mem: [0x0a0000-0x0bffff]
  pci_bus 0000:01: resource 6 mem: [0x0c0000-0x0c3fff]
  pci_bus 0000:01: resource 7 mem: [0x0c4000-0x0c7fff]
  pci_bus 0000:01: resource 8 mem: [0x0c8000-0x0cbfff]
  pci_bus 0000:01: resource 9 mem: [0x0cc000-0x0cffff]
  pci_bus 0000:01: resource 10 mem: [0x0d4000-0x0d7fff]
  pci_bus 0000:01: resource 11 mem: [0x0d8000-0x0dbfff]
  pci_bus 0000:01: resource 12 mem: [0x0dc000-0x0dffff]
  pci_bus 0000:01: resource 13 mem: [0x0e0000-0x0e3fff]
  pci_bus 0000:01: resource 14 mem: [0x0e4000-0x0e7fff]
  pci_bus 0000:01: resource 15 mem: [0x0e8000-0x0ebfff]

The bridge at 00:08.0 is a transparent (subtractive decode) bridge, so pci_read_bridge_bases() copied resources 3-15 (PCI_BUS_NUM_RESOURCES == 16) from the parent bus (0000:00).  This makes a little bit of sense, because a subtractive decode bridge forwards any unclaimed transactions downstream, so in a sense, it can use any address space forwarded by the upstream bridge.

But this doesn't seem like a very clean way of handling a transparent bridge.  For one thing, if we had to allocate space for a device below the bridge (on bus 0000:01), it's not enough to look at this list.  In addition, we'd have to avoid space used by other devices on bus 0000:00, because space claimed by those devices will not be forwarded by the transparent bridge.
Comment 7 Gary Hade 2009-09-29 23:47:03 UTC
(In reply to comment #6)
...
> The bridge at 00:08.0 is a transparent (subtractive decode) bridge, so
> pci_read_bridge_bases() copied resources 3-15 (PCI_BUS_NUM_RESOURCES == 16)
> from the parent bus (0000:00).  This makes a little bit of sense, because a
> subtractive decode bridge forwards any unclaimed transactions downstream, so
> in
> a sense, it can use any address space forwarded by the upstream bridge.
> 
> But this doesn't seem like a very clean way of handling a transparent bridge. 
> For one thing, if we had to allocate space for a device below the bridge (on
> bus 0000:01), it's not enough to look at this list.  In addition, we'd have
> to
> avoid space used by other devices on bus 0000:00, because space claimed by
> those devices will not be forwarded by the transparent bridge.

Bjorn, Doesn't the code already avoid the ranges used by other
devices on a transparent bridge parent bus by reserving those
ranges _before_ it reserves ranges for the devices on the
transparent bridge child bus?
Comment 8 Bjorn Helgaas 2009-10-01 17:07:20 UTC
Maybe, but I'm dubious that the transparent bridge windows and its secondary bus resources are updated when siblings of the bridge are hotplugged.  For example, if we add a new device next to the bridge, the new device's resources should be removed from the bridge and its secondary bus.
Comment 9 Gary Hade 2009-10-01 20:43:49 UTC
Yea, based on the headaches we have had with that code it is always
good to be dubious about it. :)  The code seems quite fragile and it
has always seemed to me like the transparent bridge support was grafted
on as kind of an afterthought.  I wasn't trying to defend the code,
just trying to tell you why I *think* it may work in the situation that
you described.  The hotplug slots on the systems I have been working on
are located on secondary buses of siblingless transparent bridges so
I have no way to actually test this assertion.

BTW, it seems like this concern should also apply to a topology where 
a transparent bridge w/sibling(s) is located on the secondary bus
of a non-transparent/non-root bridge where a default pci=use_crs would
not be a factor.  I believe pci_read_bridge_bases() would copy all
resources from the transparent bridge primary to secondary bus 
with this kind of structure as well.
Comment 10 Gary Hade 2009-10-07 20:05:58 UTC
Created attachment 23303 [details]
possible fix

Assuming that the non-response to http://lkml.org/lkml/2009/9/24/469
isn't an indication that the idea was stupid, here is a lightly tested
patch that implements it.  It at least provides some infrastructure
that could possibly be used for the white/blacklisting that was being
discussed.
Comment 11 Gary Hade 2009-10-07 20:53:11 UTC
Created attachment 23304 [details]
possible fix, w/checkpatch.pl cleanups
Comment 12 Bjorn Helgaas 2009-10-08 15:58:43 UTC
I'm hesitant about this patch because I'd rather have a more general solution.

We certainly have issues with the current handling of PCI host bridge apertures from _CRS, but I'm pretty sure Windows does use that aperture information, so I'm still hopeful that we can fix the Linux implementation.  For example, in bug 14337, Windows shows the correct host bridge apertures.  It's using pci.sys, and I don't see any chipset-specific driver, so I think the aperture information must be from _CRS.

This whole area (e.g., the limit we impose on the number of root bus resources, the hacky handling of transparent bridges, amd_bus.c, the recent intel_bus.c stuff, the duplication of ia64/x86 _CRS parsing, etc.) just seems like a pile of kludges for special cases, and I hate to add more.  I'd rather get some of the existing things working correctly than add workarounds for broken Linux code.
Comment 13 Gary Hade 2009-10-08 19:52:50 UTC
I also think that Windows on our systems uses root bridge _CRS
information, at least for guidance during PCI hot-add.  It is
interesting that your bug 14337 investigation seems to indicate
that Windows is either intentionally or unintentionally working
around one case where _CRS may be returning bad data.  This makes
me wonder what else Windows might be doing to compensate for
bad _CRS returned data on other systems.  This seems to support
my current feeling that the default use of root bridge _CRS needs
to be initially limited to a small subset of systems to make it
easier to deal with the problems that do turn up.

I agree that avoiding the use of _CRS that returns too many
entries for the fixed size resource array is a (hopefully
temporary) workaround for broken Linux code.  However, initially
limiting root bridge _CRS use exposure to small subset of
systems that includes those where it is critical for PCI hotplug
does not seem like a workaround for broken Linux code to me.  It
just seems like a good way to introduce the function gradually
while reducing the risk that it will get evicted from mainline again.
Comment 14 Bjorn Helgaas 2009-10-08 22:03:36 UTC
I see your point, and if this were an urgent bug fix and couldn't wait for a better solution, I'd probably agree.  But it seems to me that it's more of an enhancement, and including the workaround makes it less likely that we'll ever get around to fixing the underlying Linux problems.
Comment 15 Peter Haight 2010-01-26 04:26:28 UTC
Not entirely sure this fits here, but I have a Toshiba laptop which had a weird problem where X didn't work (could only use VESA mode, not any of the Radeon drivers) if you had 4GB of memory in the machine (X works fine if you only have 2GB). Someone else finally tracked it down and it turns out that if we add pci=use_crs to the boot options it fixes the problem. I'd like to help get a fix for this so that people don't need the kernel option. Do you guys think this is related and I should try these patches?
Comment 16 Bjorn Helgaas 2010-02-01 18:17:44 UTC
Gary and Peter, can you attach dmidecode output for your machines that need
"pci=use_crs"?  I'm have some patches to try turning this on again, and I think I'll use a "BIOS date >= 2010 || whitelist" approach, so I'll add your machines to the whitelist to start.
Comment 17 Peter Haight 2010-02-01 19:39:52 UTC
Created attachment 24853 [details]
dmidecode output from Toshiba Satellite A355-S6935

Here's the dmidecode output for my Toshiba Satellite A355-S6935. This is with the machine booted with pci=use_crs.

I can compile kernels if you need me to test a patch.
Comment 18 Gary Hade 2010-02-01 20:31:26 UTC
Bjorn, Can you just use the information that I put in
can_skip_pciprobe_dmi_table[] in arch/x86/pci/common.c to whitelist
something else unique to the same systems?  If not, let me know
and I will get you the dmidecode outputs.

Ditto for me on Peter's patch testing offer.

I hope you also intend to include something like pci=no_crs
that can be used to turn off the use of _CRS on the whitelisted
boxes - just in case :)
Comment 19 Bjorn Helgaas 2010-02-01 21:15:37 UTC
Gary, are you saying that the can_skip_pciprobe_dmi_table[] lists the machines for which we should automatically use _CRS?  The comments there talk about I/O resource alignment.  But I'd be happy to use that information if those systems also need _CRS.

Do you know the BIOS dates for those systems?  Peter's is 2009, and if all of yours are 2009, too, maybe we should consider enabling on "BIOS date >= 2009" rather than 2010?  It feels riskier in one sense, but on the other hand, there might be other systems like Peter's that would get fixed.

Thanks for the testing offer.  I'm definitely going to include a "pci=nocrs" option :-)  My current patches break something in CardBus, but as soon as I track that down, I'll post them and you can try them out.
Comment 20 Gary Hade 2010-02-01 23:01:03 UTC
(In reply to comment #19)
> Gary, are you saying that the can_skip_pciprobe_dmi_table[] lists the
> machines
> for which we should automatically use _CRS?

Correct.  All of our systems where a x3850, x3850, or x3950
product name appears in the dmidecode output need _CRS data
for PCI hotplug to work correctly.

> The comments there talk about I/O resource alignment.

Yes, we also needed relief from the legacy I/O
resource alignment constraints on the same systems.
The header for the patch that added this included the
following which, as I read and think about it, makes me
wonder if this could be needed on other large non-IBM
systems when _CRS is enabled.

To conserve limited PCI i/o resource on some IBM multi-node
systems, the BIOS allocates (via _CRS) and expects the kernel
to use addresses in ranges currently excluded by
pcibios_align_resource() [i386/pci/i386.c].  This change
allows the kernel to use the currently excluded address
ranges on the IBM x3800, x3850, and x3950.

> But I'd be happy to use that information if those systems
> also need _CRS.

Thanks!  This will definitely be helpful to me.

> 
> Do you know the BIOS dates for those systems?  Peter's is 2009, and if all of
> yours are 2009, too, maybe we should consider enabling on "BIOS date >= 2009"
> rather than 2010?  It feels riskier in one sense, but on the other hand,
> there
> might be other systems like Peter's that would get fixed.

It looks like we may need "BIOS date >= 2008" for some
of the x3850 and x3950 systems and possibly even an earlier
year for the older x3800 which I'm sure if we have in our
lab any longer.

> 
> Thanks for the testing offer.

No problem.

> I'm definitely going to include a "pci=nocrs" option :-)

You don't seem like the kind of guy that would leave
something important like that out but thought I better
ask anyway.

> My current patches break something in CardBus, but as soon as I
> track that down, I'll post them and you can try them out.

OK, I will watch for them.
Comment 21 Peter Haight 2010-02-04 22:06:23 UTC
I've tried the patches and they seem to work. I'm getting this weird effect of flashing blue horizontal lines all over the screen, but I'm fairly certain that's related to some other problem with Radeon drivers in the 2.6.33 kernel (I've found other complaints of that).
Comment 22 Larry Finger 2010-02-04 22:30:29 UTC
Created attachment 24914 [details]
Output from dmesg command after patches applied

After applying the 7 patches to Jesse's linux-next tree. Patch #2 was already applied.

With the patches, the kernel booted with no problems.
Comment 23 Gary Hade 2010-02-04 23:52:26 UTC
Created attachment 24915 [details]
patched kernel acpiphp load problem

Bjorn, Finally getting to your patches but ran into the
attached problem loading 'acpiphp'.  Probably unrelated to
your patches but I haven't backed them out yet to make sure.
Also a good chance for pilot error.
Comment 24 Gary Hade 2010-02-05 00:32:37 UTC
After reverting all 7 patches, acpiphp load still fails.
Comment 25 Bjorn Helgaas 2011-05-27 16:04:42 UTC
I think this issue is resolved as of 2.6.37, and I'd like to close this issue.  Any objections?
Comment 26 Gary Hade 2011-05-27 21:23:38 UTC
(In reply to comment #25)
> I think this issue is resolved as of 2.6.37, and I'd like to close this
> issue. 
> Any objections?

As the last complainer, there are no objections from me.  Any
issues I was aware of were resolved.

Thanks for your nice work on this.
Comment 27 Bjorn Helgaas 2011-05-27 21:53:23 UTC
Resolved in 2.6.37.