Bug 15533

Summary: i915 fails to set mode in 2.6.34-rc1, ok in 2.6.33-rc8
Product: Drivers Reporter: Pete Zaitcev (zaitcev)
Component: PCIAssignee: Bjorn Helgaas (bjorn.helgaas)
Severity: normal CC: bjorn.helgaas, linux-bugs, maciej.rutecki, Matt_Domsch, rezwanul_kabir, rjw, sndirsch, trenn
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.34-rc1 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 15310    
Attachments: dmesg bad
lspci -vv
parse additional _CRS resource types
debug patch to dump _CRS resources
ignore Producer bit, parse additional _CRS resources
dmesg w/ CRS dumps
dmesg w/ pnp.debug

Description Pete Zaitcev 2010-03-13 20:37:05 UTC
When booting 2.6.34-rc1, i915 fails to set mode with the following message:

i915 0000:00:02.0: irq 33 for MSI/MSI-X
[drm] set up 127M of stolen space
[drm:i915_gem_init_ringbuffer] *ERROR* Ring head not reset to zero ctl ffffffff head ffffffff tail ffffffff start ffffffff

Unfortunately, it means that X cannot start (the "new" X depends on DRM
for modesetting).

Bisecting fingered commit 7bc5e3f2be32ae6fb0c74cd0f707f986b3a01a26:
x86/PCI: use host bridge _CRS info by default on 2008 and newer machines

See also bug 15480. Filing a separate bug in case remedies are different
for this case and Yanko's, to be possibly duped later.

L-k thread: http://lkml.org/lkml/2010/3/12/558
Comment 1 Pete Zaitcev 2010-03-13 20:42:25 UTC
Created attachment 25504 [details]
dmesg bad
Comment 2 Pete Zaitcev 2010-03-13 20:44:47 UTC
Created attachment 25505 [details]
lspci -vv
Comment 3 Pete Zaitcev 2010-03-13 20:46:32 UTC
Created attachment 25506 [details]
Comment 4 Bjorn Helgaas 2010-03-15 17:35:12 UTC
Created attachment 25522 [details]
parse additional _CRS resource types

One problem with the current pci=use_crs code is that it ignores several types of memory and I/O descriptors, which could cause this problem.  This patch adds support for all the descriptors I know about.  Can you please:

  - Apply this patch
  - Boot with "acpi.debug_level=0x00010000 acpi.debug_layer=0x00000100"
  - Attach the resulting dmesg and whether the i915 works

Comment 5 Bjorn Helgaas 2010-03-15 17:48:27 UTC
Created attachment 25523 [details]
debug patch to dump _CRS resources

If the patch in comment #4 does not fix the problem, would you please apply this patch to collect more debug information?  This patch needs CONFIG_ACPI_DEBUG=y and the same "acpi.debug_level=0x00010000 acpi.debug_layer=0x00000100" boot parameters as before.  I'd like to see the entire dmesg log, which will be large, so you might need something like "log_buf_len=8M" in addition.
Comment 6 Bjorn Helgaas 2010-03-16 16:44:07 UTC
Created attachment 25549 [details]
ignore Producer bit, parse additional _CRS resources

Please test this patch instead.  In addition to parsing additional _CRS types, it ignores the Consumer/Producer bit, which BIOSes have used inconsistently.
Comment 7 Pete Zaitcev 2010-03-22 00:14:27 UTC
Created attachment 25635 [details]
dmesg w/ CRS dumps

The run includes patches from comment #6 (text fix, failed), and comment #5
(additional debugging). CONFIG_ACPI_DEBUG is on, options are on, see inside
the dmesg. Only 100KB in size, no need for 8MB log_buf. There seem to be
the required dump inside, but please check.
Comment 8 Bjorn Helgaas 2010-03-22 03:43:24 UTC
Thanks for testing this.  What kind of system is this?  Is it shipping or a prototype?  Do you know whether it can run Windows or whether it has passed WHQL testing?

Would you mind turning on CONFIG_PNP_DEBUG_MESSAGES, booting with "pnp.debug", and attaching another dmesg log and an acpidump, please?

  pci_root PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff]
  pci_root PNP0A08:00: host bridge window [mem 0xb4000000-0xf5bfffff]
  pci 0000:00:02.0: reg 10: [mem 0xff000000-0xff3fffff 64bit]
  system 00:0a: [mem 0xff000000-0xffffffff] could not be reserved

The i915 device at 00:02.0 (and many others) are in the [mem 0xff000000-0xffffffff] range.  For some reason, the BIOS reported that range as a motherboard resource, but not as a PCI host bridge window.

I'm not sure how we are supposed to determine that this motherboard resource is connected with these PCI devices.  Maybe the acpidump will have a clue.

These are the BARs in the range in question, which all seem to be built-in devices (except a few which are behind built-in bridges):

  pci 0000:00:02.0: reg 10: [mem 0xff000000-0xff3fffff 64bit]
  pci 0000:00:16.0: reg 10: [mem 0xff6c0000-0xff6c000f 64bit]
  pci 0000:00:16.3: reg 14: [mem 0xff6a0000-0xff6a0fff]
  pci 0000:00:17.0: reg 10: [mem 0xff690000-0xff690fff 64bit]
  pci 0000:00:17.0: reg 18: [mem 0xff680000-0xff6800ff 64bit]
  pci 0000:00:19.0: reg 10: [mem 0xff600000-0xff61ffff]
  pci 0000:00:19.0: reg 14: [mem 0xff670000-0xff670fff]
  pci 0000:00:1a.0: reg 10: [mem 0xff660000-0xff6603ff]
  pci 0000:00:1b.0: reg 10: [mem 0xff650000-0xff653fff 64bit]
  pci 0000:00:1d.0: reg 10: [mem 0xff640000-0xff6403ff]
  pci 0000:00:1f.2: reg 24: [mem 0xff630000-0xff6307ff]
  pci 0000:00:1f.3: reg 10: [mem 0xff620000-0xff6200ff 64bit]
  pci 0000:00:1c.6:   bridge window [mem 0xff500000-0xff5fffff]
  pci 0000:06:00.0:   bridge window [mem 0xff500000-0xff5fffff]
  pci 0000:07:00.0: reg 10: [mem 0xff510000-0xff5107ff]
  pci 0000:07:00.0: reg 14: [mem 0xff500000-0xff503fff]
  pci 0000:00:1e.0:   bridge window [mem 0xff400000-0xff4fffff]
  pci 0000:09:01.0: reg 30: [mem 0xff400000-0xff40ffff pref]

In fact, the only memory BAR *not* in that range is this one, which I guess is the frame buffer:

  pci 0000:00:02.0: reg 18: [mem 0xd0000000-0xdfffffff 64bit pref]

I wonder if there's a parent/child relationship between the host bridge and the motherboard device, and we're supposed to infer that this range is passed through somehow.
Comment 9 Pete Zaitcev 2010-03-25 18:10:35 UTC
Created attachment 25713 [details]
dmesg w/ pnp.debug
Comment 10 Pete Zaitcev 2010-03-25 18:15:28 UTC
Created attachment 25714 [details]
Comment 11 Pete Zaitcev 2010-03-25 18:35:28 UTC
This is an pre-production motherboard intended for testing of i7 CPUs,
so the BIOS can be broken. But I'm stuck with it for a while yet, and
it worked before now, so... Worst comes to worst, I'll have to continue
with pci=nocrs, if we cannot find a workaround.
Comment 12 Bjorn Helgaas 2010-03-25 21:38:59 UTC
The motherboard (PNP0C01) device *is* inside the host bridge (PNP0A08) scope, i.e., the namespace looks like this:

    _HID PNP0A08
    _CRS ... [mem 0xb4000000-0xf5bfffff] ...
      _HID PNP0C01
      _CRS ... [mem 0xff000000-0xffffffff] ...

But I think the [mem 0xff000000-0xffffffff] range should be described on the bridge, not the motherboard device, because it is used by PCI devices:

  pci 0000:00:02.0: reg 10: [mem 0xff000000-0xff3fffff 64bit]

I built a similar "PNP0C01 inside PNP0A03, with PCI devices inside PNP0C01 resources" situation in SeaBIOS and booted Windows 2008 R2 on it with qemu.  Windows didn't complain, but it did move the PCI devices out of the PNP0C01 region and put them in one of the PNP0A03 windows, just like Linux does.

So I think this is a BIOS defect, and we don't need a Linux change to accommodate it because this is a pre-production system and the BIOS should be fixed before release.

However, I am concerned because the i915 device didn't work after we moved it.  We reassigned it this space:

  pci 0000:00:02.0: BAR 0: assigned [mem 0xb4000000-0xb43fffff 64bit]

and as far as I can tell, that *should* work fine, too.  We moved almost every other PCI device, too, and I don't see the other devices blowing up.  I wonder if there's some i915 driver bug, like a dependency on the device being the same place it was when BIOS first configured it?
Comment 13 Bjorn Helgaas 2010-04-05 17:26:55 UTC
Pete, I'm going to close this as "invalid" because I think this is a BIOS defect in an unreleased system.  Please report it as a BIOS defect if you can.  If the firmware folks believe this is *not* a BIOS defect, please re-open this with any explanation they provide.  In the meantime, I think booting with "pci=nocrs" is the right workaround.

I still think the fact that the i915 doesn't work after we move it is another issue, but I don't have time to debug that right now.  What I'd *like* to do there is figure out whether the device accidentally got disabled somewhere along the way.  That should probably be investigated as a separate bug report.
Comment 14 Stefan Dirsch 2010-04-21 17:35:59 UTC
I see the same issue on my Clarkdale (8086:0042) machine. It's also a preproduction system. Workaround "pci=nocrs" fixes the issue for me.
Comment 15 Bjorn Helgaas 2010-04-27 21:55:22 UTC
Huh.  It's worrisome that this happens on another prototype.  Is there any way you could engage a BIOS engineer to figure out if this is a bug that will be fixed?  Or find out whether Windows boots on that machine?  If we don't change *something*, users will get machines that will only boot with "pci=nocrs", and that would be very bad.
Comment 16 Pete Zaitcev 2010-04-27 23:19:24 UTC
There's no question that BIOS is trippy on my box. For example, it fails
USB handoff, which is clearly a bug. I am thinking, wait until Dell ships
Clarksdale and then see what happens.
Comment 17 Stefan Dirsch 2010-04-29 00:46:47 UTC
There is definitely no newer BIOS available for my machine. I'm trying to get in contact with the developers at the vendor of this preproduction machine.