Since some machines need the ACPI PCI Hotplug driver to get ExpressCard hotplugging working and I seems that the driver can't be auto-loaded the Debian kernel team decided to build-in this driver (CONFIG_HOTPLUG_PCI_ACPI=y instead of CONFIG_HOTPLUG_PCI_ACPI=m).
See debian kernel changelog for version 3.2.10-1 :
- [x86] Change HOTPLUG_PCI_ACPI to built-in (Closes: #663433)
As a consequence of this modification the kernel fails to boot on at least two recent HP notebooks (see downstream bugreport), including my HP Folio 13-2000. I can reproduce this problem always on a fresh install of debian wheezy (BIOS is up to date).
I can workaround this problem by using the kernel boot parameter pci=noacpi or by comiling the kernel with CONFIG_HOTPLUG_PCI_ACPI=m. I tried this with (at least) the following kernels:
- debian 3.2.23-1 (testing)
- debian 3.5.2-1 (experimental)
- kernel.org 3.2.28
- kernel.org 3.5.3
When I compile the ACPI PCI Hotplug driver as a module (CONFIG_HOTPLUG_PCI_ACPI=m) and add acpiphp to /etc/modules I can't reproduce this bug.
I'll attach the output of dmesg for four cases (with kernel 3.5.3):
- dmesg-3.5.3-acpiphp_buildin: CONFIG_HOTPLUG_PCI_ACPI=y *
- dmesg-3.5.3-acpiphp_buildin-pci=noacpi: kernel boot parameter pci=noacpi in use
- dmesg-3.5.3-acpiphp_not_buildin: CONFIG_HOTPLUG_PCI_ACPI=m
- dmesg-3.5.3-acpiphp_not_buildin-manually_loaded: acpiphp added to /etc/modules
Created attachment 78551 [details]
Created attachment 78561 [details]
dmesg CONFIG_HOTPLUG_PCI_ACPI=y + kernel boot parameter pci=noacpi
Created attachment 78571 [details]
Created attachment 78581 [details]
dmesg CONFIG_HOTPLUG_PCI_ACPI=m + acpiphp added to /etc/modules
Created attachment 78761 [details]
Created attachment 78771 [details]
Created attachment 78781 [details]
Disassembled ASL source code file (DSDT)
Created attachment 78791 [details]
I added some attachments which might be useful. Sorry if I also added some unnecessary information & forgot something relevant – if so, please just tell me.
I'd also like to mention that I asked the debian kernel team to reconsider their decision to build-in the ACPI PCI Hotplug driver. Ben Hutchings answered :
> I think it is the correct decision. Building it as a module effectively
> meant it was disabled on most machine, since it doesn't get auto-loaded.
> It may be preferable to return to building it as a module *if* we can
> arrange for it to be loaded automatically where it's useful. But that
> presumably would not fix this bug.
He suggested to open an upstream bug report.
HP released a BIOS update (version F.17). According to the changelog just one enhancement was added:
> Provides the HP tool to enable or disable the ability to boot the system
> from a USB disk on key.
However, I flashed my BIOS. As far as I can tell nothing really changed, but still I wanted to ask: Would it make any sense to update (some of) the attachements? (Which?)
Created attachment 82421 [details]
turn on debug
I don't think this is related to the BIOS. Don't bother updating any of the attachments.
Can you build a kernel with CONFIG_PCI_DEBUG=y and this patch applied and attach the dmesg log? If you can do this in a recent kernel, e.g., v3.6, that would be even better.
This feels like it might be memory corruption or something equally elusive, so I suspect we'll have to iterate on this a few times to find the problem.
Created attachment 82511 [details]
dmesg from kernel 3.6, with debugging patch
compiled kernel 3.6 with the debug patch from #11.
It boots without problems this time! Will double check with a different .config (which was known to fail before), and if that still works I'll try a reverse git bisect, to see where it started working.
For me at least the bug is fixed by dfb117b3e50c52c7b3416db4a4569224b8db80bb ,
PCI: acpiphp: check whether _ADR evaluation succeeded --
and indeed, the dmesg attached in #12 shows
[ 3.362314] acpiphp_glue: found PCI host-bus bridge with hot-pluggable slots
[ 3.362337] acpiphp_glue: can't evaluate _ADR (0x5)
[ 3.362731] acpiphp_glue: Bus 0000:00 has 0 slots
[ 3.362732] acpiphp_glue: Total 0 slots
- so which does seem to correlate with the problem, crashing on hot-plugging ACPI.
I'll try to see whether cherry-picking that commit on top of v3.2.30 works (for the backporting to Debian's release).
Cherry-picking dfb117 on top of v3.2.30 does the trick, indeed!
If Stefan comfirms that you can close the bug for the upstream kernel already; I'll add the patch to the Debian bug report.
I can confirm that this bug is fixed in kernel 3.6.
I also applied the patch "PCI: acpiphp: check whether _ADR evaluation succeeded" on top of v3.2.30 with success – it fixes the problem for me too.
Created attachment 82681 [details]
another debug patch
Glad we found something that influences this, but it seems like there's more to the story that I'd like to figure out.
Looking at v3.2, where Jeroen's page fault occurred, we have this call sequence:
drm_lastclose # error path when agp_init fails
2) dev->dev_private = (void *)dev_priv
Here's the intel_fb_restore_mode() code:
void intel_fb_restore_mode(struct drm_device *dev)
drm_i915_private_t *dev_priv = dev->dev_private;
ret = drm_fb_helper_restore_fbdev_mode(&dev_priv->fbdev->helper);
DRM_DEBUG("failed to restore crtc mode\n");
The page fault IP was intel_fb_restore_mode+0x9/0x3f. My v3.2 build (different config options than Jeroen's, of course) produced this code:
0xffffffff813bc5d0 <+0>: push %rbp
0xffffffff813bc5d1 <+1>: mov %rsp,%rbp
0xffffffff813bc5d4 <+4>: callq 0xffffffff81598cc0 <mcount>
0xffffffff813bc5d9 <+9>: mov 0x348(%rdi),%rax
0xffffffff813bc5e0 <+16>: mov 0x1980(%rax),%rdi
0xffffffff813bc5e7 <+23>: callq 0xffffffff813629d0 <drm_fb_helper_restore_fbdev_mode>
(gdb) p &((struct drm_device *)0)->dev_private
$2 = (void **) 0x348 <irq_stack_union+840>
(gdb) p &((struct drm_i915_private *)0)->fbdev
$3 = (struct intel_fbdev **) 0x1980 <irq_stack_union+6528>
I think we faulted when dereferencing dev->dev_private->fbdev because dev->dev_private == NULL. This occurred at 1) in the call trace above, but dev->dev_private isn't initialized until 2).
This looks like a DRM bug, independent of the _ADR fix (dfb117b3) you identified. I'm still not sure how applying the _ADR fix keeps us from hitting this DRM problem. I'd like to figure that out to make sure there's not a third bug here.
This patch is against v3.6 and adds a little more debug info. It would be really helpful if one or both of you could apply this and collect the dmesg log. It would also be helpful if you could keep this patch applied, revert dfb117b3 (might be some minor conflicts here), and collect another dmesg.
Created attachment 82731 [details]
dmesg (kernel 3.6 with debug patch from message #16 applied), HP Folio 13-2000
Here's the dmesg log from kernel 3.6 with the debug patch from message #16 applied.
Created attachment 82741 [details]
dmesg (kernel 3.6 with debug patch from message #16 applied and dfb117b3 reverted), HP Folio 13-2000
And here's the dmesg log from kernel 3.6 with the debug patch from message #16
applied and dfb117b3 reverted.
This time the kernel didn't fail to boot – but gnome-shell failed (so gnome started in fallback mode).
(In reply to comment #18)
> dmesg (kernel 3.6 with debug patch from message #16 applied and dfb117b3
> reverted), HP Folio 13-2000
Here's the effect of using garbage data from the stack after _ADR failed:
acpiphp_glue: found ACPI PCI Hotplug slot 1 at PCI 0000:00:4f00
Then we have this error:
[drm:i915_get_bridge_dev] *ERROR* bridge device not found
This means we couldn't find device 0000:00:00.0, which clearly does exist (we found it earlier), so the list of devices must be corrupted. Could you try this same kernel again with the following config options turned on? If that doesn't turn anything up, I'll have to add some more debug in the acpiphp path, I guess.
Created attachment 82751 [details]
dmesg from kernel 3.6, with debugging config/patches
I applied the debug patch, reverted dfb117b3 and set CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC -- but unfortunately I could not boot, not even in
single mode. In normal boot (quiet mode) I did see
[drm:i915_get_bridge_dev] *ERROR* bridge device not found, and not much else.
In single mode the last messages are the ACPI warnings:
0x0500-0x057f System IO conflicts with Region Region \_SB_.PCI0.PEG0.PEGP.GPIO 2 (20120711/utaddress-251) and similar,
lpc_ich: Resource conflict(s) found affecting gpio_ich
So to get some info out I rebooted with pci=noacpi, and the attached dmesg is from that.
I'll try with ACPI hotplug disabled next.
(but will first send a dmesg from 3.6 with just the debugging patch, without the revert and without the DEBUG_SLAB etc.).
Created attachment 82761 [details]
dmesg from kernel 3.6, with debugging patch only
Here is the dmesg from the 3.6 kernel with only the latest debug patch applied,
but without the revert, and without CONFIG_DEBUG_SLAB or CONFIG_DEBUG_PAGEALLOC.
3.6 with all the debugging disabled, dfb117b3 reverted but HOTPLUG_PCI_ACPI disabled gives almost exactly the same dmesg as the kernel with only the debugging patch from #16 , and dfb117b3 still aplied (apart from the "can't evaluate _ADR (0x5)" warning, obviously).
For what it's worth, the failing kernels never make it as far as the mode switch to the higher resolution console (from 80 to 132? characters), and timing wise seem to die just about when the mode switch should happen.
Created attachment 82771 [details]
debug patch 3
Thanks, Stefan and Jeroen. This patch adds more debug to try to figure out where the PCI device list is being corrupted. It applies on v3.6 (it includes the reversion of dfb117b3). If you can get a dmesg with CONFIG_DEBUG_SLAB and
CONFIG_DEBUG_PAGEALLOC turned on, that would be ideal, but if they don't work, you can turn them off. If you're having trouble booting, one possibility is to boot with "ignore_loglevel" and capture a video.
Don't bother with CONFIG_HOTPLUG_PCI_ACPI=n or with dfb117b3 applied; those configs don't fail, so I don't think we'll learn anything more from them.
Created attachment 82791 [details]
dmesg (kernel 3.6 with debug patch from message #23 applied), HP Folio 13-2000
Here's the dmesg log from kernel 3.6 with the patch from message #23 applied and CONFIG_DEBUG_SLAB, CONFIG_DEBUG_PAGEALLOC and CONFIG_PCI_DEBUG turned on.
I had no troubles booting.
Stefan, can you try the same thing as comment #24, but with CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC turned off?
Your comment #18 and comment #24 logs *should* be functionally equivalent (v3.6 kernel with dfb117b3 reverted and a little debug added), but they aren't. So let's get back closer to the comment #18 failure, since the memory debugging didn't turn up anything.
Created attachment 82841 [details]
dmesg after patch #23
Here is a dmesg after the patch from #23 attached. DEBUG_PAGE_ALLOC and DEBUG_SLAB disabled. This time the kernel boots again! It could be some racing condition, of course, where the debug messages change the timing enough to hide the bug.
With the normal 'quiet' boot the message that still does appear on the screen is:
acpiphp: pci_hp_register failed with error -19
that also occurs in Stefan's boot log.
Created attachment 82851 [details]
extracted from patch of #23: kzalloc that makes the kernel boot again
I found why the kernel booted with the debug patch from #23: it adds a line
slot = kzalloc(sizeof(*slot), GFP_KERNEL);
in drivers/pci/hotplug/acpiphp_core.c, before the pci_hp_register() call,
as per the attached patch -- and
adding that line on top of v3.6 with dfb117b3 reverted (and no other changes)
makes it work again.
Created attachment 82871 [details]
debug patch 4
Thanks, Jeroen. I botched debug patch 3 by pasting that extra kzalloc line. Sorry for wasting your time testing that.
If you could try this replacement on top of v3.6, that would be great. Try it first with CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC turned on, and if it doesn't fail, try again with them off. I think we'll only learn something if we see a failure.
Created attachment 82931 [details]
dmesg (kernel 3.6 with debug patch 4 from comment #28 applied)
This time the kernel failed to boot completely. I applied debug patch 4 (from comment #28) and turned CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC on.
Also for me the kernel with debug 4 from comment #28 fails to boot completely - to the extent that I cannot get a dmesg out.
It hangs while executing /etc/rcS.d/S02udev. That's udev version 175, by the way.
And the hang is asynchronously: the last line printed changes from boot to boot, so it seems that whatever udev did to trigger the crash takes a little while to cause the hang.
I can boot with init=/bin/sh . Would there be a way to emulate what udev does from the shell? That way I could try to do one bit after the other, with enough of a gap to see what is the fatal step.
I tried changing the log level of udev to info or debug, but than it prints so much that the interesting bit has probably disappeared from the screen -- and when it hangs it is really stuck, and I can't scroll back
(I would not be surprised if it gets stuck when it tries to do a mode switch
to the higher resolution).
Created attachment 82971 [details]
debug patch 5
This is perfect, thanks! The comment #29 kernel is good enough to boot and save the dmesg, while still showing the problem. The dmesg shows a valid 00:00.0 device, followed later by i915 not being able to find it:
[ 3.673912] init_acpi 3: bridge 0000:00:00.0 at 0xffff88014a0ef000
[ 15.504629] acpiphp_glue: _handle_hotplug_event_bridge: Bus check notify on \_SB_.PCI0
[ 15.504634] acpiphp_glue: _handle_hotplug_event_bridge: re-enumerating slots under \_SB_.PCI0
[ 15.504836] acpiphp_glue: acpi_device not found
[ 15.504840] acpiphp_glue: acpiphp_check_bridge: 0 enabled, 1 disabled
[ 17.950082] [drm:i915_get_bridge_dev] *ERROR* bridge device not found
I'm suspicious of the acpiphp stuff in the middle. This patch adds similar debug to that path, so maybe we can narrow it down a bit.
Jeroen, I don't know enough to manually step through the udev startup. It's possible that if you just built in i915 statically, that would be enough to see the problem even without udev. If so, then booting with init=/bin/sh might be enough (you might have to "mount -o remount,rw /" or similar to get a writable filesystem to save the log).
But as long as Stefan can still test, we should be fine.
Created attachment 83021 [details]
dmesg (kernel 3.6 with debug patch 5 from comment #31 applied)
I applied debug patch 5 (from comment #31) and turned CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC on. This time again the kernel didn't fail – I was able to enter gnome fallback mode.
Since it didn't make any sense to me that the kernel failed with debug patch 4 but not with debug patch 5 applied I tried out patch 4 one more time (I did the same as I described in comment #29) and this time I was able to enter gnome fallback mode…
Sorry for the confusion. I really don't know what
…and sorry for that too :/
What I wanted to say was that I don't know what I did last time I applied debug patch 4 – I just can't reproduce the kernel fail I described in comment #29 anymore.
I hope the dmesg log from comment #32 helps nonetheless.
Created attachment 83061 [details]
screenshot of hang
Progress of sorts: I know where it hangs, when loading the radeon module.
What udev does is loading modules, so I booted with init=/bin/bash (with the kernel with patch 5 applied), and started modprobing modules.
In 'modprobe radeon' it gets stuck, on "Loading CAICOS Microde", see the
I can modprobe all the pre-requisites for radeon (i915, drm, ttm) etc. without problems. I'll attach a dmesg after 'modprobe i915' next.
Created attachment 83071 [details]
dmesg after modprobe i915
dmesg after booting with init=/bin/bash, plus 'modprobe i915'.
No dmesg after the subsequent 'modprobe radeon' because it got stuck there.
Not as bad as with the regular boot, though: I was still able to scroll back the console buffer, and typed characters appeared on the screen.
Created attachment 83081 [details]
debug patch 6
Sorry, meant to explain this... Both tests (comment #32 and comment #36) show that the bridge device went missing in disable_device(), so I added more debug output in that function. We should be getting close. Can one or both of you try patch 6 from comment #37 and post the results?
Created attachment 83091 [details]
dmesg after patch 6
Ok, here's the dmesg after applying patch 6
(again init=/bin/bash, modprobe i915)
Created attachment 83101 [details]
debug patch 7
Thanks, Jeroen, we're almost there. I have a theory that explains 99% of the situation, but there's one little piece I don't understand yet. Can you try this patch for that? Also, if it's not too much trouble, try building acpiphp as a module and loading that. The problem should not occur then; I just want to double-check the reason.
Created attachment 83111 [details]
dmesg (kernel 3.6 with debug patch 6 from comment #37 applied)
Here's my dmesg log from kernel 3.6 with debug patch 6 (from comment #31) applied, CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC turned on.
Created attachment 83141 [details]
dmesg (kernel 3.6 with debug patch 7 from comment #40 applied)
Kernel 3.6 with debug patch 6 (from comment #40) applied, CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC turned on. ACPI PCI Hotplug driver build-in.
Sould I also provide a log with CONFIG_HOTPLUG_PCI_ACPI=m?
Created attachment 83161 [details]
debug patch 8
Thanks, Stefan. This is slightly non-deterministic, and the additional debug I put in register_slot() changed the garbage values from the stack. Your previous boots from comments #32 and #41 had this:
acpiphp: _ADR status 0x5 device 0x4b00 function 0x11f0
while the one from comment #42 has this:
acpiphp: \_SB_.PCI0.DOCK _ADR status 0x5 device 0x8136 function 0x2435
The boot problem has to do with device 0000:00:00.0 being removed. That contains the AGP bridge for the i915 device at 0000:00:02.0, so removing it causes the DRM failure. 0000:00:00.0 was removed because the garbage 0x4b00 device number was truncated to 0x00, so it matched 0000:00:00.
But in the comment #42 boot, the garbage device number was 0x8136, which is truncated to 0x36, which *doesn't* match the 0000:00:00.0 bridge, so I think everything worked correctly. And this didn't exercise the path I wanted to learn about.
This patch plugs in the garbage device/function values from before, so it should fail the same way as before.
If it does fail, I would also appreciate a log with CONFIG_HOTPLUG_PCI_ACPI=m?
Thanks for all your patience; this has been a long effort.
Created attachment 83191 [details]
dmesg after patch 8
Here's the dmesg after patch 8 (init=/bin/bash, modprobe i915).
Still failing, as usual.
And building acpiphp modular makes it work, as in v3.0/v3.2. I'll send a dmesg from that next.
Created attachment 83201 [details]
dmesg after patch 8, with modular acpiphp
Here's the dmesg for the kernel with patch 8, but acpiphp modular.
I had to duplicate bh_check_bridge in slot.c (non-modular) and acpiphp_glue.c (for acpiphp.ko).
Even after modprobe acpiphp it is still fine, as expected (that already worked in v3.0 / v3.2).
Created attachment 83391 [details]
dmesg (kernel 3.6 with debug patch 8 from comment #43 applied)
Kernel 3.6 with debug patch 8 (from comment #43) applied, CONFIG_DEBUG_SLAB and
CONFIG_DEBUG_PAGEALLOC turned on. ACPI PCI Hotplug driver build-in. I'll add the log with acpiphp build as module later today or tomorrow.
Thanks for the explanation, Bjorn. Since I'd have to lern a lot to understand whats really going on and I don't have any spare time right now I'm not really using my brain that much and just try to provide the logs you request.
(In reply to comment #45)
> Even after modprobe acpiphp it is still fine, as expected (that already
> in v3.0 / v3.2).
Thanks. This looks as I expected -- when acpiphp is a module, we erroneously create a slot for device 0000:00:00.0, but we don't see a BUS CHECK notification (there probably *was* one but it occurred before acpiphp was loaded), so everything seems fine.
But I suspect that if we got another BUS CHECK, for example, if you dock or undock after loading acpiphp, we'd probably remove the 0000:00:00.0 device and things might start failing. The i915 init has already been done, so video will probably keep working (unless you unload and reload i915; I don't know whether that's possible), but maybe lspci won't show 0000:00:00.0 anymore.
I have problems building the ACPI PCI Hotplug driver as module with the debug patch applied… the build process stops with:
drivers/built-in.o: In function `pci_create_slot':
/home/stefan/linux-3.6/drivers/pci/slot.c:257: undefined reference to `bh_check_bridge'
/home/stefan/linux-3.6/drivers/pci/slot.c:280: undefined reference to `bh_check_bridge'
/home/stefan/linux-3.6/drivers/pci/slot.c:299: undefined reference to `bh_check_bridge'
/home/stefan/linux-3.6/drivers/pci/slot.c:312: undefined reference to `bh_check_bridge'
/home/stefan/linux-3.6/drivers/pci/slot.c:319: undefined reference to `bh_check_bridge'
drivers/built-in.o:/home/stefan/linux-3.6/drivers/pci/hotplug/pci_hotplug_core.c:461: more undefined references to `bh_check_bridge' follow
make: *** [vmlinux] Fehler 1
make: *** [deb-pkg] Fehler 2
make: *** [deb-pkg] Fehler 2
Sorry, I'm stuck here.
Created attachment 83451 [details]
debug patch - duplicate bh_check_bridge
That compilation failure is because the debug addition bh_check_bridge is defined in acpiphp_glue.c, so if acpiphp gets compiled as a module that is not available for the core kernel (e.g. pci/slot.c). If you apply the attached patch, to duplicate that bh_check_bridge in the core and in the module, it will compile again.
Sorry, but when I try to apply the debug patch from comment #49 (duplicate bh_check_bridge) I get this:
patching file drivers/pci/hotplug/acpiphp_core.c
patching file drivers/pci/hotplug/acpiphp_glue.c
Hunk #1 FAILED at 115.
Hunk #2 succeeded at 140 (offset 8 lines).
Hunk #3 succeeded at 165 (offset 8 lines).
Hunk #4 succeeded at 225 (offset 8 lines).
Hunk #5 succeeded at 243 (offset 8 lines).
Hunk #6 succeeded at 282 (offset 8 lines).
Hunk #7 succeeded at 770 (offset 8 lines).
Hunk #8 succeeded at 937 (offset 8 lines).
Hunk #9 succeeded at 948 (offset 8 lines).
Hunk #10 succeeded at 962 (offset 8 lines).
Hunk #11 succeeded at 1065 (offset 8 lines).
Hunk #12 succeeded at 1095 (offset 8 lines).
Hunk #13 succeeded at 1108 (offset 8 lines).
Hunk #14 succeeded at 1225 (offset 8 lines).
Hunk #15 succeeded at 1270 (offset 8 lines).
Hunk #16 succeeded at 1294 (offset 8 lines).
Hunk #17 succeeded at 1302 (offset 8 lines).
Hunk #18 succeeded at 1368 (offset 8 lines).
Hunk #19 succeeded at 1379 (offset 8 lines).
Hunk #20 succeeded at 1412 (offset 8 lines).
Hunk #21 succeeded at 1516 (offset 8 lines).
Hunk #22 succeeded at 1526 (offset 8 lines).
Hunk #23 succeeded at 1567 (offset 8 lines).
Hunk #24 succeeded at 1602 (offset 8 lines).
1 out of 24 hunks FAILED -- saving rejects to file drivers/pci/hotplug/acpiphp_glue.c.rej
Sorry for being unclear -- the patch from comment #49 was on top of debug patch 8 (form comment #43)
if I understand you correctly I should first apply debug patch 8 (from comment #43) and after that debug patch from comment #49 (duplicate bh_check_bridge) – both on top of kernel 3.6. Is this correct? Because this is exactly what I did before…
Sorry for my clumsiness, before I reported this bug upstream I never even built a custom kernel myself.
Created attachment 83571 [details]
debug patch - duplicate bh_check_bridge
Stefan, sorry, my mistake: my first attempt at fixing it failed, and the patch I sent was only the fix on top of that fix. The attached patch (on top of patch 8) should really work.
Created attachment 83581 [details]
dmesg (kernel 3.6 with debug patch 8 from comment #43 applied, acpiphp build as module)
Thanks Jeroen!! Now I was able to build the kernel with debug patch 8 (from comment #43) applied, CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC turned on & ACPI PCI Hotplug driver build as module. I'll attach the dmesg log.
BTW, Video didn't work – the screen went black but I was able to switch to a text terminal, load acpiphp and save the log.
Conclusion: dfb117b3e50c5 ("PCI: acpiphp: check whether _ADR evaluation succeeded") fixes the root cause of this issue. This commit is already upstream and included in v3.6.
When acpiphp is statically built into the kernel, we initialize it and erroneously register "Slot " for pci_bus 0000:00:00:
acpiphp_init # module_init
list_for_each_entry(root, &acpi_pci_roots, node)
driver->add # for \_SB_.PCI0
bridge = kzalloc(struct acpiphp_bridge)
bridge->handle = handle # \_SB_.PCI0
bridge->pci_bus = root_bus # pci_bus 0000:00
acpi_walk_namespace(..., register_slot, ...)
acpi_evaluate_integer("_ADR") # fails
# we use garbage from stack, e.g.,
# device = 0x5f00
# function = 0x1a40
func = kzalloc(struct acpiphp_func)
func->function = (u8) 0x1a40 = 0x40
slot = kzalloc(struct acpiphp_slot)
slot->device = (u8) 0x5f00 = 0x00
slot->flags |= SLOT_ENABLED
Now we have a struct acpiphp_bridge associated with \_SB_.PCI0 and pci_bus 0000:00. It has a "Slot " for device 00 with SLOT_ENABLED.
And we registered handle_hotplug_event_bridge() for SYSTEM notifications. Later we receive a BUS CHECK notification on \_SB_.PCI0. The cause of the BUS CHECK is unclear, but not important: it's legal and we should handle it correctly.
Here's what happens when we handle the BUS CHECK:
(notify handler for ACPI_SYSTEM_NOTIFY)
handle_hotplug_event_bridge # on \_SB_.PCI0
for (slot = bridge->slots; slot; slot = slot->next)
status = get_slot_status(acpiphp_slot) # \_SB_.PCI0.DOCK
list_for_each_entry(func, &slot->funcs, sibling)
if (func->flags & FUNC_HAS_STA) # true
acpi_evaluate_integer("_STA") # 0x0
if (slot->flags & SLOT_ENABLED && status != ACPI_STA_ALL)
while ((pdev = dev_in_slot(slot)))
pdev = "pci 0000:00:00.0"
We incorrectly removed PCI device 0000:00:00.0, which causes this error:
[drm:i915_get_bridge_dev] *ERROR* bridge device not found
because 0000:00:00.0 is the bridge upstream of the i915 device.
Probably the reason we don't see the issue when acpiphp is a module is that the BUS CHECK is delivered before acpiphp is loaded, so acpiphp never sees it. We still might see a similar issue if another BUS CHECK were delivered, e.g., for a dock or undock event.
Commit dfb117b3e50c5 fixes the problem because we won't register the slot, so the BUS CHECK notification won't cause us to remove the device.