Bug 46481 - Kernel fails to boot with build-in ACPI PCI Hotplug driver on HP Folio 13-2000
Kernel fails to boot with build-in ACPI PCI Hotplug driver on HP Folio 13-2000
Status: RESOLVED CODE_FIX
Product: Drivers
Classification: Unclassified
Component: PCI
x86-64 Linux
: P1 normal
Assigned To: Bjorn Helgaas
http://bugs.debian.org/cgi-bin/bugrep...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-08-27 14:48 UTC by Stefan Nagy
Modified: 2012-10-25 18:10 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.2.28, 3.5.3
Tree: Mainline
Regression: No


Attachments
dmesg CONFIG_HOTPLUG_PCI_ACPI=y (58.04 KB, text/plain)
2012-08-27 14:49 UTC, Stefan Nagy
Details
dmesg CONFIG_HOTPLUG_PCI_ACPI=y + kernel boot parameter pci=noacpi (54.22 KB, text/plain)
2012-08-27 14:50 UTC, Stefan Nagy
Details
dmesg CONFIG_HOTPLUG_PCI_ACPI=m (55.03 KB, text/plain)
2012-08-27 14:51 UTC, Stefan Nagy
Details
dmesg CONFIG_HOTPLUG_PCI_ACPI=m + acpiphp added to /etc/modules (55.30 KB, text/plain)
2012-08-27 14:52 UTC, Stefan Nagy
Details
lscpi -vvnn (33.07 KB, text/plain)
2012-08-29 19:42 UTC, Stefan Nagy
Details
dmidecode (12.06 KB, text/plain)
2012-08-29 19:51 UTC, Stefan Nagy
Details
Disassembled ASL source code file (DSDT) (476.56 KB, text/plain)
2012-08-29 19:59 UTC, Stefan Nagy
Details
ACPI tables (290.05 KB, text/plain)
2012-08-29 20:17 UTC, Stefan Nagy
Details
turn on debug (1.97 KB, patch)
2012-10-05 21:55 UTC, Bjorn Helgaas
Details | Diff
dmesg from kernel 3.6, with debugging patch (61.37 KB, text/plain)
2012-10-06 12:00 UTC, Jeroen Nijhof
Details
another debug patch (5.30 KB, patch)
2012-10-08 23:30 UTC, Bjorn Helgaas
Details | Diff
dmesg (kernel 3.6 with debug patch from message #16 applied), HP Folio 13-2000 (58.50 KB, text/plain)
2012-10-09 13:18 UTC, Stefan Nagy
Details
dmesg (kernel 3.6 with debug patch from message #16 applied and dfb117b3 reverted), HP Folio 13-2000 (56.84 KB, text/plain)
2012-10-09 16:54 UTC, Stefan Nagy
Details
dmesg from kernel 3.6, with debugging config/patches (61.53 KB, text/plain)
2012-10-09 20:49 UTC, Jeroen Nijhof
Details
dmesg from kernel 3.6, with debugging patch only (56.28 KB, text/plain)
2012-10-09 20:51 UTC, Jeroen Nijhof
Details
debug patch 3 (13.50 KB, patch)
2012-10-09 22:41 UTC, Bjorn Helgaas
Details | Diff
dmesg (kernel 3.6 with debug patch from message #23 applied), HP Folio 13-2000 (62.19 KB, text/plain)
2012-10-10 11:30 UTC, Stefan Nagy
Details
dmesg after patch #23 (58.78 KB, text/plain)
2012-10-10 20:50 UTC, Jeroen Nijhof
Details
extracted from patch of #23: kzalloc that makes the kernel boot again (822 bytes, patch)
2012-10-10 22:28 UTC, Jeroen Nijhof
Details | Diff
debug patch 4 (13.46 KB, patch)
2012-10-10 22:46 UTC, Bjorn Helgaas
Details | Diff
dmesg (kernel 3.6 with debug patch 4 from comment #28 applied) (59.86 KB, text/plain)
2012-10-11 10:32 UTC, Stefan Nagy
Details
debug patch 5 (17.57 KB, patch)
2012-10-11 21:32 UTC, Bjorn Helgaas
Details | Diff
dmesg (kernel 3.6 with debug patch 5 from comment #31 applied) (58.77 KB, text/plain)
2012-10-12 13:15 UTC, Stefan Nagy
Details
screenshot of hang (161.90 KB, image/png)
2012-10-12 19:45 UTC, Jeroen Nijhof
Details
dmesg after modprobe i915 (49.61 KB, text/plain)
2012-10-12 19:47 UTC, Jeroen Nijhof
Details
debug patch 6 (19.20 KB, patch)
2012-10-12 20:24 UTC, Bjorn Helgaas
Details | Diff
dmesg after patch 6 (50.42 KB, text/plain)
2012-10-12 20:39 UTC, Jeroen Nijhof
Details
debug patch 7 (20.63 KB, patch)
2012-10-12 23:29 UTC, Bjorn Helgaas
Details | Diff
dmesg (kernel 3.6 with debug patch 6 from comment #37 applied) (56.77 KB, text/plain)
2012-10-12 23:30 UTC, Stefan Nagy
Details
dmesg (kernel 3.6 with debug patch 7 from comment #40 applied) (58.49 KB, text/plain)
2012-10-13 02:13 UTC, Stefan Nagy
Details
debug patch 8 (20.67 KB, patch)
2012-10-13 03:03 UTC, Bjorn Helgaas
Details | Diff
dmesg after patch 8 (50.70 KB, text/plain)
2012-10-13 10:52 UTC, Jeroen Nijhof
Details
dmesg after patch 8, with modular acpiphp (61.82 KB, text/plain)
2012-10-13 10:56 UTC, Jeroen Nijhof
Details
dmesg (kernel 3.6 with debug patch 8 from comment #43 applied) (57.66 KB, text/plain)
2012-10-13 15:24 UTC, Stefan Nagy
Details
debug patch - duplicate bh_check_bridge (13.51 KB, patch)
2012-10-15 05:33 UTC, Jeroen Nijhof
Details | Diff
debug patch - duplicate bh_check_bridge (13.61 KB, patch)
2012-10-15 20:24 UTC, Jeroen Nijhof
Details | Diff
dmesg (kernel 3.6 with debug patch 8 from comment #43 applied, acpiphp build as module) (57.39 KB, text/plain)
2012-10-16 00:30 UTC, Stefan Nagy
Details

Description Stefan Nagy 2012-08-27 14:48:05 UTC
Since some machines need the ACPI PCI Hotplug driver to get ExpressCard hotplugging working and I seems that the driver can't be auto-loaded the Debian kernel team decided to build-in this driver (CONFIG_HOTPLUG_PCI_ACPI=y instead of CONFIG_HOTPLUG_PCI_ACPI=m).

See debian kernel changelog for version 3.2.10-1 [1]:
- [x86] Change HOTPLUG_PCI_ACPI to built-in (Closes: #663433)


As a consequence of this modification the kernel fails to boot on at least two recent HP notebooks (see downstream bugreport), including my HP Folio 13-2000. I can reproduce this problem always on a fresh install of debian wheezy (BIOS is up to date).

I can workaround this problem by using the kernel boot parameter pci=noacpi or by comiling the kernel with CONFIG_HOTPLUG_PCI_ACPI=m. I tried this with (at least) the following kernels:
- debian 3.2.23-1 (testing)
- debian 3.5.2-1 (experimental)
- kernel.org 3.2.28
- kernel.org 3.5.3

When I compile the ACPI PCI Hotplug driver as a module (CONFIG_HOTPLUG_PCI_ACPI=m) and add acpiphp to /etc/modules I can't reproduce this bug.

I'll attach the output of dmesg for four cases (with kernel 3.5.3):
- dmesg-3.5.3-acpiphp_buildin: CONFIG_HOTPLUG_PCI_ACPI=y *
- dmesg-3.5.3-acpiphp_buildin-pci=noacpi: kernel boot parameter pci=noacpi in use
- dmesg-3.5.3-acpiphp_not_buildin: CONFIG_HOTPLUG_PCI_ACPI=m
- dmesg-3.5.3-acpiphp_not_buildin-manually_loaded: acpiphp added to /etc/modules



[1] http://packages.debian.org/changelogs/pool/main/l/linux/linux_3.2.23-1/changelog#version3.2.10-1
Comment 1 Stefan Nagy 2012-08-27 14:49:57 UTC
Created attachment 78551 [details]
dmesg CONFIG_HOTPLUG_PCI_ACPI=y
Comment 2 Stefan Nagy 2012-08-27 14:50:55 UTC
Created attachment 78561 [details]
dmesg CONFIG_HOTPLUG_PCI_ACPI=y + kernel boot parameter pci=noacpi
Comment 3 Stefan Nagy 2012-08-27 14:51:32 UTC
Created attachment 78571 [details]
dmesg CONFIG_HOTPLUG_PCI_ACPI=m
Comment 4 Stefan Nagy 2012-08-27 14:52:16 UTC
Created attachment 78581 [details]
dmesg CONFIG_HOTPLUG_PCI_ACPI=m + acpiphp added to /etc/modules
Comment 5 Stefan Nagy 2012-08-29 19:42:47 UTC
Created attachment 78761 [details]
lscpi -vvnn
Comment 6 Stefan Nagy 2012-08-29 19:51:31 UTC
Created attachment 78771 [details]
dmidecode
Comment 7 Stefan Nagy 2012-08-29 19:59:25 UTC
Created attachment 78781 [details]
Disassembled ASL source code file (DSDT)
Comment 8 Stefan Nagy 2012-08-29 20:17:50 UTC
Created attachment 78791 [details]
ACPI tables
Comment 9 Stefan Nagy 2012-08-29 20:48:30 UTC
I added some attachments which might be useful. Sorry if I also added some unnecessary information & forgot something relevant – if so, please just tell me.

I'd also like to mention that I asked the debian kernel team to reconsider their decision to build-in the ACPI PCI Hotplug driver. Ben Hutchings answered [1]:

> I think it is the correct decision.  Building it as a module effectively
> meant it was disabled on most machine, since it doesn't get auto-loaded.

> It may be preferable to return to building it as a module *if* we can
> arrange for it to be loaded automatically where it's useful.  But that
> presumably would not fix this bug.

He suggested to open an upstream bug report.



[1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=665493#49
Comment 10 Stefan Nagy 2012-09-03 20:35:57 UTC
HP released a BIOS update (version F.17). According to the changelog just one enhancement was added:
> Provides the HP tool to enable or disable the ability to boot the system
> from a USB disk on key.

However, I flashed my BIOS. As far as I can tell nothing really changed, but still I wanted to ask: Would it make any sense to update (some of) the attachements? (Which?)
Comment 11 Bjorn Helgaas 2012-10-05 21:55:35 UTC
Created attachment 82421 [details]
turn on debug

I don't think this is related to the BIOS.  Don't bother updating any of the attachments.

Can you build a kernel with CONFIG_PCI_DEBUG=y and this patch applied and attach the dmesg log?  If you can do this in a recent kernel, e.g., v3.6, that would be even better.

This feels like it might be memory corruption or something equally elusive, so I suspect we'll have to iterate on this a few times to find the problem.
Comment 12 Jeroen Nijhof 2012-10-06 12:00:37 UTC
Created attachment 82511 [details]
dmesg from kernel 3.6, with debugging patch

compiled kernel 3.6 with the debug patch from #11.
It boots without problems this time! Will double check with a different .config (which was known to fail before), and if that still works I'll try a reverse git bisect, to see where it started working.
Comment 13 Jeroen Nijhof 2012-10-06 21:03:51 UTC
For me at least the bug is fixed by dfb117b3e50c52c7b3416db4a4569224b8db80bb ,
    PCI: acpiphp: check whether _ADR evaluation succeeded --
and indeed, the dmesg attached in #12 shows

[    3.362314] acpiphp_glue: found PCI host-bus bridge with hot-pluggable slots
[    3.362337] acpiphp_glue: can't evaluate _ADR (0x5)
[    3.362731] acpiphp_glue: Bus 0000:00 has 0 slots
[    3.362732] acpiphp_glue: Total 0 slots

- so which does seem to correlate with the problem, crashing on hot-plugging ACPI.

I'll try to see whether cherry-picking that commit on top of v3.2.30 works (for the backporting to Debian's release).
Comment 14 Jeroen Nijhof 2012-10-07 08:18:37 UTC
Cherry-picking dfb117 on top of v3.2.30 does the trick, indeed!
If Stefan comfirms that you can close the bug for the upstream kernel already; I'll add the patch to the Debian bug report.

Thanks,

Jeroen
Comment 15 Stefan Nagy 2012-10-08 14:40:14 UTC
I can confirm that this bug is fixed in kernel 3.6.

I also applied the patch "PCI: acpiphp: check whether _ADR evaluation succeeded" on top of v3.2.30 with success – it fixes the problem for me too.

Thanks!
Comment 16 Bjorn Helgaas 2012-10-08 23:30:22 UTC
Created attachment 82681 [details]
another debug patch

Glad we found something that influences this, but it seems like there's more to the story that I'd like to figure out.

Looking at v3.2, where Jeroen's page fault occurred, we have this call sequence:

    i915_pci_probe
        drm_get_pci_dev
            drm_fill_in_dev
                drm_pci_agp_init                (dev->driver->bus->agp_init)
                drm_lastclose                   # error path when agp_init fails
                    i915_driver_lastclose       (dev->driver->lastclose)
                        intel_fb_restore_mode
1)                          drm_fb_helper_restore_fbdev_mode(&dev->dev_private->fbdev->helper)
            i915_driver_load                    (dev->driver->load)
2)              dev->dev_private = (void *)dev_priv

Here's the intel_fb_restore_mode() code:

    void intel_fb_restore_mode(struct drm_device *dev)
    {
        int ret;
        drm_i915_private_t *dev_priv = dev->dev_private;

        ret = drm_fb_helper_restore_fbdev_mode(&dev_priv->fbdev->helper);
        if (ret)
                DRM_DEBUG("failed to restore crtc mode\n");

The page fault IP was intel_fb_restore_mode+0x9/0x3f.  My v3.2 build (different config options than Jeroen's, of course) produced this code:

    0xffffffff813bc5d0 <+0>:     push   %rbp
    0xffffffff813bc5d1 <+1>:     mov    %rsp,%rbp
    0xffffffff813bc5d4 <+4>:     callq  0xffffffff81598cc0 <mcount>
    0xffffffff813bc5d9 <+9>:     mov    0x348(%rdi),%rax
    0xffffffff813bc5e0 <+16>:    mov    0x1980(%rax),%rdi
    0xffffffff813bc5e7 <+23>:    callq  0xffffffff813629d0 <drm_fb_helper_restore_fbdev_mode>

    (gdb) p &((struct drm_device *)0)->dev_private
    $2 = (void **) 0x348 <irq_stack_union+840>
    (gdb) p &((struct drm_i915_private *)0)->fbdev
    $3 = (struct intel_fbdev **) 0x1980 <irq_stack_union+6528>

I think we faulted when dereferencing dev->dev_private->fbdev because dev->dev_private == NULL.  This occurred at 1) in the call trace above, but dev->dev_private isn't initialized until 2).

This looks like a DRM bug, independent of the _ADR fix (dfb117b3) you identified.  I'm still not sure how applying the _ADR fix keeps us from hitting this DRM problem.  I'd like to figure that out to make sure there's not a third bug here.

This patch is against v3.6 and adds a little more debug info.  It would be really helpful if one or both of you could apply this and collect the dmesg log.  It would also be helpful if you could keep this patch applied, revert dfb117b3 (might be some minor conflicts here), and collect another dmesg.
Comment 17 Stefan Nagy 2012-10-09 13:18:43 UTC
Created attachment 82731 [details]
dmesg (kernel 3.6 with debug patch from message #16 applied), HP Folio 13-2000

Here's the dmesg log from kernel 3.6 with the debug patch from message #16 applied.
Comment 18 Stefan Nagy 2012-10-09 16:54:09 UTC
Created attachment 82741 [details]
dmesg (kernel 3.6 with debug patch from message #16 applied and dfb117b3 reverted), HP Folio 13-2000

And here's the dmesg log from kernel 3.6 with the debug patch from message #16
applied and dfb117b3 reverted.

This time the kernel didn't fail to boot – but gnome-shell failed (so gnome started in fallback mode).
Comment 19 Bjorn Helgaas 2012-10-09 18:25:28 UTC
(In reply to comment #18)
> dmesg (kernel 3.6 with debug patch from message #16 applied and dfb117b3
> reverted), HP Folio 13-2000

Here's the effect of using garbage data from the stack after _ADR failed:

  acpiphp_glue: found ACPI PCI Hotplug slot 1 at PCI 0000:00:4f00

Then we have this error:

  [drm:i915_get_bridge_dev] *ERROR* bridge device not found

This means we couldn't find device 0000:00:00.0, which clearly does exist (we found it earlier), so the list of devices must be corrupted.  Could you try this same kernel again with the following config options turned on?  If that doesn't turn anything up, I'll have to add some more debug in the acpiphp path, I guess.

  CONFIG_DEBUG_SLAB
  CONFIG_DEBUG_PAGEALLOC
Comment 20 Jeroen Nijhof 2012-10-09 20:49:37 UTC
Created attachment 82751 [details]
dmesg from kernel 3.6, with debugging config/patches

I applied the debug patch, reverted dfb117b3 and set CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC -- but unfortunately I could not boot, not even in
single mode. In normal boot (quiet mode) I did see
 [drm:i915_get_bridge_dev] *ERROR* bridge device not found, and not much else.
In single mode the last messages are the ACPI warnings:

0x0500-0x057f System IO conflicts with Region Region \_SB_.PCI0.PEG0.PEGP.GPIO 2 (20120711/utaddress-251) and similar,
plus 

lpc_ich: Resource conflict(s) found affecting gpio_ich


So to get some info out I rebooted with pci=noacpi, and the attached dmesg is from that.

I'll try with ACPI hotplug disabled next.
(but will first send a dmesg from 3.6 with just the debugging patch, without the revert and without the DEBUG_SLAB etc.).
Comment 21 Jeroen Nijhof 2012-10-09 20:51:34 UTC
Created attachment 82761 [details]
dmesg from kernel 3.6, with debugging patch only

Here is the dmesg from the 3.6 kernel with only the latest debug patch applied,
but without the revert, and without CONFIG_DEBUG_SLAB or CONFIG_DEBUG_PAGEALLOC.
Comment 22 Jeroen Nijhof 2012-10-09 21:47:14 UTC
3.6 with all the debugging disabled, dfb117b3 reverted but HOTPLUG_PCI_ACPI disabled gives almost exactly the same dmesg as the kernel with only the debugging patch from #16 , and dfb117b3 still aplied (apart from the "can't evaluate _ADR (0x5)" warning, obviously).

For what it's worth, the failing kernels never make it as far as the mode switch to the higher resolution console (from 80 to 132? characters), and timing wise seem to die just about when the mode switch should happen.
Comment 23 Bjorn Helgaas 2012-10-09 22:41:00 UTC
Created attachment 82771 [details]
debug patch 3

Thanks, Stefan and Jeroen.  This patch adds more debug to try to figure out where the PCI device list is being corrupted.  It applies on v3.6 (it includes the reversion of dfb117b3).  If you can get a dmesg with CONFIG_DEBUG_SLAB and
CONFIG_DEBUG_PAGEALLOC turned on, that would be ideal, but if they don't work, you can turn them off.  If you're having trouble booting, one possibility is to boot with "ignore_loglevel" and capture a video.

Don't bother with CONFIG_HOTPLUG_PCI_ACPI=n or with dfb117b3 applied; those configs don't fail, so I don't think we'll learn anything more from them.
Comment 24 Stefan Nagy 2012-10-10 11:30:30 UTC
Created attachment 82791 [details]
dmesg (kernel 3.6 with debug patch from message #23 applied), HP Folio 13-2000

Here's the dmesg log from kernel 3.6 with the patch from message #23 applied and CONFIG_DEBUG_SLAB, CONFIG_DEBUG_PAGEALLOC and CONFIG_PCI_DEBUG turned on.

I had no troubles booting.
Comment 25 Bjorn Helgaas 2012-10-10 14:45:29 UTC
Stefan, can you try the same thing as comment #24, but with CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC turned off?

Your comment #18 and comment #24 logs *should* be functionally equivalent (v3.6 kernel with dfb117b3 reverted and a little debug added), but they aren't.  So let's get back closer to the comment #18 failure, since the memory debugging didn't turn up anything.
Comment 26 Jeroen Nijhof 2012-10-10 20:50:22 UTC
Created attachment 82841 [details]
dmesg after patch #23

Here is a dmesg after the patch from #23 attached. DEBUG_PAGE_ALLOC and DEBUG_SLAB disabled. This time the kernel boots again! It could be some racing condition, of course, where the debug messages change the timing enough to hide the bug.

With the normal 'quiet' boot the message that still does appear on the screen is:

acpiphp: pci_hp_register failed with error -19

that also occurs in Stefan's boot log.
Comment 27 Jeroen Nijhof 2012-10-10 22:28:51 UTC
Created attachment 82851 [details]
extracted from patch of #23: kzalloc that makes the kernel boot again

I found why the kernel booted with the debug patch from #23: it adds a line 

       slot = kzalloc(sizeof(*slot), GFP_KERNEL);

in drivers/pci/hotplug/acpiphp_core.c, before the pci_hp_register() call,
as per the attached patch -- and
adding that line on top of v3.6 with dfb117b3 reverted (and no other changes)
makes it work again.
Comment 28 Bjorn Helgaas 2012-10-10 22:46:56 UTC
Created attachment 82871 [details]
debug patch 4

Thanks, Jeroen.  I botched debug patch 3 by pasting that extra kzalloc line.  Sorry for wasting your time testing that.

If you could try this replacement on top of v3.6, that would be great.  Try it first with CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC turned on, and if it doesn't fail, try again with them off.  I think we'll only learn something if we see a failure.
Comment 29 Stefan Nagy 2012-10-11 10:32:28 UTC
Created attachment 82931 [details]
dmesg (kernel 3.6 with debug patch 4 from comment #28 applied)

This time the kernel failed to boot completely. I applied debug patch 4 (from comment #28) and turned CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC on.
Comment 30 Jeroen Nijhof 2012-10-11 20:35:26 UTC
Also for me the kernel with debug 4 from comment #28 fails to boot completely - to the extent that I cannot get a dmesg out.

It hangs while executing /etc/rcS.d/S02udev. That's udev version 175, by the way.
And the hang is asynchronously: the last line printed changes from boot to boot, so it seems that whatever udev did to trigger the crash takes a little while to cause the hang.

I can boot with init=/bin/sh . Would there be a way to emulate what udev does from the shell? That way I could try to do one bit after the other, with enough of a gap to see what is the fatal step.

I tried changing the log level of udev to info or debug, but than it prints so much that the interesting bit has probably disappeared from the screen -- and when it hangs it is really stuck, and I can't scroll back
(I would not be surprised if it gets stuck when it tries to do a mode switch
to the higher resolution).
Comment 31 Bjorn Helgaas 2012-10-11 21:32:03 UTC
Created attachment 82971 [details]
debug patch 5

This is perfect, thanks!  The comment #29 kernel is good enough to boot and save the dmesg, while still showing the problem.  The dmesg shows a valid 00:00.0 device, followed later by i915 not being able to find it:

[    3.673912] init_acpi 3: bridge 0000:00:00.0 at 0xffff88014a0ef000
[   15.504629] acpiphp_glue: _handle_hotplug_event_bridge: Bus check notify on \_SB_.PCI0
[   15.504634] acpiphp_glue: _handle_hotplug_event_bridge: re-enumerating slots under \_SB_.PCI0
[   15.504836] acpiphp_glue: acpi_device not found
[   15.504840] acpiphp_glue: acpiphp_check_bridge: 0 enabled, 1 disabled
[   17.950082] [drm:i915_get_bridge_dev] *ERROR* bridge device not found

I'm suspicious of the acpiphp stuff in the middle.  This patch adds similar debug to that path, so maybe we can narrow it down a bit.

Jeroen, I don't know enough to manually step through the udev startup.  It's possible that if you just built in i915 statically, that would be enough to see the problem even without udev.  If so, then booting with init=/bin/sh might be enough (you might have to "mount -o remount,rw /" or similar to get a writable filesystem to save the log).

But as long as Stefan can still test, we should be fine.
Comment 32 Stefan Nagy 2012-10-12 13:15:35 UTC
Created attachment 83021 [details]
dmesg (kernel 3.6 with debug patch 5 from comment #31 applied)

I applied debug patch 5 (from comment #31) and turned CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC on. This time again the kernel didn't fail – I was able to enter gnome fallback mode.
Comment 33 Stefan Nagy 2012-10-12 17:52:57 UTC
Since it didn't make any sense to me that the kernel failed with debug patch 4 but not with debug patch 5 applied I tried out patch 4 one more time (I did the same as I described in comment #29) and this time I was able to enter gnome fallback mode…

Sorry for the confusion. I really don't know what
Comment 34 Stefan Nagy 2012-10-12 17:59:56 UTC
…and sorry for that too :/

What I wanted to say was that I don't know what I did last time I applied debug patch 4 – I just can't reproduce the kernel fail I described in comment #29 anymore.

I hope the dmesg log from comment #32 helps nonetheless.
Comment 35 Jeroen Nijhof 2012-10-12 19:45:12 UTC
Created attachment 83061 [details]
screenshot of hang

Progress of sorts: I know where it hangs, when loading the radeon module.
What udev does is loading modules, so I booted with init=/bin/bash (with the kernel with patch 5 applied), and started modprobing modules.

In 'modprobe radeon' it gets stuck, on "Loading CAICOS Microde", see the
attached screenshot.

I can modprobe all the pre-requisites for radeon (i915, drm, ttm) etc. without problems. I'll attach a dmesg after 'modprobe i915' next.
Comment 36 Jeroen Nijhof 2012-10-12 19:47:54 UTC
Created attachment 83071 [details]
dmesg after modprobe i915

dmesg after booting with init=/bin/bash, plus 'modprobe i915'.

No dmesg after the subsequent 'modprobe radeon' because it got stuck there.
Not as bad as with the regular boot, though: I was still able to scroll back the console buffer, and typed characters appeared on the screen.
Comment 37 Bjorn Helgaas 2012-10-12 20:24:21 UTC
Created attachment 83081 [details]
debug patch 6
Comment 38 Bjorn Helgaas 2012-10-12 20:26:38 UTC
Sorry, meant to explain this...  Both tests (comment #32 and comment #36) show that the bridge device went missing in disable_device(), so I added more debug output in that function.  We should be getting close.  Can one or both of you try patch 6 from comment #37 and post the results?
Comment 39 Jeroen Nijhof 2012-10-12 20:39:49 UTC
Created attachment 83091 [details]
dmesg after patch 6

Ok, here's the dmesg after applying patch 6

(again init=/bin/bash, modprobe i915)
Comment 40 Bjorn Helgaas 2012-10-12 23:29:41 UTC
Created attachment 83101 [details]
debug patch 7

Thanks, Jeroen, we're almost there.  I have a theory that explains 99% of the situation, but there's one little piece I don't understand yet.  Can you try this patch for that?  Also, if it's not too much trouble, try building acpiphp as a module and loading that.  The problem should not occur then; I just want to double-check the reason.
Comment 41 Stefan Nagy 2012-10-12 23:30:59 UTC
Created attachment 83111 [details]
dmesg (kernel 3.6 with debug patch 6 from comment #37 applied)

Here's my dmesg log from kernel 3.6 with debug patch 6 (from comment #31) applied, CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC turned on.
Comment 42 Stefan Nagy 2012-10-13 02:13:59 UTC
Created attachment 83141 [details]
dmesg (kernel 3.6 with debug patch 7 from comment #40 applied)

Kernel 3.6 with debug patch 6 (from comment #40) applied, CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC turned on. ACPI PCI Hotplug driver build-in.

Sould I also provide a log with CONFIG_HOTPLUG_PCI_ACPI=m?
Comment 43 Bjorn Helgaas 2012-10-13 03:03:35 UTC
Created attachment 83161 [details]
debug patch 8

Thanks, Stefan.  This is slightly non-deterministic, and the additional debug I put in register_slot() changed the garbage values from the stack.  Your previous boots from comments #32 and #41 had this:

  acpiphp: _ADR status 0x5 device 0x4b00 function 0x11f0

while the one from comment #42 has this:

  acpiphp: \_SB_.PCI0.DOCK _ADR status 0x5 device 0x8136 function 0x2435

The boot problem has to do with device 0000:00:00.0 being removed.  That contains the AGP bridge for the i915 device at 0000:00:02.0, so removing it causes the DRM failure.  0000:00:00.0 was removed because the garbage 0x4b00 device number was truncated to 0x00, so it matched 0000:00:00.

But in the comment #42 boot, the garbage device number was 0x8136, which is truncated to 0x36, which *doesn't* match the 0000:00:00.0 bridge, so I think everything worked correctly.  And this didn't exercise the path I wanted to learn about.

This patch plugs in the garbage device/function values from before, so it should fail the same way as before.

If it does fail, I would also appreciate a log with CONFIG_HOTPLUG_PCI_ACPI=m?

Thanks for all your patience; this has been a long effort.
Comment 44 Jeroen Nijhof 2012-10-13 10:52:16 UTC
Created attachment 83191 [details]
dmesg after patch 8

Here's the dmesg after patch 8 (init=/bin/bash, modprobe i915).
Still failing, as usual.

And building acpiphp modular makes it work, as in v3.0/v3.2. I'll send a dmesg from that next.
Comment 45 Jeroen Nijhof 2012-10-13 10:56:46 UTC
Created attachment 83201 [details]
dmesg after patch 8, with modular acpiphp

Here's the dmesg for the kernel with patch 8, but acpiphp modular.
I had to duplicate bh_check_bridge in slot.c (non-modular) and acpiphp_glue.c (for acpiphp.ko).

Even after modprobe acpiphp it is still fine, as expected (that already worked in v3.0 / v3.2).
Comment 46 Stefan Nagy 2012-10-13 15:24:27 UTC
Created attachment 83391 [details]
dmesg (kernel 3.6 with debug patch 8 from comment #43 applied)

Kernel 3.6 with debug patch 8 (from comment #43) applied, CONFIG_DEBUG_SLAB and
CONFIG_DEBUG_PAGEALLOC turned on. ACPI PCI Hotplug driver build-in. I'll add the log with acpiphp build as module later today or tomorrow.

Thanks for the explanation, Bjorn. Since I'd have to lern a lot to understand whats really going on and I don't have any spare time right now I'm not really using my brain that much and just try to provide the logs you request.
Comment 47 Bjorn Helgaas 2012-10-13 17:32:50 UTC
(In reply to comment #45)
> Even after modprobe acpiphp it is still fine, as expected (that already worked
> in v3.0 / v3.2).

Thanks.  This looks as I expected -- when acpiphp is a module, we erroneously create a slot for device 0000:00:00.0, but we don't see a BUS CHECK notification (there probably *was* one but it occurred before acpiphp was loaded), so everything seems fine.

But I suspect that if we got another BUS CHECK, for example, if you dock or undock after loading acpiphp, we'd probably remove the 0000:00:00.0 device and things might start failing.  The i915 init has already been done, so video will probably keep working (unless you unload and reload i915; I don't know whether that's possible), but maybe lspci won't show 0000:00:00.0 anymore.
Comment 48 Stefan Nagy 2012-10-14 21:46:19 UTC
I have problems building the ACPI PCI Hotplug driver as module with the debug patch applied… the build process stops with:

drivers/built-in.o: In function `pci_create_slot':
/home/stefan/linux-3.6/drivers/pci/slot.c:257: undefined reference to `bh_check_bridge'
/home/stefan/linux-3.6/drivers/pci/slot.c:280: undefined reference to `bh_check_bridge'
/home/stefan/linux-3.6/drivers/pci/slot.c:299: undefined reference to `bh_check_bridge'
/home/stefan/linux-3.6/drivers/pci/slot.c:312: undefined reference to `bh_check_bridge'
/home/stefan/linux-3.6/drivers/pci/slot.c:319: undefined reference to `bh_check_bridge'
drivers/built-in.o:/home/stefan/linux-3.6/drivers/pci/hotplug/pci_hotplug_core.c:461: more undefined references to `bh_check_bridge' follow
make[2]: *** [vmlinux] Fehler 1
make[1]: *** [deb-pkg] Fehler 2
make: *** [deb-pkg] Fehler 2

Sorry, I'm stuck here.
Comment 49 Jeroen Nijhof 2012-10-15 05:33:40 UTC
Created attachment 83451 [details]
debug patch - duplicate bh_check_bridge

Stefan,
That compilation failure is because the debug addition bh_check_bridge is defined in acpiphp_glue.c, so if acpiphp gets compiled as a module that is not available for the core kernel (e.g. pci/slot.c). If you apply the attached patch, to duplicate that bh_check_bridge in the core and in the module, it will compile again.
Comment 50 Stefan Nagy 2012-10-15 12:41:27 UTC
Sorry, but when I try to apply the debug patch from comment #49 (duplicate bh_check_bridge) I get this:

patching file drivers/pci/hotplug/acpiphp_core.c
patching file drivers/pci/hotplug/acpiphp_glue.c
Hunk #1 FAILED at 115.
Hunk #2 succeeded at 140 (offset 8 lines).
Hunk #3 succeeded at 165 (offset 8 lines).
Hunk #4 succeeded at 225 (offset 8 lines).
Hunk #5 succeeded at 243 (offset 8 lines).
Hunk #6 succeeded at 282 (offset 8 lines).
Hunk #7 succeeded at 770 (offset 8 lines).
Hunk #8 succeeded at 937 (offset 8 lines).
Hunk #9 succeeded at 948 (offset 8 lines).
Hunk #10 succeeded at 962 (offset 8 lines).
Hunk #11 succeeded at 1065 (offset 8 lines).
Hunk #12 succeeded at 1095 (offset 8 lines).
Hunk #13 succeeded at 1108 (offset 8 lines).
Hunk #14 succeeded at 1225 (offset 8 lines).
Hunk #15 succeeded at 1270 (offset 8 lines).
Hunk #16 succeeded at 1294 (offset 8 lines).
Hunk #17 succeeded at 1302 (offset 8 lines).
Hunk #18 succeeded at 1368 (offset 8 lines).
Hunk #19 succeeded at 1379 (offset 8 lines).
Hunk #20 succeeded at 1412 (offset 8 lines).
Hunk #21 succeeded at 1516 (offset 8 lines).
Hunk #22 succeeded at 1526 (offset 8 lines).
Hunk #23 succeeded at 1567 (offset 8 lines).
Hunk #24 succeeded at 1602 (offset 8 lines).
1 out of 24 hunks FAILED -- saving rejects to file drivers/pci/hotplug/acpiphp_glue.c.rej
Comment 51 Jeroen Nijhof 2012-10-15 16:44:43 UTC
Stefan,

Sorry for being unclear -- the patch from comment #49 was on top of debug patch 8 (form comment #43)
Comment 52 Stefan Nagy 2012-10-15 20:11:58 UTC
Hi Jeroen,

if I understand you correctly I should first apply debug patch 8 (from comment #43) and after that debug patch from comment #49 (duplicate bh_check_bridge) – both on top of kernel 3.6. Is this correct? Because this is exactly what I did before…

Sorry for my clumsiness, before I reported this bug upstream I never even built a custom kernel myself.
Comment 53 Jeroen Nijhof 2012-10-15 20:24:56 UTC
Created attachment 83571 [details]
debug patch - duplicate bh_check_bridge

Stefan, sorry, my mistake: my first attempt at fixing it failed, and the patch I sent was only the fix on top of that fix. The attached patch (on top of patch 8) should really work.
Comment 54 Stefan Nagy 2012-10-16 00:30:40 UTC
Created attachment 83581 [details]
dmesg (kernel 3.6 with debug patch 8 from comment #43 applied, acpiphp build as module)

Thanks Jeroen!! Now I was able to build the kernel with debug patch 8 (from comment #43) applied, CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGEALLOC turned on & ACPI PCI Hotplug driver build as module. I'll attach the dmesg log.

BTW, Video didn't work – the screen went black but I was able to switch to a text terminal, load acpiphp and save the log.
Comment 55 Bjorn Helgaas 2012-10-25 18:10:35 UTC
Conclusion: dfb117b3e50c5 ("PCI: acpiphp: check whether _ADR evaluation succeeded") fixes the root cause of this issue.  This commit is already upstream and included in v3.6.

Detailed analysis:

When acpiphp is statically built into the kernel, we initialize it and erroneously register "Slot [1]" for pci_bus 0000:00:00:

  acpiphp_init                                  # module_init
    init_acpi
      acpiphp_glue_init
        acpi_pci_register_driver(&acpi_pci_hp_driver)
          list_for_each_entry(root, &acpi_pci_roots, node)
            driver->add                         # for \_SB_.PCI0
              add_bridge
                add_host_bridge
                  bridge = kzalloc(struct acpiphp_bridge)
                  bridge->handle = handle               # \_SB_.PCI0
                  bridge->pci_bus = root_bus            # pci_bus 0000:00
                  init_bridge_misc
                    acpi_walk_namespace(..., register_slot, ...)
                      register_slot
                        acpi_evaluate_integer("_ADR")   # fails
                        # we use garbage from stack, e.g.,
                        # device   = 0x5f00
                        # function = 0x1a40
                        func = kzalloc(struct acpiphp_func)
                        func->function = (u8) 0x1a40 = 0x40
                        slot = kzalloc(struct acpiphp_slot)
                        slot->device = (u8) 0x5f00 = 0x00
                        slot->flags |= SLOT_ENABLED
                    acpi_install_notify_handler(handle_hotplug_event_bridge)

Now we have a struct acpiphp_bridge associated with \_SB_.PCI0 and pci_bus 0000:00.  It has a "Slot [1]" for device 00 with SLOT_ENABLED.

And we registered handle_hotplug_event_bridge() for SYSTEM notifications.  Later we receive a BUS CHECK notification on \_SB_.PCI0.  The cause of the BUS CHECK is unclear, but not important: it's legal and we should handle it correctly.

Here's what happens when we handle the BUS CHECK:

  (notify handler for ACPI_SYSTEM_NOTIFY)
    handle_hotplug_event_bridge                 # on \_SB_.PCI0
      _handle_hotplug_event_bridge
        acpiphp_check_bridge
          for (slot = bridge->slots; slot; slot = slot->next)
            status = get_slot_status(acpiphp_slot)      # \_SB_.PCI0.DOCK
              list_for_each_entry(func, &slot->funcs, sibling)
                if (func->flags & FUNC_HAS_STA) # true
                  acpi_evaluate_integer("_STA") # 0x0
            if (slot->flags & SLOT_ENABLED && status != ACPI_STA_ALL)
              acpiphp_disable_slot
                disable_device
                  while ((pdev = dev_in_slot(slot)))
                    pdev = "pci 0000:00:00.0"
                    pci_stop_bus_device
                      pci_stop_dev
                        device_unregister

We incorrectly removed PCI device 0000:00:00.0, which causes this error:

    [drm:i915_get_bridge_dev] *ERROR* bridge device not found

because 0000:00:00.0 is the bridge upstream of the i915 device.

Probably the reason we don't see the issue when acpiphp is a module is that the BUS CHECK is delivered before acpiphp is loaded, so acpiphp never sees it.  We still might see a similar issue if another BUS CHECK were delivered, e.g., for a dock or undock event.

Commit dfb117b3e50c5 fixes the problem because we won't register the slot, so the BUS CHECK notification won't cause us to remove the device.

Note You need to log in before you can comment on or make changes to this bug.