Most recent kernel where this bug did not occur: 2.6.20 Distribution: Whatever (I tried gentoo, debian and that one with 2.6.23 inside Problem Description: suspend2-sources and gentoo-sources >= 2.6.21 hangs on boot at "io scheduler cfg registered (default)" line. There is no kernel panic, the kernel just gets stuck during boot. I googled a bit and found some victims of the same problem. So, running with initcall_debug and without 'quiet' and 'splash' gave me this output(the last lines): calling initcall 0xc03c63a0 : populate_rootfs + 0x0/0x100() checking if image is initramfs ... it is With noapic and acpi=off: calling initcall 0xc01eae70 : pci_init + 0x0/0x30() Changing the io scheduler via elevator= option didn't help irqpoll and idle=poll didn't help either Previous tries to get help: http://bugs.gentoo.org/show_bug.cgi?id=184821 https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/155278 I am not a kernel hacker. So, please help me with the debugging process. I really want my linux back alive.
Anton, If you can provide a boot trace for both cases - for 2.6.20 and 2.6.21 then we could find some clues in there, especially if you add "debug" on the boot command line. In the failed case, you would have to use serial console. It is also useful to know what system it is happening on, so output of dmidecode and say lspci -v would be helpful. The most effective method would be to performe git bisection as explained in http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches/ Please attach the traces above (attaching is preferable as opposed to pasting it into comments as text), and let me know if you're willing to try git bisection. I assume you're familiar with git and know how to use it?
Yes, I am willing to try git bisection. It looks simple to use so I think I can handle it. What to do if I a build fails? It stopped compiling because of some KVM function. I disabled KVM. I don't use it anyway. But what do I do if the next build fails on something important?
Created attachment 13402 [details] dmidecode.log
Created attachment 13403 [details] git log
Created attachment 13404 [details] lspci -v output
Created attachment 13405 [details] git bisect's verdict
Excellent work, and enough information to bring this to Bjorn's attention. Thanks!
Created attachment 13408 [details] debugging patch
(This goes with the patch above; bugzilla choked when I entered it as comments.) Thanks for all your debug work! I'm confused. If we hung inside acpi_reserve_resources() we should have seen something like: calling initcall 0x.... : acpi_reserve_resources + 0x0/0x...() So it looks more like we hung inside populate_rootfs(). Can you try the attached patch, booting with and without "acpi=off"?
With or without acpi: A lot of lines like: pci_init: fixup 0000:00:0a:1 done pci_init: fixup 0000:00:0b:1 How do I see more output? Something like 'less' or 'more'?
I found similar bugreport 2.6.17 hangs during boot on ASUS M2NPV-VM motherboard http://lkml.org/lkml/2006/7/14/117
The patch in question: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9a47cdb1bb85e7944fb7419e4078c46516ef7335 first appeared in 2.6.21, so the 2.6.17 hang must be a different problem with similar symptoms. I assume your kernel still hangs, right? What's the last thing it prints before it hangs? Do you see the "pci_init - done" message? Do you see any "acpi_reserve_resources" messages? The best thing would be if you could collect all the console output, using netconsole or a serial console. If you can't do that, maybe you could attach a digital photo of the screen. If the last thing printed is "pci_init: fixup 0000:00:0b:1", that would mean we hung somewhere inside a fixup for the PCI device at 00:0b.1, which looks like a USB controller.
Yes, the last line was "pci_init: fixup 0000:00:0b:1". And I couldn't see any other lines because all I saw was pci_init messages on my tiny laptop 13.1'' screen. So, I believe making a photo must be useless. I don't know how to use these tools to collect console output but I will see what I can do.
I commented your changes in drivers/pci/pci.c and I don't see any debug messages anymore. I read about netconsole. Is it really possible to use it? I mean my linux hasn't even started yet. I can't even do modprobe and the modules in autoload didn't start either.
You could build your NIC driver statically into the kernel, but you're probably right that netconsole won't help here, because I think this hang happens even before the built-in drivers are initialized. If you add "debug" to your kernel parameters, it looks like pci_do_fixups() should print the address of each fixup it calls. Then you can look up that address in your System.map to see what function it is.
netconsole does init fairly late, so Bjorn is probably correct. Some distros modify "debug" in their init scripts, so an alternative is to use "ignore_loglevel" on the kernel boot command line instead of "debug".
Thanks, I didn't know about "ignore_loglevel". And in order to get the PCI fixup debug, you have to turn on CONFIG_DEBUG_KERNEL and CONFIG_PCI_DEBUG.
PCI: Calling quirk c0294a70 for 0000:00:0b.0 PCI: Calling quirk c03d7a70 for 0000:00:0b.0 In System.map they are: t quirk_cardbus_legacy t quirk_usb_early_handoff
Created attachment 13431 [details] debug in usb quirk I guess we must be hanging in quirk_usb_early_handoff(). I don't know why the git bisection identified the ACPI patch, which seems pretty unrelated. But maybe the connection will become clear later. Here's a patch to add some more debugging in this USB quirk.
I don't know, maybe I mistyped some bisect-good-or-bad commands. Does the git log look reasonable? Or you can't say? I will attach the photo with acpi Without acpi (acpi=off) the only thing that was different is the 'base', which was 0xf880c000
Created attachment 13440 [details] screenshot
Created attachment 13443 [details] more usb quirk debug The git bisection log looked reasonable to me. Can you double-check by taking 2.6.21 and backing out just the patch identified by the bisection? Something like this, for example: $ cd linux-2.6.21 $ wget -Oacpi.patch "http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=9a47cdb1bb85e7944fb7419e4078c4651" $ patch -p1 -R < acpi.patch Based on your screenshot, we should have seen the "waiting" message next. Maybe one of the following CSR writes is hanging. But that code didn't change between 2.6.20 and 2.6.21. Here's another patch with a couple more printks to identify the exact CSR write.
Um, I can't double-check. I am having problems reverting the patch because motherboard.c doesn't exist in 2.6.21 What do I do? But I applied your debugging patch and here's the results: quirk_usb_early_handoff quirk_usb_handoff_ohci quirk_usb_handoff_ohci base 0xf8806000 (bus addr 0xc0004000) quirk_usb_handoff_ohci control 0x797 quirk_usb_handoff_ohci writing OHCI_INTRENABLE, 0x40000000 at 0xf8806010 quirk_usb_handoff_ohci writing OHCI_CMDSTATUS, 0x8 at 0xf8806008
I reverted the patch by hand after I had copied motherboard.c from my linux-git dir. Linux didn't boot. I will have to try git bisect again. I am very sorry for misleading you if it is really my fault.
Anton, Here is a good description for bisect: http://www.kernel.org/doc/local/git-quick.html
I run the bisect test again making sure I didn't make a mistake inputting good|bad commands. The same results. The same lines in the log.
Hmm, this appears to be something that git picked on. Removal of platform driver and other ACPI changes were cascading, so this is probably no single patch that will take you back to working system. Bjorn please correct me if I'm wrong.
Created attachment 13595 [details] turn USB quirk off does the bootup get any further if you apply the attached usb-quirk-off.patch, which seems to be the source of the lockup?
I applied patch "turn USB quirk off" on 2.6.21. It still stops booting. I even commented the body of the function. But it gets called - that's for sure(I put printk line inside).
Created attachment 13681 [details] revert the git-bisect verdict patch from 2.6.21 This applies to 2.6.21 and removes the patch identified by git bisect. I suppose this is the same test you did in comment #24, Anton. But maybe it'd be worth testing an independently generated patch.
Created attachment 13682 [details] turn OHCI USB quirk off Ingo's patch (comment #28) turned off the UHCI USB quirk. But the hang appears to be in the OHCI quirk, not the UHCI one, so I'm not surprised that it made no difference. This patch should turn off the OHCI patch. Does 2.6.21 + this patch boot?
Thanks for trying to help, Bjorn! I applied "revert the git-bisect verdict patch from 2.6.21" via git-apply. It is being compiled now. As I said before, I cleared the whole body of static void __devinit quirk_usb_early_handoff(struct pci_dev *pdev) function and added a printk with some bla bla text inside. It didn't show up. The bad news is that probably I will have to get rid of this laptop soon (after 9. December). Maybe, we can find the bug faster. I will do all I can to help you with debugging.
You know, Bjorn, your PATCH WORKED!!! Tell me please how to properly apply it to ubuntu kernel sources.
Manually patched 2.6.22 from kernel-sources ubuntu package. It worked!
So, how are you going to manage this situation? You just undo the patch and that's it, or you are going to do something else which is more clean?
In 2.6.20, acpi_reserve_resources() (an fs_initcall) happens before quirk_usb_handoff_ohci() (a device_initcall), but in 2.6.21, the order is reversed. I don't see anything in quirk_usb_handoff_ohci() that looks like it depends on acpi_reserve_resources(), but I'd like to know whether we run quirk_usb_handoff_ohci() in 2.6.20. Anton, do you still have this laptop? If so, can you add some printks in acpi_reserve_resources() and quirk_usb_handoff_ohci() in 2.6.20 and attach the resulting dmesg? Make sure you put several in quirk_usb_handoff_ohci() so we can tell if it bails out early. If you don't have the laptop anymore, I guess we'll have to wait for somebody else to reproduce the problem.
I told you something ambiguous. I am really sorry. I was too happy to have my new kernel back. When I said "your patch worked" I mean this one http://bugzilla.kernel.org/attachment.cgi?id=13681&action=view I guess the one that was before didn't work before because I didn't edit drivers/acpi/Makefile I still have this laptop and I run manually patched 2.6.24-rc5
Created attachment 14282 [details] instrument acpi_reserve_resources() and quirk_usb_handoff_ohci() No problem. If the problem is really something to do with the ordering of acpi_reserve_resources() vs. quirk_usb_handoff_ohci(), I would expect that patch to work because it restores the ordering we had in 2.6.20. But I don't want to push that patch upstream because it's not the right long-term solution. If you could try 2.6.20 with the attached patch, we could see whether we run quirk_usb_handoff_ohci(). Maybe we skip it for some reason or it behaves differently than in 2.6.21. If you can try this, please attach the entire dmesg log. Thanks!
Created attachment 14313 [details] dmesg output
Comment on attachment 14313 [details] dmesg output dmesg of vanilla kernel-2.6.20 with patch http://bugzilla.kernel.org/attachment.cgi?id=14282&action=view applied
Created attachment 14335 [details] move FADT reservations to acpi/motherboard.c (device_initcall to fs_initcall) Thanks. I don't see any useful difference in the way we run the USB quirk. My ACPI patch did two things that should have been split into different patches: - move the FADT reservations from the ACPI motherboard driver into osl.c, and - remove the ACPI motherboard driver so we'd rely on the PNP driver instead We know from comment #33 that starting with 2.6.21 and undoing both changes with attachment id 13681 makes it work again. So I'd like to figure out which change is important. Can you start with plain 2.6.21, try this patch, and attach the dmesg log?
Created attachment 14336 [details] add back ACPI motherboard driver (leaving FADT reservations in osl.c) Also, please start again with a fresh 2.6.21, apply this patch, and attach the dmesg log. (I guess at least one will likely hang, so attach a digital photo in that case.)
Created attachment 14369 [details] dmesg output in kernel-2.6.21 with applied patch http://bugzilla.kernel.org/attachment.cgi?id=14335&action=view
As for the second patch (http://bugzilla.kernel.org/attachment.cgi?id=14336&action=view), the kernel hanged. Unfortunately, I couldn't take a picture of the screen. But I didn't notice anything interesting - to me the output looked exactly the same as usual. But if you need it, I will try to do it.
I would be very pleased if you fixed this properly for the stable 2.6.*.x kernels before ubuntu hardy release. If you have time...
Linux 2.6.24.3 compiled from source (ubuntu hardy linux-source package) and their precompiled 2.6.24 work fine :-) https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/155278