Bug 9303 - >=kernel-2.6.21 hangs on boot at "io scheduler cfg registered (default)" line
Summary: >=kernel-2.6.21 hangs on boot at "io scheduler cfg registered (default)" line
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: Bjorn Helgaas
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-11-04 03:27 UTC by Anton Kuzmin
Modified: 2008-04-26 06:32 UTC (History)
3 users (show)

See Also:
Kernel Version: >=2.6.21
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
dmidecode.log (7.76 KB, text/plain)
2007-11-05 11:38 UTC, Anton Kuzmin
Details
git log (2.05 KB, text/plain)
2007-11-05 11:39 UTC, Anton Kuzmin
Details
lspci -v output (7.41 KB, text/plain)
2007-11-05 11:39 UTC, Anton Kuzmin
Details
git bisect's verdict (1.11 KB, text/plain)
2007-11-05 11:40 UTC, Anton Kuzmin
Details
debugging patch (2.38 KB, patch)
2007-11-05 16:21 UTC, Bjorn Helgaas
Details | Diff
debug in usb quirk (3.41 KB, patch)
2007-11-06 21:03 UTC, Bjorn Helgaas
Details | Diff
screenshot (250.14 KB, image/jpg)
2007-11-07 09:19 UTC, Anton Kuzmin
Details
more usb quirk debug (3.64 KB, patch)
2007-11-07 12:26 UTC, Bjorn Helgaas
Details | Diff
turn USB quirk off (679 bytes, patch)
2007-11-17 18:39 UTC, Ingo Molnar
Details | Diff
revert the git-bisect verdict patch from 2.6.21 (8.30 KB, patch)
2007-11-21 14:49 UTC, Bjorn Helgaas
Details | Diff
turn OHCI USB quirk off (789 bytes, patch)
2007-11-21 15:00 UTC, Bjorn Helgaas
Details | Diff
instrument acpi_reserve_resources() and quirk_usb_handoff_ohci() (2.72 KB, patch)
2008-01-04 08:59 UTC, Bjorn Helgaas
Details | Diff
dmesg output (15.09 KB, text/plain)
2008-01-06 04:56 UTC, Anton Kuzmin
Details
move FADT reservations to acpi/motherboard.c (device_initcall to fs_initcall) (5.88 KB, patch)
2008-01-07 15:53 UTC, Bjorn Helgaas
Details | Diff
add back ACPI motherboard driver (leaving FADT reservations in osl.c) (5.15 KB, patch)
2008-01-07 15:56 UTC, Bjorn Helgaas
Details | Diff
dmesg output in kernel-2.6.21 with applied patch http://bugzilla.kernel.org/attachment.cgi?id=14335&action=view (15.10 KB, text/plain)
2008-01-08 13:59 UTC, Anton Kuzmin
Details

Description Anton Kuzmin 2007-11-04 03:27:46 UTC
Most recent kernel where this bug did not occur: 2.6.20
Distribution: Whatever (I tried gentoo, debian and that one with 2.6.23 inside

Problem Description:
suspend2-sources and gentoo-sources >= 2.6.21 hangs on boot at "io scheduler
cfg registered (default)" line.
There is no kernel panic, the kernel just gets stuck during boot.

I googled a bit and found some victims of the same problem. So, running with initcall_debug and without 'quiet' and 'splash' gave me this output(the last lines):
calling initcall 0xc03c63a0 : populate_rootfs + 0x0/0x100()
checking if image is initramfs ... it is

With noapic and acpi=off:
calling initcall 0xc01eae70 : pci_init + 0x0/0x30()

Changing the io scheduler via elevator= option didn't help
irqpoll and idle=poll didn't help either

Previous tries to get help:
http://bugs.gentoo.org/show_bug.cgi?id=184821
https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/155278

I am not a kernel hacker. So, please help me with the debugging process. I really want my linux back alive.
Comment 1 Natalie Protasevich 2007-11-04 22:29:49 UTC
Anton,
If you can provide a boot trace for both cases - for 2.6.20 and 2.6.21 then we could find some clues in there, especially if you add "debug" on the boot command line. In the failed case, you would have to use serial console. 

It is also useful to know what system it is happening on, so output of dmidecode and say lspci -v would be helpful.

The most effective method would be to performe git bisection as explained in http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches/

Please attach the traces above (attaching is preferable as opposed to pasting it into comments as text), and let me know if you're willing to try git bisection. I assume you're familiar with git and know how to use it?
Comment 2 Anton Kuzmin 2007-11-05 07:03:55 UTC
Yes, I am  willing to try git bisection. It looks simple to use so I think I can handle it. What to do if I a build fails? It stopped compiling because of some KVM function. I disabled KVM. I don't use it anyway. But what do I do if the next build fails on something important?
Comment 3 Anton Kuzmin 2007-11-05 11:38:52 UTC
Created attachment 13402 [details]
dmidecode.log
Comment 4 Anton Kuzmin 2007-11-05 11:39:12 UTC
Created attachment 13403 [details]
git log
Comment 5 Anton Kuzmin 2007-11-05 11:39:38 UTC
Created attachment 13404 [details]
lspci -v output
Comment 6 Anton Kuzmin 2007-11-05 11:40:15 UTC
Created attachment 13405 [details]
git bisect's verdict
Comment 7 Natalie Protasevich 2007-11-05 12:26:08 UTC
Excellent work, and enough information to bring this to Bjorn's attention.
Thanks!
Comment 8 Bjorn Helgaas 2007-11-05 16:21:32 UTC
Created attachment 13408 [details]
debugging patch
Comment 9 Bjorn Helgaas 2007-11-05 16:22:19 UTC
(This goes with the patch above; bugzilla choked when I entered it as comments.)

Thanks for all your debug work!

I'm confused.  If we hung inside acpi_reserve_resources() we should have
seen something like:

  calling initcall 0x.... : acpi_reserve_resources + 0x0/0x...()

So it looks more like we hung inside populate_rootfs().

Can you try the attached patch, booting with and without "acpi=off"?
Comment 10 Anton Kuzmin 2007-11-06 05:53:15 UTC
With or without acpi:
   A lot of lines like:
      pci_init: fixup 0000:00:0a:1 done
      pci_init: fixup 0000:00:0b:1

How do I see more output? Something like 'less' or 'more'?
Comment 11 Anton Kuzmin 2007-11-06 05:55:02 UTC
I found similar bugreport 2.6.17 hangs during boot on ASUS M2NPV-VM motherboard
http://lkml.org/lkml/2006/7/14/117
Comment 12 Bjorn Helgaas 2007-11-06 09:19:08 UTC
The patch in question:
  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9a47cdb1bb85e7944fb7419e4078c46516ef7335
first appeared in 2.6.21, so the 2.6.17 hang must be a different problem with similar symptoms.

I assume your kernel still hangs, right?  What's the last thing it prints before it hangs?  Do you see the "pci_init - done" message?  Do you see any "acpi_reserve_resources" messages?  The best thing would be if you could collect all the console output, using netconsole or a serial console.  If you can't do that, maybe you could attach a digital photo of the screen.

If the last thing printed is "pci_init: fixup 0000:00:0b:1", that would mean we hung somewhere inside a fixup for the PCI device at 00:0b.1, which looks like a USB controller.
Comment 13 Anton Kuzmin 2007-11-06 10:24:01 UTC
Yes, the last line was "pci_init: fixup 0000:00:0b:1".

And I couldn't see any other lines because all I saw was pci_init messages on my tiny laptop 13.1'' screen. So, I believe making a photo must be useless.

I don't know how to use these tools to collect console output but I will see what I can do.
Comment 14 Anton Kuzmin 2007-11-06 10:43:33 UTC
I commented your changes in drivers/pci/pci.c and I don't see any debug messages anymore.

I read about netconsole. Is it really possible to use it? I mean my linux hasn't even started yet. I can't even do modprobe and the modules in autoload didn't start either.
Comment 15 Bjorn Helgaas 2007-11-06 13:37:16 UTC
You could build your NIC driver statically into the kernel, but you're probably right that netconsole won't help here, because I think this hang happens even before the built-in drivers are initialized.

If you add "debug" to your kernel parameters, it looks like pci_do_fixups() should print the address of each fixup it calls.  Then you can look up that address in your System.map to see what function it is.
Comment 16 Randy Dunlap 2007-11-06 13:43:42 UTC
netconsole does init fairly late, so Bjorn is probably correct.

Some distros modify "debug" in their init scripts, so an alternative is to
use "ignore_loglevel" on the kernel boot command line instead of "debug".
Comment 17 Bjorn Helgaas 2007-11-06 14:19:42 UTC
Thanks, I didn't know about "ignore_loglevel".  And in order to get the PCI fixup debug, you have to turn on CONFIG_DEBUG_KERNEL and CONFIG_PCI_DEBUG.
Comment 18 Anton Kuzmin 2007-11-06 14:55:08 UTC
PCI: Calling quirk c0294a70 for 0000:00:0b.0
PCI: Calling quirk c03d7a70 for 0000:00:0b.0

In System.map they are:
t quirk_cardbus_legacy
t quirk_usb_early_handoff
Comment 19 Bjorn Helgaas 2007-11-06 21:03:30 UTC
Created attachment 13431 [details]
debug in usb quirk

I guess we must be hanging in quirk_usb_early_handoff().  I don't know why the git bisection identified the ACPI patch, which seems pretty unrelated.  But maybe the connection will become clear later.

Here's a patch to add some more debugging in this USB quirk.
Comment 20 Anton Kuzmin 2007-11-07 09:18:57 UTC
I don't know, maybe I mistyped some bisect-good-or-bad commands. Does the git log look reasonable? Or you can't say?

I will attach the photo with acpi
Without acpi (acpi=off) the only thing that was different is the 'base', which was 0xf880c000
Comment 21 Anton Kuzmin 2007-11-07 09:19:54 UTC
Created attachment 13440 [details]
screenshot
Comment 22 Bjorn Helgaas 2007-11-07 12:26:12 UTC
Created attachment 13443 [details]
more usb quirk debug

The git bisection log looked reasonable to me.  Can you double-check by taking 2.6.21 and backing out just the patch identified by the bisection?  Something like this, for example:

  $ cd linux-2.6.21
  $ wget -Oacpi.patch "http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=9a47cdb1bb85e7944fb7419e4078c4651"
  $ patch -p1 -R < acpi.patch

Based on your screenshot, we should have seen the "waiting" message next.  Maybe one of the following CSR writes is hanging.  But that code didn't change between 2.6.20 and 2.6.21.  Here's another patch with a couple more printks to identify the exact CSR write.
Comment 23 Anton Kuzmin 2007-11-07 16:07:57 UTC
Um, I can't double-check. I am having problems reverting the patch because motherboard.c doesn't exist in 2.6.21 What do I do?

But I applied your debugging patch and here's the results:
quirk_usb_early_handoff
quirk_usb_handoff_ohci
quirk_usb_handoff_ohci base 0xf8806000 (bus addr 0xc0004000)
quirk_usb_handoff_ohci control 0x797
quirk_usb_handoff_ohci writing OHCI_INTRENABLE, 0x40000000 at 0xf8806010
quirk_usb_handoff_ohci writing OHCI_CMDSTATUS, 0x8 at 0xf8806008
Comment 24 Anton Kuzmin 2007-11-11 05:30:33 UTC
I reverted the patch by hand after I had copied motherboard.c from my linux-git dir. Linux didn't boot. I will have to try git bisect again. I am very sorry for misleading you if it is really my fault.
Comment 25 Natalie Protasevich 2007-11-11 07:54:07 UTC
Anton,
Here is a good description for bisect:
http://www.kernel.org/doc/local/git-quick.html
Comment 26 Anton Kuzmin 2007-11-11 15:25:21 UTC
I run the bisect test again making sure I didn't make a mistake inputting good|bad commands. The same results. The same lines in the log.
Comment 27 Natalie Protasevich 2007-11-13 01:20:51 UTC
Hmm, this appears to be something that git picked on. Removal of platform driver and other ACPI changes were cascading, so this is probably no single patch that will take you back to working system. Bjorn please correct me if I'm wrong.
Comment 28 Ingo Molnar 2007-11-17 18:39:35 UTC
Created attachment 13595 [details]
turn USB quirk off

does the bootup get any further if you apply the attached usb-quirk-off.patch, which seems to be the source of the lockup?
Comment 29 Anton Kuzmin 2007-11-18 10:30:19 UTC
I applied patch "turn USB quirk off" on 2.6.21. It still stops booting. I even commented the body of the function. But it gets called - that's for sure(I put printk line inside).
Comment 30 Bjorn Helgaas 2007-11-21 14:49:50 UTC
Created attachment 13681 [details]
revert the git-bisect verdict patch from 2.6.21

This applies to 2.6.21 and removes the patch identified by git bisect.  I suppose this is the same test you did in comment #24, Anton.  But maybe it'd be worth testing an independently generated patch.
Comment 31 Bjorn Helgaas 2007-11-21 15:00:46 UTC
Created attachment 13682 [details]
turn OHCI USB quirk off

Ingo's patch (comment #28) turned off the UHCI USB quirk.  But the hang appears to be in the OHCI quirk, not the UHCI one, so I'm not surprised that it made no difference.  This patch should turn off the OHCI patch.  Does 2.6.21 + this patch boot?
Comment 32 Anton Kuzmin 2007-12-04 02:30:56 UTC
Thanks for trying to help, Bjorn!

I applied "revert the git-bisect verdict patch from 2.6.21" via git-apply. It is being compiled now.

As I said before, I cleared the whole body of static void __devinit quirk_usb_early_handoff(struct pci_dev *pdev) function and added a printk with some bla bla text inside. It didn't show up.

The bad news is that probably I will have to get rid of this laptop soon (after 9. December). Maybe, we can find the bug faster. I will do all I can to help you with debugging.
Comment 33 Anton Kuzmin 2007-12-04 02:47:40 UTC
You know, Bjorn, your PATCH WORKED!!!

Tell me please how to properly apply it to ubuntu kernel sources.
Comment 34 Anton Kuzmin 2007-12-04 05:05:20 UTC
Manually patched 2.6.22 from kernel-sources ubuntu package. It worked!
Comment 35 Anton Kuzmin 2007-12-08 05:57:19 UTC
So, how are you going to manage this situation? You just undo the patch and that's it, or you are going to do something else which is more clean?
Comment 36 Bjorn Helgaas 2008-01-03 15:26:25 UTC
In 2.6.20, acpi_reserve_resources() (an fs_initcall) happens before quirk_usb_handoff_ohci() (a device_initcall), but in 2.6.21, the order is reversed.

I don't see anything in quirk_usb_handoff_ohci() that looks like it depends on acpi_reserve_resources(), but I'd like to know whether we run quirk_usb_handoff_ohci() in 2.6.20.

Anton, do you still have this laptop?  If so, can you add some printks in acpi_reserve_resources() and quirk_usb_handoff_ohci() in 2.6.20 and attach the resulting dmesg?  Make sure you put several in quirk_usb_handoff_ohci() so we can tell if it bails out early.

If you don't have the laptop anymore, I guess we'll have to wait for somebody else to reproduce the problem.
Comment 37 Anton Kuzmin 2008-01-03 23:55:01 UTC
I told you something ambiguous. I am really sorry. I was too happy to have my new kernel back.

When I said "your patch worked" I mean this one http://bugzilla.kernel.org/attachment.cgi?id=13681&action=view I guess the one that was before didn't work before because I didn't edit drivers/acpi/Makefile

I still have this laptop and I run manually patched 2.6.24-rc5
Comment 38 Bjorn Helgaas 2008-01-04 08:59:21 UTC
Created attachment 14282 [details]
instrument acpi_reserve_resources() and quirk_usb_handoff_ohci()

No problem.  If the problem is really something to do with the ordering of acpi_reserve_resources() vs. quirk_usb_handoff_ohci(), I would expect that patch to work because it restores the ordering we had in 2.6.20.

But I don't want to push that patch upstream because it's not the right long-term solution.

If you could try 2.6.20 with the attached patch, we could see whether we run quirk_usb_handoff_ohci().  Maybe we skip it for some reason or it behaves differently than in 2.6.21.

If you can try this, please attach the entire dmesg log.  Thanks!
Comment 39 Anton Kuzmin 2008-01-06 04:56:03 UTC
Created attachment 14313 [details]
dmesg output
Comment 40 Anton Kuzmin 2008-01-06 04:59:21 UTC
Comment on attachment 14313 [details]
dmesg output

dmesg of vanilla kernel-2.6.20 with patch http://bugzilla.kernel.org/attachment.cgi?id=14282&action=view applied
Comment 41 Bjorn Helgaas 2008-01-07 15:53:28 UTC
Created attachment 14335 [details]
move FADT reservations to acpi/motherboard.c (device_initcall to fs_initcall)

Thanks.  I don't see any useful difference in the way we run the USB quirk.

My ACPI patch did two things that should have been split into different patches:
  - move the FADT reservations from the ACPI motherboard driver into osl.c, and
  - remove the ACPI motherboard driver so we'd rely on the PNP driver instead

We know from comment #33 that starting with 2.6.21 and undoing both changes with attachment id 13681 makes it work again.  So I'd like to figure out which change is important.

Can you start with plain 2.6.21, try this patch, and attach the dmesg log?
Comment 42 Bjorn Helgaas 2008-01-07 15:56:06 UTC
Created attachment 14336 [details]
add back ACPI motherboard driver (leaving FADT reservations in osl.c)

Also, please start again with a fresh 2.6.21, apply this patch, and attach the dmesg log.

(I guess at least one will likely hang, so attach a digital photo in that case.)
Comment 43 Anton Kuzmin 2008-01-08 13:59:15 UTC
Created attachment 14369 [details]
dmesg output in kernel-2.6.21 with applied patch http://bugzilla.kernel.org/attachment.cgi?id=14335&action=view
Comment 44 Anton Kuzmin 2008-01-08 14:05:18 UTC
As for the second patch (http://bugzilla.kernel.org/attachment.cgi?id=14336&action=view), the kernel hanged. Unfortunately, I couldn't take a picture of the screen. But I didn't notice anything interesting - to me the output looked exactly the same as usual. But if you need it, I will try to do it.
Comment 45 Anton Kuzmin 2008-02-10 11:28:05 UTC
I would be very pleased if you fixed this properly for the stable 2.6.*.x kernels before ubuntu hardy release. If you have time...
Comment 46 Anton Kuzmin 2008-04-26 06:32:54 UTC
Linux 2.6.24.3 compiled from source (ubuntu hardy linux-source package) and their precompiled 2.6.24 work fine :-)

https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/155278

Note You need to log in before you can comment on or make changes to this bug.