Bug 14244 - 2.6.31-1 hangs early during boot
Summary: 2.6.31-1 hangs early during boot
Status: CLOSED DUPLICATE of bug 15708
Alias: None
Product: ACPI
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: ykzhao
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-09-28 18:16 UTC by Stefan Krause
Modified: 2010-09-28 22:05 UTC (History)
6 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Screenshot of acpi=verbose (397.91 KB, image/jpeg)
2009-09-28 18:17 UTC, Stefan Krause
Details
Dmesg of sucessful boot (60.00 KB, text/plain)
2009-09-28 18:17 UTC, Stefan Krause
Details
dmidecode (7.55 KB, text/plain)
2009-09-28 18:17 UTC, Stefan Krause
Details
lscpi -vnvn (34.53 KB, text/plain)
2009-09-28 18:18 UTC, Stefan Krause
Details
output of acpidump (354.88 KB, text/plain)
2009-09-29 18:14 UTC, Stefan Krause
Details
output of cpuinfo (1.48 KB, text/plain)
2009-09-29 18:14 UTC, Stefan Krause
Details
maybe related error message (377.44 KB, image/jpeg)
2010-01-03 12:11 UTC, Stefan Krause
Details

Description Stefan Krause 2009-09-28 18:16:40 UTC
My notebook, a hp hdx 9300, has severe problems booting into 2.6.31, which I didn't have for earlier versions (I've installed all Ubuntu and Fedora versions since Ubuntu 8.04).
This happens on both Fedora 12 Alpha and Ubuntu 9.10 Alpha. 
Booting with the default command line (quiet splash) made my system hang in about 80% the time. The remaining 20% it booted okay.
With the options acpi=noirq and without quiet it worked much better until Ubuntu was updated to 2.6.31-10. With this version it started hanging with acpi=noirq too.
The system boots and halts with the output attached as an screenshot. After that the system is halted and can only be powered off (ctrl + alt + del does not work).

This weekend I decided to compile my own kernel to see if I run into the same problem too. Here's the result:
* With kernel debug information booting works most of the time (without splash and without quiet)
* Without kernel debug information it works sometimes (let's say about 50% of the time)
* It seems as if apci=noirq helps
* And it seems as if acpi=off makes the whole problem go away.

I've reported this issue on both the ubuntu bugtracker (https://bugs.launchpad.net/ubuntu/+bug/410984) and fedora bugtracker (https://bugzilla.redhat.com/show_bug.cgi?id=521423), but there seems to be zero activity.

I'm absolutely willing to help finding the issue so let me know if I can help in any way.
Comment 1 Stefan Krause 2009-09-28 18:17:03 UTC
Created attachment 23197 [details]
Screenshot of acpi=verbose
Comment 2 Stefan Krause 2009-09-28 18:17:22 UTC
Created attachment 23198 [details]
Dmesg of sucessful boot
Comment 3 Stefan Krause 2009-09-28 18:17:50 UTC
Created attachment 23199 [details]
dmidecode
Comment 4 Stefan Krause 2009-09-28 18:18:48 UTC
Created attachment 23200 [details]
lscpi -vnvn
Comment 5 Stefan Krause 2009-09-28 19:40:29 UTC
I just tried with 2.6.32-rc2 and it has the same problem. I needed to start my laptop three times to boot sucessfully.
Comment 6 ykzhao 2009-09-29 05:56:38 UTC
Will you please try the following boot option and see whether the issue still exists?
   a. nolapic_timer
   b. idle=poll
   c. processor.max_cstate=1

will you please also attach the output of acpidump?
Thanks.
Comment 7 ykzhao 2009-09-29 05:57:08 UTC
Will you please also attach the output of "cat /proc/cpuinfo"?
Thanks.
Comment 8 Stefan Krause 2009-09-29 18:14:01 UTC
Thanks for your reply.
I got the following results:
nolapic_timer: First attempt failed
idle=poll: Failed on the fifth boot attempt
processor.max_cstate=1: Booted 6 times without problems!

I've attached what you requested (I booted without quiet, acpi or any of the aforementioned options).
Comment 9 Stefan Krause 2009-09-29 18:14:28 UTC
Created attachment 23202 [details]
output of acpidump
Comment 10 Stefan Krause 2009-09-29 18:14:48 UTC
Created attachment 23203 [details]
output of cpuinfo
Comment 11 ykzhao 2009-09-30 00:58:25 UTC
Will you please double check the boot option of "idle=poll"? I can't believe that the box can be booted with "processor.max_cstate=1" but it fails in the boot option of "idle=poll".
When the boot option of "idle=poll" is added, the C-state is totally disabled.

Thanks.
Comment 12 Stefan Krause 2009-09-30 18:04:29 UTC
I'm afraid to say you're right with your assumption.

I ran the tests again with "idle=poll" and it failed on the eighth attempt.
Then I tried processor.max_cstate=1 again and it failed this time the second and third time.
The last message was something like "sata_sil24 0000:20:00.0 PCI INT A -> GSI 19 (level,low) -> IRQ 19
Comment 13 Stefan Krause 2009-10-03 17:54:55 UTC
I checked a few more things:
* So far I could boot 25 times without any problems with acpi=off (but of course this is not a valid workaround)
* I also tried acpi=noirq, but it doesn't help. The last message was 'ACPI Error: No handler or method for GPE[19] disabling event (20090903/evgpe-706)' before the system hang.  

All in all this is very bad. I'll probably switch back to Fedora 11 and kernel 2.6.30 where I haven't had any problems with booting so far.
Comment 14 Stefan Krause 2009-10-11 18:07:49 UTC
I tried a few more thing. I switched from the 64bit to the 32bit version of Ubuntu 9.10. Until now I booted about 15 times and haven't had any problems yet. I'll post an comment if the 32 bit version hangs on boot too.
Comment 15 Stefan Krause 2009-10-22 20:28:19 UTC
Just to confirm my findings: The 32 bit version of Ubuntu 9.10 keeps working without any problems, the 64 bit RC hangs on boot with acpi enabled most of the time.
Comment 16 Fenrir 2009-11-18 20:19:04 UTC
I also have an HP HDX 9300, Each time I boot in 64bit Karmic Koala, I only have a 1/8 chance of a successful bootup. Trying to find a solution as quickly as possible. If successful, I'll be sure to post it.
Comment 17 Stefan Krause 2010-01-03 12:10:41 UTC
Update:
I'm still running Ubuntu 9.10 32 bit and it just boots fine.
Recently I decided to install OpenSuse 11.2 64 bit. The installed version boots fine so far (which is really odd, since the 64 bit versions of both Ubuntu 9.10 and Fedora 12 don't), but during installation I saw one error message that I'd like to post here since it might be related to that bug.

As you can see in the screenshot (titled "maybe related error message") the ata_piix message with some interrupt info is there (... -> IRQ 16), but the rest is something I haven't seen before - a hardware error for the CPUs.

Could this be the issue I'm seeing with the other 64 bit distributions? (If so, why is it printed just during installation and never during a normal boot?)
Comment 18 Stefan Krause 2010-01-03 12:11:11 UTC
Created attachment 24418 [details]
maybe related error message
Comment 19 ykzhao 2010-02-22 02:03:09 UTC
Hi, Stefan
    From the info in comment #18 it seems that the kernel panic happens.  And it is caused by the MCE check.
    >mce_machine_check
    >mce_panic

    I am not sure whether this is related with BIOS.

Can you try the boot option of "maxcpus=1" and see whether it can be booted?

thanks.
Comment 20 Stefan Krause 2010-02-28 19:51:11 UTC
It seems as if I can boot with maxcpus=1 (6 out of 6 boot attempts were successful on Ubuntu lucid Alpha 3 with kernel 2.6.32).

The same Ubuntu alpha version boots fine with the 32 bit kernel.
Comment 21 Stefan Krause 2010-03-26 19:19:25 UTC
Bad news! I recently switched to Ubuntu lucid 32 bit and my system hang a few times when I booted (uname -a: Linux hape 2.6.32-17-generic #26-Ubuntu SMP Fri Mar 19 23:58:53 UTC 2010 i686 GNU/Linux). 
This is really very bad news since I booted 2.6.31 a few hundred times and had no problems with 32 bit, but now 2.6.32-17 is causing problems with the 32 bit version too.

The boot process stopped saying someting like:
uhci_hcd 0000:00:1a.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16

In a successful boot these are the next lines in dmesg.
uhci_hcd 0000:00:1a.0: setting latency timer to 64
uhci_hcd 0000:00:1a.0: UHCI Host Controller
uhci_hcd 0000:00:1a.0: new USB bus registered, assigned bus number 3
uhci_hcd 0000:00:1a.0: irq 16, io base 0x00007000

I'll remove the 64 bit only remark from the title.
Comment 22 Stefan Krause 2010-03-26 19:23:08 UTC
One more comment: I'm using lucid since Alpha 2. I didn't notice any problems with booting until beginning of this week (I'd say before 2.6.32-16 I didn't notice any problems).
Comment 23 Stefan Krause 2010-04-07 09:32:12 UTC
I currently suspect that this issue is (at least for the 32bit version) not acpi related, but it might be an ata_piix issue. I've posted a new bug report regarding this issue on https://bugzilla.kernel.org/show_bug.cgi?id=15708 and will report back if that should be the cause.
Comment 24 ykzhao 2010-06-12 06:42:48 UTC
HI, Stefan
    Thanks for the updating. 
   From the log info it seems that the hang issue is not related with ACPI. So this bug will be marked as the dup of bug 15708.

thanks.

*** This bug has been marked as a duplicate of bug 15708 ***
Comment 25 Tejun Heo 2010-06-12 17:14:37 UTC
ykzhao, it isn't clear where the issue is.  At this point, I'm highly skeptical this is something caused by something in libata proper.  It definitely looks like something at much lower level is broken.  It might not be acpi but it's not like we have any evidence to conclusively rule out anything at this point.  I don't really mind which bug we use to track the issue but it would be great if you can stick around.

Stefan, has the MCE error happened again?  The problem could be hardware related and we might be just seeing essentially lucky and unlucky code / data / whatever layouts depending on specific build.  How many memory sticks do you have on the machine?  If there are more than one, can you please remove one memory stick, see whether the problem is reproducible and if so swap with the other one and see whether anything changes?

Thanks.

Note You need to log in before you can comment on or make changes to this bug.