Kernel Bug Tracker – Bug 8186
x86_64-only: system hang starting X unless "acpi=noirq" - HP dv9000
Last modified: 2007-08-07 14:32:45 UTC
Most recent kernel where this bug did *NOT* occur:
Has occurred in all kernels I've tried since 2.6.19
Distribution: Fedora Core 6
HP dv9000z laptop, AMD64x2, nvidia graphics, nforce MCP51 chipset, dual sata
drives, PCIE bus interface to the nvidia and bcm43xx cards.
nv driver (not proprietary one)
Fedora Core 6 boots and runs if I use kernel parameter pci=noacpi, but freezes
if I omit that parameter, at the point where it is starting Xorg. I have
looked with firescope at the boot log, but no additional log entries are printed
after starting X. If I boot with the init level set to 3 (for command line
interface), the system runs fine, but if I then try xinit, startx or telinit 5,
the freeze happens.
I am now running a version of 2.6.21-rc3, with an additional patch to print out
ACPI register accesses that conflict with reserved registers.
I will attach DSDT and the /proc/interrupts output for the case that works and
the case that does not work. I note that the acpi sci interrupt is different
in the two modes (and am skeptical about its being level triggered).
Created attachment 10725 [details]
dsdt.dsl for dv9000z laptop
Created attachment 10726 [details]
/proc/interrupts without and with pci=acpi
Created attachment 10727 [details]
/var/log/messages for boot without pci=acpi
note that the kernel parameter "debug" was used, and that the messages
referring to ACPI read and ACPI write to registers come from a test patch that
I added to my kernel that check for potential conflicts between ACPI directed
read/write activity and io registers that are marked in use.
Created attachment 10728 [details]
/var/log/messages for boot without pci=acpi
This is the correct /var/log/messages - somehow my emacs extract included too
much stuff the first time.
Has this worked properly without cmdline workarounds using
any known version of the Linux kernel? (eg. 2.6.16, or 2.6.18?)
Is it the display that freezes, or the entire machine?
ie. can you access it via the network ping or ssh?
Does a press of the power button shut the machine down --
or do you have to hold power down for more than 4 seconds
for an immediate poweroff.
is "acpi=noirq" sufficient to work around the failure?
is "noapic" sufficient to work around the failure?
If yes, please include the /proc/interrupts and dmesg for them.
oh, and is this SMP specific?
ie. does "maxcpus=1" make it go away?
> Has this worked properly without cmdline workarounds using
> any known version of the Linux kernel? (eg. 2.6.16, or 2.6.18?)
I have only tested kernels starting with 2.6.19. Has not worked properly with
any of them without cmdline workarounds.
> re: freezes
> Is it the display that freezes, or the entire machine?
> ie. can you access it via the network ping or ssh?
> Does a press of the power button shut the machine down --
> or do you have to hold power down for more than 4 seconds
machine seems to freeze - none of the above work except the >4sec power button
> is "acpi=noirq" sufficient to work around the failure?
> is "noapic" sufficient to work around the failure?
acpi=noirq seems to work OK. Attaching files.
noapic is problematic. There seems to be a serious problem with interrupts -
the ERR: line is large, and after a while the USB devices stop handling
interrupts. I have thought to report that as a separate bug, because
apparently 2-cpu systems get interrupts on both processors and there is some
sort of race in the ehci interrupt code that causes it to say that an interrupt
is unhandled when it is being handled. If you really want the noapic output I
can try to get it for you in its full ugliness. I assume noapic is a workaround
for this config, so am not sure it should have bugs reported against it.
Created attachment 10729 [details]
dmesg for acpi=noirq case
Created attachment 10730 [details]
/proc/ijnterrupts for acpi=noirq case
maxcpus=1 does NOT fix the problem. (note that I had also tried turning off
irqbalance on the theory that interrupts might be moving from cpu to cpu, and
turning that off didn't fix things either).
I am happy to try patches to generate more debugging, and I have firescope
working from another laptop - the hang makes firescope stop working, but perhaps
there's a way to generate messages just prior to the hang...
Please try nmi_watchdog=0
If that doesn't help, it would be great if you could
try 2.6.16.stable and if that works, try 2.6.18.stable.
It may also be fruitful to try out a 32-bit kernel
if that is possible -- say with a CD-based distro.
Also, to dig into an interrupt issue (if this turns out to be one)
I need the complete output from acpidump, not just the DSDT.dsl,
plus the output from lspci -vv.
Finally, I'm no expert on X, but it would be good to try
starting X without nv in the system. I believe that you
should be able to do this by running X with just vesa
or an nvidia frame buffer driver. Indeed, if the binary
nvidia driver fixed this, that would also be a clue.
Created attachment 10743 [details]
Created attachment 10744 [details]
lspci -vv output
Created attachment 10745 [details]
booting with nmi_watchdog=0 messages: freezes after coming up, but before startx
Trying nmi_watchdog=0 doesn't boot sufficiently far to test Xorg. freezes.
(must hold power button for >4 sec, no other way to contact).
Changed xorg.conf to use "vesa" driver rather than "nv" driver (both are
framebuffer drivers). Freezes in the same way when acpi=noirq or pci=noacpi
are not specified, at point where startx or xinit happens (screen is blanked, no
Created attachment 10748 [details]
irq and io registers according to Windows XP
For interest, since the machine runs fine under Windows XP, I've attached what
Windows does to set up IRQ assignments and the IO register addresses for all
the peripherals. NOTE: there are a number of IRQs that have no mention in the
Linux /proc/interrupts attachments above. "NVIDIA Network Bus Enumerator" has
an IRQ 16, as does NVIDIA nForce System Management Controller on 10 (different
from AMD ACPI-Compliant System on 9)..
nmi_watchdog=0 hang is not comforting -- as that is the new default
thanks for the interrupt info. Any chance to try 32-bit or find
if any previous kernel worked properly?
Tried 188.8.131.52 stable and 184.108.40.206 stable - both got kernel panics because
switchroot failed before trying init - probably because my root directory is
RAID1 LVM - I'm not sure what I need to do to fix that.
I did do MUCH better with a Knoppix 5.1.0 32-bit Live CD I had lying around. It
boots and runs X like a champ - no problems at all. uname tells me 2.6.19 is
the kernel version there, and SMP is running fine, all interrupts seem to be set
up by ACPI properly.
I can try to figure out how to run a more recent i686 kernel perhaps. But the
fact that x86_64 is where the problem seems to be is interesting.
Note that I have been having intermittent drive timeouts on the second sata_nv
drive, which is not necessary to run, if heavily used.
continues to fail in 2.6.21-rc4
Is there anything left for me to try? I suspect the fact that 32-bit i686
boots puts the problem squarely in x86_64 specific code - in reading the two
architecture's acpi-based IRQ assignment code and IRQ dispatching code, I find
it odd that the two are so different rather than sharing common code (must be
historical forking due to diverged development teams, I guess).
One thing I did seem to note in my boot of Knoppix 5.1.1 32-bit is that there
seemed to be no mention of the "aperture". Perhaps the problem is not IRQ
related but IOMMU setup related?
iommu=off does NOT fix the problem, though.
> Tried 220.127.116.11 stable and 18.104.22.168 stable - both got kernel panics because
> switchroot failed before trying init - probably because my root directory is
> RAID1 LVM - I'm not sure what I need to do to fix that.
You need to build the LVM stuff into the kernel (CONFIG_MD=y etc)
and you also need to boot with an initrd.
> ... Knoppix 5.1.0 32-bit Live CD... - no problems at all.
> uname tells me 2.6.19 is the kernel version there,
> and SMP is running fine, all interrupts seem to be set
> up by ACPI properly.
Any chance you can grab the /proc/interrupts and dmesg from
the knoppix boot? The IRQ numbers will be different because
i386 still has a bogus "irq compression" patch in it, but
the dmesg will tell us if anything is different from
an ACPI irq allocation point of view.
> continues to fail in 2.6.21-rc4
> I suspect the fact that 32-bit i686
> boots puts the problem squarely in x86_64 specific code
Yes, though unless the kernel configs are really analogous
it is possible to get fooled. But it is good to know that
at least one modern shipping i386 kernel config works properly.
> Is there anything left for me to try?
2.6.21-rc5 I suppose:-)
Seriously, there have been some timer fixes, so it would be
worthwhile if you can check the state of -rc5.
Created attachment 11020 [details]
32-bit knoppix 5.1.1 /proc/interrupts
Knoppix 2.6.19 kernel boots correctly with acpi
Created attachment 11021 [details]
dmesg for 32-bit knoppix
Created attachment 11022 [details]
lspci -vv under 32.-bit knoppix
checking the setup of the PCI bus
I have also tried 2.6.21-rc5 and it still fails the same way. Will explore
further with it, especially if someone has some suggestions for debugging steps
for me to explore (perhaps by checking out what differs that is relevant in the
Knoppix boot files added above).
> I have also tried 2.6.21-rc5 and it still fails the same way.
Previously the default was nmi_watchdog=2
and you said you got a (different) boot hang with nmi_watchdog=0.
In 2.6.21-rc4 and later, the default is nmi_watchdog=0.
What happens if you boot -rc5 with nmi_watchdog=2?
Is cpufreq running?
What are the contents of /sys/devices/system/cpu/cpu0/cpufreq/*?
What if you disable cpufreq with CONFIG_CPU_FREQ=n?
Created attachment 11050 [details]
cpufreq values when pci=noacpi option is used.
here is the output of
on 2.6.21-rc5 with pci=noacpi.
2.6.21-rc5 with CONFIG_CPU_FREQ=n hangs during the booting process, no matter
whether pci=noacpi is included or not. If pci=noacpi is included as a
parameter, it seems to fail when a processor limit event happens. If it is not
included, the hang happens when ACPI probes the PCI bus.
So it looks like cpufreq is essential to make the x86_64 boot happen, strange.
Should I try other tests?
Booting with nmi_watchdog=2 works if pci=noacpi is there too, but the freeze
still happens if no pci=noacpi parameter is set (in rc5). nmi_watchdog=1 suffers
the same fates.
Correction to comment #28: I can boot if pci=noacpi and cpufreq is not
configured ... the hang at probing the PCI bus is apparently random, and quite
infrequent. I wonder if that is another symptom of a common underlying problem?
However, leaving pci=noacpi off - there is no help by not having cpufreq
configured on my configuration.
I am starting to wonder if the fundamental issue is that some piece of chipset
IRQ-related state is just not set up on x86_64 machines properly via the ACPI
path, but is set by accident on the pci=noacpi path. Has anyone who knows the
details of the lspci -vv settings looked at the various ones included above for
issues? If I had a live cd kernel version of 2.6.21-rc5 to try in i386 mode,
one could see this. That's a lot of work for me to figure out how to make, but
I may try if I have to.
Just tested 2.6.21-rc6 to see if any of the changes fixed the problem observed
here. rc6 boots, but starting X still generates the same freeze. If I set
pci=noacpi, the system does not freeze when starting X.
I'd love to know if there is something I can try to debug this. Is there a way
to monitor the appropriate things (via firescope?) as the starting of X occurs?
Fooled around with .config for 22.214.171.124 until it boots properly on the machine.
It also requires pci=noacpi to get through the X startup without freezing.
I note that /proc/interrupts looks *very* different when booting with no pci=
argument, though. Instead of assigning ints up through IRQ 23 as has been true
for 2.6.20 and 2.6.21, it assigns numbers like 50, 58, 203, ...
I can provide the /proc/interrupts, dmesg, etc. for 126.96.36.199 if that will help
I'm seeing something that very much resembles this bug.
X freezes on startup exactly like described here, and it has happened
on every 2.6.20 and 2.6.21 kernel I've tried. Haven't yet tried a 2.6.19
kernel, but everything older (2.6.18 and down) works.
Like you, I have a dual-core x86_64 system (Athlon X2 3800+). Besides that,
an Asus A8N-SLI Premium motherboard (nForce4 CK804 chipset), and an NVIDIA
GeForce 7800 GT card. My distro is Slackware 11.0.
I haven't done as thorough testing as you (just tried NVIDIA's own
binary drivers), so I haven't submitted any bug reports until now. After
reading this, I tried acpi=noirq, but the kernel doesn't even boot then
(freezes printing info on my hard drive), so I couldn't test that.
One thing though: When X crashes on startup, it leaves the following
0: X(xf86SigHandler+0x8a) [0x8088b2a]
2: X(main+0x536) [0x80d47a6]
3: /lib/tls/libc.so.6(__libc_start_main+0xd4) [0xb7df7e14]
4: X [0x806ff61]
Fatal server error:
Caught signal 11. Server aborting
Please consult the The X.Org Foundation support
for help. Please also check the log file at "/var/log/Xorg.0.log" for
Does the same thing happen for you?
Forget about what I just posted, it appears that my problems were caused by the
evdev USB mouse driver. The physical device was hardcoded in xorg.conf, and the
device number changed after 2.6.18, resulting in X trying to use the powerbutton
as a mouse. :-) After updating the device path, X works again.
I'd like to add that I'm encountering this bug as well on my HP dv9317cl. I
don't boot into X so the freeze happens on the command line for me. I can
stabilize the system by using noapic or pci=noacpi. Booting without either of
those options typically gets me to a login prompt, and I can use the system
briefly. It seems that the more interrupts that are being generated, the sooner
A sure fire way to make it freeze has been to load ehci-hcd and run dmesg
repeatedly; something about the scrolling text makes the system lock in a few
seconds. The freeze occurs without ehci-hcd also, it just takes longer, and I
think the ehci relation is a different bug.
I'm rather inexperienced in reporting bugs, so I'll attempt to follow Mr. Reed's
example. I'm also willing to provide any additional information desired, and
attempt any resolution/test that has an under 50% chance of destroying my hardware.
Well... maybe 25%...
Created attachment 11629 [details]
DSDT from dv9317cl
DSDT from my dv9317cl via apcidump -b -t DSDT and iasl -d
Created attachment 11630 [details]
/proc/interrupts without pci=noacpi option
Created attachment 11631 [details]
dmesg output without pci=noacpi option
Created attachment 11632 [details]
acpidump output on dv9317cl
Created attachment 11633 [details]
lspci -vv output without pci=noacpi option
I am now operating reasonably well, having done 3 things:
1) Running kernel 2.6.22-rc3 (in 2.6.20 versions, I was able to use
pci=routeirq as the only kernel option, once I did the below stuff, apparently
initializing the IOAPIC better than it was being set up without that option).
Also upgraded to F.28 BIOS - which has been released on HP website.
2) removing hwclock --systohc call from my shutdown scripts (causes hang at
3) using nvidia proprietary X screen driver.
Here's my theory: (1) F.28 BIOS apparently changes a lot in ACPI area. It has
more tables and different ones. It may also initialize IOAPIC differently when
ACPI is turned on, thus obviating need for pci=routeirq. 2.6.22 fixes lots of
sata-nv and timer related bugs (zillions of changes in that space, and I don't
want to bother doing git bisect). (2) relates to the *very* different code in
i386 and x86_64 for basic interrupts and rtc functions - maintainers should get
their act together, but who am I to criticize. (3) is incredibly hard to debug,
because it happens after screen is turned off, there is no serial port, and
firescope doesn't work, because of hard hang. I am hoping to play with
"nouveau" rather than "nv".
Given the timer problems, I am waiting for the tickless x86_64 updates before I
do more poking than I have been doing.
[For those who want a workaround, pull 2.6.22rc3, do the "make oldconfig" with
your old .config, and make sure you have the F.28 BIOS. This allowed me to
boot with no special boot parameters, leaving me still with problems with X and
/dev/rtc - but not showstoppers.]
In any case, for people running compatible x86_64
Upgrading BIOS resolved my issues as well.
hwclock freezes the system; easy to work around. Fast scrolling text in textmode
still freezes up, but that's an easy situation to avoid. X works flawlessly.
I think I have the same problem.
Hardware: HP dv9000 laptop, dv9000 GH769EA according to bios and dv9398eu according to hp update utility, bios F.38, AMD64x2 Turion, nvidia graphics, nforce MCP51 chipset
The computer freezes almost every time it boots. I use ubuntu 7.04 with linux-image-2.6.20-16-generic package. I have also tried linux-image-2.6.22-9-generic package which results in same problem.
If the computer boots (maybe 1 of 100) and comes to kde-login then it works without any problem. So its is probably some problem during the boot, I will explain later during which steps the computer freezes.
Sometimes I get the following error message on the console just before the crash:
error receiving uevent message: No buffer space available
I can make the computer boot if 'noapic' or 'acpi=off' or 'acpi=noirq' kernel option is used, but then it has other problems:
The noapic option makes the ehci-hcd driver work very strange. It takes a lot of cpu and receives a lot of interrupts. After a while it crashes unless the irqpoll option is specifed.
The acpi=off or acpi=noirq make the computer boot too, but now the bcm43xx gives an error message when it tries to allocate irq 0! and the Xorg nvidia driver fails and gives an error message about level-triggered irq, but nothing freezes.
I have tried to boot the computer with the init=/bin/bash optins and striped down the initrd to contain only thermal, processor, fan, fbcon, tileblit, font, bitblit, softcursor, vesafb, cfbcopyarea, cfbimgblt, cfbfillrect, capability, commoncap, sd_mod, ext3, jbd, mbcache, sata_nv, libata, scsi_mod. sd_mod, ext3, jbd, mbcache, sata_nv, libata, scsi_mod is necesary for boot and the rest can't be disabled by ubuntu.
Now the computer starts bash and everything works fine until:
* I run hwclock or udev, then the computer crashes most of the times.
* I run dmesg or any other program that produces a lot of output to the console while the ehci or ohci modules are loaded. The computer then freezes. This behavior disappears if the pci=nomsi options is added, but if udev is started (and doesn't crash) the computer becomes very sensitive to console output again.
Right now I am running the computer with noapic option and disable the ehci-hcd module. But I want to make use of the full performance of the computer.
I heard that there is a bug in the hp bios. I asked the people in the hp linux forum. I got the answer "Using Linux certified hardware avoids these issues.".
What is the last console message when it freezes?
The last console message is most of the times "* Loading hardware drivers...". It is printed from /etc/rcS.d/S10udev just before /sbin/udevtrigger is run.
And sometimes the last console message is "* Setting up consolefont and keymap... [OK]". This is printed from /etc/rcS.d/S49console-setup and is the last message before S50hwclock is run.
A few times the last console message is random, but I believe this is because a lot of output to the console also makes the computer freeze as I described above.
I forgot to mention that running hwclock works as long as no other modules (only the ones listed above) are loaded.
The problem with udevtrigger and hwclock disappears if I compile a kernel without RTC support.
The console problem is still there w or w/o RTC.
I want to reopen this bug because I still consider it as a problem. I can also open a new thread if that is better.