Most recent kernel where this bug did *NOT* occur: Has occurred in all kernels I've tried since 2.6.19 Distribution: Fedora Core 6 Hardware Environment: HP dv9000z laptop, AMD64x2, nvidia graphics, nforce MCP51 chipset, dual sata drives, PCIE bus interface to the nvidia and bcm43xx cards. Software Environment: nv driver (not proprietary one) Problem Description: Fedora Core 6 boots and runs if I use kernel parameter pci=noacpi, but freezes if I omit that parameter, at the point where it is starting Xorg. I have looked with firescope at the boot log, but no additional log entries are printed after starting X. If I boot with the init level set to 3 (for command line interface), the system runs fine, but if I then try xinit, startx or telinit 5, the freeze happens. I am now running a version of 2.6.21-rc3, with an additional patch to print out ACPI register accesses that conflict with reserved registers. I will attach DSDT and the /proc/interrupts output for the case that works and the case that does not work. I note that the acpi sci interrupt is different in the two modes (and am skeptical about its being level triggered).
Created attachment 10725 [details] dsdt.dsl for dv9000z laptop
Created attachment 10726 [details] /proc/interrupts without and with pci=acpi
Created attachment 10727 [details] /var/log/messages for boot without pci=acpi note that the kernel parameter "debug" was used, and that the messages referring to ACPI read and ACPI write to registers come from a test patch that I added to my kernel that check for potential conflicts between ACPI directed read/write activity and io registers that are marked in use.
Created attachment 10728 [details] /var/log/messages for boot without pci=acpi This is the correct /var/log/messages - somehow my emacs extract included too much stuff the first time.
Has this worked properly without cmdline workarounds using any known version of the Linux kernel? (eg. 2.6.16, or 2.6.18?) re: freezes Is it the display that freezes, or the entire machine? ie. can you access it via the network ping or ssh? Does a press of the power button shut the machine down -- or do you have to hold power down for more than 4 seconds for an immediate poweroff. is "acpi=noirq" sufficient to work around the failure? is "noapic" sufficient to work around the failure? If yes, please include the /proc/interrupts and dmesg for them.
oh, and is this SMP specific? ie. does "maxcpus=1" make it go away?
> Has this worked properly without cmdline workarounds using > any known version of the Linux kernel? (eg. 2.6.16, or 2.6.18?) I have only tested kernels starting with 2.6.19. Has not worked properly with any of them without cmdline workarounds. > re: freezes > Is it the display that freezes, or the entire machine? > ie. can you access it via the network ping or ssh? > Does a press of the power button shut the machine down -- > or do you have to hold power down for more than 4 seconds machine seems to freeze - none of the above work except the >4sec power button hold-down. > is "acpi=noirq" sufficient to work around the failure? > is "noapic" sufficient to work around the failure? acpi=noirq seems to work OK. Attaching files. noapic is problematic. There seems to be a serious problem with interrupts - the ERR: line is large, and after a while the USB devices stop handling interrupts. I have thought to report that as a separate bug, because apparently 2-cpu systems get interrupts on both processors and there is some sort of race in the ehci interrupt code that causes it to say that an interrupt is unhandled when it is being handled. If you really want the noapic output I can try to get it for you in its full ugliness. I assume noapic is a workaround for this config, so am not sure it should have bugs reported against it.
Created attachment 10729 [details] dmesg for acpi=noirq case
Created attachment 10730 [details] /proc/ijnterrupts for acpi=noirq case
maxcpus=1 does NOT fix the problem. (note that I had also tried turning off irqbalance on the theory that interrupts might be moving from cpu to cpu, and turning that off didn't fix things either). I am happy to try patches to generate more debugging, and I have firescope working from another laptop - the hang makes firescope stop working, but perhaps there's a way to generate messages just prior to the hang...
Please try nmi_watchdog=0 If that doesn't help, it would be great if you could try 2.6.16.stable and if that works, try 2.6.18.stable. It may also be fruitful to try out a 32-bit kernel if that is possible -- say with a CD-based distro. Also, to dig into an interrupt issue (if this turns out to be one) I need the complete output from acpidump, not just the DSDT.dsl, plus the output from lspci -vv. Finally, I'm no expert on X, but it would be good to try starting X without nv in the system. I believe that you should be able to do this by running X with just vesa or an nvidia frame buffer driver. Indeed, if the binary nvidia driver fixed this, that would also be a clue.
Created attachment 10743 [details] acpidump output
Created attachment 10744 [details] lspci -vv output
Created attachment 10745 [details] booting with nmi_watchdog=0 messages: freezes after coming up, but before startx Trying nmi_watchdog=0 doesn't boot sufficiently far to test Xorg. freezes. (must hold power button for >4 sec, no other way to contact).
Changed xorg.conf to use "vesa" driver rather than "nv" driver (both are framebuffer drivers). Freezes in the same way when acpi=noirq or pci=noacpi are not specified, at point where startx or xinit happens (screen is blanked, no cursor shown).
Created attachment 10748 [details] irq and io registers according to Windows XP For interest, since the machine runs fine under Windows XP, I've attached what Windows does to set up IRQ assignments and the IO register addresses for all the peripherals. NOTE: there are a number of IRQs that have no mention in the Linux /proc/interrupts attachments above. "NVIDIA Network Bus Enumerator" has an IRQ 16, as does NVIDIA nForce System Management Controller on 10 (different from AMD ACPI-Compliant System on 9)..
nmi_watchdog=0 hang is not comforting -- as that is the new default in 2.6.21.... thanks for the interrupt info. Any chance to try 32-bit or find if any previous kernel worked properly?
Tried 2.6.16.46 stable and 2.6.18.8 stable - both got kernel panics because switchroot failed before trying init - probably because my root directory is RAID1 LVM - I'm not sure what I need to do to fix that. I did do MUCH better with a Knoppix 5.1.0 32-bit Live CD I had lying around. It boots and runs X like a champ - no problems at all. uname tells me 2.6.19 is the kernel version there, and SMP is running fine, all interrupts seem to be set up by ACPI properly. I can try to figure out how to run a more recent i686 kernel perhaps. But the fact that x86_64 is where the problem seems to be is interesting. Note that I have been having intermittent drive timeouts on the second sata_nv drive, which is not necessary to run, if heavily used.
continues to fail in 2.6.21-rc4 Is there anything left for me to try? I suspect the fact that 32-bit i686 boots puts the problem squarely in x86_64 specific code - in reading the two architecture's acpi-based IRQ assignment code and IRQ dispatching code, I find it odd that the two are so different rather than sharing common code (must be historical forking due to diverged development teams, I guess). One thing I did seem to note in my boot of Knoppix 5.1.1 32-bit is that there seemed to be no mention of the "aperture". Perhaps the problem is not IRQ related but IOMMU setup related? iommu=off does NOT fix the problem, though.
> Tried 2.6.16.46 stable and 2.6.18.8 stable - both got kernel panics because > switchroot failed before trying init - probably because my root directory is > RAID1 LVM - I'm not sure what I need to do to fix that. You need to build the LVM stuff into the kernel (CONFIG_MD=y etc) and you also need to boot with an initrd. > ... Knoppix 5.1.0 32-bit Live CD... - no problems at all. > uname tells me 2.6.19 is the kernel version there, > and SMP is running fine, all interrupts seem to be set > up by ACPI properly. Any chance you can grab the /proc/interrupts and dmesg from the knoppix boot? The IRQ numbers will be different because i386 still has a bogus "irq compression" patch in it, but the dmesg will tell us if anything is different from an ACPI irq allocation point of view. > continues to fail in 2.6.21-rc4 > I suspect the fact that 32-bit i686 > boots puts the problem squarely in x86_64 specific code Yes, though unless the kernel configs are really analogous it is possible to get fooled. But it is good to know that at least one modern shipping i386 kernel config works properly. > Is there anything left for me to try? 2.6.21-rc5 I suppose:-) Seriously, there have been some timer fixes, so it would be worthwhile if you can check the state of -rc5.
Created attachment 11020 [details] 32-bit knoppix 5.1.1 /proc/interrupts Knoppix 2.6.19 kernel boots correctly with acpi
Created attachment 11021 [details] dmesg for 32-bit knoppix
Created attachment 11022 [details] lspci -vv under 32.-bit knoppix checking the setup of the PCI bus
I have also tried 2.6.21-rc5 and it still fails the same way. Will explore further with it, especially if someone has some suggestions for debugging steps for me to explore (perhaps by checking out what differs that is relevant in the Knoppix boot files added above).
> I have also tried 2.6.21-rc5 and it still fails the same way. Previously the default was nmi_watchdog=2 and you said you got a (different) boot hang with nmi_watchdog=0. In 2.6.21-rc4 and later, the default is nmi_watchdog=0. What happens if you boot -rc5 with nmi_watchdog=2?
Is cpufreq running? What are the contents of /sys/devices/system/cpu/cpu0/cpufreq/*? What if you disable cpufreq with CONFIG_CPU_FREQ=n?
Created attachment 11050 [details] cpufreq values when pci=noacpi option is used. here is the output of cat /sys/devices/system/cpu/cpu0/cpufreq/* on 2.6.21-rc5 with pci=noacpi.
2.6.21-rc5 with CONFIG_CPU_FREQ=n hangs during the booting process, no matter whether pci=noacpi is included or not. If pci=noacpi is included as a parameter, it seems to fail when a processor limit event happens. If it is not included, the hang happens when ACPI probes the PCI bus. So it looks like cpufreq is essential to make the x86_64 boot happen, strange. Should I try other tests?
Booting with nmi_watchdog=2 works if pci=noacpi is there too, but the freeze still happens if no pci=noacpi parameter is set (in rc5). nmi_watchdog=1 suffers the same fates.
Correction to comment #28: I can boot if pci=noacpi and cpufreq is not configured ... the hang at probing the PCI bus is apparently random, and quite infrequent. I wonder if that is another symptom of a common underlying problem? However, leaving pci=noacpi off - there is no help by not having cpufreq configured on my configuration. I am starting to wonder if the fundamental issue is that some piece of chipset IRQ-related state is just not set up on x86_64 machines properly via the ACPI path, but is set by accident on the pci=noacpi path. Has anyone who knows the details of the lspci -vv settings looked at the various ones included above for issues? If I had a live cd kernel version of 2.6.21-rc5 to try in i386 mode, one could see this. That's a lot of work for me to figure out how to make, but I may try if I have to.
Just tested 2.6.21-rc6 to see if any of the changes fixed the problem observed here. rc6 boots, but starting X still generates the same freeze. If I set pci=noacpi, the system does not freeze when starting X. I'd love to know if there is something I can try to debug this. Is there a way to monitor the appropriate things (via firescope?) as the starting of X occurs?
Fooled around with .config for 2.6.18.8 until it boots properly on the machine. It also requires pci=noacpi to get through the X startup without freezing. I note that /proc/interrupts looks *very* different when booting with no pci= argument, though. Instead of assigning ints up through IRQ 23 as has been true for 2.6.20 and 2.6.21, it assigns numbers like 50, 58, 203, ... I can provide the /proc/interrupts, dmesg, etc. for 2.6.18.8 if that will help at all.
I'm seeing something that very much resembles this bug. X freezes on startup exactly like described here, and it has happened on every 2.6.20 and 2.6.21 kernel I've tried. Haven't yet tried a 2.6.19 kernel, but everything older (2.6.18 and down) works. Like you, I have a dual-core x86_64 system (Athlon X2 3800+). Besides that, an Asus A8N-SLI Premium motherboard (nForce4 CK804 chipset), and an NVIDIA GeForce 7800 GT card. My distro is Slackware 11.0. I haven't done as thorough testing as you (just tried NVIDIA's own binary drivers), so I haven't submitted any bug reports until now. After reading this, I tried acpi=noirq, but the kernel doesn't even boot then (freezes printing info on my hard drive), so I couldn't test that. One thing though: When X crashes on startup, it leaves the following in /var/log/Xorg.0.log: Backtrace: 0: X(xf86SigHandler+0x8a) [0x8088b2a] 1: [0xffffe420] 2: X(main+0x536) [0x80d47a6] 3: /lib/tls/libc.so.6(__libc_start_main+0xd4) [0xb7df7e14] 4: X [0x806ff61] Fatal server error: Caught signal 11. Server aborting Please consult the The X.Org Foundation support at http://wiki.X.Org for help. Please also check the log file at "/var/log/Xorg.0.log" for additional information. Does the same thing happen for you?
Forget about what I just posted, it appears that my problems were caused by the evdev USB mouse driver. The physical device was hardcoded in xorg.conf, and the device number changed after 2.6.18, resulting in X trying to use the powerbutton as a mouse. :-) After updating the device path, X works again.
I'd like to add that I'm encountering this bug as well on my HP dv9317cl. I don't boot into X so the freeze happens on the command line for me. I can stabilize the system by using noapic or pci=noacpi. Booting without either of those options typically gets me to a login prompt, and I can use the system briefly. It seems that the more interrupts that are being generated, the sooner it freezes. A sure fire way to make it freeze has been to load ehci-hcd and run dmesg repeatedly; something about the scrolling text makes the system lock in a few seconds. The freeze occurs without ehci-hcd also, it just takes longer, and I think the ehci relation is a different bug. I'm rather inexperienced in reporting bugs, so I'll attempt to follow Mr. Reed's example. I'm also willing to provide any additional information desired, and attempt any resolution/test that has an under 50% chance of destroying my hardware. Well... maybe 25%...
Created attachment 11629 [details] DSDT from dv9317cl DSDT from my dv9317cl via apcidump -b -t DSDT and iasl -d
Created attachment 11630 [details] /proc/interrupts without pci=noacpi option
Created attachment 11631 [details] dmesg output without pci=noacpi option
Created attachment 11632 [details] acpidump output on dv9317cl
Created attachment 11633 [details] lspci -vv output without pci=noacpi option
I am now operating reasonably well, having done 3 things: 1) Running kernel 2.6.22-rc3 (in 2.6.20 versions, I was able to use pci=routeirq as the only kernel option, once I did the below stuff, apparently initializing the IOAPIC better than it was being set up without that option). Also upgraded to F.28 BIOS - which has been released on HP website. 2) removing hwclock --systohc call from my shutdown scripts (causes hang at shutdown otherwise) 3) using nvidia proprietary X screen driver. Here's my theory: (1) F.28 BIOS apparently changes a lot in ACPI area. It has more tables and different ones. It may also initialize IOAPIC differently when ACPI is turned on, thus obviating need for pci=routeirq. 2.6.22 fixes lots of sata-nv and timer related bugs (zillions of changes in that space, and I don't want to bother doing git bisect). (2) relates to the *very* different code in i386 and x86_64 for basic interrupts and rtc functions - maintainers should get their act together, but who am I to criticize. (3) is incredibly hard to debug, because it happens after screen is turned off, there is no serial port, and firescope doesn't work, because of hard hang. I am hoping to play with "nouveau" rather than "nv". Given the timer problems, I am waiting for the tickless x86_64 updates before I do more poking than I have been doing. [For those who want a workaround, pull 2.6.22rc3, do the "make oldconfig" with your old .config, and make sure you have the F.28 BIOS. This allowed me to boot with no special boot parameters, leaving me still with problems with X and /dev/rtc - but not showstoppers.] In any case, for people running compatible x86_64
Upgrading BIOS resolved my issues as well. hwclock freezes the system; easy to work around. Fast scrolling text in textmode still freezes up, but that's an easy situation to avoid. X works flawlessly. Thank you.
I think I have the same problem. Hardware: HP dv9000 laptop, dv9000 GH769EA according to bios and dv9398eu according to hp update utility, bios F.38, AMD64x2 Turion, nvidia graphics, nforce MCP51 chipset The computer freezes almost every time it boots. I use ubuntu 7.04 with linux-image-2.6.20-16-generic package. I have also tried linux-image-2.6.22-9-generic package which results in same problem. If the computer boots (maybe 1 of 100) and comes to kde-login then it works without any problem. So its is probably some problem during the boot, I will explain later during which steps the computer freezes. Sometimes I get the following error message on the console just before the crash: error receiving uevent message: No buffer space available I can make the computer boot if 'noapic' or 'acpi=off' or 'acpi=noirq' kernel option is used, but then it has other problems: The noapic option makes the ehci-hcd driver work very strange. It takes a lot of cpu and receives a lot of interrupts. After a while it crashes unless the irqpoll option is specifed. The acpi=off or acpi=noirq make the computer boot too, but now the bcm43xx gives an error message when it tries to allocate irq 0! and the Xorg nvidia driver fails and gives an error message about level-triggered irq, but nothing freezes. I have tried to boot the computer with the init=/bin/bash optins and striped down the initrd to contain only thermal, processor, fan, fbcon, tileblit, font, bitblit, softcursor, vesafb, cfbcopyarea, cfbimgblt, cfbfillrect, capability, commoncap, sd_mod, ext3, jbd, mbcache, sata_nv, libata, scsi_mod. sd_mod, ext3, jbd, mbcache, sata_nv, libata, scsi_mod is necesary for boot and the rest can't be disabled by ubuntu. Now the computer starts bash and everything works fine until: * I run hwclock or udev, then the computer crashes most of the times. * I run dmesg or any other program that produces a lot of output to the console while the ehci or ohci modules are loaded. The computer then freezes. This behavior disappears if the pci=nomsi options is added, but if udev is started (and doesn't crash) the computer becomes very sensitive to console output again. Right now I am running the computer with noapic option and disable the ehci-hcd module. But I want to make use of the full performance of the computer. I heard that there is a bug in the hp bios. I asked the people in the hp linux forum. I got the answer "Using Linux certified hardware avoids these issues.".
What is the last console message when it freezes?
The last console message is most of the times "* Loading hardware drivers...". It is printed from /etc/rcS.d/S10udev just before /sbin/udevtrigger is run. And sometimes the last console message is "* Setting up consolefont and keymap... [OK]". This is printed from /etc/rcS.d/S49console-setup and is the last message before S50hwclock is run. A few times the last console message is random, but I believe this is because a lot of output to the console also makes the computer freeze as I described above. I forgot to mention that running hwclock works as long as no other modules (only the ones listed above) are loaded.
The problem with udevtrigger and hwclock disappears if I compile a kernel without RTC support. The console problem is still there w or w/o RTC.
I want to reopen this bug because I still consider it as a problem. I can also open a new thread if that is better.