Bug 8723

Summary: Kernel reboots when booted with APIC enabled on an asus a7v8x-x motherboard
Product: Platform Specific/Hardware Reporter: warm blooded creature (warmbloodedcreature)
Component: i386Assignee: Ingo Molnar (mingo)
Status: CLOSED CODE_FIX    
Severity: normal CC: bunk, mingo, protasnb, tglx
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.22-rc7 Subsystem:
Regression: --- Bisected commit-id:
Attachments: kernel 2.6.23-rc1 dmesg
kernel 2.6.23-rc1 dmidecode
kernel 2.6.23-rc1 lspci -vvnn
ubuntu-7.10-apic-verbose.log
ubuntu-7.10-apic-verbose2.log
ioapic debug patch
kernel log after ioapic debug patch was applied
ioapic debug patch, #2
kernel output with ioapic debug patch #2 applied
sync_arb_ids() workaround
kernel log with ioapic debug patch 2 and sync_arb_ids() workaround patch applied
successful boot to shell with sync_arb_ids() workaround
ioapic debug patch, #2, B
cfs-debug-info-2007.11.24-07.54.54
kernel log of successful boot to shell with ioapic debug patch #2 B
x86: fix sync_Arb_ids() related bootup crash
syslog from a debian 3.1r6 kernel 2.4.27-3-386 net install cd boot
crash the system artificially
kernel log with x86-crash-it-again.patch applied

Description warm blooded creature 2007-07-08 14:24:30 UTC
Most recent kernel where this bug did not occur:
Distribution: Ubuntu 7.04

Hardware Environment: Motherboard is an Asus A7V8X-X with the latest available BIOS installed.  CPU is an Athlon XP 1800+

Software Environment:
Problem Description:
When I try booting a distribution like Ubuntu 7.04 while APIC enabled in my BIOS, linux just reboots when I try to boot it.  If I turn off APIC either in my bios or by using the "noapic" boot command line option, linux boots normally.

Steps to reproduce:
1.  Enable APIC in the BIOS (which seems to be turned on by default in my bios whenever I reset it back to its defaults)

2.  Attempt to boot a kernel with APIC enabled.

A bunch of lines from the kernel outputs for a second or so (that I can't read fast enough before the reboot) then my system reboots.
Comment 1 warm blooded creature 2007-07-29 09:52:13 UTC
Created attachment 12192 [details]
kernel 2.6.23-rc1 dmesg

Kernel 2.6.23-rc1 dmesg output after turning off APIC
Comment 2 warm blooded creature 2007-07-29 09:53:11 UTC
Created attachment 12193 [details]
kernel 2.6.23-rc1 dmidecode

dmidecode done while APIC was turned off in kernel 2.6.23-rc1
Comment 3 warm blooded creature 2007-07-29 09:54:00 UTC
Created attachment 12194 [details]
kernel 2.6.23-rc1 lspci -vvnn

lspci -vvnn done in kernel 2.6.23-rc1 with APIC turned off.
Comment 4 Natalie Protasevich 2007-10-18 21:57:13 UTC
Warm blooded creature, (lovely name:)
Can you boot it with any other OS? any other kernel? Sometimes this can be caused by faulty hardware. (I just had the same happened to my dell (dual xeon), had to replace motherboard.)
Comment 5 warm blooded creature 2007-10-20 07:14:34 UTC
I tested Ubuntu 7.04, Ubuntu 7.10, Debian 4.0r1 (the net install iso), Fedora 7 Live, Slax 5.1.8, and OpenSuse 10.3.  Every single distribution I tested, except for Slax, reboots on my system within a second or so after attempting to boot their kernel.  Every single distribution I tested boots fine when booted with either "acpi=off", "noapic", or "nolapic" or when booted when I have "APIC" turned off in my BIOS settings.

Ubuntu 7.04's kernel 2.6.20-16-generic reboots.  Ubuntu 7.10's kernel 2.6.22-14-generic reboots.  Debian 4.0r1's kernel 2.6.18-5-486 in the net install iso reboots.  OpenSuse 10.3's install disc 1 with kernel 2.6.22.5-31-default reboots.  Fedora 7 Live with kernel 2.6.21-1.3194.fc7 reboots.

The only distribution that successfully booted with APIC turned on in my BIOS and without passing any special boot options is Slax 5.1.8 with its kernel 2.6.16, but looking at its kernel messages, it looks like its kernel was compiled with APIC support since I see no mention of "APIC" anywhere in its boot messages like I do with all the other distributions I tested.

Once booted with "acpi=off", "noapic", or "nolapic" or with "APIC" turned off in the BIOS, the system runs fine.
Comment 6 warm blooded creature 2007-10-20 07:16:14 UTC
argh.... For Slax 5.1.8, I meant.... it looks like its kernel was
compiled *without* APIC support since I see no mention of "APIC" anywhere in its
boot messages like I do with all the other distributions I tested.
Comment 7 Thomas Gleixner 2007-11-14 15:09:55 UTC
That's an odd one. Do you have a chance to capture the kernel output via serial console (i.e. does the box have a serial port and do you have a second box with a serial port to capture the output ?

If yes, please add

earlyprintk=serial,ttyS0,115200

to the kernel command line and connect the box to a second one with a null modem cable. On the second box please capture the output of the kernel.

Of course without the options, which make it boot :) Please add apic=verbose as well, so we might get some more info on that.

Thanks,

       tglx
Comment 8 warm blooded creature 2007-11-16 17:37:36 UTC
I don't have any way of connecting this machine to another machine via a serial port at this time.  Is there any other way of capturing output?  like maybe dumping it to my printer (via the parallel port) and I could scan it and upload it or something?
Comment 9 Ingo Molnar 2007-11-17 08:00:06 UTC
you could try the following: boot with boot_delay=1000, this causes a 1 
second pause after each printk'd line. Then do a digital camera picture 
of the last screen you get before the spontaneous reboot (hopefully 
there's such lines - also booting with 'debug' and apic=verbose will 
help get out as much info as possible). You probably have to try it 
twice to figure out the exact place to do the picture at.

Or you could try to make a digital movie if your camera can do that, and 
then pick out last frame before the reboot clears the screen.
Comment 10 warm blooded creature 2007-11-21 12:00:27 UTC
I got a null modem serial cable and was able to capture the output you needed from my machine with APIC turned on in my BIOS.  I used the following kernel command line:
earlyprintk=serial,ttyS0,115200 apic=verbose debug
in Ubuntu 7.10's latest "generic" kernel version 2.6.22-14-generic (according to "uname -a") which lists as kernel 2.6.22-14.46 in synaptic in Ubuntu 7.10.
Comment 11 warm blooded creature 2007-11-21 12:03:28 UTC
Created attachment 13676 [details]
ubuntu-7.10-apic-verbose.log

Serial port kernel boot log capture of Ubuntu 7.10 kernel 2.6.22-14 with APIC turned on in my BIOS.
Comment 12 Ingo Molnar 2007-11-21 12:07:54 UTC
your log ends with:

[   42.465564] console handover: boot [earlyser0] -> real [tty0]

you can 'keep' the early-printk serial console via adding 'keep' to the 
boot options, i.e. using:

  earlyprintk=serial,ttyS0,115200,keep

that way we should be able to see the last line before the kernel 
rebooted.
Comment 13 warm blooded creature 2007-11-21 12:49:33 UTC
Created attachment 13678 [details]
ubuntu-7.10-apic-verbose2.log

added "keep" to the earlyprintk part of the kernel command line and the log output stops at:
[  248.560422] ENABLING IO-APIC IRQs
Comment 14 Ingo Molnar 2007-11-21 23:18:52 UTC
> output stops at:
> [  248.560422] ENABLING IO-APIC IRQs

hm, debugging such spontaneous reboots is generally very hard. There's 
no crash in the log - just at a certain stage the kernel went poof and 
all we see is the last kernel message during bootup.

could you try the patch below? This wont resolve the crash but it might 
give us some new entries in the serial log, hopefully narrowing down the 
problem some more.

---
 arch/i386/kernel/io_apic.c |   21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

Index: linux-cfs-2.6.22.13.q/arch/i386/kernel/io_apic.c
===================================================================
--- linux-cfs-2.6.22.13.q.orig/arch/i386/kernel/io_apic.c
+++ linux-cfs-2.6.22.13.q/arch/i386/kernel/io_apic.c
@@ -20,6 +20,8 @@
  *	Paul Diefenbaugh	:	Added full ACPI support
  */
 
+#define DD()	printk("%s:%d\n", __FILE__, __LINE__)
+
 #include <linux/mm.h>
 #include <linux/interrupt.h>
 #include <linux/init.h>
@@ -1787,13 +1789,16 @@ static void __init setup_ioapic_ids_from
 	unsigned char old_id;
 	unsigned long flags;
 
+	DD();
 	/*
 	 * Don't check I/O APIC IDs for xAPIC systems.  They have
 	 * no meaning without the serial APIC bus.
 	 */
 	if (!(boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
-		|| APIC_XAPIC(apic_version[boot_cpu_physical_apicid]))
+		|| APIC_XAPIC(apic_version[boot_cpu_physical_apicid])) {
+		DD();
 		return;
+	}
 	/*
 	 * This is broken; anything with a real cpu count has to
 	 * circumvent this idiocy regardless.
@@ -1805,10 +1810,12 @@ static void __init setup_ioapic_ids_from
 	 */
 	for (apic = 0; apic < nr_ioapics; apic++) {
 
+		DD();
 		/* Read the register 0 value */
 		spin_lock_irqsave(&ioapic_lock, flags);
 		reg_00.raw = io_apic_read(apic, 0);
 		spin_unlock_irqrestore(&ioapic_lock, flags);
+		DD();
 		
 		old_id = mp_ioapics[apic].mpc_apicid;
 
@@ -1848,6 +1855,7 @@ static void __init setup_ioapic_ids_from
 		}
 
 
+		DD();
 		/*
 		 * We need to adjust the IRQ routing table
 		 * if the ID changed.
@@ -1857,6 +1865,7 @@ static void __init setup_ioapic_ids_from
 				if (mp_irqs[i].mpc_dstapic == old_id)
 					mp_irqs[i].mpc_dstapic
 						= mp_ioapics[apic].mpc_apicid;
+		DD();
 
 		/*
 		 * Read the right value from the MPC table and
@@ -2329,14 +2338,22 @@ void __init setup_IO_APIC(void)
 	/*
 	 * Set up IO-APIC IRQ routing.
 	 */
-	if (!acpi_ioapic)
+	if (!acpi_ioapic) {
+		DD();
 		setup_ioapic_ids_from_mpc();
+		DD();
+	}
 	sync_Arb_IDs();
+	DD();
 	setup_IO_APIC_irqs();
+	DD();
 	init_IO_APIC_traps();
+	DD();
 	check_timer();
+	DD();
 	if (!acpi_ioapic)
 		print_IO_APIC();
+	DD();
 }
 
 static int __init setup_disable_8254_timer(char *s)
Comment 15 warm blooded creature 2007-11-22 04:07:36 UTC
I tried to apply the patch you sent with "patch -p1 < ../apic-debug.patch" (while I was in the kernel's extracted source folder) to Ubuntu's kernel 2.6.22 source and also tried with kernel 2.6.23.8's source and got several failures when trying to apply the patch.  I am not that good with applying kernel patches so maybe I am not doing it right.

Here is what I get when attempting to apply it to Ubuntu kernel 2.6.22's source and when I attempt to apply it to kernel 2.6.23.8's source:
patching file arch/i386/kernel/io_apic.c
Hunk #1 succeeded at 20 with fuzz 1.
Hunk #2 FAILED at 1789.
Hunk #3 FAILED at 1810.
Hunk #4 FAILED at 1855.
Hunk #5 FAILED at 1865.
Hunk #6 FAILED at 2338.
5 out of 6 hunks FAILED -- saving rejects to file arch/i386/kernel/io_apic.c.rej
Comment 16 Ingo Molnar 2007-11-22 07:37:01 UTC
Created attachment 13694 [details]
ioapic debug patch

I think it did not apply because when you got the patch from this bugzilla entry it got whitespace damaged.

I've created a bugzilla attachment (ioapic-debug.patch), could you try that one, does it apply? I used a 2.6.22.13 kernel as a base.
Comment 17 warm blooded creature 2007-11-22 11:57:01 UTC
I was successfully able to apply the patch you attached to kernel 2.6.22.13 source that I downloaded from www.kernel.org.
Comment 18 warm blooded creature 2007-11-22 11:57:39 UTC
Created attachment 13699 [details]
kernel log after ioapic debug patch was applied
Comment 19 Ingo Molnar 2007-11-23 04:48:08 UTC
Created attachment 13709 [details]
ioapic debug patch, #2

Hm, the new log shows the same "last line" as before - so we must have crashed in a really weird way. Could you try the attached ioapic-debug-2.patch (first unapply ioapic-debug.patch) and re-capture the serial log?

Spontaneous reboots are hideously hard to debug - if you are interested you might want to add DD()s to the code yourself too to narrow it down within io_apic.c.

But even then, the crash itself might happen in the first interrupt after we enable the IO-APIC - which makes the crash even harder to debug.
Comment 20 warm blooded creature 2007-11-23 09:47:54 UTC
I unapplied the first ioapic debug patch and applied the ioapic debug patch #2.
Comment 21 warm blooded creature 2007-11-23 09:49:14 UTC
Created attachment 13716 [details]
kernel output with ioapic debug patch #2 applied
Comment 22 Ingo Molnar 2007-11-23 09:55:22 UTC
Created attachment 13717 [details]
sync_arb_ids() workaround

thanks.

Could you now apply the attached sync-arb-ids-fix.patch in addition (ontop) of ioapic-debug-2.patch? Does it get any further in the bootup?

either the sync_arb_ids() call itself does something goofy - in which case the bootup might proceed far further than it did before - or a pure access to the local APIC is crashing the box - in which case you should see a few additional lines but not much more. Your next serial capture will give us some more ideas.
Comment 23 warm blooded creature 2007-11-23 11:26:03 UTC
With the sync_arb_ids() workaround, it no longer spontaneously reboots with APIC on and gets MUCH farther along in the boot process, but I now get a kernel panic shortly after mounting one of my filesystems only when APIC is on.  Comparing the kernel logs between a "noapic" boot and this APIC boot, the kernel panic on the APIC boot appears to happen right around the same time that the "noapic" boot would output "spurious 8259A interrupt: IRQ7".
Comment 24 warm blooded creature 2007-11-23 11:27:40 UTC
Created attachment 13722 [details]
kernel log with ioapic debug patch 2 and sync_arb_ids() workaround patch applied
Comment 25 warm blooded creature 2007-11-23 12:46:48 UTC
It turned out that the kernel panic "[   39.673201] Kernel panic - not syncing: Attempted to kill init!" that occured after applying the sync_arb_ids() workaround only occured if I booted using the "earlyprintk" kernel command line option.  Removing "earlyprintk" on the kernel with the sync_arb_ids() workaround patch successfully boots to a shell.   I guess something is loading shortly after mounting one of my filesystems that doesn't like the "earlyprintk" option.
Comment 26 warm blooded creature 2007-11-23 12:48:29 UTC
Created attachment 13723 [details]
successful boot to shell with sync_arb_ids() workaround
Comment 27 Ingo Molnar 2007-11-23 23:17:27 UTC
nice progress! I still dont see why we reboot spontaneously though.

to have a fuller picture about your system, could you run this script 
and send me the resulting file:

  http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

this will collect a good deal of hardware and environment information. 
Perhaps the CPU version gives us a clue.

the panic you got with earlyprintk is weird and might suggest hw 
flakiness. I've never seen serial-printk cause problems before. Has the 
system a history of good reliability?
Comment 28 Ingo Molnar 2007-11-23 23:33:07 UTC
Created attachment 13725 [details]
ioapic debug patch, #2, B

an additional thing to try: could you please apply the attached ioapic-debug-2b.patch, in addition to (ontop of) the already applied ioapic-debug-2.patch and sync-arb-ids.patch, and see what earlyprintk output you get?

this 2b debug patch tests whether we can safely query the APIC_LVR (Version Register) without crashing the CPU - and prints out its contents. This register should be unconditionally accessible. Then we test the contents APIC_ICR (Command Register) - this is another register that sync_Arb_ids() accesses. So via this patch we open-code most of sync_Arb_ids()s APIC accesses - we just do it in a more debuggable way. Lets see whether it crashes, and if it crashes, where.

(sidenote: if earlyprintk=...,keep is causing problems then you might want to try the console=ttyS0,115200 boot option - this way you can do serial logging after the early-serial console stops its output. If you use this boot option then remove the 'keep' parameter from the earlyprintk option - otherwise you'll get every line twice on the serial line.)
Comment 29 Ingo Molnar 2007-11-23 23:35:50 UTC
another question. You wrote this:

-------------------->
Most recent kernel where this bug did not occur:
Distribution: Ubuntu 7.04
<--------------------

could you specify which exact kernel version (kernel package version) the last working kernel (which booted fine with APIC enabled in the BIOS) was?

Does that kernel still boot fine if you enable APIC support in the BIOS? If yes then that would exclude hardware flakiness.
Comment 30 warm blooded creature 2007-11-24 05:28:51 UTC
My system has a kernel panic after mounting one of my partitions only when using the "keep" option of earlyprintk ("earlyprintk=serial,ttyS0,115200,keep").  Removing "keep" in earlyprintk (using "earlyprintk=serial,ttyS0,115200") and adding "console=ttyS0,115200" boots to a shell without having a kernel panic and console output all the way to the shell gets output without any problems over the serial cable.

I applied your ioapic debug patch #2 B in addition to ioapic debug patch #2 and the sync-arb-ids patch before testing earlyprintk without the "keep" but with the "console=" kernel command line options and before running your cfs-debug-info.sh script.  My system boots to a shell successfully with the ioapic debug patch #2 B applied without the need for a serial capture to get its kernel output, so I ran cfs-debug-info.sh without the "earlyprintk" and "console" kernel command line options.
Comment 31 warm blooded creature 2007-11-24 05:31:24 UTC
Created attachment 13727 [details]
cfs-debug-info-2007.11.24-07.54.54

cfs-debug-info.sh run on a boot of kernel 2.6.22.13 with ioapic debug patch #2, ioapic debug patch #2 B, and sync-arb-ids patch applied to the kernel but without the earlyprintk or console kernel command line options and with APIC enabled in my BIOS.
Comment 32 warm blooded creature 2007-11-24 05:42:56 UTC
Created attachment 13728 [details]
kernel log of successful boot to shell with ioapic debug patch #2 B

kernel log of successful boot to shell with ioapic debug patch #2 B, ioapic debug patch #2, and sync_arb_ids() patch.
Comment 33 warm blooded creature 2007-11-24 05:52:56 UTC
(In reply to comment #27)
> nice progress! I still dont see why we reboot spontaneously though.
> 
> to have a fuller picture about your system, could you run this script 
> and send me the resulting file:
> 
>   http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
> 
> this will collect a good deal of hardware and environment information. 
> Perhaps the CPU version gives us a clue.
> 
> the panic you got with earlyprintk is weird and might suggest hw 
> flakiness. I've never seen serial-printk cause problems before. Has the 
> system a history of good reliability?
> 

Yes, this system is very reliable in linux as long as I boot with "noapic" or turn off APIC in my bios before attempting to boot Linux.
Comment 34 Ingo Molnar 2007-11-24 07:50:05 UTC
Created attachment 13731 [details]
x86: fix sync_Arb_ids() related bootup crash

great. Your latest bootlog confirms that on your CPU the local APIC version is 0x10 - which is lower than the 0x14 check that modern_apic() uses. So we fall into the ICR writes of sync_Arb_ids() - which was never needed on AMD systems and crashes it. So lets just return from sync_Arb_ids() if we see an AMD CPU.

could you please unapply all the extra patches you got from this bugzilla and apply only the new one i've attached: x86-fix-sync-arb-ids.patch. This is hopefully the real and final fix for this bug.

What i dont understand yet is why 2.6.20 worked fine on your box before with APIC enabled in the BIOS. (my understanding is that Ubuntu 7.04 used the 2.6.20 kernel) The sync_Arb_ids() call was in Linux for an eternity.
Comment 35 warm blooded creature 2007-11-24 08:17:47 UTC
(In reply to comment #34)
> Created an attachment (id=13731) [details]
> x86: fix sync_Arb_ids() related bootup crash
> 
> great. Your latest bootlog confirms that on your CPU the local APIC version
> is
> 0x10 - which is lower than the 0x14 check that modern_apic() uses. So we fall
> into the ICR writes of sync_Arb_ids() - which was never needed on AMD systems
> and crashes it. So lets just return from sync_Arb_ids() if we see an AMD CPU.
> 
> could you please unapply all the extra patches you got from this bugzilla and
> apply only the new one i've attached: x86-fix-sync-arb-ids.patch. This is
> hopefully the real and final fix for this bug.
> 
> What i dont understand yet is why 2.6.20 worked fine on your box before with
> APIC enabled in the BIOS. (my understanding is that Ubuntu 7.04 used the
> 2.6.20
> kernel) The sync_Arb_ids() call was in Linux for an eternity.
> 

Sorry for the confusion.  The "Most recent kernel where this bug did not occur:" line was meant to be a blank line.  I skipped over it since I didn't know which kernel worked with APIC enabled on this machine (I've practically always had to use "noapic" with 2.6.x kernels of almost every distribution I tried out).  The "Distribution:" line was the most current distribution where I still encountered the failure at the time.  Maybe I should have just completely deleted the other line if it was going to be left blank instead of leaving it in.  But, I did eventually find a kernel that did work fine with APIC enabled on my machine and it is....2.4.27-3-386 from Debian 3.1r6.  The 2.4.x series kernels in older distributions appears to boot fine without spontaneously rebooting with APIC turned on, but in their logs it looks like the "ACPI interpreter" is disabled (if I am reading them right).

I am going to unapply all the other patches and try only the new one and see if it fixes this bug.
Comment 36 warm blooded creature 2007-11-24 09:57:23 UTC
(In reply to comment #34)
> Created an attachment (id=13731) [details]
> x86: fix sync_Arb_ids() related bootup crash
> 
> great. Your latest bootlog confirms that on your CPU the local APIC version
> is
> 0x10 - which is lower than the 0x14 check that modern_apic() uses. So we fall
> into the ICR writes of sync_Arb_ids() - which was never needed on AMD systems
> and crashes it. So lets just return from sync_Arb_ids() if we see an AMD CPU.
> 
> could you please unapply all the extra patches you got from this bugzilla and
> apply only the new one i've attached: x86-fix-sync-arb-ids.patch. This is
> hopefully the real and final fix for this bug.

With all other patches removed and the x86-fix-sync-arb-ids.patch applied, I successfully was able to boot my machine with APIC on in my BIOS settings! :)
Comment 37 Ingo Molnar 2007-11-24 11:30:07 UTC
> But, I did eventually find a kernel that did work fine with APIC 
> enabled on my machine and it is....2.4.27-3-386 from Debian 3.1r6.  
> The 2.4.x series kernels in older distributions appears to boot fine 
> without spontaneously rebooting with APIC turned on, but in their logs 
> it looks like the "ACPI interpreter" is disabled (if I am reading them 
> right).

ok, that makes a lot more sense. I suspect due to having no ACPI there's 
no IO-APIC support either? If you boot a 2.4 kernel successfully, how 
does /proc/interrupts (and dmesg) look like? IRQ numbers above 15 would 
mean IO-APIC support is present.

so my guess is that this was a showstopper bug in Linux IO-APIC support 
ever since the beginning of times (i see the same sync-arb-ids problem 
in the 2.4 code too), so is the first time in history an Athlon-XP was 
booted with Linux in APIC+IO-APIC mode. (they are not MP capable so APIC 
support was rarely required) Congratulations! :-)
Comment 38 warm blooded creature 2007-11-24 12:07:52 UTC
Created attachment 13738 [details]
syslog from a debian 3.1r6 kernel 2.4.27-3-386 net install cd boot

log of older kernel 2.4.27-3-386 in debian 3.1r6 booting with APIC enabled.
Comment 39 warm blooded creature 2007-11-24 12:15:12 UTC
(In reply to comment #37)
> ok, that makes a lot more sense. I suspect due to having no ACPI there's 
> no IO-APIC support either? If you boot a 2.4 kernel successfully, how 
> does /proc/interrupts (and dmesg) look like? IRQ numbers above 15 would 
> mean IO-APIC support is present.

/proc/interrupts appears to stop at 15, but I didn't go very far in debian 3.1r6's net install before copying /proc/interrupts from its shell.

/proc/interrupts:
           CPU0       
  0:      10874          XT-PIC  timer
  1:        656          XT-PIC  keyboard
  2:          0          XT-PIC  cascade
  3:         42          XT-PIC  usb-uhci, usb-uhci, usb-uhci, ehci_hcd
  6:          9          XT-PIC  floppy
 14:         13          XT-PIC  ide0
 15:        273          XT-PIC  ide1
NMI:          0 
LOC:      10837 
ERR:          0
MIS:          0
Comment 40 Ingo Molnar 2007-11-25 05:49:59 UTC
> /proc/interrupts:
>            CPU0       
>   0:      10874          XT-PIC  timer
>   1:        656          XT-PIC  keyboard
>   2:          0          XT-PIC  cascade
>   3:         42          XT-PIC  usb-uhci, usb-uhci, usb-uhci, ehci_hcd
>   6:          9          XT-PIC  floppy
>  14:         13          XT-PIC  ide0
>  15:        273          XT-PIC  ide1

thx - that's a clearly non-APIC bootup - all lines say "XT-PIC".

Your IOAPIC bootup has this:

           CPU0
  0:         64   IO-APIC-edge      timer
  1:         27   IO-APIC-edge      i8042
  6:          3   IO-APIC-edge      floppy
  7:          0   IO-APIC-edge      parport0
  8:          3   IO-APIC-edge      rtc
  9:          0   IO-APIC-fasteoi   acpi
 12:        102   IO-APIC-edge      i8042
 14:      13748   IO-APIC-edge      ide0
 15:         51   IO-APIC-edge      ide1
 16:        128   IO-APIC-fasteoi   uhci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, ehci_hcd:usb4
 18:          0   IO-APIC-fasteoi   EMU10K1

which is a nice IO-APIC setup. I think it's all explained now - this 
type of hardware never ran in true IOAPIC mode under Linux in the past, 
due to this bug - the 2.4 kernel hid this fact by virtue of not having 
good enough ACPI support. Congratulations :-)

We've queued up this fix in the x86 tree for 2.6.24 - and later on it 
should become part of 2.6.23 (and 2.6.22) as well.
Comment 41 Ingo Molnar 2007-11-25 09:39:10 UTC
Created attachment 13745 [details]
crash the system artificially

one final thing: could you please try the attached x86-crash-it-again.patch and capture an earlyprintk serial log? The expectation would be a spontaneous reboot, with the last printout of "before APIC ICR write".
Comment 42 warm blooded creature 2007-11-25 11:09:47 UTC
Created attachment 13746 [details]
kernel log with x86-crash-it-again.patch applied

serial port capture of kernel 2.6.22.13 output.  All other patches were removed and only "x86-crash-it-again.patch" was applied to this kernel.  APIC was enabled in the BIOS.
Comment 43 warm blooded creature 2007-11-25 11:13:27 UTC
It rebooted right after outputting "before APIC ICR write".
Comment 44 Ingo Molnar 2007-11-25 12:20:07 UTC
> It rebooted right after outputting "before APIC ICR write".

great! I think we've now successfully verified it from various angles 
that this is indeed causing this problem and that the fix does the right 
thing.
Comment 45 Natalie Protasevich 2007-11-25 12:58:21 UTC
(This was spectacular, Ingo - and warm blooded creature... Congratulations to the team!)
Comment 46 Ingo Molnar 2007-11-27 05:02:32 UTC
the fix is now upstream:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f44d9efd3510776216938fef84adc99cc0e12412

and should thus show up in 2.6.24-rc4.