Bug 8723
Description
warm blooded creature
2007-07-08 14:24:30 UTC
Created attachment 12192 [details]
kernel 2.6.23-rc1 dmesg
Kernel 2.6.23-rc1 dmesg output after turning off APIC
Created attachment 12193 [details]
kernel 2.6.23-rc1 dmidecode
dmidecode done while APIC was turned off in kernel 2.6.23-rc1
Created attachment 12194 [details]
kernel 2.6.23-rc1 lspci -vvnn
lspci -vvnn done in kernel 2.6.23-rc1 with APIC turned off.
Warm blooded creature, (lovely name:) Can you boot it with any other OS? any other kernel? Sometimes this can be caused by faulty hardware. (I just had the same happened to my dell (dual xeon), had to replace motherboard.) I tested Ubuntu 7.04, Ubuntu 7.10, Debian 4.0r1 (the net install iso), Fedora 7 Live, Slax 5.1.8, and OpenSuse 10.3. Every single distribution I tested, except for Slax, reboots on my system within a second or so after attempting to boot their kernel. Every single distribution I tested boots fine when booted with either "acpi=off", "noapic", or "nolapic" or when booted when I have "APIC" turned off in my BIOS settings. Ubuntu 7.04's kernel 2.6.20-16-generic reboots. Ubuntu 7.10's kernel 2.6.22-14-generic reboots. Debian 4.0r1's kernel 2.6.18-5-486 in the net install iso reboots. OpenSuse 10.3's install disc 1 with kernel 2.6.22.5-31-default reboots. Fedora 7 Live with kernel 2.6.21-1.3194.fc7 reboots. The only distribution that successfully booted with APIC turned on in my BIOS and without passing any special boot options is Slax 5.1.8 with its kernel 2.6.16, but looking at its kernel messages, it looks like its kernel was compiled with APIC support since I see no mention of "APIC" anywhere in its boot messages like I do with all the other distributions I tested. Once booted with "acpi=off", "noapic", or "nolapic" or with "APIC" turned off in the BIOS, the system runs fine. argh.... For Slax 5.1.8, I meant.... it looks like its kernel was compiled *without* APIC support since I see no mention of "APIC" anywhere in its boot messages like I do with all the other distributions I tested. That's an odd one. Do you have a chance to capture the kernel output via serial console (i.e. does the box have a serial port and do you have a second box with a serial port to capture the output ? If yes, please add earlyprintk=serial,ttyS0,115200 to the kernel command line and connect the box to a second one with a null modem cable. On the second box please capture the output of the kernel. Of course without the options, which make it boot :) Please add apic=verbose as well, so we might get some more info on that. Thanks, tglx I don't have any way of connecting this machine to another machine via a serial port at this time. Is there any other way of capturing output? like maybe dumping it to my printer (via the parallel port) and I could scan it and upload it or something? you could try the following: boot with boot_delay=1000, this causes a 1 second pause after each printk'd line. Then do a digital camera picture of the last screen you get before the spontaneous reboot (hopefully there's such lines - also booting with 'debug' and apic=verbose will help get out as much info as possible). You probably have to try it twice to figure out the exact place to do the picture at. Or you could try to make a digital movie if your camera can do that, and then pick out last frame before the reboot clears the screen. I got a null modem serial cable and was able to capture the output you needed from my machine with APIC turned on in my BIOS. I used the following kernel command line: earlyprintk=serial,ttyS0,115200 apic=verbose debug in Ubuntu 7.10's latest "generic" kernel version 2.6.22-14-generic (according to "uname -a") which lists as kernel 2.6.22-14.46 in synaptic in Ubuntu 7.10. Created attachment 13676 [details]
ubuntu-7.10-apic-verbose.log
Serial port kernel boot log capture of Ubuntu 7.10 kernel 2.6.22-14 with APIC turned on in my BIOS.
your log ends with: [ 42.465564] console handover: boot [earlyser0] -> real [tty0] you can 'keep' the early-printk serial console via adding 'keep' to the boot options, i.e. using: earlyprintk=serial,ttyS0,115200,keep that way we should be able to see the last line before the kernel rebooted. Created attachment 13678 [details]
ubuntu-7.10-apic-verbose2.log
added "keep" to the earlyprintk part of the kernel command line and the log output stops at:
[ 248.560422] ENABLING IO-APIC IRQs
> output stops at:
> [ 248.560422] ENABLING IO-APIC IRQs
hm, debugging such spontaneous reboots is generally very hard. There's
no crash in the log - just at a certain stage the kernel went poof and
all we see is the last kernel message during bootup.
could you try the patch below? This wont resolve the crash but it might
give us some new entries in the serial log, hopefully narrowing down the
problem some more.
---
arch/i386/kernel/io_apic.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
Index: linux-cfs-2.6.22.13.q/arch/i386/kernel/io_apic.c
===================================================================
--- linux-cfs-2.6.22.13.q.orig/arch/i386/kernel/io_apic.c
+++ linux-cfs-2.6.22.13.q/arch/i386/kernel/io_apic.c
@@ -20,6 +20,8 @@
* Paul Diefenbaugh : Added full ACPI support
*/
+#define DD() printk("%s:%d\n", __FILE__, __LINE__)
+
#include <linux/mm.h>
#include <linux/interrupt.h>
#include <linux/init.h>
@@ -1787,13 +1789,16 @@ static void __init setup_ioapic_ids_from
unsigned char old_id;
unsigned long flags;
+ DD();
/*
* Don't check I/O APIC IDs for xAPIC systems. They have
* no meaning without the serial APIC bus.
*/
if (!(boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
- || APIC_XAPIC(apic_version[boot_cpu_physical_apicid]))
+ || APIC_XAPIC(apic_version[boot_cpu_physical_apicid])) {
+ DD();
return;
+ }
/*
* This is broken; anything with a real cpu count has to
* circumvent this idiocy regardless.
@@ -1805,10 +1810,12 @@ static void __init setup_ioapic_ids_from
*/
for (apic = 0; apic < nr_ioapics; apic++) {
+ DD();
/* Read the register 0 value */
spin_lock_irqsave(&ioapic_lock, flags);
reg_00.raw = io_apic_read(apic, 0);
spin_unlock_irqrestore(&ioapic_lock, flags);
+ DD();
old_id = mp_ioapics[apic].mpc_apicid;
@@ -1848,6 +1855,7 @@ static void __init setup_ioapic_ids_from
}
+ DD();
/*
* We need to adjust the IRQ routing table
* if the ID changed.
@@ -1857,6 +1865,7 @@ static void __init setup_ioapic_ids_from
if (mp_irqs[i].mpc_dstapic == old_id)
mp_irqs[i].mpc_dstapic
= mp_ioapics[apic].mpc_apicid;
+ DD();
/*
* Read the right value from the MPC table and
@@ -2329,14 +2338,22 @@ void __init setup_IO_APIC(void)
/*
* Set up IO-APIC IRQ routing.
*/
- if (!acpi_ioapic)
+ if (!acpi_ioapic) {
+ DD();
setup_ioapic_ids_from_mpc();
+ DD();
+ }
sync_Arb_IDs();
+ DD();
setup_IO_APIC_irqs();
+ DD();
init_IO_APIC_traps();
+ DD();
check_timer();
+ DD();
if (!acpi_ioapic)
print_IO_APIC();
+ DD();
}
static int __init setup_disable_8254_timer(char *s)
I tried to apply the patch you sent with "patch -p1 < ../apic-debug.patch" (while I was in the kernel's extracted source folder) to Ubuntu's kernel 2.6.22 source and also tried with kernel 2.6.23.8's source and got several failures when trying to apply the patch. I am not that good with applying kernel patches so maybe I am not doing it right. Here is what I get when attempting to apply it to Ubuntu kernel 2.6.22's source and when I attempt to apply it to kernel 2.6.23.8's source: patching file arch/i386/kernel/io_apic.c Hunk #1 succeeded at 20 with fuzz 1. Hunk #2 FAILED at 1789. Hunk #3 FAILED at 1810. Hunk #4 FAILED at 1855. Hunk #5 FAILED at 1865. Hunk #6 FAILED at 2338. 5 out of 6 hunks FAILED -- saving rejects to file arch/i386/kernel/io_apic.c.rej Created attachment 13694 [details]
ioapic debug patch
I think it did not apply because when you got the patch from this bugzilla entry it got whitespace damaged.
I've created a bugzilla attachment (ioapic-debug.patch), could you try that one, does it apply? I used a 2.6.22.13 kernel as a base.
I was successfully able to apply the patch you attached to kernel 2.6.22.13 source that I downloaded from www.kernel.org. Created attachment 13699 [details]
kernel log after ioapic debug patch was applied
Created attachment 13709 [details]
ioapic debug patch, #2
Hm, the new log shows the same "last line" as before - so we must have crashed in a really weird way. Could you try the attached ioapic-debug-2.patch (first unapply ioapic-debug.patch) and re-capture the serial log?
Spontaneous reboots are hideously hard to debug - if you are interested you might want to add DD()s to the code yourself too to narrow it down within io_apic.c.
But even then, the crash itself might happen in the first interrupt after we enable the IO-APIC - which makes the crash even harder to debug.
I unapplied the first ioapic debug patch and applied the ioapic debug patch #2. Created attachment 13716 [details]
kernel output with ioapic debug patch #2 applied
Created attachment 13717 [details]
sync_arb_ids() workaround
thanks.
Could you now apply the attached sync-arb-ids-fix.patch in addition (ontop) of ioapic-debug-2.patch? Does it get any further in the bootup?
either the sync_arb_ids() call itself does something goofy - in which case the bootup might proceed far further than it did before - or a pure access to the local APIC is crashing the box - in which case you should see a few additional lines but not much more. Your next serial capture will give us some more ideas.
With the sync_arb_ids() workaround, it no longer spontaneously reboots with APIC on and gets MUCH farther along in the boot process, but I now get a kernel panic shortly after mounting one of my filesystems only when APIC is on. Comparing the kernel logs between a "noapic" boot and this APIC boot, the kernel panic on the APIC boot appears to happen right around the same time that the "noapic" boot would output "spurious 8259A interrupt: IRQ7". Created attachment 13722 [details]
kernel log with ioapic debug patch 2 and sync_arb_ids() workaround patch applied
It turned out that the kernel panic "[ 39.673201] Kernel panic - not syncing: Attempted to kill init!" that occured after applying the sync_arb_ids() workaround only occured if I booted using the "earlyprintk" kernel command line option. Removing "earlyprintk" on the kernel with the sync_arb_ids() workaround patch successfully boots to a shell. I guess something is loading shortly after mounting one of my filesystems that doesn't like the "earlyprintk" option. Created attachment 13723 [details]
successful boot to shell with sync_arb_ids() workaround
nice progress! I still dont see why we reboot spontaneously though. to have a fuller picture about your system, could you run this script and send me the resulting file: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh this will collect a good deal of hardware and environment information. Perhaps the CPU version gives us a clue. the panic you got with earlyprintk is weird and might suggest hw flakiness. I've never seen serial-printk cause problems before. Has the system a history of good reliability? Created attachment 13725 [details]
ioapic debug patch, #2, B
an additional thing to try: could you please apply the attached ioapic-debug-2b.patch, in addition to (ontop of) the already applied ioapic-debug-2.patch and sync-arb-ids.patch, and see what earlyprintk output you get?
this 2b debug patch tests whether we can safely query the APIC_LVR (Version Register) without crashing the CPU - and prints out its contents. This register should be unconditionally accessible. Then we test the contents APIC_ICR (Command Register) - this is another register that sync_Arb_ids() accesses. So via this patch we open-code most of sync_Arb_ids()s APIC accesses - we just do it in a more debuggable way. Lets see whether it crashes, and if it crashes, where.
(sidenote: if earlyprintk=...,keep is causing problems then you might want to try the console=ttyS0,115200 boot option - this way you can do serial logging after the early-serial console stops its output. If you use this boot option then remove the 'keep' parameter from the earlyprintk option - otherwise you'll get every line twice on the serial line.)
another question. You wrote this: --------------------> Most recent kernel where this bug did not occur: Distribution: Ubuntu 7.04 <-------------------- could you specify which exact kernel version (kernel package version) the last working kernel (which booted fine with APIC enabled in the BIOS) was? Does that kernel still boot fine if you enable APIC support in the BIOS? If yes then that would exclude hardware flakiness. My system has a kernel panic after mounting one of my partitions only when using the "keep" option of earlyprintk ("earlyprintk=serial,ttyS0,115200,keep"). Removing "keep" in earlyprintk (using "earlyprintk=serial,ttyS0,115200") and adding "console=ttyS0,115200" boots to a shell without having a kernel panic and console output all the way to the shell gets output without any problems over the serial cable. I applied your ioapic debug patch #2 B in addition to ioapic debug patch #2 and the sync-arb-ids patch before testing earlyprintk without the "keep" but with the "console=" kernel command line options and before running your cfs-debug-info.sh script. My system boots to a shell successfully with the ioapic debug patch #2 B applied without the need for a serial capture to get its kernel output, so I ran cfs-debug-info.sh without the "earlyprintk" and "console" kernel command line options. Created attachment 13727 [details]
cfs-debug-info-2007.11.24-07.54.54
cfs-debug-info.sh run on a boot of kernel 2.6.22.13 with ioapic debug patch #2, ioapic debug patch #2 B, and sync-arb-ids patch applied to the kernel but without the earlyprintk or console kernel command line options and with APIC enabled in my BIOS.
Created attachment 13728 [details]
kernel log of successful boot to shell with ioapic debug patch #2 B
kernel log of successful boot to shell with ioapic debug patch #2 B, ioapic debug patch #2, and sync_arb_ids() patch.
(In reply to comment #27) > nice progress! I still dont see why we reboot spontaneously though. > > to have a fuller picture about your system, could you run this script > and send me the resulting file: > > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh > > this will collect a good deal of hardware and environment information. > Perhaps the CPU version gives us a clue. > > the panic you got with earlyprintk is weird and might suggest hw > flakiness. I've never seen serial-printk cause problems before. Has the > system a history of good reliability? > Yes, this system is very reliable in linux as long as I boot with "noapic" or turn off APIC in my bios before attempting to boot Linux. Created attachment 13731 [details]
x86: fix sync_Arb_ids() related bootup crash
great. Your latest bootlog confirms that on your CPU the local APIC version is 0x10 - which is lower than the 0x14 check that modern_apic() uses. So we fall into the ICR writes of sync_Arb_ids() - which was never needed on AMD systems and crashes it. So lets just return from sync_Arb_ids() if we see an AMD CPU.
could you please unapply all the extra patches you got from this bugzilla and apply only the new one i've attached: x86-fix-sync-arb-ids.patch. This is hopefully the real and final fix for this bug.
What i dont understand yet is why 2.6.20 worked fine on your box before with APIC enabled in the BIOS. (my understanding is that Ubuntu 7.04 used the 2.6.20 kernel) The sync_Arb_ids() call was in Linux for an eternity.
(In reply to comment #34) > Created an attachment (id=13731) [details] > x86: fix sync_Arb_ids() related bootup crash > > great. Your latest bootlog confirms that on your CPU the local APIC version > is > 0x10 - which is lower than the 0x14 check that modern_apic() uses. So we fall > into the ICR writes of sync_Arb_ids() - which was never needed on AMD systems > and crashes it. So lets just return from sync_Arb_ids() if we see an AMD CPU. > > could you please unapply all the extra patches you got from this bugzilla and > apply only the new one i've attached: x86-fix-sync-arb-ids.patch. This is > hopefully the real and final fix for this bug. > > What i dont understand yet is why 2.6.20 worked fine on your box before with > APIC enabled in the BIOS. (my understanding is that Ubuntu 7.04 used the > 2.6.20 > kernel) The sync_Arb_ids() call was in Linux for an eternity. > Sorry for the confusion. The "Most recent kernel where this bug did not occur:" line was meant to be a blank line. I skipped over it since I didn't know which kernel worked with APIC enabled on this machine (I've practically always had to use "noapic" with 2.6.x kernels of almost every distribution I tried out). The "Distribution:" line was the most current distribution where I still encountered the failure at the time. Maybe I should have just completely deleted the other line if it was going to be left blank instead of leaving it in. But, I did eventually find a kernel that did work fine with APIC enabled on my machine and it is....2.4.27-3-386 from Debian 3.1r6. The 2.4.x series kernels in older distributions appears to boot fine without spontaneously rebooting with APIC turned on, but in their logs it looks like the "ACPI interpreter" is disabled (if I am reading them right). I am going to unapply all the other patches and try only the new one and see if it fixes this bug. (In reply to comment #34) > Created an attachment (id=13731) [details] > x86: fix sync_Arb_ids() related bootup crash > > great. Your latest bootlog confirms that on your CPU the local APIC version > is > 0x10 - which is lower than the 0x14 check that modern_apic() uses. So we fall > into the ICR writes of sync_Arb_ids() - which was never needed on AMD systems > and crashes it. So lets just return from sync_Arb_ids() if we see an AMD CPU. > > could you please unapply all the extra patches you got from this bugzilla and > apply only the new one i've attached: x86-fix-sync-arb-ids.patch. This is > hopefully the real and final fix for this bug. With all other patches removed and the x86-fix-sync-arb-ids.patch applied, I successfully was able to boot my machine with APIC on in my BIOS settings! :)
> But, I did eventually find a kernel that did work fine with APIC
> enabled on my machine and it is....2.4.27-3-386 from Debian 3.1r6.
> The 2.4.x series kernels in older distributions appears to boot fine
> without spontaneously rebooting with APIC turned on, but in their logs
> it looks like the "ACPI interpreter" is disabled (if I am reading them
> right).
ok, that makes a lot more sense. I suspect due to having no ACPI there's
no IO-APIC support either? If you boot a 2.4 kernel successfully, how
does /proc/interrupts (and dmesg) look like? IRQ numbers above 15 would
mean IO-APIC support is present.
so my guess is that this was a showstopper bug in Linux IO-APIC support
ever since the beginning of times (i see the same sync-arb-ids problem
in the 2.4 code too), so is the first time in history an Athlon-XP was
booted with Linux in APIC+IO-APIC mode. (they are not MP capable so APIC
support was rarely required) Congratulations! :-)
Created attachment 13738 [details]
syslog from a debian 3.1r6 kernel 2.4.27-3-386 net install cd boot
log of older kernel 2.4.27-3-386 in debian 3.1r6 booting with APIC enabled.
(In reply to comment #37) > ok, that makes a lot more sense. I suspect due to having no ACPI there's > no IO-APIC support either? If you boot a 2.4 kernel successfully, how > does /proc/interrupts (and dmesg) look like? IRQ numbers above 15 would > mean IO-APIC support is present. /proc/interrupts appears to stop at 15, but I didn't go very far in debian 3.1r6's net install before copying /proc/interrupts from its shell. /proc/interrupts: CPU0 0: 10874 XT-PIC timer 1: 656 XT-PIC keyboard 2: 0 XT-PIC cascade 3: 42 XT-PIC usb-uhci, usb-uhci, usb-uhci, ehci_hcd 6: 9 XT-PIC floppy 14: 13 XT-PIC ide0 15: 273 XT-PIC ide1 NMI: 0 LOC: 10837 ERR: 0 MIS: 0
> /proc/interrupts:
> CPU0
> 0: 10874 XT-PIC timer
> 1: 656 XT-PIC keyboard
> 2: 0 XT-PIC cascade
> 3: 42 XT-PIC usb-uhci, usb-uhci, usb-uhci, ehci_hcd
> 6: 9 XT-PIC floppy
> 14: 13 XT-PIC ide0
> 15: 273 XT-PIC ide1
thx - that's a clearly non-APIC bootup - all lines say "XT-PIC".
Your IOAPIC bootup has this:
CPU0
0: 64 IO-APIC-edge timer
1: 27 IO-APIC-edge i8042
6: 3 IO-APIC-edge floppy
7: 0 IO-APIC-edge parport0
8: 3 IO-APIC-edge rtc
9: 0 IO-APIC-fasteoi acpi
12: 102 IO-APIC-edge i8042
14: 13748 IO-APIC-edge ide0
15: 51 IO-APIC-edge ide1
16: 128 IO-APIC-fasteoi uhci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, ehci_hcd:usb4
18: 0 IO-APIC-fasteoi EMU10K1
which is a nice IO-APIC setup. I think it's all explained now - this
type of hardware never ran in true IOAPIC mode under Linux in the past,
due to this bug - the 2.4 kernel hid this fact by virtue of not having
good enough ACPI support. Congratulations :-)
We've queued up this fix in the x86 tree for 2.6.24 - and later on it
should become part of 2.6.23 (and 2.6.22) as well.
Created attachment 13745 [details]
crash the system artificially
one final thing: could you please try the attached x86-crash-it-again.patch and capture an earlyprintk serial log? The expectation would be a spontaneous reboot, with the last printout of "before APIC ICR write".
Created attachment 13746 [details]
kernel log with x86-crash-it-again.patch applied
serial port capture of kernel 2.6.22.13 output. All other patches were removed and only "x86-crash-it-again.patch" was applied to this kernel. APIC was enabled in the BIOS.
It rebooted right after outputting "before APIC ICR write".
> It rebooted right after outputting "before APIC ICR write".
great! I think we've now successfully verified it from various angles
that this is indeed causing this problem and that the fix does the right
thing.
(This was spectacular, Ingo - and warm blooded creature... Congratulations to the team!) the fix is now upstream: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f44d9efd3510776216938fef84adc99cc0e12412 and should thus show up in 2.6.24-rc4. |