Distribution: RH9 Hardware Environment: dual-PII 450MHz, Intel 440BX, PIIX4 Software Environment: Problem Description: pre6, pre7 and pre8 fail to boot past Uncompressing kernel. Reverting the ACPI changes from pre6 allows it to boot.
Created attachment 1232 [details] output from dmidecode
Created attachment 1233 [details] output from acpidmp
Created attachment 1234 [details] dmesg log
Kernel pre8 boots with this patch from Marcelo. NOTE: have also tested 2.4.22 with ACPI enabled and it fails in the same way. I guess thats why I disabled it, must have been quite a few kernels back as I only have a vague memory of booting problems with ACPI enabled. > --- drivers/acpi/Config.in.orig 2003-10-27 14:33:03.000000000 -0200 > +++ drivers/acpi/Config.in 2003-10-27 14:33:16.000000000 -0200 > @@ -32,10 +32,6 @@ > tristate ' Toshiba Laptop Extras' CONFIG_ACPI_TOSHIBA > bool ' Debug Statements' CONFIG_ACPI_DEBUG > bool ' Relaxed AML Checking' CONFIG_ACPI_RELAXED_AML > - else > - if [ "$CONFIG_SMP" = "y" ]; then > - define_bool CONFIG_ACPI_BOOT y > - fi > fi > > endmenu >
Created attachment 1289 [details] patch to make dmi year cutoff effective for HT part of ACPI There are 2 bugs here. #1: this box is crashing early on, apparently in ACPI initialization. We started including this code on SMP configs in -pre6, so that must be why you noticed the regression then. Did any previous kernels work on this box with CONFIG_ACPI enabled? Do you have a serial console? Do you have a 7-segment display or a port-80 debug card we can write numbers to to figure out where the crash is? Please build a kernel like so CONFIG_SMP=n CONFIG_ACPI=y CONFIG_X86_LOCAL_APIC=n and see if it boots with acpi=force. If it works, then the generic table parsing code is okay and the trouble is in the LAPIC part. #2: by default this box should not be running any ACPI code. Manufacturer: Dell Computer Corporation Product Name: Precision WorkStation 410 MT Version: A14 Release Date: 08/16/00 This is the latest BIOS, but it is older than the DMI ACPI cutoff date of 2001. Please test the patch attached (by itself) and verify that it addresse the symptom.
Created attachment 1299 [details] dmesg output for test case #1 The system boots for test case #1. I had to enable uniprocessor local APIC support, which I assume it what you wanted. I'll attach the .config next so you can check it if required. This is with acpi=force I don't have 7-seg display or debug card. Could do a serial console.
Created attachment 1300 [details] .config file for test case #1
Created attachment 1302 [details] dmesg output for test case #2 The system boots normally with the patch applied.
Thanks for verifying that the patch for issue #2 (SMP kernel crash) is correct. I'm glad to see that out of the box the SMP kernel will "do the right thing" for you now. Thanks also for using acpi=force to test issue#1 (root cause) on this system. You showed that ACPI can indeed boot on this box on the UP kernel and that the generic table parsing code is not the problem. Thanks for attaching the .config, please update it to say =y to these: # CONFIG_ACPI_AC is not set # CONFIG_ACPI_BATTERY is not set # CONFIG_ACPI_BUTTON is not set # CONFIG_ACPI_THERMAL is not set I belive that as requested, the test for issue#1 excluded CONFIG_X86_LOCAL_APIC, otherwise we'd see lines like this in the dmesg: ACPI: APIC (v001 DELL WS 410 0x00000002 ASL 0x00000061) @ 0x(nil) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) ACPI: IOAPIC (id[0x02] address[0xfec00000] global_irq_base[0x0]) ACPI: INT_SRC_OVR (bus[0] irq[0x0] global_irq[0x2] polarity[0x0] trigger[0x0]) ACPI: INT_SRC_OVR (bus[0] irq[0x9] global_irq[0x9] polarity[0x1] trigger[0x3]) ACPI: LAPIC_NMI (acpi_id[0x01] polarity[0x1] trigger[0x1] lint[0x4]) I think that with your inclusion of CONFIG_X86_LOCAL_APIC=y that if you run "make oldconfig" (and then make dep) that CONFIG_X86_LOCAL_APIC will be added to the config and to the uni-processor build. And that is what we want to try next: acpi=force on a UP kernel with uni-processor LOCAL_APIC. If that succeeds, then the next step is to add the IO_APIC to the mix: add this to .config: CONFIG_X86_UP_IOAPIC=y run make oldconfig and that will add CONFIG_X86_IO_APIC=y If this succeeds, then the only difference between the success and failure case is enabling CONFIG_SMP itself. yes, if you can enable the serial console to capture the console during the failure, that will be useful. --- BTW. "ACPI: INT_SRC_OVR (bus[0] irq[0x9] global_irq[0x9] polarity[0x1] trigger[0x3])" tells us that the ACPI SCI has non-standard polarity on this system in APIC mode. (specifies active high instead of active low). I'd be interested if when you boot the ACPI configs if you can attach the /proc/interrupts. If the acpi interrupt shows zero, please press the power button and check /proc/interrupts again. (this is why I asked for CONFIG_ACPI_BUTTON=y above) RH9 has no acpid, so it should just register an acpi event but no shutdown or poweroff action should be taken. thanks, -Len
Created attachment 1327 [details] .config with LOCAL_APIC enabled. Fails to boot System fails to boot with LOCAL_APIC enabled.
Created attachment 1331 [details] test patch to return after parsing the MADT Thanks for running that test Tony, it narrows the cause down to either parsing the MADT itself, or the LAPIC entries themselves. Please repeat the test with this patch applied -- it will parse the MADT, but not process the the LAPIC entries. If successful, the system will boot and will display some additional messages about what it found when it parsed the APIC tables.
Created attachment 1332 [details] test patch to return after parsing the MADT oops, gave you that patch against 2.6, here it is again against 2.4.23
Created attachment 1334 [details] patch to return error if ACPI_NMI asks for LINT != 1 > ACPI: LAPIC_NMI (acpi_id[0x01] polarity[0x1] trigger[0x1] lint[0x4]) Hmm, 1 LAPIC_NMI entry for a 2-processor system, and it requests LINT _4_. If the previous patch boots, please attach its dmesg, remove it and try this one. If this one boots, please attach the dmesg. thanks, -Len
Created attachment 1339 [details] dmesg with MADT only parsing patch
Created attachment 1340 [details] /proc/interrupts with MADT only patch applied. Count started at zero, then I hit the power button twice.
The system fails to boot with just the LINT patch applied. There is no serial console output.
Created attachment 1346 [details] test: return after MADT parse, fix LAPIC != 1, leave SCI level sensitive Thanks for testing the ACPI SCI Tony. Please apply this patch to the clean source tree and see if the power button still gives us acpi interrupts when we leave the ACPI SCI at edge triggered. (attach /proc/interrutps after pressing the power button a n times)
Created attachment 1347 [details] test: four returns to isolate failure Thanks for verifying that we get past the MADT parse. Here's a patch to return from acpi_boot_init() a little later -- in 4 places. If the 1st one works, then comment it out to see if we get to the 2nd one, then 3rd & 4th. Please attach the dmesg from the one that gets the farthest. TIA! -Len
Created attachment 1359 [details] dmesg output with edge triggered patch
Created attachment 1360 [details] /proc/interrupts, edge triggered count started at 0, hit power button twice, went to two
Ref: four returns patch It doesn't get past the first one. I did a bit of testing and it appears not to get past the while() loop in function acpi_table_parse_madt_family. I added this 'patch' (this isn't a real patch, just a cut'n'paste): entry = (acpi_table_entry_header *) ((unsigned long) madt + madt_size); + printk(KERN_ERR PREFIX "entry: %p\n", entry); + printk(KERN_ERR PREFIX "madt: %p\n", madt); + printk(KERN_ERR PREFIX "madt_end: %p\n ", madt_end); + return -ENODEV; while (((unsigned long) entry) < madt_end) { The kernel output's the following during boot: ACPI: RSDP (v000 DELL ) @ 0x000fdee0 ACPI: RSDT (v001 DELL WS 410 0x00000002 ASL 0x00000061) @ 0x000fdef4 ACPI: FADT (v001 DELL WS 410 0x00000002 ASL 0x00000061) @ 0x000fdf20 ACPI: MADT (v001 DELL WS 410 0x00000002 ASL 0x00000061) @ 0x000fdf94 ACPI: DSDT (v001 DELL dt_ex 0x00001000 MSFT 0x0100000b) @ 0x00000000 ACPI: Local APIC address 0xfee00000 ACPI: entry: c00fdfc0 ACPI: madt: c00fdf94 ACPI: madt_end: c00fdffa It appears these entries differ from the above table entries. i.e. 0xc00fdf94 != 0x000fdf94 I assume that is a problem, but don't know for sure.
Thanks for verifying that the ACPI/SCI Edge Trigger patch works. Nice to have made it to a point where we get output on a failure. > 0xc00fdf94 is okay -- it is the virtual mapping for 0x000fdf94 physical. The root cause of the crash is that the MADT, while having correct checksum, has garbled entries at the end. 461514 is LAPIC_NMI (acpi_id[0x01] polarity[0x1] trigger[0x1] lint[0x4]) it has reserved bits set in the flags, and lint=4 is invalid. 62 starts the next entry, 6 is an IO-SAPIC!, and length 2 means there is no data after the header... 51 is the next "entry", 5 is a local apic address over-ride. size 1 is bogus as it is smaller than the 2-byte header it includes... This data will cause the loop to go right off the end of the table. It would be interesting to see if you reload your BIOS if the acpidmp output changes. It would also be interesting to see what Windows 2000 or WinXP do with this. BIOS date is 08/16/00, so maybe the box was never tested with win2k before it shipped? I'll think about a way to make the kernel more resiliant to bad tables. I think in this case the goal would be for the kernel to recognize that the MADT is garbled and to cleanly disable ACPI.
Reloaded the A14 BIOS and acpidmp outputs exactly the same info. Haven't tried WinXP on this box. I'll try and give it a go when my new box arrives. Any particular install/setup options?
Created attachment 2223 [details] acpidmp from DELL Workstation 610
Created attachment 2328 [details] acpidmp from Dell Workstation 610
Looks like the 610 has the same problem that the 410 has. acpixtract APIC acpidmp.610 | madt ACPI: APIC (v001 DELL WS 610 0x00000002 ASL 0x00000061) @ 0x(nil) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) ACPI: IOAPIC (id[0x02] address[0xfec00000] global_irq_base[0x0]) ACPI: INT_SRC_OVR (bus[0] irq[0x0] global_irq[0x2] polarity[0x0] trigger[0x0]) ACPI: INT_SRC_OVR (bus[0] irq[0x9] global_irq[0x9] polarity[0x1] trigger[0x3]) ACPI: LAPIC_NMI (acpi_id[0x01] polarity[0x1] trigger[0x1] lint[0x4]) fread: Success
Hi Tony and Mark, Could you please try below patch for 2.6 kernel and attach your dmesg if your box boot successfully? --- linux-2.6.2-orig/drivers/acpi/tables.c.orig 2004-04-01 14:09:36.000000000 +0800 +++ linux-2.6.2/drivers/acpi/tables.c 2004-04-01 15:04:16.000000000 +0800 @@ -348,6 +348,9 @@ (!max_entries || count++ < max_entries)) handler(entry); + if (entry->length <= sizeof(acpi_table_entry_header)) + return -EINVAL; + entry = (acpi_table_entry_header *) ((unsigned long) entry + entry->length); } Thanks, -yi
more folks interested in this failure https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=109550
Created attachment 2640 [details] Intensive check for garbled MADT Hi, For anyone who can reproduce the bug, would you please help to test this patch? This patch will do intensive check for a grabled MADT and is supposed to survive the system after that. This patch is against 2.6.5 kernel. Let me know if you need a 2.4 kernel patch. Really need your help and thanks in advance! -yi
fix is in 2.6.7 not fixing in 2.4 closing.