Bug 1434

Summary: kernel crash reading garbled ACPI table - Dell Precision 410, 610
Product: ACPI Reporter: Tony Gale (gale)
Component: Config-TablesAssignee: Zhu Yi (yi.zhu)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, Matt_Domsch
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.4.23pre6 Subsystem:
Regression: --- Bisected commit-id:
Attachments: output from dmidecode
output from acpidmp
dmesg log
patch to make dmi year cutoff effective for HT part of ACPI
dmesg output for test case #1
.config file for test case #1
dmesg output for test case #2
.config with LOCAL_APIC enabled. Fails to boot
test patch to return after parsing the MADT
test patch to return after parsing the MADT
patch to return error if ACPI_NMI asks for LINT != 1
dmesg with MADT only parsing patch
/proc/interrupts with MADT only patch applied.
test: return after MADT parse, fix LAPIC != 1, leave SCI level sensitive
test: four returns to isolate failure
dmesg output with edge triggered patch
/proc/interrupts, edge triggered
acpidmp from DELL Workstation 610
acpidmp from Dell Workstation 610
Intensive check for garbled MADT

Description Tony Gale 2003-10-27 05:10:56 UTC
Distribution: RH9
Hardware Environment: dual-PII 450MHz, Intel 440BX, PIIX4
Software Environment:
Problem Description:

pre6, pre7 and pre8 fail to boot past Uncompressing kernel. Reverting the ACPI
changes from pre6 allows it to boot.
Comment 1 Tony Gale 2003-10-28 01:54:25 UTC
Created attachment 1232 [details]
output from dmidecode
Comment 2 Tony Gale 2003-10-28 01:55:08 UTC
Created attachment 1233 [details]
output from acpidmp
Comment 3 Tony Gale 2003-10-28 02:34:25 UTC
Created attachment 1234 [details]
dmesg log
Comment 4 Tony Gale 2003-10-28 02:38:22 UTC
Kernel pre8 boots with this patch from Marcelo.

NOTE: have also tested 2.4.22 with ACPI enabled and it fails in the same way. I
guess thats why I disabled it, must have been quite a few kernels back as I only
have a vague memory of booting problems with ACPI enabled.

> --- drivers/acpi/Config.in.orig       2003-10-27 14:33:03.000000000 -0200
> +++ drivers/acpi/Config.in    2003-10-27 14:33:16.000000000 -0200
> @@ -32,10 +32,6 @@
>      tristate     '  Toshiba Laptop Extras'   CONFIG_ACPI_TOSHIBA
>      bool         '  Debug Statements'        CONFIG_ACPI_DEBUG
>      bool         '  Relaxed AML Checking'    CONFIG_ACPI_RELAXED_AML
> -  else 
> -    if [ "$CONFIG_SMP" = "y" ]; then
> -      define_bool CONFIG_ACPI_BOOT           y
> -    fi
>    fi
>  
>    endmenu
> 
Comment 5 Len Brown 2003-10-30 08:55:36 UTC
Created attachment 1289 [details]
patch to make dmi year cutoff effective for HT part of ACPI

There are 2 bugs here.

#1: this box is crashing early on, apparently in ACPI initialization.
We started including this code on SMP configs in -pre6, so that must be
why you noticed the regression then.  Did any previous kernels work on
this box with CONFIG_ACPI enabled?  Do you have a serial console?
Do you have a 7-segment display or a port-80 debug card we can
write numbers to to figure out where the crash is?

Please build a kernel like so
CONFIG_SMP=n
CONFIG_ACPI=y
CONFIG_X86_LOCAL_APIC=n
and see if it boots with acpi=force.
If it works, then the generic table parsing code is okay and
the trouble is in the LAPIC part.

#2: by default this box should not be running any ACPI code.
		Manufacturer: Dell Computer Corporation
		Product Name: Precision WorkStation 410 MT
		Version: A14
		Release Date: 08/16/00
This is the latest BIOS, but it is older than the DMI ACPI cutoff date of 2001.
Please test the patch attached (by itself) and verify that it addresse the
symptom.
Comment 6 Tony Gale 2003-10-31 01:41:08 UTC
Created attachment 1299 [details]
dmesg output for test case #1


The system boots for test case #1. I had to enable uniprocessor local APIC
support, which I assume it what you wanted. I'll attach the .config next so you
can check it if required.

This is with acpi=force 

I don't have 7-seg display or debug card. Could do a serial console.
Comment 7 Tony Gale 2003-10-31 01:42:12 UTC
Created attachment 1300 [details]
.config file for test case #1
Comment 8 Tony Gale 2003-10-31 02:03:53 UTC
Created attachment 1302 [details]
dmesg output for test case #2


The system boots normally with the patch applied.
Comment 9 Len Brown 2003-10-31 10:32:45 UTC
Thanks for verifying that the patch for issue #2 (SMP kernel crash) is correct. 
I'm glad to see that out of the box the SMP kernel will "do the right thing" for you now. 
 
Thanks also for using acpi=force to test issue#1 (root cause) on this system. 
You showed that ACPI can indeed boot on this box on the UP kernel 
and that the generic table parsing code is not the problem. 
 
Thanks for attaching the .config, please update it to say =y to these: 
# CONFIG_ACPI_AC is not set 
# CONFIG_ACPI_BATTERY is not set 
# CONFIG_ACPI_BUTTON is not set 
# CONFIG_ACPI_THERMAL is not set 
 
I belive that as requested, the test for issue#1 excluded CONFIG_X86_LOCAL_APIC, 
otherwise we'd see lines like this in the dmesg: 
 
ACPI: APIC (v001 DELL    WS 410  0x00000002 ASL  0x00000061) @ 0x(nil) 
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) 
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) 
ACPI: IOAPIC (id[0x02] address[0xfec00000] global_irq_base[0x0]) 
ACPI: INT_SRC_OVR (bus[0] irq[0x0] global_irq[0x2] polarity[0x0] trigger[0x0]) 
ACPI: INT_SRC_OVR (bus[0] irq[0x9] global_irq[0x9] polarity[0x1] trigger[0x3]) 
ACPI: LAPIC_NMI (acpi_id[0x01] polarity[0x1] trigger[0x1] lint[0x4]) 
 
I think that with your inclusion of CONFIG_X86_LOCAL_APIC=y that 
if you run "make oldconfig" (and then make dep) that CONFIG_X86_LOCAL_APIC will 
be added to the config and to the uni-processor build.  And that is what 
we want to try next:  acpi=force on a UP kernel with uni-processor LOCAL_APIC. 
 
If that succeeds, then the next step is to add the IO_APIC to the mix: 
add this to .config: CONFIG_X86_UP_IOAPIC=y 
run make oldconfig 
and that will add 
CONFIG_X86_IO_APIC=y 
 
If this succeeds, then the only difference between the success and failure case 
is enabling CONFIG_SMP itself. 
 
yes, if you can enable the serial console to capture the console during the 
failure, that will be useful. 
--- 
BTW. "ACPI: INT_SRC_OVR (bus[0] irq[0x9] global_irq[0x9] polarity[0x1] trigger[0x3])" 
tells us that the ACPI SCI has non-standard polarity on this system in APIC mode. 
(specifies active high instead of active low).  I'd be interested if when you boot the 
ACPI configs if you can attach the /proc/interrupts.  If the acpi interrupt shows 
zero, please press the power button and check /proc/interrupts again. 
(this is why I asked for CONFIG_ACPI_BUTTON=y above) 
RH9 has no acpid, so it should just register an acpi event but no shutdown 
or poweroff action should be taken. 
 
thanks, 
-Len 
 
 
 
Comment 10 Tony Gale 2003-11-03 04:27:17 UTC
Created attachment 1327 [details]
.config with LOCAL_APIC enabled. Fails to boot


System fails to boot with LOCAL_APIC enabled.
Comment 11 Len Brown 2003-11-03 10:30:37 UTC
Created attachment 1331 [details]
test patch to return after parsing the MADT

Thanks for running that test Tony, it narrows the cause down to either
parsing the MADT itself, or the LAPIC entries themselves.

Please repeat the test with this patch applied -- it will parse the MADT,
but not process the the LAPIC entries.
If successful, the system will boot and will display some additional messages
about what it found when it parsed the APIC tables.
Comment 12 Len Brown 2003-11-03 11:04:46 UTC
Created attachment 1332 [details]
test patch to return after parsing the MADT

oops, gave you that patch against 2.6, here it is again against 2.4.23
Comment 13 Len Brown 2003-11-03 11:17:17 UTC
Created attachment 1334 [details]
patch to return error if ACPI_NMI asks for LINT != 1

> ACPI: LAPIC_NMI (acpi_id[0x01] polarity[0x1] trigger[0x1] lint[0x4]) 

Hmm, 1 LAPIC_NMI entry for a 2-processor system, and it requests LINT _4_.
If the previous patch boots, please attach its dmesg, remove it and try this
one.
If this one boots, please attach the dmesg.

thanks,
-Len
Comment 14 Tony Gale 2003-11-04 03:04:51 UTC
Created attachment 1339 [details]
dmesg with MADT only parsing patch
Comment 15 Tony Gale 2003-11-04 03:14:36 UTC
Created attachment 1340 [details]
/proc/interrupts with MADT only patch applied.

Count started at zero, then I hit the power button twice.
Comment 16 Tony Gale 2003-11-04 03:23:17 UTC
The system fails to boot with just the LINT patch applied. There is no serial
console output.
Comment 17 Len Brown 2003-11-04 11:58:32 UTC
Created attachment 1346 [details]
test: return after MADT parse, fix LAPIC != 1, leave SCI level sensitive 

Thanks for testing the ACPI SCI Tony.
Please apply this patch to the clean source tree and see
if the power button still gives us acpi interrupts when we
leave the ACPI SCI at edge triggered.
(attach /proc/interrutps after pressing the power button a n times)
Comment 18 Len Brown 2003-11-04 12:11:10 UTC
Created attachment 1347 [details]
test: four returns to isolate failure

Thanks for verifying that we get past the MADT parse.
Here's a patch to return from acpi_boot_init()
a little later -- in 4 places.	If the 1st one works,
then comment it out to see if we get to the 2nd one,
then 3rd & 4th.  Please attach the dmesg from
the one that gets the farthest.   TIA! -Len
Comment 19 Tony Gale 2003-11-05 04:09:28 UTC
Created attachment 1359 [details]
dmesg output with edge triggered patch
Comment 20 Tony Gale 2003-11-05 04:10:35 UTC
Created attachment 1360 [details]
/proc/interrupts, edge triggered


count started at 0, hit power button twice, went to two
Comment 21 Tony Gale 2003-11-05 05:33:50 UTC
Ref: four returns patch

It doesn't get past the first one.

I did a bit of testing and it appears not to get past the while() loop in
function acpi_table_parse_madt_family. I added this 'patch' (this isn't a real
patch, just a cut'n'paste):

        entry = (acpi_table_entry_header *)
                ((unsigned long) madt + madt_size);
                                                                               
                                           
+        printk(KERN_ERR PREFIX "entry: %p\n", entry);
+        printk(KERN_ERR PREFIX "madt: %p\n", madt);
+        printk(KERN_ERR PREFIX "madt_end: %p\n ", madt_end);
+        return -ENODEV;
                                                                               
                                           
        while (((unsigned long) entry) < madt_end) {


The kernel output's the following during boot:

ACPI: RSDP (v000 DELL                                      ) @ 0x000fdee0
ACPI: RSDT (v001 DELL    WS 410  0x00000002 ASL  0x00000061) @ 0x000fdef4
ACPI: FADT (v001 DELL    WS 410  0x00000002 ASL  0x00000061) @ 0x000fdf20
ACPI: MADT (v001 DELL    WS 410  0x00000002 ASL  0x00000061) @ 0x000fdf94
ACPI: DSDT (v001   DELL    dt_ex 0x00001000 MSFT 0x0100000b) @ 0x00000000
ACPI: Local APIC address 0xfee00000
ACPI: entry: c00fdfc0
ACPI: madt: c00fdf94
ACPI: madt_end: c00fdffa

It appears these entries differ from the above table entries. i.e.

0xc00fdf94 != 0x000fdf94

I assume that is a problem, but don't know for sure.
Comment 22 Len Brown 2003-11-06 00:15:21 UTC
Thanks for verifying that the ACPI/SCI Edge Trigger patch works. 
 
Nice to have made it to a point where we get output on a failure. 
> 0xc00fdf94 is okay -- it is the virtual mapping for 0x000fdf94 physical. 
 
The root cause of the crash is that the MADT, while having correct checksum, 
has garbled entries at the end. 
 
461514 is LAPIC_NMI (acpi_id[0x01] polarity[0x1] trigger[0x1] lint[0x4]) 
it has reserved bits set in the flags, and lint=4 is invalid. 
62 starts the next entry, 6 is an IO-SAPIC!, and length 2 means 
there is no data after the header... 
51 is the next "entry", 5 is a local apic address over-ride. 
size 1 is bogus as it is smaller than the 2-byte header it includes... 
This data will cause the loop to go right off the end of the table. 
 
It would be interesting to see if you reload your BIOS if the 
acpidmp output changes.  It would also be interesting to 
see what Windows 2000 or WinXP do with this. 
BIOS date is 08/16/00, so maybe the box was never 
tested with win2k before it shipped? 
 
I'll think about a way to make the kernel more resiliant to bad tables. 
I think in this case the goal would be for the kernel to recognize 
that the MADT is garbled and to cleanly disable ACPI. 
 
Comment 23 Tony Gale 2003-11-07 07:01:04 UTC
Reloaded the A14 BIOS and acpidmp outputs exactly the same info.

Haven't tried WinXP on this box. I'll try and give it a go when my new box
arrives. Any particular install/setup options?
Comment 24 Mark 2004-02-24 15:40:22 UTC
Created attachment 2223 [details]
acpidmp from DELL Workstation 610
Comment 25 Mark 2004-03-14 09:27:11 UTC
Created attachment 2328 [details]
acpidmp from Dell Workstation 610
Comment 26 Len Brown 2004-03-30 20:15:26 UTC
Looks like the 610 has the same problem that the 410 has. 
 
acpixtract APIC acpidmp.610 | madt 
ACPI: APIC (v001 DELL    WS 610  0x00000002 ASL  0x00000061) @ 0x(nil) 
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) 
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) 
ACPI: IOAPIC (id[0x02] address[0xfec00000] global_irq_base[0x0]) 
ACPI: INT_SRC_OVR (bus[0] irq[0x0] global_irq[0x2] polarity[0x0] trigger[0x0]) 
ACPI: INT_SRC_OVR (bus[0] irq[0x9] global_irq[0x9] polarity[0x1] trigger[0x3]) 
ACPI: LAPIC_NMI (acpi_id[0x01] polarity[0x1] trigger[0x1] lint[0x4]) 
fread: Success 
 
Comment 27 Zhu Yi 2004-04-01 01:26:57 UTC
Hi Tony and Mark,

Could you please try below patch for 2.6 kernel and attach your dmesg if your 
box boot successfully?

--- linux-2.6.2-orig/drivers/acpi/tables.c.orig 2004-04-01 14:09:36.000000000 
+0800
+++ linux-2.6.2/drivers/acpi/tables.c   2004-04-01 15:04:16.000000000 +0800
@@ -348,6 +348,9 @@
                    (!max_entries || count++ < max_entries))
                        handler(entry);

+               if (entry->length <= sizeof(acpi_table_entry_header))
+                       return -EINVAL;
+
                entry = (acpi_table_entry_header *)
                        ((unsigned long) entry + entry->length);
        }

Thanks,
-yi
Comment 28 Len Brown 2004-04-14 18:31:06 UTC
more folks interested in this failure
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=109550
Comment 29 Zhu Yi 2004-04-21 20:57:17 UTC
Created attachment 2640 [details]
Intensive check for garbled MADT

Hi,

For anyone who can reproduce the bug, would you please help to test this patch?

This patch will do intensive check for a grabled MADT and is supposed to
survive the system after that.

This patch is against 2.6.5 kernel. Let me know if you need a 2.4 kernel patch.


Really need your help and thanks in advance!

-yi
Comment 30 Len Brown 2004-06-17 22:27:01 UTC
fix is in 2.6.7 
not fixing in 2.4 
closing.