Bug 5930 - 2.6.15 regression - 2nd CPU unused - Serverworks OSB4/Supermicro 370DER
Summary: 2.6.15 regression - 2nd CPU unused - Serverworks OSB4/Supermicro 370DER
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: Config-Processors (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Venkatesh Pallipadi
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-01-20 18:38 UTC by Ronald Hummelink
Modified: 2006-02-10 19:02 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.15
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
acpidump from affected system (95.83 KB, text/plain)
2006-01-20 18:40 UTC, Ronald Hummelink
Details
debug patch (698 bytes, patch)
2006-01-22 18:00 UTC, Shaohua
Details | Diff
dmesg for the buggy case (15.68 KB, text/plain)
2006-01-23 04:49 UTC, Diego Calleja
Details
dmesg for the buggy case (16.07 KB, text/plain)
2006-01-23 06:03 UTC, Diego Calleja
Details
debug patch (718 bytes, patch)
2006-01-23 19:48 UTC, Shaohua
Details | Diff
working dmesg (15.92 KB, text/plain)
2006-01-23 20:10 UTC, Diego Calleja
Details
Debug patch number 3 (1.42 KB, patch)
2006-01-24 13:22 UTC, Venkatesh Pallipadi
Details | Diff
debug dmesg (16.40 KB, text/plain)
2006-01-24 14:41 UTC, Diego Calleja
Details
Dont record disabled lapic values to avoid conflict in some BIOS's (1.38 KB, patch)
2006-01-25 08:14 UTC, Ashok Raj
Details | Diff

Description Ronald Hummelink 2006-01-20 18:38:33 UTC
Most recent kernel where this bug did not occur: 
Vanilla release 2.6.14 is good, 2.6.15 is affected. git bisected it down to
(with the help of dsd@gentoo irc):

# git bisect bad
cd8e2b48daee891011a4f21e2c62b210d24dcc9e is first bad commit
diff-tree cd8e2b48daee891011a4f21e2c62b210d24dcc9e (from
d2149b542382bfc206cb28485108f6470c979566)
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date:   Fri Oct 21 19:22:00 2005 -0400

    [ACPI] fix 2.6.13 boot hang regression on HT box w/ broken BIOS

    http://bugzilla.kernel.org/show_bug.cgi?id=5452

    Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
    Signed-off-by: Len Brown <len.brown@intel.com>

:040000 040000 9cb687b77dcd64bf82e9a73214db467c964c1266
b1bde4a4ad91720daa6645c60bdc123b824c39b2 M      drivers

Distribution: Gentoo
Hardware Environment: Supermicro 370DER mainboard, dual p3 1ghz coppermine,
512mb ecc reg pc 133, scsi drive
Software Environment: Gentoo
Problem Description: CPU0 and CPU1 are both detected by the kernel, and show in
a program like top. but on any kernel after given commit CPU1 won
Comment 1 Ronald Hummelink 2006-01-20 18:40:35 UTC
Created attachment 7085 [details]
acpidump from affected system

dump made using acpidump
Comment 2 Diego Calleja 2006-01-22 11:12:51 UTC
I have see this (and reported it to the kernel mailing lists, but I wasn't able
to bisect the commit.

The machine is a dual P3 1 Ghz with a Supermicro 370 DE6 (same chipset)


In my box, I can make it work again by setting CONFIG_ACPI_PROCESSOR to "m". The
problem is only reproduceable when CONFIG_ACPI_PROCESSOR=y
Comment 3 Shaohua 2006-01-22 18:00:33 UTC
Created attachment 7094 [details]
debug patch

Does attached patch help? Thanks!
Comment 4 Shaohua 2006-01-22 18:14:37 UTC
And please also provide the dmesg form the buggy case!
Comment 5 Daniel Drake 2006-01-23 02:39:44 UTC
Not sure if it is required but it can do no harm: Probably a good idea to turn
on CONFIG_ACPI_DEBUG as well as applying that patch.
Comment 6 Diego Calleja 2006-01-23 04:49:20 UTC
Created attachment 7100 [details]
dmesg for the buggy case
Comment 7 Diego Calleja 2006-01-23 06:03:54 UTC
Created attachment 7101 [details]
dmesg for the buggy case

David, the patch does not fix the problem for me. dmesg attached.
Comment 8 Shaohua 2006-01-23 19:48:32 UTC
Created attachment 7110 [details]
debug patch

Does this one help a little? When ACPI returns wrong ID, we might wrongly free
some info.
Sorry for letting you try so many, I haven't a system to reproduce it.
Comment 9 Diego Calleja 2006-01-23 20:10:04 UTC
Created attachment 7111 [details]
working dmesg

Yes, this one definitively works in my box. Attached working dmesg.
Comment 10 Shaohua 2006-01-23 22:00:40 UTC
Mark this one as resolved. Ronald, does the patch work for you?
Comment 11 Venkatesh Pallipadi 2006-01-24 13:21:37 UTC
I am not sure whether the patch in cmment #8 is the right fix.

It is not clear to me why we are ending up here in this error case in the first
place. Ideally, it should not come here unless BIOS has messed up acpi_id.

I want to get more information on this failure. 

Diego/Ronald. Can you attach the dmesg from your system after you apply the
patch below.

Thx.
Comment 12 Venkatesh Pallipadi 2006-01-24 13:22:39 UTC
Created attachment 7122 [details]
Debug patch number 3
Comment 13 Diego Calleja 2006-01-24 14:15:34 UTC
the patch doesn't seems to apply on top of current linus's git tree
Comment 14 Diego Calleja 2006-01-24 14:32:10 UTC
ok I applied it by hand - it was just a extra space
Comment 15 Diego Calleja 2006-01-24 14:41:03 UTC
Created attachment 7123 [details]
debug dmesg

This is the corresponding dmesg
Comment 16 Venkatesh Pallipadi 2006-01-24 15:43:33 UTC
Diego,

Thanks for the prompt and quick check of all the debug patches. I have narrowed
down on the problem here. 

As per my theory, backing out this patch 
http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fbe83e209ad9c8281e29ac17a60f91119d86fa8c
should also make your system work as before.

Now that I have understood the problem, I will work with Shaohua and Ashok and
get to a clean solution.

Description of the problem.
- This particular BIOS (both Ronald and Diego) is unique in that, it has the
disabled ACPI madt entries mapping to one of the enabled madt entry. Though
strictly it is not out of ACPI spec, it is uncommon though. Typically BIOSes
give some lapic_id like 0x80, 0x81, etc to these disabled CPUs.

ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 6:8 APIC version 17
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 6:8 APIC version 17
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x00] disabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x00] disabled)

- With the above change from Ashok, we now look at disabled CPUs as well and
store there lapic_id. As this id is same as one of the enabled CPUs, the ACPI
gets confused while adding CPUs and that results in an error at later point.

- Though this issue was exposed by 5452, that is not the real cause for the problem.
Comment 17 Ronald Hummelink 2006-01-24 16:13:26 UTC
These are busy days for me, i will try to get in some testing in by tomorrow
evening CET. My apologies, normally i have more time at hands to respond as
swiftly as you guys did in working on this bug, thanks for that!
Comment 18 Diego Calleja 2006-01-24 17:15:25 UTC
Yes, backing out that change from Linus's git tree also solved the problem
Comment 19 Ashok Raj 2006-01-25 08:14:54 UTC
Created attachment 7139 [details]
Dont record disabled lapic values to avoid conflict in some BIOS's

Could you please apply and let me know if this fixes the problem?

Thanks
ashok
Comment 20 Ronald Hummelink 2006-01-25 13:51:30 UTC
Patch from comment 19 fixes the problem for me.

Applies cleanly to and fixes both gentoo patched 2.6.15-gentoo-r1 and vanilla
2.6.15.1
Comment 21 Diego Calleja 2006-01-25 15:14:10 UTC
ditto for me
Comment 22 Daniel Drake 2006-01-31 11:47:01 UTC
Ashok, is this patch OK to apply in the Gentoo kernel, or is a final patch in
the works?
Comment 23 Ashok Raj 2006-02-02 09:29:27 UTC
Sorry for the delay...

Yes.. this is the final patch. i just got email from Andi Kleen that he pushed 
to linus. So it should be showing up in git trees pretty soon.
Comment 24 Len Brown 2006-02-07 02:09:42 UTC
shipped in linux-2.6.16-rc2, closing.
Comment 25 Ashok Raj 2006-02-10 19:02:57 UTC
Also available in 2.6.15.4 stable release.

Note You need to log in before you can comment on or make changes to this bug.