Most recent kernel where this bug did not occur: Vanilla release 2.6.14 is good, 2.6.15 is affected. git bisected it down to (with the help of dsd@gentoo irc): # git bisect bad cd8e2b48daee891011a4f21e2c62b210d24dcc9e is first bad commit diff-tree cd8e2b48daee891011a4f21e2c62b210d24dcc9e (from d2149b542382bfc206cb28485108f6470c979566) Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Date: Fri Oct 21 19:22:00 2005 -0400 [ACPI] fix 2.6.13 boot hang regression on HT box w/ broken BIOS http://bugzilla.kernel.org/show_bug.cgi?id=5452 Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> :040000 040000 9cb687b77dcd64bf82e9a73214db467c964c1266 b1bde4a4ad91720daa6645c60bdc123b824c39b2 M drivers Distribution: Gentoo Hardware Environment: Supermicro 370DER mainboard, dual p3 1ghz coppermine, 512mb ecc reg pc 133, scsi drive Software Environment: Gentoo Problem Description: CPU0 and CPU1 are both detected by the kernel, and show in a program like top. but on any kernel after given commit CPU1 won
Created attachment 7085 [details] acpidump from affected system dump made using acpidump
I have see this (and reported it to the kernel mailing lists, but I wasn't able to bisect the commit. The machine is a dual P3 1 Ghz with a Supermicro 370 DE6 (same chipset) In my box, I can make it work again by setting CONFIG_ACPI_PROCESSOR to "m". The problem is only reproduceable when CONFIG_ACPI_PROCESSOR=y
Created attachment 7094 [details] debug patch Does attached patch help? Thanks!
And please also provide the dmesg form the buggy case!
Not sure if it is required but it can do no harm: Probably a good idea to turn on CONFIG_ACPI_DEBUG as well as applying that patch.
Created attachment 7100 [details] dmesg for the buggy case
Created attachment 7101 [details] dmesg for the buggy case David, the patch does not fix the problem for me. dmesg attached.
Created attachment 7110 [details] debug patch Does this one help a little? When ACPI returns wrong ID, we might wrongly free some info. Sorry for letting you try so many, I haven't a system to reproduce it.
Created attachment 7111 [details] working dmesg Yes, this one definitively works in my box. Attached working dmesg.
Mark this one as resolved. Ronald, does the patch work for you?
I am not sure whether the patch in cmment #8 is the right fix. It is not clear to me why we are ending up here in this error case in the first place. Ideally, it should not come here unless BIOS has messed up acpi_id. I want to get more information on this failure. Diego/Ronald. Can you attach the dmesg from your system after you apply the patch below. Thx.
Created attachment 7122 [details] Debug patch number 3
the patch doesn't seems to apply on top of current linus's git tree
ok I applied it by hand - it was just a extra space
Created attachment 7123 [details] debug dmesg This is the corresponding dmesg
Diego, Thanks for the prompt and quick check of all the debug patches. I have narrowed down on the problem here. As per my theory, backing out this patch http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fbe83e209ad9c8281e29ac17a60f91119d86fa8c should also make your system work as before. Now that I have understood the problem, I will work with Shaohua and Ashok and get to a clean solution. Description of the problem. - This particular BIOS (both Ronald and Diego) is unique in that, it has the disabled ACPI madt entries mapping to one of the enabled madt entry. Though strictly it is not out of ACPI spec, it is uncommon though. Typically BIOSes give some lapic_id like 0x80, 0x81, etc to these disabled CPUs. ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 6:8 APIC version 17 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 6:8 APIC version 17 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x00] disabled) ACPI: LAPIC (acpi_id[0x03] lapic_id[0x00] disabled) - With the above change from Ashok, we now look at disabled CPUs as well and store there lapic_id. As this id is same as one of the enabled CPUs, the ACPI gets confused while adding CPUs and that results in an error at later point. - Though this issue was exposed by 5452, that is not the real cause for the problem.
These are busy days for me, i will try to get in some testing in by tomorrow evening CET. My apologies, normally i have more time at hands to respond as swiftly as you guys did in working on this bug, thanks for that!
Yes, backing out that change from Linus's git tree also solved the problem
Created attachment 7139 [details] Dont record disabled lapic values to avoid conflict in some BIOS's Could you please apply and let me know if this fixes the problem? Thanks ashok
Patch from comment 19 fixes the problem for me. Applies cleanly to and fixes both gentoo patched 2.6.15-gentoo-r1 and vanilla 2.6.15.1
ditto for me
Ashok, is this patch OK to apply in the Gentoo kernel, or is a final patch in the works?
Sorry for the delay... Yes.. this is the final patch. i just got email from Andi Kleen that he pushed to linus. So it should be showing up in git trees pretty soon.
shipped in linux-2.6.16-rc2, closing.
Also available in 2.6.15.4 stable release.