|Summary:||2.6.15 regression - 2nd CPU unused - Serverworks OSB4/Supermicro 370DER|
|Product:||ACPI||Reporter:||Ronald Hummelink (ronald)|
|Component:||Config-Processors||Assignee:||Venkatesh Pallipadi (venki)|
|Severity:||normal||CC:||acpi-bugzilla, again, ashok.raj, diegocg, dsd, hallbw|
acpidump from affected system
dmesg for the buggy case
dmesg for the buggy case
Debug patch number 3
Dont record disabled lapic values to avoid conflict in some BIOS's
Description Ronald Hummelink 2006-01-20 18:38:33 UTC
Most recent kernel where this bug did not occur: Vanilla release 2.6.14 is good, 2.6.15 is affected. git bisected it down to (with the help of dsd@gentoo irc): # git bisect bad cd8e2b48daee891011a4f21e2c62b210d24dcc9e is first bad commit diff-tree cd8e2b48daee891011a4f21e2c62b210d24dcc9e (from d2149b542382bfc206cb28485108f6470c979566) Author: Venkatesh Pallipadi <firstname.lastname@example.org> Date: Fri Oct 21 19:22:00 2005 -0400 [ACPI] fix 2.6.13 boot hang regression on HT box w/ broken BIOS http://bugzilla.kernel.org/show_bug.cgi?id=5452 Signed-off-by: Venkatesh Pallipadi <email@example.com> Signed-off-by: Len Brown <firstname.lastname@example.org> :040000 040000 9cb687b77dcd64bf82e9a73214db467c964c1266 b1bde4a4ad91720daa6645c60bdc123b824c39b2 M drivers Distribution: Gentoo Hardware Environment: Supermicro 370DER mainboard, dual p3 1ghz coppermine, 512mb ecc reg pc 133, scsi drive Software Environment: Gentoo Problem Description: CPU0 and CPU1 are both detected by the kernel, and show in a program like top. but on any kernel after given commit CPU1 won
Comment 1 Ronald Hummelink 2006-01-20 18:40:35 UTC
Created attachment 7085 [details] acpidump from affected system dump made using acpidump
Comment 2 Diego Calleja 2006-01-22 11:12:51 UTC
I have see this (and reported it to the kernel mailing lists, but I wasn't able to bisect the commit. The machine is a dual P3 1 Ghz with a Supermicro 370 DE6 (same chipset) In my box, I can make it work again by setting CONFIG_ACPI_PROCESSOR to "m". The problem is only reproduceable when CONFIG_ACPI_PROCESSOR=y
Comment 3 Shaohua 2006-01-22 18:00:33 UTC
Created attachment 7094 [details] debug patch Does attached patch help? Thanks!
Comment 4 Shaohua 2006-01-22 18:14:37 UTC
And please also provide the dmesg form the buggy case!
Comment 5 Daniel Drake 2006-01-23 02:39:44 UTC
Not sure if it is required but it can do no harm: Probably a good idea to turn on CONFIG_ACPI_DEBUG as well as applying that patch.
Comment 6 Diego Calleja 2006-01-23 04:49:20 UTC
Created attachment 7100 [details] dmesg for the buggy case
Comment 7 Diego Calleja 2006-01-23 06:03:54 UTC
Created attachment 7101 [details] dmesg for the buggy case David, the patch does not fix the problem for me. dmesg attached.
Comment 8 Shaohua 2006-01-23 19:48:32 UTC
Created attachment 7110 [details] debug patch Does this one help a little? When ACPI returns wrong ID, we might wrongly free some info. Sorry for letting you try so many, I haven't a system to reproduce it.
Comment 9 Diego Calleja 2006-01-23 20:10:04 UTC
Created attachment 7111 [details] working dmesg Yes, this one definitively works in my box. Attached working dmesg.
Comment 10 Shaohua 2006-01-23 22:00:40 UTC
Mark this one as resolved. Ronald, does the patch work for you?
Comment 11 Venkatesh Pallipadi 2006-01-24 13:21:37 UTC
I am not sure whether the patch in cmment #8 is the right fix. It is not clear to me why we are ending up here in this error case in the first place. Ideally, it should not come here unless BIOS has messed up acpi_id. I want to get more information on this failure. Diego/Ronald. Can you attach the dmesg from your system after you apply the patch below. Thx.
Comment 12 Venkatesh Pallipadi 2006-01-24 13:22:39 UTC
Created attachment 7122 [details] Debug patch number 3
Comment 13 Diego Calleja 2006-01-24 14:15:34 UTC
the patch doesn't seems to apply on top of current linus's git tree
Comment 14 Diego Calleja 2006-01-24 14:32:10 UTC
ok I applied it by hand - it was just a extra space
Comment 15 Diego Calleja 2006-01-24 14:41:03 UTC
Created attachment 7123 [details] debug dmesg This is the corresponding dmesg
Comment 16 Venkatesh Pallipadi 2006-01-24 15:43:33 UTC
Diego, Thanks for the prompt and quick check of all the debug patches. I have narrowed down on the problem here. As per my theory, backing out this patch http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fbe83e209ad9c8281e29ac17a60f91119d86fa8c should also make your system work as before. Now that I have understood the problem, I will work with Shaohua and Ashok and get to a clean solution. Description of the problem. - This particular BIOS (both Ronald and Diego) is unique in that, it has the disabled ACPI madt entries mapping to one of the enabled madt entry. Though strictly it is not out of ACPI spec, it is uncommon though. Typically BIOSes give some lapic_id like 0x80, 0x81, etc to these disabled CPUs. ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 6:8 APIC version 17 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 6:8 APIC version 17 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x00] disabled) ACPI: LAPIC (acpi_id[0x03] lapic_id[0x00] disabled) - With the above change from Ashok, we now look at disabled CPUs as well and store there lapic_id. As this id is same as one of the enabled CPUs, the ACPI gets confused while adding CPUs and that results in an error at later point. - Though this issue was exposed by 5452, that is not the real cause for the problem.
Comment 17 Ronald Hummelink 2006-01-24 16:13:26 UTC
These are busy days for me, i will try to get in some testing in by tomorrow evening CET. My apologies, normally i have more time at hands to respond as swiftly as you guys did in working on this bug, thanks for that!
Comment 18 Diego Calleja 2006-01-24 17:15:25 UTC
Yes, backing out that change from Linus's git tree also solved the problem
Comment 19 Ashok Raj 2006-01-25 08:14:54 UTC
Created attachment 7139 [details] Dont record disabled lapic values to avoid conflict in some BIOS's Could you please apply and let me know if this fixes the problem? Thanks ashok
Comment 20 Ronald Hummelink 2006-01-25 13:51:30 UTC
Patch from comment 19 fixes the problem for me. Applies cleanly to and fixes both gentoo patched 2.6.15-gentoo-r1 and vanilla 18.104.22.168
Comment 21 Diego Calleja 2006-01-25 15:14:10 UTC
ditto for me
Comment 22 Daniel Drake 2006-01-31 11:47:01 UTC
Ashok, is this patch OK to apply in the Gentoo kernel, or is a final patch in the works?
Comment 23 Ashok Raj 2006-02-02 09:29:27 UTC
Sorry for the delay... Yes.. this is the final patch. i just got email from Andi Kleen that he pushed to linus. So it should be showing up in git trees pretty soon.
Comment 24 Len Brown 2006-02-07 02:09:42 UTC
shipped in linux-2.6.16-rc2, closing.
Comment 25 Ashok Raj 2006-02-10 19:02:57 UTC
Also available in 22.214.171.124 stable release.