Bug 5452
Summary: | regression: 2.6.13 boot hang if HT enabled | ||
---|---|---|---|
Product: | ACPI | Reporter: | Joerg Platte (bugzilla) |
Component: | BIOS | Assignee: | Venkatesh Pallipadi (venki) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | acpi-bugzilla |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.13 and newer | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
W600 DSDT
acpidump output dmesg 2.6.12.5 dmesg 2.6.13.3 Patch to add a check for faulty acpiid reported by BIOS |
Description
Joerg Platte
2005-10-16 09:51:05 UTC
Created attachment 6313 [details]
W600 DSDT
Looks like a duplicate of bug #5165. Can you please try the patches there? I tried the following patches: Don't use P_LVL when there is a valid _CST Watchout for P_LVL2_UP flag in fadt, before using C2 and beyond on SMP systems The problem still remains. The computer hangs after loading the processor module after printign the following message: ACPI: CPU0 (power states: C1[C1]) ACPI: CPU0 (power states: C1[C1]) Couple of more inputs I need (sorry I don't have this specific system to reproduce it locally): 1) Full acpidump output using the pmtools here - http://www.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/ 2) Can you look for "acpi_processor_set_pdc" in drivers/acpi/processor_idle.c and comment out that particular line and try 2.6.13 (or any other kernel that has the bug) kernel. And let me know whether there is any change with that. Thanks. Created attachment 6326 [details]
acpidump output
Will test the patched kernel tomorrow.
Commenting out acpi_processor_set_pdc doesn't help. Kernels prior to 2.6.13.X detected the real and the "virtual" second CPU as CPU0 and CPU1. Kernel 2.6.13 detects CPU0 two times. Is this behaviour expected? Or does the second CPU0 confuse the kernel? Please attach the dmesg from the (working) 2.6.12 Any chance you can get a console capture for the failing >= 2.6.13 boot? If no, perhaps the dmesg from a working >= 2.6.13 boot, say by using "maxcpus=1" to workaround the issue, or disabling HT in the BIOS. Created attachment 6337 [details]
dmesg 2.6.12.5
This kernel has most required drivers build statically.
Created attachment 6338 [details]
dmesg 2.6.13.3
This kernel ist fully modular. I invoked bash before loading the processor
module in the generated initramfs to copy all logmessages. After loading the
module the next two kernel messages are:
ACPI: CPU0 (power states: C1[C1])
ACPI: CPU0 (power states: C1[C1])
In 2.6.12 we used to disable C-states on SMP systems altogether and we were not using acpi_processor_idle(). In 2.6.13, we enable C-states and use acpi_processor_idle() when CPUs are idling. And in both case, for C1, we use the underlying idle routine, mwait_idle() in this case for actual idle. So, it is still a mystery to me why we are hanging here. Can I ask for one more help from you. Enable magic SysRq keys and get a register dump at the hang? While I scratch my head on this, you should be able to use "idle=halt" as a workaround for your regular system usage. Thanks. Magic SysRq keys were already compiled into the kernel. Unfortunately, I can't get a register dump. After the hang the key combination doesn't work... I was able to print the help (alt sysreq h) until the processor hangs. It looks like an endless loop, because the fan gets louder and louder. Hence, the CPU seems to work very hard... "idle=halt" is working fine. Thanks for the hint. Are you sure, the output ACPI: CPU0 (power states: C1[C1]) ACPI: CPU0 (power states: C1[C1]) is correct? On my W620 (a newer model), I get the following output: ACPI: CPU0 (power states: C1[C1]) ACPI: CPU1 (power states: C1[C1]) Here, both CPUs are get a different number... Indeed... The problem is due to this.. ACPI: CPU0 (power states: C1[C1]) ACPI: CPU0 (power states: C1[C1]) Congratulations. You yourself have debugged your issue here :). Details about the problem: In the MADT this is what BIOS is saying about the two CPUs and their ACPI and APIC ids [17179569.184000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) [17179569.184000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) But, when the BIOS describes each of these CPUs in DSDT _PR namespace, it says (from acpidump->disassembly) Scope (\_PR) { Processor (CPU0, 0x00, 0x0000F010, 0x06) {} Processor (CPUA, 0x00, 0x00000000, 0x00) {} } The first field CPU0,CPUA is the processor name and second field is the ACPI id. And BIOS says ACPI_ID of 0 to both the CPUs!!! And our kernel code is not detecting this condition and as a result, both CPUs here happen to point to same structures and possibly clobbering each others structures during halt and result in some infinite loop. I still don't know which is the exact place we are hanging. But, that is not very relevant I think. Solution: 1) Make sure you are running the latest BIOS. And complain to your BIOS vendor, providing the above info. 2) Make the processor driver a bit more intelligent, so that it can catch this bug and raise a redflag earlier. Workaround with current kernels: 3) use "idle=halt" boot parameter. For 2) above I will attach a patch real soon. Created attachment 6358 [details]
Patch to add a check for faulty acpiid reported by BIOS
Let me know what happens with this patch on 2.6.13. If it works as expected, I
will push the patch towards base.
Thanks.
The patch works as expected. The ACPI part detects only one CPU0 and doesn't hang any more. Thanks for the patch! :-) Thanks for verifying. Also, check with you BIOS provider for any updates that fixes the issue. I will push this patch towards base. moved to BIOS category applied to acpi-test shipped in linux-2.6.15-rc5 -- closing. + ACPI_DEBUG_PRINT((ACPI_DB_ERROR, "BIOS reporting wrong ACPI id" + "for the processor\n")); Sorry, but could you get used to reported erros by using ACPI_REPORT_ERROR(("")) or ACPI_REPORT_WARNING(("")), so that people can see possible culprits without the need of compiling with ACPI_DEBUG=y. Bob wants to change all critical ACPI_DB_ERRORs to use the ACPI_DEBUG independent interface soon AFAIK, this one probably belongs to those that should always be written to dmesg. This is not a critical error or warning in most cases. We encounter this message even when say someone limits the CPUs by using maxcpus. I have always seen this message only in maxcpus case. Infact, Andi also told me a while back to remove this message as it is a kind of false alarm to the end user to see this message as error message. Oops Sorry.. I agree with you. This message should be changed to ACPI_REPORT_ERROR as it is critical and should appear in dmesg irrespective of ACPI_DEBUG setting. I got confused with some other error message and replied in a hurry earlier. I will fix this with a patch. Thx. 0eacee585a89ce5827b572a73a024931506bef48 shipped in 2.6.17-git9 |