Bug 5452

Summary:	regression: 2.6.13 boot hang if HT enabled
Product:	ACPI	Reporter:	Joerg Platte (bugzilla)
Component:	BIOS	Assignee:	Venkatesh Pallipadi (venki)
Status:	CLOSED CODE_FIX
Severity:	high	CC:	acpi-bugzilla
Priority:	P2
Hardware:	i386
OS:	Linux
Kernel Version:	2.6.13 and newer	Subsystem:
Regression:	---	Bisected commit-id:
Attachments:	W600 DSDT acpidump output dmesg 2.6.12.5 dmesg 2.6.13.3 Patch to add a check for faulty acpiid reported by BIOS

Description Joerg Platte 2005-10-16 09:51:05 UTC

Most recent kernel where this bug did not occur:2.6.12.X   
Distribution:Debian (vanilla Kernel)  
Hardware Environment:Fujitsu Siemens Scenic W600  
Software Environment:   
Problem Description:   
The problem exist on various Fujitsu-Siemens computers (W600 series). Kernels    
up to 2.6.12 can boot without ACPI related problems. All tested  newer    
kernels (2.6.13, 2.6.13.3 and 2.6.14-rc3) hang after ACPI CPU detection. I  
solved this problem temporaly by disabling hypter threading in    
BIOS, but there should be a better solution. The dsdt can be found here: 
http://www-ds.e-technik.uni-dortmund.de/~jplatte/dsdt.w600 
   
Steps to reproduce:   
Boot with Hyper Threading enabled...

Comment 1 Joerg Platte 2005-10-16 09:52:13 UTC

Created attachment 6313 [details]
W600 DSDT

Comment 2 Venkatesh Pallipadi 2005-10-17 06:02:58 UTC

Looks like a duplicate of bug #5165. Can you please try the patches there?

Comment 3 Joerg Platte 2005-10-17 08:14:47 UTC

I tried the following patches: 
Don't use P_LVL when there is a valid _CST 
Watchout for P_LVL2_UP flag in fadt, before using C2 and beyond on SMP systems 
 
The problem still remains. The computer hangs after loading the processor 
module after printign the following message: 
ACPI: CPU0 (power states: C1[C1]) 
ACPI: CPU0 (power states: C1[C1])

Comment 4 Venkatesh Pallipadi 2005-10-17 08:24:24 UTC

Couple of more inputs I need (sorry I don't have this specific system to 
reproduce it locally):

1) Full acpidump output using the pmtools here - 
http://www.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/

2) Can you look for "acpi_processor_set_pdc" in drivers/acpi/processor_idle.c 
and comment out that particular line and try 2.6.13 (or any other kernel that 
has the bug) kernel. And let me know whether there is any change with that.

Thanks.

Comment 5 Joerg Platte 2005-10-17 08:59:29 UTC

Created attachment 6326 [details]
acpidump output

Will test the patched kernel tomorrow.

Comment 6 Joerg Platte 2005-10-17 09:12:02 UTC

Commenting out acpi_processor_set_pdc doesn't help. Kernels prior to 2.6.13.X 
detected the real and the "virtual" second CPU as CPU0 and CPU1. Kernel 2.6.13 
detects CPU0 two times. Is this behaviour expected? Or does the second CPU0 
confuse the kernel?

Comment 7 Len Brown 2005-10-19 18:36:41 UTC

Please attach the dmesg from the (working) 2.6.12
Any chance you can get a console capture for the failing >= 2.6.13 boot?
If no, perhaps the dmesg from a working >= 2.6.13 boot, say by using
"maxcpus=1" to workaround the issue, or disabling HT in the BIOS.

Comment 8 Joerg Platte 2005-10-19 23:30:26 UTC

Created attachment 6337 [details]
dmesg 2.6.12.5

This kernel has most required drivers build statically.

Comment 9 Joerg Platte 2005-10-19 23:33:03 UTC

Created attachment 6338 [details]
dmesg 2.6.13.3

This kernel ist fully modular. I invoked bash before loading the processor
module in the generated initramfs to copy all logmessages. After loading the
module the next two kernel messages are:
ACPI: CPU0 (power states: C1[C1]) 
ACPI: CPU0 (power states: C1[C1])

Comment 10 Venkatesh Pallipadi 2005-10-21 06:34:22 UTC

In 2.6.12 we used to disable C-states on SMP systems altogether and we were 
not using acpi_processor_idle(). In 2.6.13, we enable C-states and use 
acpi_processor_idle() when CPUs are idling. And in both case, for C1, we use 
the underlying idle routine, mwait_idle() in this case for actual idle.

So, it is still a mystery to me why we are hanging here. Can I ask for one 
more help from you. Enable magic SysRq keys and get a register dump at the 
hang?

While I scratch my head on this, you should be able to use "idle=halt" as a 
workaround for your regular system usage.

Thanks.

Comment 11 Joerg Platte 2005-10-21 08:17:48 UTC

Magic SysRq keys were already compiled into the kernel. Unfortunately, I can't  
get a register dump. After the hang the key combination doesn't work... I was  
able to print the help (alt sysreq h) until the processor hangs. It looks like  
an endless loop, because the fan gets louder and louder. Hence, the CPU seems  
to work very hard...  
  
"idle=halt" is working fine. Thanks for the hint.  
  
Are you sure, the output   
ACPI: CPU0 (power states: C1[C1])  
ACPI: CPU0 (power states: C1[C1]) 
is correct? On my W620 (a newer model), I get the following output: 
ACPI: CPU0 (power states: C1[C1]) 
ACPI: CPU1 (power states: C1[C1]) 
Here, both CPUs are get a different number...

Comment 12 Venkatesh Pallipadi 2005-10-21 18:58:10 UTC

Indeed...
The problem is due to this..
ACPI: CPU0 (power states: C1[C1]) 
ACPI: CPU0 (power states: C1[C1])

Congratulations. You yourself have debugged your issue here :).

Details about the problem:
In the MADT this is what BIOS is saying about the two CPUs and their ACPI and
APIC ids
[17179569.184000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[17179569.184000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)

But, when the BIOS describes each of these CPUs in DSDT _PR namespace, it says
(from acpidump->disassembly)
    Scope (\_PR)
    {
        Processor (CPU0, 0x00, 0x0000F010, 0x06) {}
        Processor (CPUA, 0x00, 0x00000000, 0x00) {}
    }

The first field CPU0,CPUA is the processor name and second field is the ACPI id.
And BIOS says ACPI_ID of 0 to both the CPUs!!!

And our kernel code is not detecting this condition and as a result, both CPUs
here happen to point to same structures and possibly clobbering each others
structures during halt and result in some infinite loop. 

I still don't know which is the exact place we are hanging. But, that is not
very relevant I think.

Solution:
1) Make sure you are running the latest BIOS. And complain to your BIOS vendor,
providing the above info.
2) Make the processor driver a bit more intelligent, so that it can catch this
bug and raise a redflag earlier.

Workaround with current kernels:
3) use "idle=halt" boot parameter.

For 2) above I will attach a patch real soon.

Comment 13 Venkatesh Pallipadi 2005-10-21 19:22:39 UTC

Created attachment 6358 [details]
Patch to add a check for faulty acpiid reported by BIOS

Let me know what happens with this patch on 2.6.13. If it works as expected, I
will push the patch towards base.

Thanks.

Comment 14 Joerg Platte 2005-10-24 00:14:53 UTC

The patch works as expected. The ACPI part detects only one CPU0 and doesn't 
hang any more. Thanks for the patch! :-)

Comment 15 Venkatesh Pallipadi 2005-10-24 04:45:58 UTC

Thanks for verifying. Also, check with you BIOS provider for any updates that
fixes the issue. 

I will push this patch towards base.

Comment 16 Len Brown 2005-11-30 20:18:35 UTC

moved to BIOS category
applied to acpi-test

Comment 17 Len Brown 2005-12-05 14:22:31 UTC

shipped in linux-2.6.15-rc5 -- closing.

Comment 18 Thomas Renninger 2005-12-22 05:19:58 UTC

+		ACPI_DEBUG_PRINT((ACPI_DB_ERROR, "BIOS reporting wrong ACPI id"
+			"for the processor\n"));

Sorry, but could you get used to reported erros by using ACPI_REPORT_ERROR((""))
or ACPI_REPORT_WARNING(("")), so that people can see possible culprits without
the need of compiling with ACPI_DEBUG=y.
Bob wants to change all critical ACPI_DB_ERRORs to use the ACPI_DEBUG
independent interface soon AFAIK, this one probably belongs to those that should
always be written to dmesg.

Comment 19 Venkatesh Pallipadi 2006-01-19 11:38:55 UTC

This is not a critical error or warning in most cases. We encounter this message
even when say someone limits the CPUs by using maxcpus. I have always seen this
message only in maxcpus case. Infact, Andi also told me a while back to remove
this message as it is a kind of false alarm to the end user to see this message
as error message.

Comment 20 Venkatesh Pallipadi 2006-01-24 10:30:45 UTC

Oops Sorry.. I agree with you. This message should be changed to
ACPI_REPORT_ERROR as it is critical and should appear in dmesg irrespective of
ACPI_DEBUG setting. I got confused with some other error message and replied in
a hurry earlier. I will fix this with a patch.

Thx.

Comment 21 Len Brown 2006-06-25 21:16:55 UTC

0eacee585a89ce5827b572a73a024931506bef48 
shipped in 2.6.17-git9