Bug 4359

Summary: With HT off in bios, /proc/cpuinfo shows2 processors, each with 2 siblings
Product: ACPI Reporter: Jay Hilliard (jay.hilliard)
Component: BIOSAssignee: Suresh B Siddha (suresh.b.siddha)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, akpm, suresh.b.siddha
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.6 - 2.6.11 Subsystem:
Regression: --- Bisected commit-id:
Attachments: dmesg
Patch

Description Jay Hilliard 2005-03-17 12:25:10 UTC
Distribution: RHEL4 AS, Fedora Core 2
Hardware Environment: Dual Xeon (HP DL-360 or Sun Fire V60x)
Software Environment: kernel-smp
Problem Description: inconsistent sibling count when ht is disabled

Steps to reproduce:

On an HP DL-360 or Sun Fire V60x (dual xeon systems)
  1. disable HT in the bios
  2. boot any smp kernel after 2.6.5 (didn't try UP kernel)
  3. /proc/cpuinfo shows 2 processors (phys ID 0 and 3) with 2 siblings

Expected behavior:
 /proc/cpuinfo should show 2 processors with 1 sibling each.

I don't see this behavior on any of my other dual xeon systems (HP x4000,
xw8000, or xw8200).
I don't see this behavior on ANY systems running kernel 2.6.5 or earlier.
I've tried updating the firmware and that didn't help.
Comment 1 Venkatesh Pallipadi 2005-03-17 14:22:18 UTC
Can you please post the complete dmesg (dmesg -s 400000000) and /proc/cpuinfo 
output for this one, with any kernel later than 2.6.5?

This looks like a BIOS issue, as it should set number of siblings to 1 when HT 
is disabled. I can confirm it after seeing the logs.

Earlier to 2.6.5, there was a check in kernel that was handling this BIOS bug. 
That check would check when BIOS says two siblings when actual number of 
siblings is one. That check seems to have gone from recent kernels.
Comment 2 Jay Hilliard 2005-03-17 14:37:40 UTC
Created attachment 4745 [details]
dmesg

dmesg output for:
HP DL360 G3, dual xeon, vanilla 2.6.11smp kernel
Comment 3 Jay Hilliard 2005-03-17 14:39:11 UTC
Here's cpuinfo from that same system.  THANKS!!

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 3.06GHz
stepping        : 9
cpu MHz         : 3066.530
cache size      : 512 KB
physical id     : 0
siblings        : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips        : 6078.46

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 3.06GHz
stepping        : 9
cpu MHz         : 3066.530
cache size      : 512 KB
physical id     : 3
siblings        : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips        : 6127.61
Comment 4 Venkatesh Pallipadi 2005-03-18 17:30:59 UTC
This is indeed a BIOS bug.

The IA-32 Intel Architecture Software Developer's Manual (vol 3)
(http://developer.intel.com/design/pentium4/manuals/index_new.htm#sdm_vol3)
in section 7.6.3 "Detecting Hyper-Threading technology", says that when cpuid
(1) returns HT flag in bit 28 of EDX _and_ 16:23 of EBX contain the number of 
logical processors > 1, then HT is enabled.

On this particular system, we have HT flag enabled and Number of siblings as 2 
even when HT is disabled in BIOS.

This should not cause any issue within the kernel as kernel finds out that 
there are no HT siblings. But, /proc/cpuinfo still shows what BIOS reports 
(siblings=2 in this case).

Pre 2.6.5 there was a special check for this, and kernel was resetting its 
sibling count variable to 1 in such cases. The check is not there in recent 
kernels. We can add that check back in the kernel to fix the issue of 
reporting the wrong number of siblings in /proc/cpuinfo. But, that may not 
really solve the problem fully. As, some other user program can do cpuid on 
its own and find the same_wrong_information provided by the BIOS. So, ideal 
fix will be to get it fixed in BIOS.

Can you contact the BIOS provider with this information?
Do you think an additional check in kernel will help you in short term?

Comment 5 Jay Hilliard 2005-03-18 17:59:56 UTC
Thank you very much for clarifying this.  I have opened trouble tickets with
both Sun and HP regarding the BIOS issues.  Hopefully, reference to this bug
will encourage them to fix the problem.

I have 1000 of these systems I need to fix, so is a kernel patch an option?
We're currently running Fedora's 2.6.5-1.358, but I can tweak a 2.6.6 patch to
fit. I just need to know what the basic code changes are.

Thanks again!
Comment 6 Venkatesh Pallipadi 2005-03-19 12:29:34 UTC
Are you sure you have the problem Fedora's 2.6.5-1.358 kernel? From source code
I see the problem is not there in 2.6.5-1.358, but it is there in 2.6.6.-1.435. 

Comment 7 Venkatesh Pallipadi 2005-03-19 12:32:37 UTC
Created attachment 4760 [details]
Patch

Attached patch is against 2.6.6-1.435* kernel. And fixes this /proc/cpuinfo
issue in i386 part of the code. 

I am assuming these are plain i386 processors. If these are EM64T capable
processors and you are having problems with x86-64 kernel, let me know. I will
send a seperate patch for that.
Comment 8 Jay Hilliard 2005-03-21 00:40:59 UTC
I'm sorry, you are correct, the 2.6.5 kernel works.  We're actually using the
2.6.6 kernel.  The patch is greatly appreciated.  I wonder how many more
manufacturers still have this problem.  Maybe keeping this "fix" in the kernel
isn't such a bad idea.  If I deploy on EM64T with X86_64 (as I'm planning), and
the BIOS still has this problem, will I need a different patch?  What do I tell
vendors they need to fix in their BIOS in that case?
Thank you again for you help!
Comment 9 Venkatesh Pallipadi 2005-03-21 10:44:08 UTC
Looking at the linux-2.6.6* code, x86-64 kernel has a workaround for this BIOS 
bug. It will only report 1 sibling in this case. So no patch required there. I 
will try to push a patch onto upstream kernel 2.6.12-rc*, to have this kernel 
workaround. For all the earlier kernels, you can use the patch attached.

Again, the kernel patch is just the band-aid. Real problem is in BIOS, and if 
some user program does its own 'cpuid' instruction to find out the HT 
information, instead of looking at the kernel interfaces of /proc/cpuinfo, 
then those user prgrams are still going to have problems.

Comment 10 Andrew Morton 2005-03-21 18:37:22 UTC
Thanks, Venkatesh.

Yes, a little workaround like that is appropriate, IMO.  I'll be
watching my inbox ;)
Comment 11 Jay Hilliard 2005-03-22 10:04:25 UTC
Thank you Venkatesh,  The patch you attached did work around this problem for
us. I'll continue to persue getting the vendors to fix their bios.  Just
curious, is the workaround in x86-64 the same?  In other words, cpuid would
still report incorrectly on x86-64?
Comment 12 Venkatesh Pallipadi 2005-03-22 10:12:31 UTC
Yes. It is same workaround in x86-64 as well.
Comment 13 Andrew Morton 2005-05-25 16:13:54 UTC
Is this problem fixed in 2.6.12-rc5?
Comment 14 Venkatesh Pallipadi 2005-05-25 16:35:33 UTC
2.6.12-rc has this fixed in x86-64 but not in i386. Suresh will be sending a 
patch for that one.

Looks like this code has changes in in 2.6.12-rc4-mm2, probably from CPU hot-
plug changes. This code seems to have been removed there. Suresh, is looking 
more into it.

Thanks,
Venki
Comment 15 Jon Tollefson 2005-05-26 15:44:17 UTC
renamed account
Comment 16 Suresh B Siddha 2005-06-03 17:05:17 UTC
This is now  in Linus's tree 

http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-
2.6.git;a=commit;h=49f384b82b03416dd7e4fc77847a959fe3247362