Distribution: RHEL4 AS, Fedora Core 2 Hardware Environment: Dual Xeon (HP DL-360 or Sun Fire V60x) Software Environment: kernel-smp Problem Description: inconsistent sibling count when ht is disabled Steps to reproduce: On an HP DL-360 or Sun Fire V60x (dual xeon systems) 1. disable HT in the bios 2. boot any smp kernel after 2.6.5 (didn't try UP kernel) 3. /proc/cpuinfo shows 2 processors (phys ID 0 and 3) with 2 siblings Expected behavior: /proc/cpuinfo should show 2 processors with 1 sibling each. I don't see this behavior on any of my other dual xeon systems (HP x4000, xw8000, or xw8200). I don't see this behavior on ANY systems running kernel 2.6.5 or earlier. I've tried updating the firmware and that didn't help.
Can you please post the complete dmesg (dmesg -s 400000000) and /proc/cpuinfo output for this one, with any kernel later than 2.6.5? This looks like a BIOS issue, as it should set number of siblings to 1 when HT is disabled. I can confirm it after seeing the logs. Earlier to 2.6.5, there was a check in kernel that was handling this BIOS bug. That check would check when BIOS says two siblings when actual number of siblings is one. That check seems to have gone from recent kernels.
Created attachment 4745 [details] dmesg dmesg output for: HP DL360 G3, dual xeon, vanilla 2.6.11smp kernel
Here's cpuinfo from that same system. THANKS!! processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 3.06GHz stepping : 9 cpu MHz : 3066.530 cache size : 512 KB physical id : 0 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips : 6078.46 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 3.06GHz stepping : 9 cpu MHz : 3066.530 cache size : 512 KB physical id : 3 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips : 6127.61
This is indeed a BIOS bug. The IA-32 Intel Architecture Software Developer's Manual (vol 3) (http://developer.intel.com/design/pentium4/manuals/index_new.htm#sdm_vol3) in section 7.6.3 "Detecting Hyper-Threading technology", says that when cpuid (1) returns HT flag in bit 28 of EDX _and_ 16:23 of EBX contain the number of logical processors > 1, then HT is enabled. On this particular system, we have HT flag enabled and Number of siblings as 2 even when HT is disabled in BIOS. This should not cause any issue within the kernel as kernel finds out that there are no HT siblings. But, /proc/cpuinfo still shows what BIOS reports (siblings=2 in this case). Pre 2.6.5 there was a special check for this, and kernel was resetting its sibling count variable to 1 in such cases. The check is not there in recent kernels. We can add that check back in the kernel to fix the issue of reporting the wrong number of siblings in /proc/cpuinfo. But, that may not really solve the problem fully. As, some other user program can do cpuid on its own and find the same_wrong_information provided by the BIOS. So, ideal fix will be to get it fixed in BIOS. Can you contact the BIOS provider with this information? Do you think an additional check in kernel will help you in short term?
Thank you very much for clarifying this. I have opened trouble tickets with both Sun and HP regarding the BIOS issues. Hopefully, reference to this bug will encourage them to fix the problem. I have 1000 of these systems I need to fix, so is a kernel patch an option? We're currently running Fedora's 2.6.5-1.358, but I can tweak a 2.6.6 patch to fit. I just need to know what the basic code changes are. Thanks again!
Are you sure you have the problem Fedora's 2.6.5-1.358 kernel? From source code I see the problem is not there in 2.6.5-1.358, but it is there in 2.6.6.-1.435.
Created attachment 4760 [details] Patch Attached patch is against 2.6.6-1.435* kernel. And fixes this /proc/cpuinfo issue in i386 part of the code. I am assuming these are plain i386 processors. If these are EM64T capable processors and you are having problems with x86-64 kernel, let me know. I will send a seperate patch for that.
I'm sorry, you are correct, the 2.6.5 kernel works. We're actually using the 2.6.6 kernel. The patch is greatly appreciated. I wonder how many more manufacturers still have this problem. Maybe keeping this "fix" in the kernel isn't such a bad idea. If I deploy on EM64T with X86_64 (as I'm planning), and the BIOS still has this problem, will I need a different patch? What do I tell vendors they need to fix in their BIOS in that case? Thank you again for you help!
Looking at the linux-2.6.6* code, x86-64 kernel has a workaround for this BIOS bug. It will only report 1 sibling in this case. So no patch required there. I will try to push a patch onto upstream kernel 2.6.12-rc*, to have this kernel workaround. For all the earlier kernels, you can use the patch attached. Again, the kernel patch is just the band-aid. Real problem is in BIOS, and if some user program does its own 'cpuid' instruction to find out the HT information, instead of looking at the kernel interfaces of /proc/cpuinfo, then those user prgrams are still going to have problems.
Thanks, Venkatesh. Yes, a little workaround like that is appropriate, IMO. I'll be watching my inbox ;)
Thank you Venkatesh, The patch you attached did work around this problem for us. I'll continue to persue getting the vendors to fix their bios. Just curious, is the workaround in x86-64 the same? In other words, cpuid would still report incorrectly on x86-64?
Yes. It is same workaround in x86-64 as well.
Is this problem fixed in 2.6.12-rc5?
2.6.12-rc has this fixed in x86-64 but not in i386. Suresh will be sending a patch for that one. Looks like this code has changes in in 2.6.12-rc4-mm2, probably from CPU hot- plug changes. This code seems to have been removed there. Suresh, is looking more into it. Thanks, Venki
renamed account
This is now in Linus's tree http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux- 2.6.git;a=commit;h=49f384b82b03416dd7e4fc77847a959fe3247362