Bug 1706

Summary: problem with SMP boot with 2.4.22 and acpi-20031203
Product: ACPI Reporter: Sergey Vlasov (vsu)
Component: Config-ProcessorsAssignee: Luming Yu (luming.yu)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.4.22 Subsystem:
Regression: --- Bisected commit-id:
Attachments: dmesg from booting with 2.4.22 and acpi-20031203 (ACPI mode)
dmesg from booting wit 2.4.22 and acpi-20031203 (acpi=off)
dmesg from booting with 2.4.22 and previous version of the ACPI patch
a proposal patch

Description Sergey Vlasov 2003-12-19 08:07:56 UTC
After updating ACPI patch version from 21-Nov-2003 to 11-Dec-2003
(acpi-20031203-2.4.22.diff), a problem with booting on an SMP (P4 HT) machine
was reported: booting takes about 3 minutes, and kernel complains about not
responding processors:

Booting processor 2/32 eip 3000
Not responding.
Booting processor 2/33 eip 3000
Not responding.
Booting processor 2/64 eip 3000
Not responding.
Booting processor 2/65 eip 3000
Not responding.
Booting processor 2/96 eip 3000
Not responding.
Booting processor 2/97 eip 3000
Not responding.
Booting processor 2/128 eip 3000
Not responding.
Booting processor 2/129 eip 3000
Not responding.
Booting processor 2/160 eip 3000
Not responding.
Booting processor 2/161 eip 3000
Not responding.
Booting processor 2/192 eip 3000
Not responding.
Booting processor 2/193 eip 3000
Not responding.
Booting processor 2/224 eip 3000
Not responding.
Booting processor 2/225 eip 3000
Not responding.
Total of 2 processors activated (9620.68 BogoMIPS).

(full dmesg output is attached separately).

Even acpi=off does not fix this problem (so the common SMP boot code seems to be
broken by this patch).  With maxcpus=2 the kernel boots as expected.

It seems that the problem is caused by this chunk of the patch:

diff -Nru a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c	Fri Dec 12 00:22:07 2003
+++ b/arch/i386/kernel/smpboot.c	Fri Dec 12 00:22:07 2003
@@ -1106,7 +1106,7 @@
 	 */
 	Dprintk("CPU present map: %lx\n", phys_cpu_present_map);
 
-	for (bit = 0; bit < NR_CPUS; bit++) {
+	for (bit = 0; bit < MAX_APICS; bit++) {
 		apicid = cpu_present_to_apicid(bit);
 		/*
 		 * Don't even attempt to start the boot CPU!

MAX_APICS is 256 (the kernel was compiled with CONFIG_X86_CLUSTERED_APIC), but
the hardware does not use clustered APIC mode, therefore
cpu_present_to_apicid(bit) == bit, and later in the loop (1ul << bit) value
becomes undefined when bit >= 32. Actually the CPU calculates (1UL << (bit &
0x1F)), therefore the code attempts to activate processors 32 and 33, 64 and 65,
etc.

In 2.4.23 the code was slightly changed; however, if the clustered APIC mode is
not used, the same shift with out-of-range number of bits would be performed in
apicid_to_phys_cpu_present(), with the same results (not tested actually).

The latest ACPI patch (acpi-20031203-2.4.22.diff from 18-Dec-2003) was not
tested too, but the problematic part of code was not changed...
Comment 1 Sergey Vlasov 2003-12-19 08:09:59 UTC
Created attachment 1699 [details]
dmesg from booting with 2.4.22 and acpi-20031203 (ACPI mode)
Comment 2 Sergey Vlasov 2003-12-19 08:10:43 UTC
Created attachment 1700 [details]
dmesg from booting wit 2.4.22 and acpi-20031203 (acpi=off)
Comment 3 Sergey Vlasov 2003-12-19 08:12:11 UTC
Created attachment 1701 [details]
dmesg from booting with 2.4.22 and previous version of the ACPI patch

With the previous ACPI patch the problem does not happen
Comment 4 Luming Yu 2004-01-08 01:38:21 UTC
Created attachment 1814 [details]
a proposal patch 

This problem seems to be introduced by GCC:

#cat foo.c
main()
{
	int bit;

	for (bit=28; bit < 34; bit++){
		printf("%.4x \n", (1ul << bit)));
	}
}

#gcc foo.c
#./a.out
10000000 
20000000 
40000000 
80000000 

0001	 ( 1<< 34 , It's overflowed, but it should be 0 )
0002	 ( 1<< 35 , It's overflowed, but it should be 0 )

# gcc -v
Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/3.2/specs
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--disable-checking --host=i386-redhat-linux --with-system-zlib
--enable-__cxa_atexit
Thread model: posix
gcc version 3.2 20020903 (Red Hat Linux 8.0 3.2-7)
Comment 5 Sergey Vlasov 2004-01-08 02:13:58 UTC
sizeof(unsigned long) in the patch should probably be BITS_PER_LONG.

The behavior of gcc-generated code is correct - according to the C standard, the
result of a shift operation with an out-of-range number of bits is undefined.
Comment 6 Len Brown 2004-01-09 02:45:25 UTC
woops, this one got away... 
 
now fixed: 
http://ftp.kernel.org/pub/linux/kernel/people/lenb/acpi/patches/test/2.4.22/20040109024148-smpboot.patch 
 
Luming -- please mark this bug closed. 
 
thanks, 
-Len