Bug 13444

Summary: SparcServer 1000E SMP can cause kernel-nullpointer with some hw configurations
Product: Platform Specific/Hardware Reporter: Kjetil Oftedal (oftedal)
Component: SPARC32Assignee: Pete Zaitcev (zaitcev)
Status: CLOSED OBSOLETE    
Severity: normal CC: alan
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.4.37 Subsystem:
Regression: No Bisected commit-id:
Attachments: SUN4D-SMP mode patch

Description Kjetil Oftedal 2009-06-03 16:45:33 UTC
Created attachment 21732 [details]
SUN4D-SMP mode patch

SparcServer 1000E/SUN4D machines will cause a kernel-nullpointer when running in SMP-mode with certain hw configurations.
According to Sun documentation Slot A on the systemboards should be filled with cpu modules first. This will fill physical cpu-slots 0,2,4,8 with cpus first. 
The current smp code will then fill the init_tasks-array using the physical cpu-mapping. 
The scheduler on the other hand uses the logical cpu-mapping.
So in a two cpu system with two systemboards cpu-slots 0 and 2 will be occupied and init_tasks-array position 0 and 2 will contain a idle_task. But the scheduler will access init_tasks position 0 and 1 which will cause an error.
Any sparcserver 1000(E) system with 2 systemboards and 2 cpus, with 1 cpu per board will cause this error.
Comment 1 Andrew Morton 2009-06-03 21:42:44 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 3 Jun 2009 16:45:33 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=13444
> 
>            Summary: SparcServer 1000E SMP can cause kernel-nullpointer
>                     with some hw configurations
>            Product: Platform Specific/Hardware
>            Version: 2.5
>     Kernel Version: 2.4.37
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: SPARC32
>         AssignedTo: zaitcev@yahoo.com
>         ReportedBy: oftedal@gmail.com
>         Regression: No
> 
> 
> Created an attachment (id=21732)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=21732)
> SUN4D-SMP mode patch
> 
> SparcServer 1000E/SUN4D machines will cause a kernel-nullpointer when running
> in SMP-mode with certain hw configurations.
> According to Sun documentation Slot A on the systemboards should be filled
> with
> cpu modules first. This will fill physical cpu-slots 0,2,4,8 with cpus first. 
> The current smp code will then fill the init_tasks-array using the physical
> cpu-mapping. 
> The scheduler on the other hand uses the logical cpu-mapping.
> So in a two cpu system with two systemboards cpu-slots 0 and 2 will be
> occupied
> and init_tasks-array position 0 and 2 will contain a idle_task. But the
> scheduler will access init_tasks position 0 and 1 which will cause an error.
> Any sparcserver 1000(E) system with 2 systemboards and 2 cpus, with 1 cpu per
> board will cause this error.
> 

Thanks, but handling patches via bugzilla is far from preferred.

Please send the patch via email, as a reply-to-all to this email.  The
patch should include a full description and a Signed-off-by:, as per
Documentation/SubmittingPatches.
Comment 2 Kjetil Oftedal 2009-06-03 23:28:47 UTC
On Wed, 3 Jun 2009, Andrew Morton wrote:

>
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> On Wed, 3 Jun 2009 16:45:33 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
>
>> http://bugzilla.kernel.org/show_bug.cgi?id=13444
>>
>>            Summary: SparcServer 1000E SMP can cause kernel-nullpointer
>>                     with some hw configurations
>>            Product: Platform Specific/Hardware
>>            Version: 2.5
>>     Kernel Version: 2.4.37
>>           Platform: All
>>         OS/Version: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: normal
>>           Priority: P1
>>          Component: SPARC32
>>         AssignedTo: zaitcev@yahoo.com
>>         ReportedBy: oftedal@gmail.com
>>         Regression: No
>>
>>
>> Created an attachment (id=21732)
>>  --> (http://bugzilla.kernel.org/attachment.cgi?id=21732)
>> SUN4D-SMP mode patch
>>
>> SparcServer 1000E/SUN4D machines will cause a kernel-nullpointer when
>> running
>> in SMP-mode with certain hw configurations.
>> According to Sun documentation Slot A on the systemboards should be filled
>> with
>> cpu modules first. This will fill physical cpu-slots 0,2,4,8 with cpus
>> first.
>> The current smp code will then fill the init_tasks-array using the physical
>> cpu-mapping.
>> The scheduler on the other hand uses the logical cpu-mapping.
>> So in a two cpu system with two systemboards cpu-slots 0 and 2 will be
>> occupied
>> and init_tasks-array position 0 and 2 will contain a idle_task. But the
>> scheduler will access init_tasks position 0 and 1 which will cause an error.
>> Any sparcserver 1000(E) system with 2 systemboards and 2 cpus, with 1 cpu
>> per
>> board will cause this error.
>>
>
> Thanks, but handling patches via bugzilla is far from preferred.
>
> Please send the patch via email, as a reply-to-all to this email.  The
> patch should include a full description and a Signed-off-by:, as per
> Documentation/SubmittingPatches.
>
>

As mentioned in the bug-report(#13444) the sun4d-SMP code on 
2.4-series kernels(2.4.37) uses the physical cpu-mapping when inserting 
tasks into the init_tasks array.
And the scheduler uses the logical cpu-mapping which will cause a
kernel nullpointer when there are gaps in the physical cpu-mapping.

The attached patch uses the cpucount variable, which can be used to find 
the next logical slot in the init_tasks array. As it is only incremented
when a cpu is successfully started.

(I hope this was more compliant with kernel-bugfixing standards. I am new 
to this)

Signed-off-by: Kjetil Oftedal <oftedal@gmail.com>
---
diff --git a/arch/sparc/kernel/sun4d_smp.c b/arch/sparc/kernel/sun4d_smp.c
index 9bb7f78..5b9b34e 100644
--- a/arch/sparc/kernel/sun4d_smp.c
+++ b/arch/sparc/kernel/sun4d_smp.c
@@ -221,7 +221,9 @@ void __init smp4d_boot_cpus(void)
 			cpucount++;

 			p = init_task.prev_task;
-			init_tasks[i] = p;
+
+			/* The scheduler uses the logical cpu mapping when accessing this array */
+			init_tasks[cpucount] = p;

 			p->processor = i;
 			p->cpus_runnable = 1 << i; /* we schedule the first task manually */
Comment 3 David S. Miller 2009-06-04 04:23:16 UTC
From: oftedal <oftedal@gmail.com>
Date: Thu, 4 Jun 2009 01:14:33 +0200 (CEST)

> As mentioned in the bug-report(#13444) the sun4d-SMP code on
> 2.4-series kernels(2.4.37) uses the physical cpu-mapping when
> inserting tasks into the init_tasks array.
> And the scheduler uses the logical cpu-mapping which will cause a
> kernel nullpointer when there are gaps in the physical cpu-mapping.
> 
> The attached patch uses the cpucount variable, which can be used to
> find the next logical slot in the init_tasks array. As it is only
> incremented
> when a cpu is successfully started.
> 
> (I hope this was more compliant with kernel-bugfixing standards. I am
> new to this)
> 
> Signed-off-by: Kjetil Oftedal <oftedal@gmail.com>

I don't think we're handling 2.4.x patches at this point. :-)

If the bug exists in 2.6.x as well I'd be happy to review such
a patch and apply it for you.
Comment 4 Kjetil Oftedal 2009-06-04 08:54:04 UTC
On Wed, 3 Jun 2009, David Miller wrote:

> From: oftedal <oftedal@gmail.com>
> Date: Thu, 4 Jun 2009 01:14:33 +0200 (CEST)
>
>> As mentioned in the bug-report(#13444) the sun4d-SMP code on
>> 2.4-series kernels(2.4.37) uses the physical cpu-mapping when
>> inserting tasks into the init_tasks array.
>> And the scheduler uses the logical cpu-mapping which will cause a
>> kernel nullpointer when there are gaps in the physical cpu-mapping.
>>
>> The attached patch uses the cpucount variable, which can be used to
>> find the next logical slot in the init_tasks array. As it is only
>> incremented
>> when a cpu is successfully started.
>>
>> (I hope this was more compliant with kernel-bugfixing standards. I am
>> new to this)
>>
>> Signed-off-by: Kjetil Oftedal <oftedal@gmail.com>
>
> I don't think we're handling 2.4.x patches at this point. :-)
>
> If the bug exists in 2.6.x as well I'd be happy to review such
> a patch and apply it for you.
>

Yeah. 2.4.x is a "bit" old. But it was easier to fix. As the 2.6.x-series
is giving me all kinds of trouble. The SMP-code was redesigned for 2.6, 
and I havn't been able to confirm if it works or not.
I could have a go at it if someone could give me a small pointer in how to 
get the piggyback_32 utility working with 2.6s System.map.
(looks like "A end" is missing from it)
Comment 5 Sam Ravnborg 2009-06-04 09:20:36 UTC
> 
> Yeah. 2.4.x is a "bit" old. But it was easier to fix. As the 2.6.x-series
> is giving me all kinds of trouble. The SMP-code was redesigned for 2.6, 
> and I havn't been able to confirm if it works or not.
> I could have a go at it if someone could give me a small pointer in how to 
> get the piggyback_32 utility working with 2.6s System.map.
> (looks like "A end" is missing from it)

Just a quick guess..

Try to add:

	end = .;

to arch/sparc/kernel/vmlinux.lds.S

Do this above the definition of _end.

Or better change piggyback_32 to support _end.

	Sam