Bug 15786

Summary: 2.6.34 RC3 and RC4: BUG: unable to handle kernel NULL pointer dereference at 0000001c at apbt_cpuhp_notify+0x52/0x130
Product: Platform Specific/Hardware Reporter: Andreas Jaeger (jaegerandi)
Component: i386Assignee: platform_i386
Status: CLOSED CODE_FIX    
Severity: normal CC: akpm, jacob.jun.pan, maciej.rutecki, rjw, tj
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.34-rc Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 15310    
Attachments: dmesg output showing oops
patch to conditionally register apbt cpu hotplug notifier

Description Andreas Jaeger 2010-04-15 07:34:22 UTC
Created attachment 26007 [details]
dmesg output showing oops

Running halt gives a kernel OOPS.  

It can be triggered reliably as well with:
# echo 0 > /sys/devices/system/cpu/cpu1/online

This happens with both the 2.6.34 RC3 and RC4 (2.6.33 worked fine) kernels from openSUSE. Rafael Wysocki suggested (see http://bugzilla.novell.com/show_bug.cgi?id=595904) to report it here.

The OOPS is at arch/x86/kernel/apb_timer.c:415.

I'm attaching my full dmesg output until I run the echo command.

Hardware is an Intel Atom based HP Mini 5101 netbook.
Comment 1 Andrew Morton 2010-04-16 19:46:23 UTC
On Thu, 15 Apr 2010 07:34:30 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=15786

An oops in apbt_cpuhp_notify().  This is a post-2.6.33 regression and
is hence more-urgent-than-anything, please.


btw, why does apbt_cpuhp_notify() test system_state?  system_state is a
nasty hack with poorly-defined and historically-changing semantics and
it would be really really good to minimise any dependencies upon it. 
Can we even ever _get_ hotplug events when the system is in any state
other than SYSTEM_RUNNING?
Comment 2 H. Peter Anvin 2010-04-16 21:57:59 UTC
On 04/16/2010 12:45 PM, Andrew Morton wrote:
> On Thu, 15 Apr 2010 07:34:30 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
>> https://bugzilla.kernel.org/show_bug.cgi?id=15786
> 
> An oops in apbt_cpuhp_notify().  This is a post-2.6.33 regression and
> is hence more-urgent-than-anything, please.
> 
> btw, why does apbt_cpuhp_notify() test system_state?  system_state is a
> nasty hack with poorly-defined and historically-changing semantics and
> it would be really really good to minimise any dependencies upon it. 
> Can we even ever _get_ hotplug events when the system is in any state
> other than SYSTEM_RUNNING?
> 

FWIW, Jacob is on vacation today, but he'll be back by Monday.  He's
still the best person to look at this.

	-hpa
Comment 3 Tejun Heo 2010-04-18 22:23:53 UTC
AJ, can you please build kernel w/ debug info and ask gdb at which line the oops is happening?  The function seems a bit strange.

	struct apbt_dev *adev = &per_cpu(cpu_apbt_dev, cpu);
                        ^^^^^ here, adev can't be NULL as it's
                              taking address of an lvalue.

	switch (action & 0xf) {
	case CPU_DEAD:
		apbt_disable_int(cpu);
		if (system_state == SYSTEM_RUNNING)
			pr_debug("skipping APBT CPU %lu offline\n", cpu);
		else if (adev) {
                         ^^^^^
                        so, this cond is always true. maybe it's testing the
                        wrong thing?

Thanks.
Comment 4 Andreas Jaeger 2010-04-19 06:00:20 UTC
Mmmh, seeing the above, I wonder whether gcc 4.5 plays into this as well.  I'll try a kernel compiled with gcc 4.4 first.
Comment 5 Andreas Jaeger 2010-04-19 14:51:23 UTC
Fails the same way with kernel compiled by gcc 4.4.1
Comment 6 Andreas Jaeger 2010-04-19 15:32:44 UTC
Mmmh, it does not make sense at all what I see in gdb.  All lines of apbt_cpuhp_notify are 0.
Comment 7 Jacob Pan 2010-04-19 16:48:16 UTC
Created attachment 26051 [details]
patch to conditionally register apbt cpu hotplug notifier
Comment 8 Anonymous Emailer 2010-04-19 16:50:43 UTC
Reply-To: jacob.jun.pan@intel.com

sorry for the late reply, I am looking into this right now.

I test system state because Moorestown PM code do cpu online/offline often to the non-boot CPUs, so i was trying to remove the overhead of request_irq/free_irq if system is in SYSTEM_RUNNING state.

> -----Original Message-----
> From: H. Peter Anvin [mailto:hpa@zytor.com]
> Sent: Friday, April 16, 2010 1:57 PM
> To: Andrew Morton
> Cc: Pan, Jacob jun; Thomas Gleixner; Ingo Molnar; bugzilla-
> daemon@bugzilla.kernel.org; bugme-daemon@bugzilla.kernel.org; Tejun
> Heo; jaegerandi@gmail.com
> Subject: Re: [Bugme-new] [Bug 15786] New: 2.6.34 RC3 and RC4: BUG:
> unable to handle kernel NULL pointer dereference at 0000001c at
> apbt_cpuhp_notify+0x52/0x130
> 
> On 04/16/2010 12:45 PM, Andrew Morton wrote:
> > On Thu, 15 Apr 2010 07:34:30 GMT
> > bugzilla-daemon@bugzilla.kernel.org wrote:
> >
> >> https://bugzilla.kernel.org/show_bug.cgi?id=15786
> >
> > An oops in apbt_cpuhp_notify().  This is a post-2.6.33 regression and
> > is hence more-urgent-than-anything, please.
> >
> > btw, why does apbt_cpuhp_notify() test system_state?  system_state is
> a
> > nasty hack with poorly-defined and historically-changing semantics
> and
> > it would be really really good to minimise any dependencies upon it.
> > Can we even ever _get_ hotplug events when the system is in any state
> > other than SYSTEM_RUNNING?
> >
> 
> FWIW, Jacob is on vacation today, but he'll be back by Monday.  He's
> still the best person to look at this.
> 
>       -hpa
Comment 9 Jacob Pan 2010-04-19 16:57:50 UTC
AJ, could you try the patch I just attached? The bug was that apbt_late_init is
an initcall, it does not check if the timer block is enabled or not when
registering the notifier. So when you boot the kernel on a PC, APB timer is not
initialized but the notifier is still registered thus causes oops.
Comment 10 Andreas Jaeger 2010-04-20 02:46:54 UTC
Jacob, the patch works fine and solves all problems I had: the offline via echo works, halt works and suspend to ram works again!

thanks!  Hope the patch makes it into 2.6.34.
Comment 11 Jacob Pan 2010-04-20 14:46:54 UTC
thanks for the update. the patch has been sent to x86 maintainers and lkml. I will follow up if there are any issues.
Again, sorry for all the troubles.
Comment 12 Rafael J. Wysocki 2010-04-21 05:44:55 UTC
Handled-By : Jacob Pan <jacob.jun.pan@linux.intel.com>
Patch : https://bugzilla.kernel.org/attachment.cgi?id=26051
Comment 13 Rafael J. Wysocki 2010-05-04 19:40:35 UTC
Fixed by commit ae7c9b70dcb4313ea3dbcc9a2f240dae6c2b50c0 .
Comment 14 Rafael J. Wysocki 2010-05-04 21:23:09 UTC
*** Bug 15820 has been marked as a duplicate of this bug. ***