Created attachment 26007 [details] dmesg output showing oops Running halt gives a kernel OOPS. It can be triggered reliably as well with: # echo 0 > /sys/devices/system/cpu/cpu1/online This happens with both the 2.6.34 RC3 and RC4 (2.6.33 worked fine) kernels from openSUSE. Rafael Wysocki suggested (see http://bugzilla.novell.com/show_bug.cgi?id=595904) to report it here. The OOPS is at arch/x86/kernel/apb_timer.c:415. I'm attaching my full dmesg output until I run the echo command. Hardware is an Intel Atom based HP Mini 5101 netbook.
On Thu, 15 Apr 2010 07:34:30 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=15786 An oops in apbt_cpuhp_notify(). This is a post-2.6.33 regression and is hence more-urgent-than-anything, please. btw, why does apbt_cpuhp_notify() test system_state? system_state is a nasty hack with poorly-defined and historically-changing semantics and it would be really really good to minimise any dependencies upon it. Can we even ever _get_ hotplug events when the system is in any state other than SYSTEM_RUNNING?
On 04/16/2010 12:45 PM, Andrew Morton wrote: > On Thu, 15 Apr 2010 07:34:30 GMT > bugzilla-daemon@bugzilla.kernel.org wrote: > >> https://bugzilla.kernel.org/show_bug.cgi?id=15786 > > An oops in apbt_cpuhp_notify(). This is a post-2.6.33 regression and > is hence more-urgent-than-anything, please. > > btw, why does apbt_cpuhp_notify() test system_state? system_state is a > nasty hack with poorly-defined and historically-changing semantics and > it would be really really good to minimise any dependencies upon it. > Can we even ever _get_ hotplug events when the system is in any state > other than SYSTEM_RUNNING? > FWIW, Jacob is on vacation today, but he'll be back by Monday. He's still the best person to look at this. -hpa
AJ, can you please build kernel w/ debug info and ask gdb at which line the oops is happening? The function seems a bit strange. struct apbt_dev *adev = &per_cpu(cpu_apbt_dev, cpu); ^^^^^ here, adev can't be NULL as it's taking address of an lvalue. switch (action & 0xf) { case CPU_DEAD: apbt_disable_int(cpu); if (system_state == SYSTEM_RUNNING) pr_debug("skipping APBT CPU %lu offline\n", cpu); else if (adev) { ^^^^^ so, this cond is always true. maybe it's testing the wrong thing? Thanks.
Mmmh, seeing the above, I wonder whether gcc 4.5 plays into this as well. I'll try a kernel compiled with gcc 4.4 first.
Fails the same way with kernel compiled by gcc 4.4.1
Mmmh, it does not make sense at all what I see in gdb. All lines of apbt_cpuhp_notify are 0.
Created attachment 26051 [details] patch to conditionally register apbt cpu hotplug notifier
Reply-To: jacob.jun.pan@intel.com sorry for the late reply, I am looking into this right now. I test system state because Moorestown PM code do cpu online/offline often to the non-boot CPUs, so i was trying to remove the overhead of request_irq/free_irq if system is in SYSTEM_RUNNING state. > -----Original Message----- > From: H. Peter Anvin [mailto:hpa@zytor.com] > Sent: Friday, April 16, 2010 1:57 PM > To: Andrew Morton > Cc: Pan, Jacob jun; Thomas Gleixner; Ingo Molnar; bugzilla- > daemon@bugzilla.kernel.org; bugme-daemon@bugzilla.kernel.org; Tejun > Heo; jaegerandi@gmail.com > Subject: Re: [Bugme-new] [Bug 15786] New: 2.6.34 RC3 and RC4: BUG: > unable to handle kernel NULL pointer dereference at 0000001c at > apbt_cpuhp_notify+0x52/0x130 > > On 04/16/2010 12:45 PM, Andrew Morton wrote: > > On Thu, 15 Apr 2010 07:34:30 GMT > > bugzilla-daemon@bugzilla.kernel.org wrote: > > > >> https://bugzilla.kernel.org/show_bug.cgi?id=15786 > > > > An oops in apbt_cpuhp_notify(). This is a post-2.6.33 regression and > > is hence more-urgent-than-anything, please. > > > > btw, why does apbt_cpuhp_notify() test system_state? system_state is > a > > nasty hack with poorly-defined and historically-changing semantics > and > > it would be really really good to minimise any dependencies upon it. > > Can we even ever _get_ hotplug events when the system is in any state > > other than SYSTEM_RUNNING? > > > > FWIW, Jacob is on vacation today, but he'll be back by Monday. He's > still the best person to look at this. > > -hpa
AJ, could you try the patch I just attached? The bug was that apbt_late_init is an initcall, it does not check if the timer block is enabled or not when registering the notifier. So when you boot the kernel on a PC, APB timer is not initialized but the notifier is still registered thus causes oops.
Jacob, the patch works fine and solves all problems I had: the offline via echo works, halt works and suspend to ram works again! thanks! Hope the patch makes it into 2.6.34.
thanks for the update. the patch has been sent to x86 maintainers and lkml. I will follow up if there are any issues. Again, sorry for all the troubles.
Handled-By : Jacob Pan <jacob.jun.pan@linux.intel.com> Patch : https://bugzilla.kernel.org/attachment.cgi?id=26051
Fixed by commit ae7c9b70dcb4313ea3dbcc9a2f240dae6c2b50c0 .
*** Bug 15820 has been marked as a duplicate of this bug. ***