Bug 16991 - divide by zero bug in find_busiest_group (actually inlined update_sg_lb_stats )
Summary: divide by zero bug in find_busiest_group (actually inlined update_sg_lb_sta...
Status: RESOLVED CODE_FIX
Alias: None
Product: Process Management
Classification: Unclassified
Component: Scheduler (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: Ingo Molnar
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-24 22:47 UTC by Chetan
Modified: 2017-04-27 08:46 UTC (History)
19 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
screen capture of div-zero crash in find_busiest_group (12.68 KB, image/gif)
2010-08-24 22:48 UTC, Chetan
Details
proposed patch against 2.6.32 (2.14 KB, patch)
2011-08-05 22:31 UTC, Daniel Kahn Gillmor
Details | Diff
3x Kernel Panic Stack Traces (21.16 KB, text/plain)
2011-08-09 20:55 UTC, Gavin McDonald
Details
Kernel crash screen photo (732.39 KB, image/jpeg)
2011-08-24 09:47 UTC, Bas van der Doorn
Details

Description Chetan 2010-08-24 22:47:59 UTC
Hi,
 
  We upgraded to the "official" 2.6.32 kernel on our small cluster of linux boxes a few months ago. Recently, within the last week or two, we've started seeing this divide-by-zero crash on multiple of these servers at seemingly random times. We haven't been able to catch one of these while attached on our console servers but luckily we can capture an image of the last screen from the stack trace, one of which I'm adding as an attachment:

  Using the RIP in that trace, I was able to pin down the line of code where this crash is happening to the following (in the 2.6.32 treee):


The function
 /usr/src/linux/kernel/sched.c:3769
*balance = 0;
return;
}

/* Adjust by relative CPU power of the group */
sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) /
group->cpu_power;
aff5: 48 c1 e0 0a shl $0xa,%rax
aff9: 48 f7 f6 div %rsi


  
  So it looks like group->cpu_power happens to be zero in these crashes. All the crashes we've examined have been at the exact same adddress. The same code structure is present in 2.6.35 series kernels except it seems to have moved to the sched_fair.c file now. 

Note:
 The above is a snippet from an objdump of sched.o built with -g flag. The instruction address in this dump doesn't match the RIP from the screenshot since these are unrelcaoted addresses. I did match the assembly here to that in the full vmlinux objdump and I'm quite sure of the location (The -g build of the entire kernel is too huge for me to work with.)
 

btw, here's a report of another divide-by-zero in find_busiest_group in 2.6.35 kernel here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/615135


 Looking at the code, what strikes me is that group->cpu_power fields are initt'd to 0  in build_xxx_sched_groups. But I don't see any synchronization between these places - where group->cpu_power starts off as zero -- and the usage of that variable (as a denominator) in find_busiest_group and find_idlest_group functions. It's possible that there's some logical "can't happen" type synchronization inherent in the design that's not visible in the code. If so, I'd love to understand the argument that makes it ok.
 

Thanks
Chetan
Comment 1 Chetan 2010-08-24 22:48:53 UTC
Created attachment 27921 [details]
screen capture of div-zero crash in find_busiest_group
Comment 2 Andrew Morton 2010-08-24 23:00:39 UTC
cc'ed Peter, marked as a regression.

Nice bug report - thanks!  I don't immediately see anything in later kernels which addressed this.
Comment 3 Chetan 2010-08-24 23:06:25 UTC
Thanks Andrew.

  btw, I left out some information. The line that says:

The function

 should read:

 The function this line of code belongs to is  update_sg_lb_stats (inlined into find_busiest_group)
Comment 4 Anonymous Emailer 2010-08-25 07:39:50 UTC
Reply-To: peterz@infradead.org

On Tue, 2010-08-24 at 23:00 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=16991
> 
> 
> Andrew Morton <akpm@linux-foundation.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |a.p.zijlstra@chello.nl,
>                    |                            |akpm@linux-foundation.org
>          Regression|No                          |Yes
> 
> 
> 
> 
> --- Comment #2 from Andrew Morton <akpm@linux-foundation.org>  2010-08-24
> 23:00:39 ---
> cc'ed Peter, marked as a regression.
> 
> Nice bug report - thanks!  I don't immediately see anything in later kernels
> which addressed this.
> 

Chetan writes:

> Looking at the code, what strikes me is that group->cpu_power fields are
> initt'd to 0  in build_xxx_sched_groups. But I don't see any synchronization
> between these places - where group->cpu_power starts off as zero -- and the
> usage of that variable (as a denominator) in find_busiest_group and
> find_idlest_group functions. It's possible that there's some logical "can't
> happen" type synchronization inherent in the design that's not visible in the
> code. If so, I'd love to understand the argument that makes it ok.

Look at __build_sched_domains(); we first construct the groups using the
build_*_groups() functions, which as you note initialize ->cpu_power to
0. However, after that we run through init_*_groups_power() which will
build up the group power (using +=, hence the need for the initial 0).

So group->cpu_power should not be 0 after that, which is before we
attach the sched domain tree to the cpu runqueues, which makes them
visible to the load-balancer code.

Now all this is slightly complicated by the fact that we dynamically
update ->cpu_power these days -- see update_cpu_power() and its callers.
But again, that should never result in group->cpu_power being 0.

See update_cpu_power(), which updates the singleton (cpu) groups, it
ensures ->cpu_power never becomes 0:

	if (!power)
		power = 1;

	cpu_rq(cpu)->cpu_power = power;
	sdg->cpu_power = power;

It's caller: update_group_power(), then constructs the ->cpu_power of
aggregate groups by computing the direct sum of its subgroups, which
given that the power of each subgroup should be strictly >0, can never
be 0.

Now clearly there's some hole in all this since your kernel is indeed
exploding, but I'm not seeing it (in my .36-rc2 code).
Comment 5 Chetan 2010-08-25 22:51:34 UTC
(In reply to comment #4)
> Reply-To: peterz@infradead.org
> 

Hi Peter,

     Thanks for the detailed response. My comments are inline:

> 
> Chetan writes:
> 
> > Looking at the code, what strikes me is that group->cpu_power fields are
> > initt'd to 0  in build_xxx_sched_groups. But I don't see any
> synchronization
> > between these places - where group->cpu_power starts off as zero -- and the
> > usage of that variable (as a denominator) in find_busiest_group and
> > find_idlest_group functions. It's possible that there's some logical "can't
> > happen" type synchronization inherent in the design that's not visible in
> the
> > code. If so, I'd love to understand the argument that makes it ok.
> 
> Look at __build_sched_domains(); we first construct the groups using the
> build_*_groups() functions, which as you note initialize ->cpu_power to
> 0. However, after that we run through init_*_groups_power() which will
> build up the group power (using +=, hence the need for the initial 0).
> 
> So group->cpu_power should not be 0 after that, which is before we
> attach the sched domain tree to the cpu runqueues, which makes them
> visible to the load-balancer code.


  Sounds good. That's what I meant by "by design" synchronization.  Even though I'd like to express caution about assuming ordering of init_group_power and attach_domain operations without explicit synchronization here,  I don't think this is the problem I'm seeing. From your description, it sounds like the init_group and attach_domains all run only once at boot time (is that correct?).   But  in each of the cases of this crash I've seen, the systems had been up for months.  

> 
> Now all this is slightly complicated by the fact that we dynamically
> update ->cpu_power these days -- see update_cpu_power() and its callers.
> But again, that should never result in group->cpu_power being 0.
> 
> See update_cpu_power(), which updates the singleton (cpu) groups, it
> ensures ->cpu_power never becomes 0:
> 
>     if (!power)
>         power = 1;
> 
>     cpu_rq(cpu)->cpu_power = power;
>     sdg->cpu_power = power;
> 


   If this is the only place group->cpu_power can be updated (and that's what it looks like from my code-read... which might be less that 100% exhaustive), then here's my long-shot theory:

     Given  that a store  into sdg->cpu_power   is not guaranteed to be atomic in a portable manner, and specifically on the x86_64 architecture -- unless some alignment requirements are met. ( 8byte alignment if I remember correctly) -- I would have converted this into an atomic write of some sort. Now as I said, it *is* a long shot and the likelihood of this particular race resulting in this particular crash is pretty low. But at least the code can be "more correct" in this one place. 
  
   As for the extremely low probability thing, fwiw, we *have* been running this kernel in a couple hundred 8 core servers for a few months now. And we've seen this bug only recently (all within a small window of time, curiously enough).
 
> It's caller: update_group_power(), then constructs the ->cpu_power of
> aggregate groups by computing the direct sum of its subgroups, which
> given that the power of each subgroup should be strictly >0, can never
> be 0.
> 
> Now clearly there's some hole in all this since your kernel is indeed
> exploding, but I'm not seeing it (in my .36-rc2 code).

 Have we missed anywhere cpu_power is being updated or the group structs are being freed or re-inited  in any way ???


Chetan
Comment 6 Chetan 2010-08-26 01:13:21 UTC
Some information about the system and the kernel build for the record. Let me know any other information you need:


a)   kernel version (2.6.32 release from kernel.org. Though a similar divide-by-zero has been 
   reported as recently as 2.6.35 in a Ubuntu distribution kernel here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/615135

b) Hardware :  8 core nehalem (Intel E5520).. /proc/cpuinfo shows 16 "hyperthreaded" cores.
 
some relevant CONFIG settings:
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_ACPI_NUMA=y
.
.
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
# CONFIG_USER_SCHED is not set
CONFIG_CGROUP_SCHED=y

CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_HRTICK=y
CONFIG_SCHED_DEBUG=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_SCHED_TRACER is not set
Comment 7 Anonymous Emailer 2010-08-27 08:59:13 UTC
Reply-To: peterz@infradead.org

On Wed, 2010-08-25 at 22:51 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:

> > So group->cpu_power should not be 0 after that, which is before we
> > attach the sched domain tree to the cpu runqueues, which makes them
> > visible to the load-balancer code.
> 
> 
>   Sounds good. That's what I meant by "by design" synchronization.  Even
>   though
> I'd like to express caution about assuming ordering of init_group_power and
> attach_domain operations without explicit synchronization here,  I don't
> think
> this is the problem I'm seeing. From your description, it sounds like the
> init_group and attach_domains all run only once at boot time (is that
> correct?).   But  in each of the cases of this crash I've seen, the systems
> had
> been up for months.  

There's a few places where we can change the domain tree once its
attached, these include cpu-hotplug and cpuset creation/destruction.

If you look at cpu_attach_domain() you'll find that we do the actual
attach using rcu_assign_pointer() which includes a write-memory-barrier
before doing the actual assignment.

Code traversing the domain tree (e.g. for_each_domain()) will use
rcu_derefence() which includes the matching read-memory-barrier.

If we change the tree, we first do detach_destroy_domains() which will
attach the NULL tree and synchronize_sched() (machine wide preemption
sync, ensures each cpu has scheduled at least once between call and
return), which ensures there's no remaining user left (load-balancing is
non-preemptable)

> > Now all this is slightly complicated by the fact that we dynamically
> > update ->cpu_power these days -- see update_cpu_power() and its callers.
> > But again, that should never result in group->cpu_power being 0.
> > 
> > See update_cpu_power(), which updates the singleton (cpu) groups, it
> > ensures ->cpu_power never becomes 0:
> > 
> >     if (!power)
> >         power = 1;
> > 
> >     cpu_rq(cpu)->cpu_power = power;
> >     sdg->cpu_power = power;
> > 
> 
> 
>    If this is the only place group->cpu_power can be updated (and that's what
> it looks like from my code-read... which might be less that 100% exhaustive),
> then here's my long-shot theory:
> 
>      Given  that a store  into sdg->cpu_power   is not guaranteed to be
>      atomic
> in a portable manner, and specifically on the x86_64 architecture -- unless
> some alignment requirements are met. ( 8byte alignment if I remember
> correctly)
> -- I would have converted this into an atomic write of some sort. Now as I
> said, it *is* a long shot and the likelihood of this particular race
> resulting
> in this particular crash is pretty low. But at least the code can be "more
> correct" in this one place. 

The compiler is assumed to naturally align stuff, in particular, have a
look at atomic_read()/atomic_set() ;-)

The Linux kernel code assumes variables are naturally aligned and word
sized reads and stores (32bits on 32bit machines, 32bit and 64bit on a
64bit machine) are machine atomic.

>    As for the extremely low probability thing, fwiw, we *have* been running
> this kernel in a couple hundred 8 core servers for a few months now. And
> we've
> seen this bug only recently (all within a small window of time, curiously
> enough).

Most odd, Venki's patch might be worth a shot. Any idea if your
workloads changed recently? Does your workload include any RT tasks?

They crash after boot right, not on boot? 

I've once seen a creative BIOS cause the domain construction to fail,
resulting in a similar explosion on boot (the BIOS messed up the machine
topology detection code somewhere in the low-level arch/x86 bootup code,
resulting in the scheduler not getting the right cpu masks and
constructing wonky things).

> > It's caller: update_group_power(), then constructs the ->cpu_power of
> > aggregate groups by computing the direct sum of its subgroups, which
> > given that the power of each subgroup should be strictly >0, can never
> > be 0.
> > 
> > Now clearly there's some hole in all this since your kernel is indeed
> > exploding, but I'm not seeing it (in my .36-rc2 code).
> 
>  Have we missed anywhere cpu_power is being updated or the group structs are
> being freed or re-inited  in any way ???

I can't think of any, and a quick grep through the source leads me to
the same few functions, domain construction and the
update_{cpu,group}_power() functions.
Comment 8 joe williams 2010-09-14 17:26:59 UTC
I believe I am hitting the same regression and have documented it here:

https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/614853

A fix for this would be greatly appreciated, this bug has caused a few near sleepless nights.
Comment 9 Anonymous Emailer 2010-09-16 15:56:32 UTC
Reply-To: peterz@infradead.org

Can someone who can reproduce this try with something like the below
stuck in?

---
 kernel/sched_fair.c |   11 ++++++++++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a171138..dd992f9 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2283,6 +2283,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 	unsigned long weight = sd->span_weight;
 	unsigned long power = SCHED_LOAD_SCALE;
 	struct sched_group *sdg = sd->groups;
+	unsigned long scale_rt;
 
 	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
 		if (sched_feat(ARCH_POWER))
@@ -2302,12 +2303,16 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 
 	power >>= SCHED_LOAD_SHIFT;
 
-	power *= scale_rt_power(cpu);
+	scale_rt = scale_rt_power(cpu);
+	power *= scale_rt;
 	power >>= SCHED_LOAD_SHIFT;
 
 	if (!power)
 		power = 1;
 
+	if (WARN_ON((long)power <= 0))
+		printk(KERN_ERR "cpu_power = %ld; scale_rt = %ld\n", power, scale_rt);
+
 	cpu_rq(cpu)->cpu_power = power;
 	sdg->cpu_power = power;
 }
@@ -2332,6 +2337,10 @@ static void update_group_power(struct sched_domain *sd, int cpu)
 	} while (group != child->groups);
 
 	sdg->cpu_power = power;
+
+	if (WARN_ON((long)power <= 0))
+		printk(KERN_ERR "cpu_power = %ld\n", power);
+
 }
 
 /*
Comment 10 Chetan 2010-09-17 07:07:34 UTC
Hi Joe,
    I'm the original reporter of this bug. One salient feature of this bug in our case is that it has 


On Sep 14, 2010, at 5:27 PM, bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=16991
> 
> 
> joe williams <joe@joetify.com> changed:
> 
>           What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                 CC|                            |joe@joetify.com
> 
> 
> 
> 
> --- Comment #8 from joe williams <joe@joetify.com>  2010-09-14 17:26:59 ---
> I believe I am hitting the same regression and have documented it here:
> 
> https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/614853
> 
> A fix for this would be greatly appreciated, this bug has caused a few near
> sleepless nights.
> 
> -- 
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You reported the bug.
Comment 11 Chetan 2010-09-17 07:24:28 UTC
(Oops fat-fingered the send button midway thru the mail)

So the salient feature is that we have *only* seen this crash on systems with longish uptimes... of the order of 6 months or so. At that time frame, the kernel seems to be hitting some sort of a wall and we saw a bunch of those boxes ( with around the same uptime ) start crashing within a few days of each other. None of those boxes has crashed again (and this started a few weeks ago for us) since the crash induced reboots. 

  Have you noticed a similar pattern? 

Chetan

On Sep 17, 2010, at 12:06 AM, Chetan Ahuja <chetan.ahuja@gmail.com> wrote:

> Hi Joe,
>    I'm the original reporter of this bug. One salient feature of this bug in
>    our case is that it has 
> 
> 
> On Sep 14, 2010, at 5:27 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
>> https://bugzilla.kernel.org/show_bug.cgi?id=16991
>> 
>> 
>> joe williams <joe@joetify.com> changed:
>> 
>>          What    |Removed                     |Added
>> ----------------------------------------------------------------------------
>>                CC|                            |joe@joetify.com
>> 
>> 
>> 
>> 
>> --- Comment #8 from joe williams <joe@joetify.com>  2010-09-14 17:26:59 ---
>> I believe I am hitting the same regression and have documented it here:
>> 
>> https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/614853
>> 
>> A fix for this would be greatly appreciated, this bug has caused a few near
>> sleepless nights.
>> 
>> -- 
>> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
>> ------- You are receiving this mail because: -------
>> You reported the bug.
Comment 12 joe williams 2010-09-17 14:38:00 UTC
In my case it has been shorter, panics on a roughly weekly basis. Additionally it seemed to effect machines on Amazon EC2 more frequently than on physical hardware.
Comment 13 Chetan 2010-09-17 23:20:49 UTC
Hi Joe,
 
   Since you have higher reproducibility on this bug, it would be great if you could apply the patch suggested by peterz. The only caveat is that his patch applies to a later version of the kernel, not 2.6.32 (the update_cpu_power etc. functions have been moved the sched_fair.c in recent versions of the kernel. They used to be sched.c before). If you must use 2.6.32 kernel for any reason, here's the patch that'll apply to 2.6.32. (peterz, could you confirm that it follows what you'd intetended):


diff -u -r -p a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	2010-09-17 23:15:29.000000000 +0000
+++ b/kernel/sched.c	2010-09-17 23:15:36.000000000 +0000
@@ -3640,6 +3640,7 @@ static void update_cpu_power(struct sche
 	unsigned long weight = cpumask_weight(sched_domain_span(sd));
 	unsigned long power = SCHED_LOAD_SCALE;
 	struct sched_group *sdg = sd->groups;
+	unsigned long scale_rt;
 
 	if (sched_feat(ARCH_POWER))
 		power *= arch_scale_freq_power(sd, cpu);
@@ -3657,11 +3658,16 @@ static void update_cpu_power(struct sche
 		power >>= SCHED_LOAD_SHIFT;
 	}
 
-	power *= scale_rt_power(cpu);
+
+	scale_rt = scale_rt_power(cpu);
+	power *= scale_rt;
 	power >>= SCHED_LOAD_SHIFT;
 
 	if (!power)
 		power = 1;
+	if (WARN_ON((long)power <= 0)) {
+         	printk(KERN_ERR "cpu_power = %ld; scale_rt = %ld\n", power, scale_rt);
+	}
 
 	sdg->cpu_power = power;
 }
@@ -3686,6 +3692,10 @@ static void update_group_power(struct sc
 	} while (group != child->groups);
 
 	sdg->cpu_power = power;
+
+	if (WARN_ON((long)power <= 0))
+	       printk(KERN_ERR "cpu_power = %ld\n", power);
+
 }
 
 /**
Comment 14 Chetan 2010-09-17 23:22:27 UTC
And oh, this bug has had a fair bit of discussion on lkml in this thread:

http://lkml.indiana.edu/hypermail//linux/kernel/1008.3/01238.html
Comment 15 joe williams 2010-09-17 23:37:06 UTC
Unfortunately the machines that produce this bug with any regularity are live and production. Even on these it's hard to really tell when it's going to happen and on which machine, so I would probably need to apply it to all of them and I'm not sure I'm willing to do that. I hope you understand.
Comment 16 Chetan 2010-09-18 00:39:46 UTC
I understand. Similar situation here... (though I might have been more willing to spend the effort if I had seen even one case of this happening on a < 6month uptime). Well I guess the patch is out there if someone who can reproduce this cares to try.

 In the meantime, you could try out this (non-invasive) solution. It might affect the performance of the machines (especially if they have large number of cores and run lots of OS threads):

 sudo  mount -t cgroup -ocpuset cpuset /dev/cpuset
 sudo echo 0 > cpuset.sched_load_balance


 This has the effect of bypassing the section of the code that's crashing (it turns off load-balancing of the scheduler queues over all the cores in the system.)  We've had mixed results with this, with some of our codebase seeing adverse performance impact and some other parts doing ok. 

Hope that helps.
Chetan
Comment 17 Andrew Dickinson 2010-10-20 07:33:41 UTC
I submitted this patch the linux-kernel@vger.kernel.org:

It doesn't solve the root cause, but will prevent the divide_error from happening by cleanly handling the case where sds.total_pwr == 0

diff -ruwp a/kernel/sched_fair.c b/kernel/sched_fair.c
--- a/kernel/sched_fair.c	2010-10-19 23:47:51.000000000 -0700
+++ b/kernel/sched_fair.c	2010-10-20 00:08:17.000000000 -0700
@@ -1344,7 +1344,9 @@ find_idlest_group(struct sched_domain *s
 		}
 
 		/* Adjust by relative CPU power of the group */
-		avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power;
+		avg_load = (avg_load * SCHED_LOAD_SCALE);
+		if (group->cpu_power) 
+			avg_load /= group->cpu_power;
 
 		if (local_group) {
 			this_load = avg_load;
@@ -2409,7 +2411,9 @@ static inline void update_sg_lb_stats(st
 	update_group_power(sd, this_cpu);
 
 	/* Adjust by relative CPU power of the group */
-	sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;
+	sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE);
+	if (group->cpu_power)
+		sgs->avg_load /= group->cpu_power;
 
 	/*
 	 * Consider the group unbalanced when the imbalance is larger
@@ -2692,7 +2696,7 @@ find_busiest_group(struct sched_domain *
 	if (!(*balance))
 		goto ret;
 
-	if (!sds.busiest || sds.busiest_nr_running == 0)
+	if (!sds.busiest || sds.busiest_nr_running == 0 || sds.total_pwr == 0)
 		goto out_balanced;
 
 	if (sds.this_load >= sds.max_load)
@@ -2757,7 +2761,9 @@ find_busiest_queue(struct sched_group *g
 		 * the load can be moved away from the cpu that is potentially
 		 * running at a lower capacity.
 		 */
-		wl = (wl * SCHED_LOAD_SCALE) / power;
+		wl = (wl * SCHED_LOAD_SCALE);
+		if (power)
+			wl /= power;
 
 		if (wl > max_load) {
 			max_load = wl;
Comment 18 Toralf Förster 2010-10-20 10:39:26 UTC
(In reply to comment #17)
> I submitted this patch the linux-kernel@vger.kernel.org:
> 
> It doesn't solve the root cause, but will prevent the divide_error from
> happening by cleanly handling the case where sds.total_pwr == 0
> 
I'm wondering whether at  least a BUG_ON or sth. similar should be added to debug this issue.
Comment 19 joe williams 2010-11-08 22:28:34 UTC
I have applied the patch in Comment #17 to some of my machines this morning, I'll report back when I can say if it fixes the issue or not.
Comment 20 Andrew Dickinson 2010-11-08 22:33:25 UTC
Thanks joe.

Do you have a way to reproduce this quickly?  In all of my cases it was only after many months of uptime.

-Andrew
Comment 21 joe williams 2010-11-08 22:35:54 UTC
I don't but in my case I would see this issue multiple times a week. If I don't get a panic in the next week or two I'm satisfied.
Comment 22 joe williams 2010-11-19 20:09:02 UTC
The patch seems solid, the panics don't seem to happen any longer on my machines.
Comment 23 Rudolfs Osins 2011-01-06 09:06:50 UTC
Were still hitting this bug in 2.6.32-309-ec2, I've documented the crashes (in comment #29):
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/614853

In reply to #22:
Did you get any side-effects from the workaround patch?
Comment 24 joe williams 2011-02-06 07:25:12 UTC
#23 No side effects that I have been able to detect, as I mentioned I installed this patch on a group of servers on 2010-11-08 and they have running solid ever since.
Comment 25 Hervé COMMOWICK 2011-04-28 12:21:06 UTC
I think we have encounter the same bug, on 2 similar servers, which were connected on the same NFS server. The panic on the second server occured one hour after the first one.
Those servers are both debian servers with 2.6.32 kernel from lenny-backports.

here is the trace :

kernel: : [341496.279049] divide error: 0000 [#1] SMP 
kernel: : [341496.279081] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:04:03.0/class
kernel: : [341496.279125] CPU 0 
kernel: : [341496.279147] Modules linked in: nfsd exportfs tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext3 jbd m
ore snd_page_alloc psmouse dcdbas evdev pcspkr serio_raw power_meter button processor reiserfs dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid1 md_mod sd_mod crc_
 mptbase scsi_transport_sas scsi_mod usbcore nls_base bnx2 thermal fan thermal_sys [last unloaded: scsi_wait_scan]
kernel: : [341496.279445] Pid: 9844, comm: apache2 Not tainted 2.6.32-bpo.5-amd64 #1 PowerEdge R410
kernel: : [341496.279489] RIP: 0010:[<ffffffff81044c4b>]  [<ffffffff81044c4b>] find_busiest_group+0x3cf/0x875
kernel: : [341496.279544] RSP: 0018:ffff88009550f608  EFLAGS: 00010046
kernel: : [341496.279570] RAX: 0000000000000000 RBX: ffffffffffffffff RCX: ffffffff8103a501
kernel: : [341496.279613] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000200
kernel: : [341496.279656] RBP: ffff8800c300fd50 R08: 0000000000000000 R09: ffff8801afa46050
kernel: : [341496.279699] R10: ffff8801ad7fdcd0 R11: ffff8801ad7fdaf0 R12: 0000000000015780
kernel: : [341496.279743] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8801afa46000
kernel: : [341496.279786] FS:  00007fa312819750(0000) GS:ffff8800c3000000(0000) knlGS:0000000000000000
kernel: : [341496.279831] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: : [341496.279858] CR2: 0000000001360000 CR3: 00000000809ee000 CR4: 00000000000006f0
kernel: : [341496.279902] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: : [341496.279945] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
kernel: : [341496.279988] Process apache2 (pid: 9844, threadinfo ffff88009550e000, task ffff880008b27810)
kernel: : [341496.280033] Stack:
kernel: : [341496.280053]  0000000000015788 0000000000015780 0000000000000008 0000000000015780
kernel: : [341496.280089] <0> 0000000000015780 0000000000015780 ffff8801ad4053b8 ffffffff8123fef5
kernel: : [341496.280142] <0> ffff8801adf7a000 ffffffff8124f16b 0000000000000000 ffff8800c300f9e0
kernel: : [341496.280211] Call Trace:
kernel: : [341496.280237]  [<ffffffff8123fef5>] ? sock_sendmsg+0xa3/0xbb
kernel: : [341496.280268]  [<ffffffff8124f16b>] ? dev_queue_xmit+0x35b/0x38d
kernel: : [341496.280297]  [<ffffffff812f92d7>] ? schedule+0x2cc/0x7da
kernel: : [341496.280326]  [<ffffffff81261c0b>] ? sch_direct_xmit+0x7f/0x14c
kernel: : [341496.280360]  [<ffffffffa0245b9b>] ? rpc_wait_bit_killable+0x0/0x31 [sunrpc]
kernel: : [341496.280393]  [<ffffffffa0245bc8>] ? rpc_wait_bit_killable+0x2d/0x31 [sunrpc]
kernel: : [341496.280424]  [<ffffffff812f9e45>] ? __wait_on_bit+0x41/0x70
kernel: : [341496.280454]  [<ffffffffa024170e>] ? xprt_timer+0x0/0x85 [sunrpc]
kernel: : [341496.280485]  [<ffffffffa0245b9b>] ? rpc_wait_bit_killable+0x0/0x31 [sunrpc]
kernel: : [341496.280515]  [<ffffffff812f9edf>] ? out_of_line_wait_on_bit+0x6b/0x77
kernel: : [341496.280545]  [<ffffffff81064a08>] ? wake_bit_function+0x0/0x23
kernel: : [341496.280575]  [<ffffffffa023f9fe>] ? call_transmit_status+0x3c/0x57 [sunrpc]
kernel: : [341496.280608]  [<ffffffffa0246310>] ? __rpc_execute+0xf2/0x24d [sunrpc]
kernel: : [341496.280639]  [<ffffffffa02404d4>] ? rpc_run_task+0x53/0x5b [sunrpc]
kernel: : [341496.280670]  [<ffffffffa02405c3>] ? rpc_call_sync+0x3d/0x5a [sunrpc]
kernel: : [341496.280706]  [<ffffffffa0290e76>] ? nfs3_rpc_wrapper+0x28/0x59 [nfs]
kernel: : [341496.280739]  [<ffffffffa02915ef>] ? nfs3_proc_getattr+0x55/0x79 [nfs]
kernel: : [341496.280771]  [<ffffffffa028457e>] ? __nfs_revalidate_inode+0xb1/0x1b6 [nfs]
kernel: : [341496.280802]  [<ffffffff81047e29>] ? finish_task_switch+0x3a/0xaf
kernel: : [341496.280833]  [<ffffffffa028372f>] ? nfs_attribute_timeout+0xe/0x58 [nfs]
kernel: : [341496.280865]  [<ffffffffa027f813>] ? nfs_lookup_revalidate+0x1d3/0x387 [nfs]
kernel: : [341496.280899]  [<ffffffff810b372a>] ? find_get_pages_tag+0x46/0xdd
kernel: : [341496.280928]  [<ffffffff810bb92e>] ? pagevec_lookup_tag+0x1a/0x21
kernel: : [341496.280961]  [<ffffffffa024760e>] ? rpcauth_lookup_credcache+0x1c0/0x1d2 [sunrpc]
kernel: : [341496.281009]  [<ffffffffa0281713>] ? nfs_do_access+0x78/0x3f0 [nfs]
kernel: : [341496.281038]  [<ffffffff810f59c9>] ? do_revalidate+0x12/0x45
kernel: : [341496.281066]  [<ffffffff810f5b2f>] ? do_lookup+0x133/0x15d
kernel: : [341496.281093]  [<ffffffff810f64fc>] ? __link_path_walk+0x5a5/0x6f5
kernel: : [341496.281121]  [<ffffffff810f687a>] ? path_walk+0x66/0xc9
kernel: : [341496.281148]  [<ffffffff810f7ce4>] ? do_path_lookup+0x20/0x77
kernel: : [341496.281175]  [<ffffffff810f7e70>] ? do_filp_open+0xe5/0x94b
kernel: : [341496.281207]  [<ffffffffa028f536>] ? nfs_write_mapping+0x69/0x8e [nfs]
kernel: : [341496.281237]  [<ffffffff810f1583>] ? cp_new_stat+0xe9/0xfc
kernel: : [341496.281264]  [<ffffffff81101279>] ? alloc_fd+0x67/0x10c
kernel: : [341496.281292]  [<ffffffff810ec8af>] ? do_sys_open+0x55/0xfc
kernel: : [341496.281319]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
kernel: : [341496.281347] Code: bc 24 a0 01 00 00 00 74 10 48 8b 94 24 a0 01 00 00 c7 02 00 00 00 00 eb 65 41 8b 77 08 48 8b 84 24 38 01 00 00 31 d2 48 c1 e0 0a <48> f7 f6 48 8b b4 24 40 01 00 00 48 89 84 24 30 01 00 00 31 c0 
kernel: : [341496.281553] RIP  [<ffffffff81044c4b>] find_busiest_group+0x3cf/0x875
kernel: : [341496.281584]  RSP <ffff88009550f608>
kernel: : [341496.281929] ---[ end trace 74b0eafab087280b ]---
Comment 26 nik@linuxbox.cz 2011-05-14 18:52:45 UTC
> Most odd, Venki's patch might be worth a shot. Any idea if your
> workloads changed recently? Does your workload include any RT tasks?
> 
> They crash after boot right, not on boot? 

Hello Peter, 
we're trying to hunt (I believe) same bug. To answer your question, on all four of my crashed machines, I was using pacemaker/corosync, ie clustering software, which has lots of RT priority tasks. So I think this might be important. I'll ask others if they're running similar workloads.

In case You're interested, please have a look on whole thread: http://www.gossamer-threads.com/lists/linux/kernel/1371841
Comment 27 Daniel Kahn Gillmor 2011-08-05 22:31:19 UTC
Created attachment 67672 [details]
proposed patch against 2.6.32

We were running into what appears to be this same issue with debian's stock 2.6.32 during periods of intensive I/O.

  https://support.mayfirst.org/ticket/4342

The patch i'm attaching here (nodbz-2.6.32.patch) has been running in production for over a month on some servers with no crashes.
Comment 28 Gavin McDonald 2011-08-09 20:55:54 UTC
Created attachment 68252 [details]
3x Kernel Panic Stack Traces

I think we have been hit by this bug 3x, in our production ec2 farm;

The first instance runs membase, and had been up for 20.86 days when it threw the kernel panic.

The second instance was a mongoDB node, which had been up for 42.48 days.

The third crash was on the same membase instance as the first panic, after only 4.04 days uptime.

Both servers are m2.2xlarge, running Ubuntu 10.04 LTS, kernel 2.6.32-312-ec2.  They are single-purpose machines, built from the same Canonical base image, plus monitoring tools, plus [membase | mongodb].  Both are under moderate to heavy production loads, MongoDB was handling ~400 connections/sec at time of panic, membase ~1900 conn/sec both times.

I am attaching 3 complete stack traces in kpanic.txt
Comment 29 Bas van der Doorn 2011-08-24 09:47:51 UTC
Created attachment 69842 [details]
Kernel crash screen photo
Comment 30 Bas van der Doorn 2011-08-24 09:55:39 UTC
Crash occurred on Ubuntu 10.04 LTS with kernel version amd64 2.6.32-32-generic. The machine runs KVM with 9 light-load virtual machine guests consuming about 10GB RAM of the total of 24GB physical RAM. Load normally sits between 1-2 on the quad core 5620 processor with hyperthreading enabled (8 virtual cores). The host environment also runs nagios3 to check the guests, but I/O load from this is quite low as it only concerns about 90 services to check.
Comment 31 Yuriy Yevtukhov 2011-09-04 14:07:15 UTC
Address to kernel developers: have this bug been fixed in some later kernels? I see 3.0.3 still has divisions like 

avg_load = (avg_load * SCHED_POWER_SCALE) / group->sgp->power;

(find_idlest_group, update_sg_lb_stats)
without additional checks of group->sgp->power. 

Was the root cause of the problem found?

I faced this bug on 2.6.32.23 two or three times, having such kernel on about 70 highly loaded web-servers. It happened on servers with uptime > 6 months.

I'd like to know if I should update kernel to new one or aply patch manually.
Comment 32 Andrew Morton 2011-09-07 21:52:43 UTC
It's possible this bug was secretly fixed by e9e9250bc78e7f6342517214c0178a529807964b ("sched: Scale down cpu_power due to RT tasks") which added the 

  if (!power)
     power = 1;

to update_cpu_power().

However update_group_power() doesn't have that check and can perhaps set -> power back to zero if the child has no groups.  That's probably wrong, but you could try adding

--- a/kernel/sched_fair.c~a
+++ a/kernel/sched_fair.c
@@ -2681,6 +2681,9 @@ static void update_group_power(struct sc
                group = group->next;
        } while (group != child->groups);
 
+       if (!power)
+               power = 1;
+
        sdg->sgp->power = power;
 }
 
_
Comment 33 Ian Corbett 2012-05-23 20:48:38 UTC
Has there been any progress on this? We appear to still be experiencing it on 6 xeon machines running 2.6.34 while e9e9250bc78e7f6342517214c0178a529807964b appears to have rolled up in 2.6.32.
Comment 34 Lenny Shovsky 2012-07-18 13:45:51 UTC
Is there a patch that can be used for 2.6.18 kernel ?

Having issues with AMD Opteron 6234 systems ( dell c6145 ).
Can't boot up at all. 


full crash output is here.

http://pastebin.com/AtvtCwXf

sample is below.

thanks a lot in advance !



AMD Opteron(TM) Processor 6234                  stepping 02
Brought up 48 CPUs
testing NMI watchdog ... OK.
time.c: Using 14.318180 MHz WALL HPET GTOD HPET/TSC timer.
time.c: Detected 2400.187 MHz processor.
divide error: 0000 [1] SMP
last sysfs file:
CPU 1
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.18-194.17.1.el5_lustre.1.8.7 #1
RIP: 0010:[<ffffffff8008bb03>]  [<ffffffff8008bb03>]
find_busiest_group+0x23a/0x621
RSP: 0018:ffff81102805fdb8  EFLAGS: 00010046
RAX: 0000000000004000 RBX: 00000000000000ff RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000c0
RBP: ffff81102805fea8 R08: 0000000000000006 R09: 000000000000003a
R10: ffff810836279e08 R11: 0000000000000048 R12: ffff810836279e00
R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000004000
FS:  0000000000000000(0000) GS:ffff81010e95eec0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffff810828060000, task ffff8104280537a0)
Stack:  0000000000000000 ffff81102805fee8 ffff81102805ff10 0000000000000000
 ffff81102805ff08 000000010100caa0 ffff81000100e260 0000000000000000
 0000000000000000 0000000000000000 0000000000000080 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff8008dba0>] rebalance_tick+0x183/0x3cb
 [<ffffffff8009829f>] update_process_times+0x68/0x78
 [<ffffffff80077bc3>] smp_local_timer_interrupt+0x2f/0x66
 [<ffffffff800781ff>] smp_apic_timer_interrupt+0x41/0x47
 [<ffffffff80057018>] mwait_idle+0x0/0x4a
 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
 <EOI>  [<ffffffff8005704e>] mwait_idle+0x36/0x4a
 [<ffffffff80049206>] cpu_idle+0x95/0xb8
 [<ffffffff8007796b>] start_secondary+0x498/0x4a7


Code: 48 f7 f6 49 c1 ee 07 83 7d cc 00 74 1c 48 8b 55 d0 4c 89 a5
RIP  [<ffffffff8008bb03>] find_busiest_group+0x23a/0x621
 RSP <ffff81102805fdb8>
 <0>Kernel panic - not syncing: Fatal exception
divide error: 0000 [2] SMP
last sysfs file:
CPU 5
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.18-194.17.1.el5_lustre.1.8.7 #1
RIP: 0010:[<ffffffff8008bb03>]  [<ffffffff8008bb03>]
find_busiest_group+0x23a/0x621
RSP: 0018:ffff81083616bdb8  EFLAGS: 00010046
RAX: 0000000000004000 RBX: 00000000000000ff RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000c0
RBP: ffff81083616bea8 R08: 0000000000000006 R09: 000000000000003a
R10: ffff810836279e08 R11: 0000000000000048 R12: ffff810836279e00
R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000004000
FS:  0000000000000000(0000) GS:ffff81183611a2c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffff81010e9d0000, task ffff811c36155080)
Stack:  0000000000000000 ffff81083616bee8 ffff81083616bf10 0000000000000000
 ffff81083616bf08 0000000500000000 ffff81000102fc60 0000000000000000
 0000000000000000 0000000000000000 0000000000000080 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff8008dba0>] rebalance_tick+0x183/0x3cb
 [<ffffffff8009829f>] update_process_times+0x68/0x78
 [<ffffffff80077bc3>] smp_local_timer_interrupt+0x2f/0x66
 [<ffffffff800781ff>] smp_apic_timer_interrupt+0x41/0x47
 [<ffffffff80057018>] mwait_idle+0x0/0x4a
 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
 <EOI>  [<ffffffff8005704e>] mwait_idle+0x36/0x4a
 [<ffffffff80049206>] cpu_idle+0x95/0xb8
 [<ffffffff8007796b>] start_secondary+0x498/0x4a7


Code: 48 f7 f6 49 c1 ee 07 83 7d cc 00 74 1c 48 8b 55 d0 4c 89 a5
RIP  [<ffffffff8008bb03>] find_busiest_group+0x23a/0x621
 RSP <ffff81083616bdb8>
 <0>Kernel panic - not syncing: Fatal exception
 <0>divide error: 0000 [3] SMP
last sysfs file:
CPU 8
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.18-194.17.1.el5_lustre.1.8.7 #1
RIP: 0010:[<ffffffff8008bb03>]  [<ffffffff8008bb03>]
find_busiest_group+0x23a/0x621
RSP: 0018:ffff811036173db8  EFLAGS: 00010002
RAX: 0000000000000000 RBX: 00000000000000ff RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000c0
RBP: ffff811036173ea8 R08: 0000000000000012 R09: 000000000000002e
R10: ffff8104363251c8 R11: 0000000000000048 R12: ffff8104363251c0
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff810436286940(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffff81083618e000, task ffff810436229820)
Stack:  0000000000000000 ffff811036173ee8 ffff811036173f10 0000000000000000
 ffff811036173f08 0000000800000000 ffff8104360b38e0 0000000000000000
 ffff810436325180 0000000000000000 0000000000000000 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff8008dba0>] rebalance_tick+0x183/0x3cb
 [<ffffffff8009829f>] update_process_times+0x68/0x78
 [<ffffffff80077bc3>] smp_local_timer_interrupt+0x2f/0x66
 [<ffffffff800781ff>] smp_apic_timer_interrupt+0x41/0x47
 [<ffffffff80057018>] mwait_idle+0x0/0x4a
 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
 <EOI>  [<ffffffff8005704e>] mwait_idle+0x36/0x4a
 [<ffffffff80049206>] cpu_idle+0x95/0xb8
 [<ffffffff8007796b>] start_secondary+0x498/0x4a7
Comment 35 Yogesh Gaur 2013-09-05 06:43:00 UTC
Hello All,

Marking this bug as open, we are able to hit same problem on Kernel-3.8.x
Reproducibility : Very Low
Test case : Left system up, issur occurs after few days.
crash log
------------------------------------------------------------------------------------------------------
[    0.156521] Brought up 3 CPUs
[    0.158313] SMP: Total of 3 processors activated (4783.10 BogoMIPS).
[    0.160101] NET: Registered protocol family 16
[    0.160795] DMA: preallocated 256 KiB pool for atomic coherent allocations
[    0.161529] CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
[    0.162036] Division by zero in kernel.
[    0.162037] Backtrace: 
[    0.162041] [<c0016bbc>] (dump_backtrace+0x0/0x114) from [<c03126cc>] (dump_stack+0x20/0x24)
[    0.162043]  r6:00000000 r5:ffffffff r4:00000000 r3:00000000
[    0.162045] [<c03126ac>] (dump_stack+0x0/0x24) from [<c0016cf4>] (__div0+0x24/0x28)
[    0.162047] [<c0016cd0>] (__div0+0x0/0x28) from [<c01d70a0>] (Ldiv0+0x8/0x10)
[    0.162050] [<c00624e4>] (find_busiest_group+0x0/0xac4) from [<c0063078>] (load_balance+0xd0/0x7cc)
[    0.162051] [<c0062fa8>] (load_balance+0x0/0x7cc) from [<c0063824>] (rebalance_domains+0xb0/0x19c)
[    0.162053] [<c0063774>] (rebalance_domains+0x0/0x19c) from [<c0063960>] (run_rebalance_domains+0x50/0x148)
[    0.162055] [<c0063910>] (run_rebalance_domains+0x0/0x148) from [<c002eb7c>] (__do_softirq+0x134/0x328)
[    0.162057] [<c002ea48>] (__do_softirq+0x0/0x328) from [<c002f2a4>] (irq_exit+0xa8/0xb0)
[    0.162060] [<c002f1fc>] (irq_exit+0x0/0xb0) from [<c001332c>] (handle_IRQ+0x68/0xe0)
[    0.162061]  r4:c0475154 r3:db8c0638
[    0.162062] [<c00132c4>] (handle_IRQ+0x0/0xe0) from [<c00085f0>] (gic_handle_irq+0x50/0x98)
[    0.162065]  r8:c047e8b0 r7:db8c1f60 r6:c047b2dc r5:c0497898 r4:f8200100
[    0.162065] r3:00000000
------------------------------------------------------------------------------------------------------

I have checked that changes as proposed by Venkat are already present in mainline code i.e.
Function : update_cpu_power()

    if (!power)
        power = 1;

Function : scale_rt_power()
    if (unlikely(total < avg)) {
        /* Ensures that power won't end up being negative */
        available = 0;
    } else {
        available = total - avg;
    }

Can anyone please explain that what might gone wrong which leads to this crash apart from reasons mentioned by Chetan.
Can any one please let me know about exact reason for this crash apart from 2 reasons explained by Chetan?

--
Yogesh.
Comment 36 Yogesh Gaur 2013-09-05 07:00:17 UTC
This bug has been marked as Resolved_closed_fix .. can anyone please share actual fixed Patch and marked this bug in open state as I can't change its state.

--
Yogesh.
Comment 37 Daniel Kahn Gillmor 2013-09-05 14:14:02 UTC
in debian kernels, this was addressed with a patch from Ben Hutchings and Bjoern Boschman:

 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=636797#46
Comment 38 Yogesh Gaur 2013-09-10 02:59:15 UTC
Thanks Daneil,

Can you please point me to the release in which this feature has been taken in debian mainline kernel release.
I checked source code from link
http://packages.debian.org/wheezy/linux-source-3.2
but in this source code,linux-source-3.2, and patch file,linux-patch-3.2-rt.patch , I didn't find this patch added.
Comment 39 Daniel Kahn Gillmor 2013-09-10 04:07:54 UTC
Yogesh, i don't undersatnd your question.  the debian kernel is (kind of by definition) not the mainline kernel -- it contains debian-specific patches.  The link i provided in comment 37 here points to a patch stored in the debian bug tracking system, which applies to 2.6.32 (which is in squeeze, not wheezy).  I don't see any associated patch in debian's 3.2 branch of the kernel, i'm afraid.  And the patch i linked to definitely doesn't apply to 3.2 directly.

(however, i also haven't run into the bug with the 3.2 kernel yet)

sorry to not be more helpful.
Comment 40 Yogesh Gaur 2013-09-10 05:37:46 UTC
Thanks Daniel,

Actually I am looking for latest kernel (Debian, Ubuntu, Redhat or Vanilla) release which has included this patch or done solution for this issue.

I checked latest vanilla kernel and found that check for power == 0 is being done in function update_cpu_power()
  + if (!power)
  + power = 1;

But as per my understanding power can also become zero in function update_group_power(), In patch at #37 was having check for this function too.

As you know this issue is very difficult to reproduce so I was looking for advice or solution applicable and tested on any latest release like 3.x.x.
Comment 41 James 2017-04-27 08:46:05 UTC
Hi all,
 Use this patch, 
 https://access.redhat.com/solutions/68466
 From: Salman Qazi <sqazi@google.com>
Date: Tue, 15 Nov 2011 14:12:06 -0800
Subject: sched, x86: Avoid unnecessary overflow in sched_clock

(Added the missing signed-off-by line)

In hundreds of days, the __cycles_2_ns calculation in sched_clock
has an overflow.  cyc * per_cpu(cyc2ns, cpu) exceeds 64 bits, causing
the final value to become zero.  We can solve this without losing
any precision.

We can decompose TSC into quotient and remainder of division by the
scale factor, and then use this to convert TSC into nanoseconds.

Signed-off-by: Salman Qazi <sqazi@google.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Reviewed-by: Paul Turner <pjt@google.com>
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111115221121.7262.88871.stgit@dungbeetle.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/include/asm/timer.h | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h
index fa7b917..431793e 100644
--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -32,6 +32,22 @@ extern int no_timer_check;
  *  (mathieu.desnoyers@polymtl.ca)
  *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
+ *
+ * In:
+ *
+ * ns = cycles * cyc2ns_scale / SC
+ *
+ * Although we may still have enough bits to store the value of ns,
+ * in some cases, we may not have enough bits to store cycles * cyc2ns_scale,
+ * leading to an incorrect result.
+ *
+ * To avoid this, we can decompose 'cycles' into quotient and remainder
+ * of division by SC.  Then,
+ *
+ * ns = (quot * SC + rem) * cyc2ns_scale / SC
+ *    = quot * cyc2ns_scale + (rem * cyc2ns_scale) / SC
+ *
+ *			- sqazi@google.com
  */
 
 DECLARE_PER_CPU(unsigned long, cyc2ns);
@@ -41,9 +57,14 @@ DECLARE_PER_CPU(unsigned long long, cyc2ns_offset);
 
 static inline unsigned long long __cycles_2_ns(unsigned long long cyc)
 {
+	unsigned long long quot;
+	unsigned long long rem;
 	int cpu = smp_processor_id();
 	unsigned long long ns = per_cpu(cyc2ns_offset, cpu);
-	ns += cyc * per_cpu(cyc2ns, cpu) >> CYC2NS_SCALE_FACTOR;
+	quot = (cyc >> CYC2NS_SCALE_FACTOR);
+	rem = cyc & ((1ULL << CYC2NS_SCALE_FACTOR) - 1);
+	ns += quot * per_cpu(cyc2ns, cpu) +
+		((rem * per_cpu(cyc2ns, cpu)) >> CYC2NS_SCALE_FACTOR);
 	return ns;
 }
 
-- 
 but the installation of patch, restart the system, more than and 240 days after, the os stack restart again

Note You need to log in before you can comment on or make changes to this bug.