Hi, We upgraded to the "official" 2.6.32 kernel on our small cluster of linux boxes a few months ago. Recently, within the last week or two, we've started seeing this divide-by-zero crash on multiple of these servers at seemingly random times. We haven't been able to catch one of these while attached on our console servers but luckily we can capture an image of the last screen from the stack trace, one of which I'm adding as an attachment: Using the RIP in that trace, I was able to pin down the line of code where this crash is happening to the following (in the 2.6.32 treee): The function /usr/src/linux/kernel/sched.c:3769 *balance = 0; return; } /* Adjust by relative CPU power of the group */ sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power; aff5: 48 c1 e0 0a shl $0xa,%rax aff9: 48 f7 f6 div %rsi So it looks like group->cpu_power happens to be zero in these crashes. All the crashes we've examined have been at the exact same adddress. The same code structure is present in 2.6.35 series kernels except it seems to have moved to the sched_fair.c file now. Note: The above is a snippet from an objdump of sched.o built with -g flag. The instruction address in this dump doesn't match the RIP from the screenshot since these are unrelcaoted addresses. I did match the assembly here to that in the full vmlinux objdump and I'm quite sure of the location (The -g build of the entire kernel is too huge for me to work with.) btw, here's a report of another divide-by-zero in find_busiest_group in 2.6.35 kernel here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/615135 Looking at the code, what strikes me is that group->cpu_power fields are initt'd to 0 in build_xxx_sched_groups. But I don't see any synchronization between these places - where group->cpu_power starts off as zero -- and the usage of that variable (as a denominator) in find_busiest_group and find_idlest_group functions. It's possible that there's some logical "can't happen" type synchronization inherent in the design that's not visible in the code. If so, I'd love to understand the argument that makes it ok. Thanks Chetan
Created attachment 27921 [details] screen capture of div-zero crash in find_busiest_group
cc'ed Peter, marked as a regression. Nice bug report - thanks! I don't immediately see anything in later kernels which addressed this.
Thanks Andrew. btw, I left out some information. The line that says: The function should read: The function this line of code belongs to is update_sg_lb_stats (inlined into find_busiest_group)
Reply-To: peterz@infradead.org On Tue, 2010-08-24 at 23:00 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=16991 > > > Andrew Morton <akpm@linux-foundation.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |a.p.zijlstra@chello.nl, > | |akpm@linux-foundation.org > Regression|No |Yes > > > > > --- Comment #2 from Andrew Morton <akpm@linux-foundation.org> 2010-08-24 > 23:00:39 --- > cc'ed Peter, marked as a regression. > > Nice bug report - thanks! I don't immediately see anything in later kernels > which addressed this. > Chetan writes: > Looking at the code, what strikes me is that group->cpu_power fields are > initt'd to 0 in build_xxx_sched_groups. But I don't see any synchronization > between these places - where group->cpu_power starts off as zero -- and the > usage of that variable (as a denominator) in find_busiest_group and > find_idlest_group functions. It's possible that there's some logical "can't > happen" type synchronization inherent in the design that's not visible in the > code. If so, I'd love to understand the argument that makes it ok. Look at __build_sched_domains(); we first construct the groups using the build_*_groups() functions, which as you note initialize ->cpu_power to 0. However, after that we run through init_*_groups_power() which will build up the group power (using +=, hence the need for the initial 0). So group->cpu_power should not be 0 after that, which is before we attach the sched domain tree to the cpu runqueues, which makes them visible to the load-balancer code. Now all this is slightly complicated by the fact that we dynamically update ->cpu_power these days -- see update_cpu_power() and its callers. But again, that should never result in group->cpu_power being 0. See update_cpu_power(), which updates the singleton (cpu) groups, it ensures ->cpu_power never becomes 0: if (!power) power = 1; cpu_rq(cpu)->cpu_power = power; sdg->cpu_power = power; It's caller: update_group_power(), then constructs the ->cpu_power of aggregate groups by computing the direct sum of its subgroups, which given that the power of each subgroup should be strictly >0, can never be 0. Now clearly there's some hole in all this since your kernel is indeed exploding, but I'm not seeing it (in my .36-rc2 code).
(In reply to comment #4) > Reply-To: peterz@infradead.org > Hi Peter, Thanks for the detailed response. My comments are inline: > > Chetan writes: > > > Looking at the code, what strikes me is that group->cpu_power fields are > > initt'd to 0 in build_xxx_sched_groups. But I don't see any > synchronization > > between these places - where group->cpu_power starts off as zero -- and the > > usage of that variable (as a denominator) in find_busiest_group and > > find_idlest_group functions. It's possible that there's some logical "can't > > happen" type synchronization inherent in the design that's not visible in > the > > code. If so, I'd love to understand the argument that makes it ok. > > Look at __build_sched_domains(); we first construct the groups using the > build_*_groups() functions, which as you note initialize ->cpu_power to > 0. However, after that we run through init_*_groups_power() which will > build up the group power (using +=, hence the need for the initial 0). > > So group->cpu_power should not be 0 after that, which is before we > attach the sched domain tree to the cpu runqueues, which makes them > visible to the load-balancer code. Sounds good. That's what I meant by "by design" synchronization. Even though I'd like to express caution about assuming ordering of init_group_power and attach_domain operations without explicit synchronization here, I don't think this is the problem I'm seeing. From your description, it sounds like the init_group and attach_domains all run only once at boot time (is that correct?). But in each of the cases of this crash I've seen, the systems had been up for months. > > Now all this is slightly complicated by the fact that we dynamically > update ->cpu_power these days -- see update_cpu_power() and its callers. > But again, that should never result in group->cpu_power being 0. > > See update_cpu_power(), which updates the singleton (cpu) groups, it > ensures ->cpu_power never becomes 0: > > if (!power) > power = 1; > > cpu_rq(cpu)->cpu_power = power; > sdg->cpu_power = power; > If this is the only place group->cpu_power can be updated (and that's what it looks like from my code-read... which might be less that 100% exhaustive), then here's my long-shot theory: Given that a store into sdg->cpu_power is not guaranteed to be atomic in a portable manner, and specifically on the x86_64 architecture -- unless some alignment requirements are met. ( 8byte alignment if I remember correctly) -- I would have converted this into an atomic write of some sort. Now as I said, it *is* a long shot and the likelihood of this particular race resulting in this particular crash is pretty low. But at least the code can be "more correct" in this one place. As for the extremely low probability thing, fwiw, we *have* been running this kernel in a couple hundred 8 core servers for a few months now. And we've seen this bug only recently (all within a small window of time, curiously enough). > It's caller: update_group_power(), then constructs the ->cpu_power of > aggregate groups by computing the direct sum of its subgroups, which > given that the power of each subgroup should be strictly >0, can never > be 0. > > Now clearly there's some hole in all this since your kernel is indeed > exploding, but I'm not seeing it (in my .36-rc2 code). Have we missed anywhere cpu_power is being updated or the group structs are being freed or re-inited in any way ??? Chetan
Some information about the system and the kernel build for the record. Let me know any other information you need: a) kernel version (2.6.32 release from kernel.org. Though a similar divide-by-zero has been reported as recently as 2.6.35 in a Ubuntu distribution kernel here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/615135 b) Hardware : 8 core nehalem (Intel E5520).. /proc/cpuinfo shows 16 "hyperthreaded" cores. some relevant CONFIG settings: CONFIG_NUMA=y CONFIG_K8_NUMA=y CONFIG_X86_64_ACPI_NUMA=y # CONFIG_NUMA_EMU is not set CONFIG_ACPI_NUMA=y . . CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y CONFIG_GROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y # CONFIG_RT_GROUP_SCHED is not set # CONFIG_USER_SCHED is not set CONFIG_CGROUP_SCHED=y CONFIG_SCHED_OMIT_FRAME_POINTER=y CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y CONFIG_SCHED_HRTICK=y CONFIG_SCHED_DEBUG=y # CONFIG_SCHEDSTATS is not set # CONFIG_SCHED_TRACER is not set
Reply-To: peterz@infradead.org On Wed, 2010-08-25 at 22:51 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > > So group->cpu_power should not be 0 after that, which is before we > > attach the sched domain tree to the cpu runqueues, which makes them > > visible to the load-balancer code. > > > Sounds good. That's what I meant by "by design" synchronization. Even > though > I'd like to express caution about assuming ordering of init_group_power and > attach_domain operations without explicit synchronization here, I don't > think > this is the problem I'm seeing. From your description, it sounds like the > init_group and attach_domains all run only once at boot time (is that > correct?). But in each of the cases of this crash I've seen, the systems > had > been up for months. There's a few places where we can change the domain tree once its attached, these include cpu-hotplug and cpuset creation/destruction. If you look at cpu_attach_domain() you'll find that we do the actual attach using rcu_assign_pointer() which includes a write-memory-barrier before doing the actual assignment. Code traversing the domain tree (e.g. for_each_domain()) will use rcu_derefence() which includes the matching read-memory-barrier. If we change the tree, we first do detach_destroy_domains() which will attach the NULL tree and synchronize_sched() (machine wide preemption sync, ensures each cpu has scheduled at least once between call and return), which ensures there's no remaining user left (load-balancing is non-preemptable) > > Now all this is slightly complicated by the fact that we dynamically > > update ->cpu_power these days -- see update_cpu_power() and its callers. > > But again, that should never result in group->cpu_power being 0. > > > > See update_cpu_power(), which updates the singleton (cpu) groups, it > > ensures ->cpu_power never becomes 0: > > > > if (!power) > > power = 1; > > > > cpu_rq(cpu)->cpu_power = power; > > sdg->cpu_power = power; > > > > > If this is the only place group->cpu_power can be updated (and that's what > it looks like from my code-read... which might be less that 100% exhaustive), > then here's my long-shot theory: > > Given that a store into sdg->cpu_power is not guaranteed to be > atomic > in a portable manner, and specifically on the x86_64 architecture -- unless > some alignment requirements are met. ( 8byte alignment if I remember > correctly) > -- I would have converted this into an atomic write of some sort. Now as I > said, it *is* a long shot and the likelihood of this particular race > resulting > in this particular crash is pretty low. But at least the code can be "more > correct" in this one place. The compiler is assumed to naturally align stuff, in particular, have a look at atomic_read()/atomic_set() ;-) The Linux kernel code assumes variables are naturally aligned and word sized reads and stores (32bits on 32bit machines, 32bit and 64bit on a 64bit machine) are machine atomic. > As for the extremely low probability thing, fwiw, we *have* been running > this kernel in a couple hundred 8 core servers for a few months now. And > we've > seen this bug only recently (all within a small window of time, curiously > enough). Most odd, Venki's patch might be worth a shot. Any idea if your workloads changed recently? Does your workload include any RT tasks? They crash after boot right, not on boot? I've once seen a creative BIOS cause the domain construction to fail, resulting in a similar explosion on boot (the BIOS messed up the machine topology detection code somewhere in the low-level arch/x86 bootup code, resulting in the scheduler not getting the right cpu masks and constructing wonky things). > > It's caller: update_group_power(), then constructs the ->cpu_power of > > aggregate groups by computing the direct sum of its subgroups, which > > given that the power of each subgroup should be strictly >0, can never > > be 0. > > > > Now clearly there's some hole in all this since your kernel is indeed > > exploding, but I'm not seeing it (in my .36-rc2 code). > > Have we missed anywhere cpu_power is being updated or the group structs are > being freed or re-inited in any way ??? I can't think of any, and a quick grep through the source leads me to the same few functions, domain construction and the update_{cpu,group}_power() functions.
I believe I am hitting the same regression and have documented it here: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/614853 A fix for this would be greatly appreciated, this bug has caused a few near sleepless nights.
Reply-To: peterz@infradead.org Can someone who can reproduce this try with something like the below stuck in? --- kernel/sched_fair.c | 11 ++++++++++- 1 files changed, 10 insertions(+), 1 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index a171138..dd992f9 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2283,6 +2283,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu) unsigned long weight = sd->span_weight; unsigned long power = SCHED_LOAD_SCALE; struct sched_group *sdg = sd->groups; + unsigned long scale_rt; if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) { if (sched_feat(ARCH_POWER)) @@ -2302,12 +2303,16 @@ static void update_cpu_power(struct sched_domain *sd, int cpu) power >>= SCHED_LOAD_SHIFT; - power *= scale_rt_power(cpu); + scale_rt = scale_rt_power(cpu); + power *= scale_rt; power >>= SCHED_LOAD_SHIFT; if (!power) power = 1; + if (WARN_ON((long)power <= 0)) + printk(KERN_ERR "cpu_power = %ld; scale_rt = %ld\n", power, scale_rt); + cpu_rq(cpu)->cpu_power = power; sdg->cpu_power = power; } @@ -2332,6 +2337,10 @@ static void update_group_power(struct sched_domain *sd, int cpu) } while (group != child->groups); sdg->cpu_power = power; + + if (WARN_ON((long)power <= 0)) + printk(KERN_ERR "cpu_power = %ld\n", power); + } /*
Hi Joe, I'm the original reporter of this bug. One salient feature of this bug in our case is that it has On Sep 14, 2010, at 5:27 PM, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=16991 > > > joe williams <joe@joetify.com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |joe@joetify.com > > > > > --- Comment #8 from joe williams <joe@joetify.com> 2010-09-14 17:26:59 --- > I believe I am hitting the same regression and have documented it here: > > https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/614853 > > A fix for this would be greatly appreciated, this bug has caused a few near > sleepless nights. > > -- > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You reported the bug.
(Oops fat-fingered the send button midway thru the mail) So the salient feature is that we have *only* seen this crash on systems with longish uptimes... of the order of 6 months or so. At that time frame, the kernel seems to be hitting some sort of a wall and we saw a bunch of those boxes ( with around the same uptime ) start crashing within a few days of each other. None of those boxes has crashed again (and this started a few weeks ago for us) since the crash induced reboots. Have you noticed a similar pattern? Chetan On Sep 17, 2010, at 12:06 AM, Chetan Ahuja <chetan.ahuja@gmail.com> wrote: > Hi Joe, > I'm the original reporter of this bug. One salient feature of this bug in > our case is that it has > > > On Sep 14, 2010, at 5:27 PM, bugzilla-daemon@bugzilla.kernel.org wrote: > >> https://bugzilla.kernel.org/show_bug.cgi?id=16991 >> >> >> joe williams <joe@joetify.com> changed: >> >> What |Removed |Added >> ---------------------------------------------------------------------------- >> CC| |joe@joetify.com >> >> >> >> >> --- Comment #8 from joe williams <joe@joetify.com> 2010-09-14 17:26:59 --- >> I believe I am hitting the same regression and have documented it here: >> >> https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/614853 >> >> A fix for this would be greatly appreciated, this bug has caused a few near >> sleepless nights. >> >> -- >> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email >> ------- You are receiving this mail because: ------- >> You reported the bug.
In my case it has been shorter, panics on a roughly weekly basis. Additionally it seemed to effect machines on Amazon EC2 more frequently than on physical hardware.
Hi Joe, Since you have higher reproducibility on this bug, it would be great if you could apply the patch suggested by peterz. The only caveat is that his patch applies to a later version of the kernel, not 2.6.32 (the update_cpu_power etc. functions have been moved the sched_fair.c in recent versions of the kernel. They used to be sched.c before). If you must use 2.6.32 kernel for any reason, here's the patch that'll apply to 2.6.32. (peterz, could you confirm that it follows what you'd intetended): diff -u -r -p a/kernel/sched.c b/kernel/sched.c --- a/kernel/sched.c 2010-09-17 23:15:29.000000000 +0000 +++ b/kernel/sched.c 2010-09-17 23:15:36.000000000 +0000 @@ -3640,6 +3640,7 @@ static void update_cpu_power(struct sche unsigned long weight = cpumask_weight(sched_domain_span(sd)); unsigned long power = SCHED_LOAD_SCALE; struct sched_group *sdg = sd->groups; + unsigned long scale_rt; if (sched_feat(ARCH_POWER)) power *= arch_scale_freq_power(sd, cpu); @@ -3657,11 +3658,16 @@ static void update_cpu_power(struct sche power >>= SCHED_LOAD_SHIFT; } - power *= scale_rt_power(cpu); + + scale_rt = scale_rt_power(cpu); + power *= scale_rt; power >>= SCHED_LOAD_SHIFT; if (!power) power = 1; + if (WARN_ON((long)power <= 0)) { + printk(KERN_ERR "cpu_power = %ld; scale_rt = %ld\n", power, scale_rt); + } sdg->cpu_power = power; } @@ -3686,6 +3692,10 @@ static void update_group_power(struct sc } while (group != child->groups); sdg->cpu_power = power; + + if (WARN_ON((long)power <= 0)) + printk(KERN_ERR "cpu_power = %ld\n", power); + } /**
And oh, this bug has had a fair bit of discussion on lkml in this thread: http://lkml.indiana.edu/hypermail//linux/kernel/1008.3/01238.html
Unfortunately the machines that produce this bug with any regularity are live and production. Even on these it's hard to really tell when it's going to happen and on which machine, so I would probably need to apply it to all of them and I'm not sure I'm willing to do that. I hope you understand.
I understand. Similar situation here... (though I might have been more willing to spend the effort if I had seen even one case of this happening on a < 6month uptime). Well I guess the patch is out there if someone who can reproduce this cares to try. In the meantime, you could try out this (non-invasive) solution. It might affect the performance of the machines (especially if they have large number of cores and run lots of OS threads): sudo mount -t cgroup -ocpuset cpuset /dev/cpuset sudo echo 0 > cpuset.sched_load_balance This has the effect of bypassing the section of the code that's crashing (it turns off load-balancing of the scheduler queues over all the cores in the system.) We've had mixed results with this, with some of our codebase seeing adverse performance impact and some other parts doing ok. Hope that helps. Chetan
I submitted this patch the linux-kernel@vger.kernel.org: It doesn't solve the root cause, but will prevent the divide_error from happening by cleanly handling the case where sds.total_pwr == 0 diff -ruwp a/kernel/sched_fair.c b/kernel/sched_fair.c --- a/kernel/sched_fair.c 2010-10-19 23:47:51.000000000 -0700 +++ b/kernel/sched_fair.c 2010-10-20 00:08:17.000000000 -0700 @@ -1344,7 +1344,9 @@ find_idlest_group(struct sched_domain *s } /* Adjust by relative CPU power of the group */ - avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power; + avg_load = (avg_load * SCHED_LOAD_SCALE); + if (group->cpu_power) + avg_load /= group->cpu_power; if (local_group) { this_load = avg_load; @@ -2409,7 +2411,9 @@ static inline void update_sg_lb_stats(st update_group_power(sd, this_cpu); /* Adjust by relative CPU power of the group */ - sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power; + sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE); + if (group->cpu_power) + sgs->avg_load /= group->cpu_power; /* * Consider the group unbalanced when the imbalance is larger @@ -2692,7 +2696,7 @@ find_busiest_group(struct sched_domain * if (!(*balance)) goto ret; - if (!sds.busiest || sds.busiest_nr_running == 0) + if (!sds.busiest || sds.busiest_nr_running == 0 || sds.total_pwr == 0) goto out_balanced; if (sds.this_load >= sds.max_load) @@ -2757,7 +2761,9 @@ find_busiest_queue(struct sched_group *g * the load can be moved away from the cpu that is potentially * running at a lower capacity. */ - wl = (wl * SCHED_LOAD_SCALE) / power; + wl = (wl * SCHED_LOAD_SCALE); + if (power) + wl /= power; if (wl > max_load) { max_load = wl;
(In reply to comment #17) > I submitted this patch the linux-kernel@vger.kernel.org: > > It doesn't solve the root cause, but will prevent the divide_error from > happening by cleanly handling the case where sds.total_pwr == 0 > I'm wondering whether at least a BUG_ON or sth. similar should be added to debug this issue.
I have applied the patch in Comment #17 to some of my machines this morning, I'll report back when I can say if it fixes the issue or not.
Thanks joe. Do you have a way to reproduce this quickly? In all of my cases it was only after many months of uptime. -Andrew
I don't but in my case I would see this issue multiple times a week. If I don't get a panic in the next week or two I'm satisfied.
The patch seems solid, the panics don't seem to happen any longer on my machines.
Were still hitting this bug in 2.6.32-309-ec2, I've documented the crashes (in comment #29): https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/614853 In reply to #22: Did you get any side-effects from the workaround patch?
#23 No side effects that I have been able to detect, as I mentioned I installed this patch on a group of servers on 2010-11-08 and they have running solid ever since.
I think we have encounter the same bug, on 2 similar servers, which were connected on the same NFS server. The panic on the second server occured one hour after the first one. Those servers are both debian servers with 2.6.32 kernel from lenny-backports. here is the trace : kernel: : [341496.279049] divide error: 0000 [#1] SMP kernel: : [341496.279081] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:04:03.0/class kernel: : [341496.279125] CPU 0 kernel: : [341496.279147] Modules linked in: nfsd exportfs tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext3 jbd m ore snd_page_alloc psmouse dcdbas evdev pcspkr serio_raw power_meter button processor reiserfs dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid1 md_mod sd_mod crc_ mptbase scsi_transport_sas scsi_mod usbcore nls_base bnx2 thermal fan thermal_sys [last unloaded: scsi_wait_scan] kernel: : [341496.279445] Pid: 9844, comm: apache2 Not tainted 2.6.32-bpo.5-amd64 #1 PowerEdge R410 kernel: : [341496.279489] RIP: 0010:[<ffffffff81044c4b>] [<ffffffff81044c4b>] find_busiest_group+0x3cf/0x875 kernel: : [341496.279544] RSP: 0018:ffff88009550f608 EFLAGS: 00010046 kernel: : [341496.279570] RAX: 0000000000000000 RBX: ffffffffffffffff RCX: ffffffff8103a501 kernel: : [341496.279613] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000200 kernel: : [341496.279656] RBP: ffff8800c300fd50 R08: 0000000000000000 R09: ffff8801afa46050 kernel: : [341496.279699] R10: ffff8801ad7fdcd0 R11: ffff8801ad7fdaf0 R12: 0000000000015780 kernel: : [341496.279743] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8801afa46000 kernel: : [341496.279786] FS: 00007fa312819750(0000) GS:ffff8800c3000000(0000) knlGS:0000000000000000 kernel: : [341496.279831] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: : [341496.279858] CR2: 0000000001360000 CR3: 00000000809ee000 CR4: 00000000000006f0 kernel: : [341496.279902] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 kernel: : [341496.279945] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 kernel: : [341496.279988] Process apache2 (pid: 9844, threadinfo ffff88009550e000, task ffff880008b27810) kernel: : [341496.280033] Stack: kernel: : [341496.280053] 0000000000015788 0000000000015780 0000000000000008 0000000000015780 kernel: : [341496.280089] <0> 0000000000015780 0000000000015780 ffff8801ad4053b8 ffffffff8123fef5 kernel: : [341496.280142] <0> ffff8801adf7a000 ffffffff8124f16b 0000000000000000 ffff8800c300f9e0 kernel: : [341496.280211] Call Trace: kernel: : [341496.280237] [<ffffffff8123fef5>] ? sock_sendmsg+0xa3/0xbb kernel: : [341496.280268] [<ffffffff8124f16b>] ? dev_queue_xmit+0x35b/0x38d kernel: : [341496.280297] [<ffffffff812f92d7>] ? schedule+0x2cc/0x7da kernel: : [341496.280326] [<ffffffff81261c0b>] ? sch_direct_xmit+0x7f/0x14c kernel: : [341496.280360] [<ffffffffa0245b9b>] ? rpc_wait_bit_killable+0x0/0x31 [sunrpc] kernel: : [341496.280393] [<ffffffffa0245bc8>] ? rpc_wait_bit_killable+0x2d/0x31 [sunrpc] kernel: : [341496.280424] [<ffffffff812f9e45>] ? __wait_on_bit+0x41/0x70 kernel: : [341496.280454] [<ffffffffa024170e>] ? xprt_timer+0x0/0x85 [sunrpc] kernel: : [341496.280485] [<ffffffffa0245b9b>] ? rpc_wait_bit_killable+0x0/0x31 [sunrpc] kernel: : [341496.280515] [<ffffffff812f9edf>] ? out_of_line_wait_on_bit+0x6b/0x77 kernel: : [341496.280545] [<ffffffff81064a08>] ? wake_bit_function+0x0/0x23 kernel: : [341496.280575] [<ffffffffa023f9fe>] ? call_transmit_status+0x3c/0x57 [sunrpc] kernel: : [341496.280608] [<ffffffffa0246310>] ? __rpc_execute+0xf2/0x24d [sunrpc] kernel: : [341496.280639] [<ffffffffa02404d4>] ? rpc_run_task+0x53/0x5b [sunrpc] kernel: : [341496.280670] [<ffffffffa02405c3>] ? rpc_call_sync+0x3d/0x5a [sunrpc] kernel: : [341496.280706] [<ffffffffa0290e76>] ? nfs3_rpc_wrapper+0x28/0x59 [nfs] kernel: : [341496.280739] [<ffffffffa02915ef>] ? nfs3_proc_getattr+0x55/0x79 [nfs] kernel: : [341496.280771] [<ffffffffa028457e>] ? __nfs_revalidate_inode+0xb1/0x1b6 [nfs] kernel: : [341496.280802] [<ffffffff81047e29>] ? finish_task_switch+0x3a/0xaf kernel: : [341496.280833] [<ffffffffa028372f>] ? nfs_attribute_timeout+0xe/0x58 [nfs] kernel: : [341496.280865] [<ffffffffa027f813>] ? nfs_lookup_revalidate+0x1d3/0x387 [nfs] kernel: : [341496.280899] [<ffffffff810b372a>] ? find_get_pages_tag+0x46/0xdd kernel: : [341496.280928] [<ffffffff810bb92e>] ? pagevec_lookup_tag+0x1a/0x21 kernel: : [341496.280961] [<ffffffffa024760e>] ? rpcauth_lookup_credcache+0x1c0/0x1d2 [sunrpc] kernel: : [341496.281009] [<ffffffffa0281713>] ? nfs_do_access+0x78/0x3f0 [nfs] kernel: : [341496.281038] [<ffffffff810f59c9>] ? do_revalidate+0x12/0x45 kernel: : [341496.281066] [<ffffffff810f5b2f>] ? do_lookup+0x133/0x15d kernel: : [341496.281093] [<ffffffff810f64fc>] ? __link_path_walk+0x5a5/0x6f5 kernel: : [341496.281121] [<ffffffff810f687a>] ? path_walk+0x66/0xc9 kernel: : [341496.281148] [<ffffffff810f7ce4>] ? do_path_lookup+0x20/0x77 kernel: : [341496.281175] [<ffffffff810f7e70>] ? do_filp_open+0xe5/0x94b kernel: : [341496.281207] [<ffffffffa028f536>] ? nfs_write_mapping+0x69/0x8e [nfs] kernel: : [341496.281237] [<ffffffff810f1583>] ? cp_new_stat+0xe9/0xfc kernel: : [341496.281264] [<ffffffff81101279>] ? alloc_fd+0x67/0x10c kernel: : [341496.281292] [<ffffffff810ec8af>] ? do_sys_open+0x55/0xfc kernel: : [341496.281319] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b kernel: : [341496.281347] Code: bc 24 a0 01 00 00 00 74 10 48 8b 94 24 a0 01 00 00 c7 02 00 00 00 00 eb 65 41 8b 77 08 48 8b 84 24 38 01 00 00 31 d2 48 c1 e0 0a <48> f7 f6 48 8b b4 24 40 01 00 00 48 89 84 24 30 01 00 00 31 c0 kernel: : [341496.281553] RIP [<ffffffff81044c4b>] find_busiest_group+0x3cf/0x875 kernel: : [341496.281584] RSP <ffff88009550f608> kernel: : [341496.281929] ---[ end trace 74b0eafab087280b ]---
> Most odd, Venki's patch might be worth a shot. Any idea if your > workloads changed recently? Does your workload include any RT tasks? > > They crash after boot right, not on boot? Hello Peter, we're trying to hunt (I believe) same bug. To answer your question, on all four of my crashed machines, I was using pacemaker/corosync, ie clustering software, which has lots of RT priority tasks. So I think this might be important. I'll ask others if they're running similar workloads. In case You're interested, please have a look on whole thread: http://www.gossamer-threads.com/lists/linux/kernel/1371841
Created attachment 67672 [details] proposed patch against 2.6.32 We were running into what appears to be this same issue with debian's stock 2.6.32 during periods of intensive I/O. https://support.mayfirst.org/ticket/4342 The patch i'm attaching here (nodbz-2.6.32.patch) has been running in production for over a month on some servers with no crashes.
Created attachment 68252 [details] 3x Kernel Panic Stack Traces I think we have been hit by this bug 3x, in our production ec2 farm; The first instance runs membase, and had been up for 20.86 days when it threw the kernel panic. The second instance was a mongoDB node, which had been up for 42.48 days. The third crash was on the same membase instance as the first panic, after only 4.04 days uptime. Both servers are m2.2xlarge, running Ubuntu 10.04 LTS, kernel 2.6.32-312-ec2. They are single-purpose machines, built from the same Canonical base image, plus monitoring tools, plus [membase | mongodb]. Both are under moderate to heavy production loads, MongoDB was handling ~400 connections/sec at time of panic, membase ~1900 conn/sec both times. I am attaching 3 complete stack traces in kpanic.txt
Created attachment 69842 [details] Kernel crash screen photo
Crash occurred on Ubuntu 10.04 LTS with kernel version amd64 2.6.32-32-generic. The machine runs KVM with 9 light-load virtual machine guests consuming about 10GB RAM of the total of 24GB physical RAM. Load normally sits between 1-2 on the quad core 5620 processor with hyperthreading enabled (8 virtual cores). The host environment also runs nagios3 to check the guests, but I/O load from this is quite low as it only concerns about 90 services to check.
Address to kernel developers: have this bug been fixed in some later kernels? I see 3.0.3 still has divisions like avg_load = (avg_load * SCHED_POWER_SCALE) / group->sgp->power; (find_idlest_group, update_sg_lb_stats) without additional checks of group->sgp->power. Was the root cause of the problem found? I faced this bug on 2.6.32.23 two or three times, having such kernel on about 70 highly loaded web-servers. It happened on servers with uptime > 6 months. I'd like to know if I should update kernel to new one or aply patch manually.
It's possible this bug was secretly fixed by e9e9250bc78e7f6342517214c0178a529807964b ("sched: Scale down cpu_power due to RT tasks") which added the if (!power) power = 1; to update_cpu_power(). However update_group_power() doesn't have that check and can perhaps set -> power back to zero if the child has no groups. That's probably wrong, but you could try adding --- a/kernel/sched_fair.c~a +++ a/kernel/sched_fair.c @@ -2681,6 +2681,9 @@ static void update_group_power(struct sc group = group->next; } while (group != child->groups); + if (!power) + power = 1; + sdg->sgp->power = power; } _
Has there been any progress on this? We appear to still be experiencing it on 6 xeon machines running 2.6.34 while e9e9250bc78e7f6342517214c0178a529807964b appears to have rolled up in 2.6.32.
Is there a patch that can be used for 2.6.18 kernel ? Having issues with AMD Opteron 6234 systems ( dell c6145 ). Can't boot up at all. full crash output is here. http://pastebin.com/AtvtCwXf sample is below. thanks a lot in advance ! AMD Opteron(TM) Processor 6234 stepping 02 Brought up 48 CPUs testing NMI watchdog ... OK. time.c: Using 14.318180 MHz WALL HPET GTOD HPET/TSC timer. time.c: Detected 2400.187 MHz processor. divide error: 0000 [1] SMP last sysfs file: CPU 1 Modules linked in: Pid: 0, comm: swapper Not tainted 2.6.18-194.17.1.el5_lustre.1.8.7 #1 RIP: 0010:[<ffffffff8008bb03>] [<ffffffff8008bb03>] find_busiest_group+0x23a/0x621 RSP: 0018:ffff81102805fdb8 EFLAGS: 00010046 RAX: 0000000000004000 RBX: 00000000000000ff RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000c0 RBP: ffff81102805fea8 R08: 0000000000000006 R09: 000000000000003a R10: ffff810836279e08 R11: 0000000000000048 R12: ffff810836279e00 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000004000 FS: 0000000000000000(0000) GS:ffff81010e95eec0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffff810828060000, task ffff8104280537a0) Stack: 0000000000000000 ffff81102805fee8 ffff81102805ff10 0000000000000000 ffff81102805ff08 000000010100caa0 ffff81000100e260 0000000000000000 0000000000000000 0000000000000000 0000000000000080 0000000000000000 Call Trace: <IRQ> [<ffffffff8008dba0>] rebalance_tick+0x183/0x3cb [<ffffffff8009829f>] update_process_times+0x68/0x78 [<ffffffff80077bc3>] smp_local_timer_interrupt+0x2f/0x66 [<ffffffff800781ff>] smp_apic_timer_interrupt+0x41/0x47 [<ffffffff80057018>] mwait_idle+0x0/0x4a [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c <EOI> [<ffffffff8005704e>] mwait_idle+0x36/0x4a [<ffffffff80049206>] cpu_idle+0x95/0xb8 [<ffffffff8007796b>] start_secondary+0x498/0x4a7 Code: 48 f7 f6 49 c1 ee 07 83 7d cc 00 74 1c 48 8b 55 d0 4c 89 a5 RIP [<ffffffff8008bb03>] find_busiest_group+0x23a/0x621 RSP <ffff81102805fdb8> <0>Kernel panic - not syncing: Fatal exception divide error: 0000 [2] SMP last sysfs file: CPU 5 Modules linked in: Pid: 0, comm: swapper Not tainted 2.6.18-194.17.1.el5_lustre.1.8.7 #1 RIP: 0010:[<ffffffff8008bb03>] [<ffffffff8008bb03>] find_busiest_group+0x23a/0x621 RSP: 0018:ffff81083616bdb8 EFLAGS: 00010046 RAX: 0000000000004000 RBX: 00000000000000ff RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000c0 RBP: ffff81083616bea8 R08: 0000000000000006 R09: 000000000000003a R10: ffff810836279e08 R11: 0000000000000048 R12: ffff810836279e00 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000004000 FS: 0000000000000000(0000) GS:ffff81183611a2c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffff81010e9d0000, task ffff811c36155080) Stack: 0000000000000000 ffff81083616bee8 ffff81083616bf10 0000000000000000 ffff81083616bf08 0000000500000000 ffff81000102fc60 0000000000000000 0000000000000000 0000000000000000 0000000000000080 0000000000000000 Call Trace: <IRQ> [<ffffffff8008dba0>] rebalance_tick+0x183/0x3cb [<ffffffff8009829f>] update_process_times+0x68/0x78 [<ffffffff80077bc3>] smp_local_timer_interrupt+0x2f/0x66 [<ffffffff800781ff>] smp_apic_timer_interrupt+0x41/0x47 [<ffffffff80057018>] mwait_idle+0x0/0x4a [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c <EOI> [<ffffffff8005704e>] mwait_idle+0x36/0x4a [<ffffffff80049206>] cpu_idle+0x95/0xb8 [<ffffffff8007796b>] start_secondary+0x498/0x4a7 Code: 48 f7 f6 49 c1 ee 07 83 7d cc 00 74 1c 48 8b 55 d0 4c 89 a5 RIP [<ffffffff8008bb03>] find_busiest_group+0x23a/0x621 RSP <ffff81083616bdb8> <0>Kernel panic - not syncing: Fatal exception <0>divide error: 0000 [3] SMP last sysfs file: CPU 8 Modules linked in: Pid: 0, comm: swapper Not tainted 2.6.18-194.17.1.el5_lustre.1.8.7 #1 RIP: 0010:[<ffffffff8008bb03>] [<ffffffff8008bb03>] find_busiest_group+0x23a/0x621 RSP: 0018:ffff811036173db8 EFLAGS: 00010002 RAX: 0000000000000000 RBX: 00000000000000ff RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000c0 RBP: ffff811036173ea8 R08: 0000000000000012 R09: 000000000000002e R10: ffff8104363251c8 R11: 0000000000000048 R12: ffff8104363251c0 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff810436286940(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffff81083618e000, task ffff810436229820) Stack: 0000000000000000 ffff811036173ee8 ffff811036173f10 0000000000000000 ffff811036173f08 0000000800000000 ffff8104360b38e0 0000000000000000 ffff810436325180 0000000000000000 0000000000000000 0000000000000000 Call Trace: <IRQ> [<ffffffff8008dba0>] rebalance_tick+0x183/0x3cb [<ffffffff8009829f>] update_process_times+0x68/0x78 [<ffffffff80077bc3>] smp_local_timer_interrupt+0x2f/0x66 [<ffffffff800781ff>] smp_apic_timer_interrupt+0x41/0x47 [<ffffffff80057018>] mwait_idle+0x0/0x4a [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c <EOI> [<ffffffff8005704e>] mwait_idle+0x36/0x4a [<ffffffff80049206>] cpu_idle+0x95/0xb8 [<ffffffff8007796b>] start_secondary+0x498/0x4a7
Hello All, Marking this bug as open, we are able to hit same problem on Kernel-3.8.x Reproducibility : Very Low Test case : Left system up, issur occurs after few days. crash log ------------------------------------------------------------------------------------------------------ [ 0.156521] Brought up 3 CPUs [ 0.158313] SMP: Total of 3 processors activated (4783.10 BogoMIPS). [ 0.160101] NET: Registered protocol family 16 [ 0.160795] DMA: preallocated 256 KiB pool for atomic coherent allocations [ 0.161529] CPU1: thread -1, cpu 1, socket 0, mpidr 80000001 [ 0.162036] Division by zero in kernel. [ 0.162037] Backtrace: [ 0.162041] [<c0016bbc>] (dump_backtrace+0x0/0x114) from [<c03126cc>] (dump_stack+0x20/0x24) [ 0.162043] r6:00000000 r5:ffffffff r4:00000000 r3:00000000 [ 0.162045] [<c03126ac>] (dump_stack+0x0/0x24) from [<c0016cf4>] (__div0+0x24/0x28) [ 0.162047] [<c0016cd0>] (__div0+0x0/0x28) from [<c01d70a0>] (Ldiv0+0x8/0x10) [ 0.162050] [<c00624e4>] (find_busiest_group+0x0/0xac4) from [<c0063078>] (load_balance+0xd0/0x7cc) [ 0.162051] [<c0062fa8>] (load_balance+0x0/0x7cc) from [<c0063824>] (rebalance_domains+0xb0/0x19c) [ 0.162053] [<c0063774>] (rebalance_domains+0x0/0x19c) from [<c0063960>] (run_rebalance_domains+0x50/0x148) [ 0.162055] [<c0063910>] (run_rebalance_domains+0x0/0x148) from [<c002eb7c>] (__do_softirq+0x134/0x328) [ 0.162057] [<c002ea48>] (__do_softirq+0x0/0x328) from [<c002f2a4>] (irq_exit+0xa8/0xb0) [ 0.162060] [<c002f1fc>] (irq_exit+0x0/0xb0) from [<c001332c>] (handle_IRQ+0x68/0xe0) [ 0.162061] r4:c0475154 r3:db8c0638 [ 0.162062] [<c00132c4>] (handle_IRQ+0x0/0xe0) from [<c00085f0>] (gic_handle_irq+0x50/0x98) [ 0.162065] r8:c047e8b0 r7:db8c1f60 r6:c047b2dc r5:c0497898 r4:f8200100 [ 0.162065] r3:00000000 ------------------------------------------------------------------------------------------------------ I have checked that changes as proposed by Venkat are already present in mainline code i.e. Function : update_cpu_power() if (!power) power = 1; Function : scale_rt_power() if (unlikely(total < avg)) { /* Ensures that power won't end up being negative */ available = 0; } else { available = total - avg; } Can anyone please explain that what might gone wrong which leads to this crash apart from reasons mentioned by Chetan. Can any one please let me know about exact reason for this crash apart from 2 reasons explained by Chetan? -- Yogesh.
This bug has been marked as Resolved_closed_fix .. can anyone please share actual fixed Patch and marked this bug in open state as I can't change its state. -- Yogesh.
in debian kernels, this was addressed with a patch from Ben Hutchings and Bjoern Boschman: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=636797#46
Thanks Daneil, Can you please point me to the release in which this feature has been taken in debian mainline kernel release. I checked source code from link http://packages.debian.org/wheezy/linux-source-3.2 but in this source code,linux-source-3.2, and patch file,linux-patch-3.2-rt.patch , I didn't find this patch added.
Yogesh, i don't undersatnd your question. the debian kernel is (kind of by definition) not the mainline kernel -- it contains debian-specific patches. The link i provided in comment 37 here points to a patch stored in the debian bug tracking system, which applies to 2.6.32 (which is in squeeze, not wheezy). I don't see any associated patch in debian's 3.2 branch of the kernel, i'm afraid. And the patch i linked to definitely doesn't apply to 3.2 directly. (however, i also haven't run into the bug with the 3.2 kernel yet) sorry to not be more helpful.
Thanks Daniel, Actually I am looking for latest kernel (Debian, Ubuntu, Redhat or Vanilla) release which has included this patch or done solution for this issue. I checked latest vanilla kernel and found that check for power == 0 is being done in function update_cpu_power() + if (!power) + power = 1; But as per my understanding power can also become zero in function update_group_power(), In patch at #37 was having check for this function too. As you know this issue is very difficult to reproduce so I was looking for advice or solution applicable and tested on any latest release like 3.x.x.
Hi all, Use this patch, https://access.redhat.com/solutions/68466 From: Salman Qazi <sqazi@google.com> Date: Tue, 15 Nov 2011 14:12:06 -0800 Subject: sched, x86: Avoid unnecessary overflow in sched_clock (Added the missing signed-off-by line) In hundreds of days, the __cycles_2_ns calculation in sched_clock has an overflow. cyc * per_cpu(cyc2ns, cpu) exceeds 64 bits, causing the final value to become zero. We can solve this without losing any precision. We can decompose TSC into quotient and remainder of division by the scale factor, and then use this to convert TSC into nanoseconds. Signed-off-by: Salman Qazi <sqazi@google.com> Acked-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Paul Turner <pjt@google.com> Cc: stable@kernel.org Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111115221121.7262.88871.stgit@dungbeetle.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/include/asm/timer.h | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h index fa7b917..431793e 100644 --- a/arch/x86/include/asm/timer.h +++ b/arch/x86/include/asm/timer.h @@ -32,6 +32,22 @@ extern int no_timer_check; * (mathieu.desnoyers@polymtl.ca) * * -johnstul@us.ibm.com "math is hard, lets go shopping!" + * + * In: + * + * ns = cycles * cyc2ns_scale / SC + * + * Although we may still have enough bits to store the value of ns, + * in some cases, we may not have enough bits to store cycles * cyc2ns_scale, + * leading to an incorrect result. + * + * To avoid this, we can decompose 'cycles' into quotient and remainder + * of division by SC. Then, + * + * ns = (quot * SC + rem) * cyc2ns_scale / SC + * = quot * cyc2ns_scale + (rem * cyc2ns_scale) / SC + * + * - sqazi@google.com */ DECLARE_PER_CPU(unsigned long, cyc2ns); @@ -41,9 +57,14 @@ DECLARE_PER_CPU(unsigned long long, cyc2ns_offset); static inline unsigned long long __cycles_2_ns(unsigned long long cyc) { + unsigned long long quot; + unsigned long long rem; int cpu = smp_processor_id(); unsigned long long ns = per_cpu(cyc2ns_offset, cpu); - ns += cyc * per_cpu(cyc2ns, cpu) >> CYC2NS_SCALE_FACTOR; + quot = (cyc >> CYC2NS_SCALE_FACTOR); + rem = cyc & ((1ULL << CYC2NS_SCALE_FACTOR) - 1); + ns += quot * per_cpu(cyc2ns, cpu) + + ((rem * per_cpu(cyc2ns, cpu)) >> CYC2NS_SCALE_FACTOR); return ns; } -- but the installation of patch, restart the system, more than and 240 days after, the os stack restart again