Bug 13631
Summary: | BUG/panic - update_curr | ||
---|---|---|---|
Product: | Process Management | Reporter: | Brad Plant (bplant) |
Component: | Scheduler | Assignee: | Ingo Molnar (mingo) |
Status: | CLOSED OBSOLETE | ||
Severity: | normal | CC: | a.p.zijlstra, akpm, alan, jeremy, mark.fasheh, penberg |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.30 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
First oops
Second oops Third oops .config BUG kmalloc-16: Redzone overwritten Thread overran stack, or stack corrupted |
Created attachment 22114 [details]
Second oops
Created attachment 22115 [details]
Third oops
On Sat, 2009-06-27 at 01:36 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13631 RIP: e030:[<ffffffff8022c1ab>] [<ffffffff8022c1ab>] update_curr+0x19/0xf0 Could you post your .config and possibly rebuild the kernel with debug information and provide the output of # addr2line -e vmlinux $RIP Created attachment 22128 [details]
.config
(In reply to comment #3) > RIP: e030:[<ffffffff8022c1ab>] [<ffffffff8022c1ab>] update_curr+0x19/0xf0 > > Could you post your .config and possibly rebuild the kernel with debug I already have CONFIG_DEBUG_INFO and CONFIG_FRAME_POINTER enabled. Are any other options required? > information and provide the output of > > # addr2line -e vmlinux $RIP kernel/sched_fair.c:480 (In reply to comment #0) > -- Background info -- > I am running 3 node ocfs2 xen (paravirt_ops) setup on quad core xeons. The > guest VMs have 2 CPU's and 1.5GB of memory. To trigger the bug, I do the > following: > - Repeatedly untar/delete kernel sources on the ocfs2 fs > - Benchmark a php application on 2 of the nodes which uses flock (on the > ocfs2 > fs) to control access to a cache. The php application is being benchmarked > with > a concurrency of 7. I have been able to repeat the BUG without the repeatedly untaring/deleting the kernel sources. Can you reproduce running natively (ie, not Xen)? (In reply to comment #7) > Can you reproduce running natively (ie, not Xen)? I will have to see what spare x86_64 hardware we have on Monday. I think we might have something that will work. I tried moving the php application off the ocfs2 FS onto a reiser FS (so that the flock wasn't clustered) and ran the benchmarking using the same parameters as before but no BUG occurred. I ran the benchmarking overnight for about 10 hours. It would usually be triggered within a few hours. I have since moved the php application back onto the ocfs2 FS and running the benchmarking application on only 1 node as apposed to 2 so there shouldn't be any lock contention between the different nodes. Will see how it goes. I've been running more tests and discovered something I believe is important. My configuration has 3 nodes: www1, www2 and backup1. backup1 has the ocfs2 FS mounted read-only. When I was benchmarking the www{1,2} nodes, one of these was crashing (and xen was automatically rebooting it). The other www node would always do the journal recovery of the crashed node and upon/during starting/completing the recovery it would crash and print the stack traces I have previously uploaded. Knowing this, I have been able to reproduce the BUG with 100% success rate by simply destroying one of the www nodes. A couple of minutes later, the other www node will start the recovery and then panic. Sometimes, the following would be printed in the stack trace: Thread overran stack, or stack corrupted I am unsure as to why the first node that crashes does not produce a stack trace. In any case, the remaining nodes shouldn't panic when trying to do a recovery. Joel Becker suggested trying slab instead of slub. This appears to have resolved the issue of the second node crashing a couple of minutes after the first. For interests sake, the issue of the first node crashing is being worked on in this bug: http://bugzilla.kernel.org/show_bug.cgi?id=13631 Cc Pekka :) (In reply to comment #10) > For interests sake, the issue of the first node crashing is being worked on > in > this bug: http://bugzilla.kernel.org/show_bug.cgi?id=13631 Oops, that should be: http://bugzilla.kernel.org/show_bug.cgi?id=13632 On Fri, 2009-07-03 at 00:42 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13631 > > > Andrew Morton <akpm@linux-foundation.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |penberg@cs.helsinki.fi > > > > > --- Comment #11 from Andrew Morton <akpm@linux-foundation.org> 2009-07-03 > 00:42:40 --- > Cc Pekka :) Looking at the bug report, I'd be pretty surprised if this would be a SLUB bug. It seems more likely that there's some memory corruption going on under heavy load and SLAB just happens to have a different layout of slab objects or something. Did you run the test with CONFIG_SLAB_DEBUG, btw? Pekka Created attachment 22193 [details] BUG kmalloc-16: Redzone overwritten (In reply to comment #13) > Looking at the bug report, I'd be pretty surprised if this would be a > SLUB bug. It seems more likely that there's some memory corruption going > on under heavy load and SLAB just happens to have a different layout of > slab objects or something. > > Did you run the test with CONFIG_SLAB_DEBUG, btw? I tried slub debugging first. I tried to make it crash for a while but of course it wouldn't do it when I wanted it to. I had given up on trying to crash slub and was just rebooting the node to change the kernel when I hit the jackpot. Does this suggest ocfs2 is corrupting the memory? Hi Brad, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #14 from Brad Plant <bplant@iinet.net.au> 2009-07-03 15:23:06 > --- > Created an attachment (id=22193) > --> (http://bugzilla.kernel.org/attachment.cgi?id=22193) > BUG kmalloc-16: Redzone overwritten > > (In reply to comment #13) >> Looking at the bug report, I'd be pretty surprised if this would be a >> SLUB bug. It seems more likely that there's some memory corruption going >> on under heavy load and SLAB just happens to have a different layout of >> slab objects or something. >> >> Did you run the test with CONFIG_SLAB_DEBUG, btw? > > I tried slub debugging first. I tried to make it crash for a while but of > course it wouldn't do it when I wanted it to. I had given up on trying to > crash > slub and was just rebooting the node to change the kernel when I hit the > jackpot. > > Does this suggest ocfs2 is corrupting the memory? Yup, that would be the prime suspect here. Lets cc ocfs2 developers and LKML. The corruption can be found here: http://bugzilla.kernel.org/attachment.cgi?id=22193 Pekka Created attachment 22214 [details]
Thread overran stack, or stack corrupted
I was going over some of the other stack traces that I'd collected over the last week or so and found 2 of them had the following message:
Thread overran stack, or stack corrupted
Both stack traces which contain the above message appear to be very similar, as in identical bar a few different register values.
Looks like things have gone a bit quiet - what happens now? Is there maybe someone else familiar with ocfs2 that can assist? On Thu, 2009-07-23 at 08:13 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #17 from Brad Plant <bplant@iinet.net.au> 2009-07-23 08:13:37 > --- > Looks like things have gone a bit quiet - what happens now? > > Is there maybe someone else familiar with ocfs2 that can assist? Do you have CONFIG_LATENCYTOP enabled? If so, could you try without? I think I just spotted a corruption bug in there. No, LATENCYTOP is not enabled. xen2.dev src # grep LATENCYTOP */.config linux-2.6.27.25/.config:CONFIG_HAVE_LATENCYTOP_SUPPORT=y linux-2.6.27.25/.config:# CONFIG_LATENCYTOP is not set linux-2.6.28.10/.config:CONFIG_HAVE_LATENCYTOP_SUPPORT=y linux-2.6.28.10/.config:# CONFIG_LATENCYTOP is not set linux-2.6.30/.config:CONFIG_HAVE_LATENCYTOP_SUPPORT=y linux-2.6.30/.config:# CONFIG_LATENCYTOP is not set Closing as obsolete, if this is incorrect please re-open this bug and update the kernel version |
Created attachment 22113 [details] First oops I am not sure if this is a regression because I heavily tested previous kernels. Please see attachments for oops. I am able to trigger these by putting the machines under high load, but it tasks 1 or so hours usually to trigger. -- Background info -- I am running 3 node ocfs2 xen (paravirt_ops) setup on quad core xeons. The guest VMs have 2 CPU's and 1.5GB of memory. To trigger the bug, I do the following: - Repeatedly untar/delete kernel sources on the ocfs2 fs - Benchmark a php application on 2 of the nodes which uses flock (on the ocfs2 fs) to control access to a cache. The php application is being benchmarked with a concurrency of 7. The above scenario puts the VMs under both high IO and CPU load. uptime will report a system load of around 7-8. The kernel was compiled using gcc 4.3.3 from ubuntu 9.04. Please let me know what further info you require, tests you need performed or patches you need tested.