Created attachment 22113 [details] First oops I am not sure if this is a regression because I heavily tested previous kernels. Please see attachments for oops. I am able to trigger these by putting the machines under high load, but it tasks 1 or so hours usually to trigger. -- Background info -- I am running 3 node ocfs2 xen (paravirt_ops) setup on quad core xeons. The guest VMs have 2 CPU's and 1.5GB of memory. To trigger the bug, I do the following: - Repeatedly untar/delete kernel sources on the ocfs2 fs - Benchmark a php application on 2 of the nodes which uses flock (on the ocfs2 fs) to control access to a cache. The php application is being benchmarked with a concurrency of 7. The above scenario puts the VMs under both high IO and CPU load. uptime will report a system load of around 7-8. The kernel was compiled using gcc 4.3.3 from ubuntu 9.04. Please let me know what further info you require, tests you need performed or patches you need tested.
Created attachment 22114 [details] Second oops
Created attachment 22115 [details] Third oops
On Sat, 2009-06-27 at 01:36 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13631 RIP: e030:[<ffffffff8022c1ab>] [<ffffffff8022c1ab>] update_curr+0x19/0xf0 Could you post your .config and possibly rebuild the kernel with debug information and provide the output of # addr2line -e vmlinux $RIP
Created attachment 22128 [details] .config
(In reply to comment #3) > RIP: e030:[<ffffffff8022c1ab>] [<ffffffff8022c1ab>] update_curr+0x19/0xf0 > > Could you post your .config and possibly rebuild the kernel with debug I already have CONFIG_DEBUG_INFO and CONFIG_FRAME_POINTER enabled. Are any other options required? > information and provide the output of > > # addr2line -e vmlinux $RIP kernel/sched_fair.c:480
(In reply to comment #0) > -- Background info -- > I am running 3 node ocfs2 xen (paravirt_ops) setup on quad core xeons. The > guest VMs have 2 CPU's and 1.5GB of memory. To trigger the bug, I do the > following: > - Repeatedly untar/delete kernel sources on the ocfs2 fs > - Benchmark a php application on 2 of the nodes which uses flock (on the > ocfs2 > fs) to control access to a cache. The php application is being benchmarked > with > a concurrency of 7. I have been able to repeat the BUG without the repeatedly untaring/deleting the kernel sources.
Can you reproduce running natively (ie, not Xen)?
(In reply to comment #7) > Can you reproduce running natively (ie, not Xen)? I will have to see what spare x86_64 hardware we have on Monday. I think we might have something that will work. I tried moving the php application off the ocfs2 FS onto a reiser FS (so that the flock wasn't clustered) and ran the benchmarking using the same parameters as before but no BUG occurred. I ran the benchmarking overnight for about 10 hours. It would usually be triggered within a few hours. I have since moved the php application back onto the ocfs2 FS and running the benchmarking application on only 1 node as apposed to 2 so there shouldn't be any lock contention between the different nodes. Will see how it goes.
I've been running more tests and discovered something I believe is important. My configuration has 3 nodes: www1, www2 and backup1. backup1 has the ocfs2 FS mounted read-only. When I was benchmarking the www{1,2} nodes, one of these was crashing (and xen was automatically rebooting it). The other www node would always do the journal recovery of the crashed node and upon/during starting/completing the recovery it would crash and print the stack traces I have previously uploaded. Knowing this, I have been able to reproduce the BUG with 100% success rate by simply destroying one of the www nodes. A couple of minutes later, the other www node will start the recovery and then panic. Sometimes, the following would be printed in the stack trace: Thread overran stack, or stack corrupted I am unsure as to why the first node that crashes does not produce a stack trace. In any case, the remaining nodes shouldn't panic when trying to do a recovery.
Joel Becker suggested trying slab instead of slub. This appears to have resolved the issue of the second node crashing a couple of minutes after the first. For interests sake, the issue of the first node crashing is being worked on in this bug: http://bugzilla.kernel.org/show_bug.cgi?id=13631
Cc Pekka :)
(In reply to comment #10) > For interests sake, the issue of the first node crashing is being worked on > in > this bug: http://bugzilla.kernel.org/show_bug.cgi?id=13631 Oops, that should be: http://bugzilla.kernel.org/show_bug.cgi?id=13632
On Fri, 2009-07-03 at 00:42 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13631 > > > Andrew Morton <akpm@linux-foundation.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |penberg@cs.helsinki.fi > > > > > --- Comment #11 from Andrew Morton <akpm@linux-foundation.org> 2009-07-03 > 00:42:40 --- > Cc Pekka :) Looking at the bug report, I'd be pretty surprised if this would be a SLUB bug. It seems more likely that there's some memory corruption going on under heavy load and SLAB just happens to have a different layout of slab objects or something. Did you run the test with CONFIG_SLAB_DEBUG, btw? Pekka
Created attachment 22193 [details] BUG kmalloc-16: Redzone overwritten (In reply to comment #13) > Looking at the bug report, I'd be pretty surprised if this would be a > SLUB bug. It seems more likely that there's some memory corruption going > on under heavy load and SLAB just happens to have a different layout of > slab objects or something. > > Did you run the test with CONFIG_SLAB_DEBUG, btw? I tried slub debugging first. I tried to make it crash for a while but of course it wouldn't do it when I wanted it to. I had given up on trying to crash slub and was just rebooting the node to change the kernel when I hit the jackpot. Does this suggest ocfs2 is corrupting the memory?
Hi Brad, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #14 from Brad Plant <bplant@iinet.net.au> 2009-07-03 15:23:06 > --- > Created an attachment (id=22193) > --> (http://bugzilla.kernel.org/attachment.cgi?id=22193) > BUG kmalloc-16: Redzone overwritten > > (In reply to comment #13) >> Looking at the bug report, I'd be pretty surprised if this would be a >> SLUB bug. It seems more likely that there's some memory corruption going >> on under heavy load and SLAB just happens to have a different layout of >> slab objects or something. >> >> Did you run the test with CONFIG_SLAB_DEBUG, btw? > > I tried slub debugging first. I tried to make it crash for a while but of > course it wouldn't do it when I wanted it to. I had given up on trying to > crash > slub and was just rebooting the node to change the kernel when I hit the > jackpot. > > Does this suggest ocfs2 is corrupting the memory? Yup, that would be the prime suspect here. Lets cc ocfs2 developers and LKML. The corruption can be found here: http://bugzilla.kernel.org/attachment.cgi?id=22193 Pekka
Created attachment 22214 [details] Thread overran stack, or stack corrupted I was going over some of the other stack traces that I'd collected over the last week or so and found 2 of them had the following message: Thread overran stack, or stack corrupted Both stack traces which contain the above message appear to be very similar, as in identical bar a few different register values.
Looks like things have gone a bit quiet - what happens now? Is there maybe someone else familiar with ocfs2 that can assist?
On Thu, 2009-07-23 at 08:13 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #17 from Brad Plant <bplant@iinet.net.au> 2009-07-23 08:13:37 > --- > Looks like things have gone a bit quiet - what happens now? > > Is there maybe someone else familiar with ocfs2 that can assist? Do you have CONFIG_LATENCYTOP enabled? If so, could you try without? I think I just spotted a corruption bug in there.
No, LATENCYTOP is not enabled. xen2.dev src # grep LATENCYTOP */.config linux-2.6.27.25/.config:CONFIG_HAVE_LATENCYTOP_SUPPORT=y linux-2.6.27.25/.config:# CONFIG_LATENCYTOP is not set linux-2.6.28.10/.config:CONFIG_HAVE_LATENCYTOP_SUPPORT=y linux-2.6.28.10/.config:# CONFIG_LATENCYTOP is not set linux-2.6.30/.config:CONFIG_HAVE_LATENCYTOP_SUPPORT=y linux-2.6.30/.config:# CONFIG_LATENCYTOP is not set
Closing as obsolete, if this is incorrect please re-open this bug and update the kernel version