Bug 13631 - BUG/panic - update_curr
Summary: BUG/panic - update_curr
Status: CLOSED OBSOLETE
Alias: None
Product: Process Management
Classification: Unclassified
Component: Scheduler (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Ingo Molnar
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-06-27 00:59 UTC by Brad Plant
Modified: 2012-06-12 10:01 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.30
Subsystem:
Regression: No
Bisected commit-id:


Attachments
First oops (3.27 KB, text/plain)
2009-06-27 00:59 UTC, Brad Plant
Details
Second oops (4.39 KB, text/plain)
2009-06-27 01:00 UTC, Brad Plant
Details
Third oops (3.32 KB, text/plain)
2009-06-27 01:00 UTC, Brad Plant
Details
.config (34.65 KB, application/octet-stream)
2009-06-27 13:44 UTC, Brad Plant
Details
BUG kmalloc-16: Redzone overwritten (1.84 KB, text/plain)
2009-07-03 15:23 UTC, Brad Plant
Details
Thread overran stack, or stack corrupted (4.60 KB, text/plain)
2009-07-05 02:44 UTC, Brad Plant
Details

Description Brad Plant 2009-06-27 00:59:50 UTC
Created attachment 22113 [details]
First oops

I am not sure if this is a regression because I heavily tested previous kernels. Please see attachments for oops. I am able to trigger these by putting the machines under high load, but it tasks 1 or so hours usually to trigger.

-- Background info --
I am running 3 node ocfs2 xen (paravirt_ops) setup on quad core xeons. The guest VMs have 2 CPU's and 1.5GB of memory. To trigger the bug, I do the following:
 - Repeatedly untar/delete kernel sources on the ocfs2 fs
 - Benchmark a php application on 2 of the nodes which uses flock (on the ocfs2 fs) to control access to a cache. The php application is being benchmarked with a concurrency of 7.

The above scenario puts the VMs under both high IO and CPU load. uptime will report a system load of around 7-8.

The kernel was compiled using gcc 4.3.3 from ubuntu 9.04.

Please let me know what further info you require, tests you need performed or patches you need tested.
Comment 1 Brad Plant 2009-06-27 01:00:23 UTC
Created attachment 22114 [details]
Second oops
Comment 2 Brad Plant 2009-06-27 01:00:48 UTC
Created attachment 22115 [details]
Third oops
Comment 3 Peter Zijlstra 2009-06-27 08:32:40 UTC
On Sat, 2009-06-27 at 01:36 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13631

RIP: e030:[<ffffffff8022c1ab>]  [<ffffffff8022c1ab>] update_curr+0x19/0xf0

Could you post your .config and possibly rebuild the kernel with debug
information and provide the output of

# addr2line -e vmlinux $RIP
Comment 4 Brad Plant 2009-06-27 13:44:06 UTC
Created attachment 22128 [details]
.config
Comment 5 Brad Plant 2009-06-27 13:48:59 UTC
(In reply to comment #3)
> RIP: e030:[<ffffffff8022c1ab>]  [<ffffffff8022c1ab>] update_curr+0x19/0xf0
> 
> Could you post your .config and possibly rebuild the kernel with debug

I already have CONFIG_DEBUG_INFO and CONFIG_FRAME_POINTER enabled. Are any other options required?

> information and provide the output of
> 
> # addr2line -e vmlinux $RIP

kernel/sched_fair.c:480
Comment 6 Brad Plant 2009-06-28 00:50:25 UTC
(In reply to comment #0)
> -- Background info --
> I am running 3 node ocfs2 xen (paravirt_ops) setup on quad core xeons. The
> guest VMs have 2 CPU's and 1.5GB of memory. To trigger the bug, I do the
> following:
>  - Repeatedly untar/delete kernel sources on the ocfs2 fs
>  - Benchmark a php application on 2 of the nodes which uses flock (on the
>  ocfs2
> fs) to control access to a cache. The php application is being benchmarked
> with
> a concurrency of 7.

I have been able to repeat the BUG without the repeatedly untaring/deleting the kernel sources.
Comment 7 Jeremy Fitzhardinge 2009-06-28 02:03:38 UTC
Can you reproduce running natively (ie, not Xen)?
Comment 8 Brad Plant 2009-06-28 02:24:45 UTC
(In reply to comment #7)
> Can you reproduce running natively (ie, not Xen)?

I will have to see what spare x86_64 hardware we have on Monday. I think we might have something that will work.

I tried moving the php application off the ocfs2 FS onto a reiser FS (so that the flock wasn't clustered) and ran the benchmarking using the same parameters as before but no BUG occurred. I ran the benchmarking overnight for about 10 hours. It would usually be triggered within a few hours.

I have since moved the php application back onto the ocfs2 FS and running the benchmarking application on only 1 node as apposed to 2 so there shouldn't be any lock contention between the different nodes. Will see how it goes.
Comment 9 Brad Plant 2009-06-28 12:36:28 UTC
I've been running more tests and discovered something I believe is important. My configuration has 3 nodes: www1, www2 and backup1. backup1 has the ocfs2 FS mounted read-only. When I was benchmarking the www{1,2} nodes, one of these was crashing (and xen was automatically rebooting it). The other www node would always do the journal recovery of the crashed node and upon/during starting/completing the recovery it would crash and print the stack traces I have previously uploaded.

Knowing this, I have been able to reproduce the BUG with 100% success rate by simply destroying one of the www nodes. A couple of minutes later, the other www node will start the recovery and then panic. Sometimes, the following would be printed in the stack trace:

Thread overran stack, or stack corrupted

I am unsure as to why the first node that crashes does not produce a stack trace. In any case, the remaining nodes shouldn't panic when trying to do a recovery.
Comment 10 Brad Plant 2009-07-03 00:24:15 UTC
Joel Becker suggested trying slab instead of slub. This appears to have resolved the issue of the second node crashing a couple of minutes after the first.

For interests sake, the issue of the first node crashing is being worked on in this bug: http://bugzilla.kernel.org/show_bug.cgi?id=13631
Comment 11 Andrew Morton 2009-07-03 00:42:40 UTC
Cc Pekka :)
Comment 12 Brad Plant 2009-07-03 00:44:33 UTC
(In reply to comment #10)
> For interests sake, the issue of the first node crashing is being worked on
> in
> this bug: http://bugzilla.kernel.org/show_bug.cgi?id=13631

Oops, that should be: http://bugzilla.kernel.org/show_bug.cgi?id=13632
Comment 13 Pekka Enberg 2009-07-03 10:41:44 UTC
On Fri, 2009-07-03 at 00:42 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13631
> 
> 
> Andrew Morton <akpm@linux-foundation.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |penberg@cs.helsinki.fi
> 
> 
> 
> 
> --- Comment #11 from Andrew Morton <akpm@linux-foundation.org>  2009-07-03
> 00:42:40 ---
> Cc Pekka :)

Looking at the bug report, I'd be pretty surprised if this would be a
SLUB bug. It seems more likely that there's some memory corruption going
on under heavy load and SLAB just happens to have a different layout of
slab objects or something.

Did you run the test with CONFIG_SLAB_DEBUG, btw?

			Pekka
Comment 14 Brad Plant 2009-07-03 15:23:06 UTC
Created attachment 22193 [details]
BUG kmalloc-16: Redzone overwritten

(In reply to comment #13)
> Looking at the bug report, I'd be pretty surprised if this would be a
> SLUB bug. It seems more likely that there's some memory corruption going
> on under heavy load and SLAB just happens to have a different layout of
> slab objects or something.
> 
> Did you run the test with CONFIG_SLAB_DEBUG, btw?

I tried slub debugging first. I tried to make it crash for a while but of course it wouldn't do it when I wanted it to. I had given up on trying to crash slub and was just rebooting the node to change the kernel when I hit the jackpot.

Does this suggest ocfs2 is corrupting the memory?
Comment 15 Pekka Enberg 2009-07-04 13:03:45 UTC
Hi Brad,

bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #14 from Brad Plant <bplant@iinet.net.au>  2009-07-03 15:23:06
> ---
> Created an attachment (id=22193)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=22193)
> BUG kmalloc-16: Redzone overwritten
> 
> (In reply to comment #13)
>> Looking at the bug report, I'd be pretty surprised if this would be a
>> SLUB bug. It seems more likely that there's some memory corruption going
>> on under heavy load and SLAB just happens to have a different layout of
>> slab objects or something.
>>
>> Did you run the test with CONFIG_SLAB_DEBUG, btw?
> 
> I tried slub debugging first. I tried to make it crash for a while but of
> course it wouldn't do it when I wanted it to. I had given up on trying to
> crash
> slub and was just rebooting the node to change the kernel when I hit the
> jackpot.
> 
> Does this suggest ocfs2 is corrupting the memory?

Yup, that would be the prime suspect here. Lets cc ocfs2 developers and 
LKML. The corruption can be found here:

   http://bugzilla.kernel.org/attachment.cgi?id=22193

			Pekka
Comment 16 Brad Plant 2009-07-05 02:44:02 UTC
Created attachment 22214 [details]
Thread overran stack, or stack corrupted

I was going over some of the other stack traces that I'd collected over the last week or so and found 2 of them had the following message:

Thread overran stack, or stack corrupted

Both stack traces which contain the above message appear to be very similar, as in identical bar a few different register values.
Comment 17 Brad Plant 2009-07-23 08:13:37 UTC
Looks like things have gone a bit quiet - what happens now?

Is there maybe someone else familiar with ocfs2 that can assist?
Comment 18 Peter Zijlstra 2009-07-23 17:58:43 UTC
On Thu, 2009-07-23 at 08:13 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:

> --- Comment #17 from Brad Plant <bplant@iinet.net.au>  2009-07-23 08:13:37
> ---
> Looks like things have gone a bit quiet - what happens now?
> 
> Is there maybe someone else familiar with ocfs2 that can assist?

Do you have CONFIG_LATENCYTOP enabled? If so, could you try without? I
think I just spotted a corruption bug in there.
Comment 19 Brad Plant 2009-07-23 21:28:20 UTC
No, LATENCYTOP is not enabled.

xen2.dev src # grep LATENCYTOP */.config
linux-2.6.27.25/.config:CONFIG_HAVE_LATENCYTOP_SUPPORT=y
linux-2.6.27.25/.config:# CONFIG_LATENCYTOP is not set
linux-2.6.28.10/.config:CONFIG_HAVE_LATENCYTOP_SUPPORT=y
linux-2.6.28.10/.config:# CONFIG_LATENCYTOP is not set
linux-2.6.30/.config:CONFIG_HAVE_LATENCYTOP_SUPPORT=y
linux-2.6.30/.config:# CONFIG_LATENCYTOP is not set
Comment 20 Alan 2012-06-12 10:01:46 UTC
Closing as obsolete, if this is incorrect please re-open this bug and update the kernel version

Note You need to log in before you can comment on or make changes to this bug.