Bug 13632 - BUG at fs/ocfs2/dlm/dlmmaster.c:3266
Summary: BUG at fs/ocfs2/dlm/dlmmaster.c:3266
Status: CLOSED OBSOLETE
Alias: None
Product: File System
Classification: Unclassified
Component: ocfs2 (show other bugs)
Hardware: All Linux
: P1 high
Assignee: fs_ocfs2
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-06-27 01:05 UTC by Brad Plant
Modified: 2012-06-13 13:18 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.30
Subsystem:
Regression: No
Bisected commit-id:


Attachments
oops (4.25 KB, text/plain)
2009-06-27 01:05 UTC, Brad Plant
Details
objdump -DSl fs/ocfs2/dlm/ocfs2_dlm.o (41 bytes, text/plain)
2009-06-29 21:53 UTC, Brad Plant
Details
Debug patch. (812 bytes, text/plain)
2009-06-29 23:47 UTC, Sunil Mushran
Details
flock stress tester mentioned in comment #5 (1.00 KB, text/plain)
2009-07-03 00:36 UTC, Brad Plant
Details
stack trace mentioned in comment #5 (3.71 KB, text/plain)
2009-07-03 00:37 UTC, Brad Plant
Details
dlm_mig_lockres_worker+0x574/0xe6f (4.34 KB, text/plain)
2009-07-03 15:27 UTC, Brad Plant
Details
dlm_clean_master_list+0x3b9/0x400 (3.92 KB, text/plain)
2009-07-03 15:31 UTC, Brad Plant
Details
BUG kmalloc-16: Redzone overwritten (1.84 KB, text/plain)
2009-07-03 15:33 UTC, Brad Plant
Details
kernel BUG at fs/ocfs2/dlmglue.c:2991 (4.20 KB, text/plain)
2009-07-05 06:05 UTC, Brad Plant
Details
flock-10s.py - take an flock for 10 seconds (387 bytes, text/plain)
2009-07-05 06:06 UTC, Brad Plant
Details

Description Brad Plant 2009-06-27 01:05:36 UTC
Created attachment 22116 [details]
oops

I am running 3 node ocfs2 xen (paravirt_ops) setup on quad core xeons. The
guest VMs have 2 CPU's and 1.5GB of memory. When the BUG occurred, I was load testing the system by doing the following:
 - Repeatedly untar/delete kernel sources on the ocfs2 fs
 - Benchmark a php application on 2 of the nodes which uses flock (on the ocfs2
fs) to control access to a cache. The php application is being benchmarked with
a concurrency of 7.

The above scenario puts the VMs under both high IO and CPU load. uptime will
report a system load of around 7-8.

The kernel was compiled using gcc 4.3.3 from ubuntu 9.04.

Please let me know what further info you require.
Comment 1 Sunil Mushran 2009-06-29 18:37:09 UTC
Please attach the object dump of ocfs2_dlm.ko.

$ objdump -DSl /lib/modules/`uname -r`/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko >/tmp/ocfs2_dlm.out

Use the module that generated this oops.
Comment 2 Brad Plant 2009-06-29 21:53:54 UTC
Created attachment 22144 [details]
objdump -DSl fs/ocfs2/dlm/ocfs2_dlm.o

I only use a static kernel so I did the following instead:

objdump -DSl fs/ocfs2/dlm/ocfs2_dlm.o >ocfs2_dlm.out

Does that give you what you require?
Comment 3 Sunil Mushran 2009-06-29 23:47:26 UTC
Created attachment 22145 [details]
Debug patch.

Could you please run with the above patch. So what seems to be happening is that the o2hb heartbeat sends a node down event to the dlm. As part of that, the dlm 
scans the mle hash to clear all stale mles (now stale because of a node death). The oops is because a mle has an incorrect type.

This patch prints the mles found. Hopefully should tell us more about this.
Comment 4 Brad Plant 2009-06-30 12:34:32 UTC
I have applied the patch and recompiled. Now I just have to reproduce the stack trace (not something I've managed to reliably do yet). I will report back when I have something.
Comment 5 Brad Plant 2009-07-03 00:34:24 UTC
-- Progress update --

I've been working with Joel Becker and Sunil Mushran at Oracle on the issue of crashing ocfs2 nodes. The issue appears to be related to the flock system call since only heavy use of this feature appears to cause the crash.

What we know:
 - constant untar/delete of kernel sources on 2 nodes at the same time produces no crash
 - using flock on the www nodes on the ocfs2 FS will crash a node in a random amount of time (5 min - 2 hours usually)
 - doing flock on the same nodes, but on a local FS instead does not produce a crash
 - the issue exists across multiple kernel versions (tested 2.6.27.25, 2.6.28.10, 2.6.30 and 2.6.27.25 openSUSE sources with _NO_ paravirt_ops, i.e. xen sub-arch)
 - the issue exists with or without the use of paravirt_ops.

Unfortunately, when the ocfs2 node crashes, no stack trace is produced. The xen hypervisor only says that the VM rebooted itself.

I have written a python script to try and replicate the operations performed by the php web application. Since running this I have managed to get the same stack trace twice. Hopefully this sheds some light on things.
Comment 6 Brad Plant 2009-07-03 00:36:16 UTC
Created attachment 22182 [details]
flock stress tester mentioned in comment #5

It takes 2 arguments; the number of processes and the number of flocks to do for each process. I.e.

./flock.py 5 10000

will fork 5 processes and each will do 10000 flocks. You must create a file called "flock.test" in the current directory first. The children will print a '.' to stdout when they attempt an exclusive lock.
Comment 7 Brad Plant 2009-07-03 00:37:42 UTC
Created attachment 22183 [details]
stack trace mentioned in comment #5

This stack trace has been produced twice. Both times by running the "flock stress tester" python script attached to this bug.
Comment 8 Brad Plant 2009-07-03 15:27:18 UTC
Created attachment 22194 [details]
dlm_mig_lockres_worker+0x574/0xe6f

I got this when I killed the node that owned the flock in the lockres (determined using dlm_locks) while other nodes were still trying to flock the same file.
Comment 9 Brad Plant 2009-07-03 15:31:09 UTC
Created attachment 22195 [details]
dlm_clean_master_list+0x3b9/0x400

I killed a node and one of the remaining nodes produced this stack trace. This is a similar location to attachment #1 [details] (3260 v's 3266). The difference being attachment #1 [details] was at a BUG() whereas this was at

hlist_for_each(list, bucket) {
Comment 10 Brad Plant 2009-07-03 15:33:26 UTC
Created attachment 22196 [details]
BUG kmalloc-16: Redzone overwritten

After Pekka's comment in http://bugzilla.kernel.org/show_bug.cgi?id=13631, I turned on slub/slab debugging. It would appear there is some memory corruption going on.
Comment 11 Brad Plant 2009-07-05 06:05:21 UTC
Created attachment 22216 [details]
kernel BUG at fs/ocfs2/dlmglue.c:2991

I've been trying to trigger the bug using the flock.py script I wrote but I've let in run for hours without a hitch. What I did notice however is that one time when I did a "killall flock.py", the system locked up.

Further testing revealed that the following procedure would on almost all occasions, trigger the attached BUG.

1. Run a script on node A thats an EX lock for 10s before releasing. According to "debugfs.ocfs2 -R dlm_locks", the "F" lock is owned by node A.
2. Start the same script on node B multiple times (I was starting 3) so that they all queue waiting to get the lock.
3. When the 10s comes round from the script on node A, one of the scripts on node B will get the lock, but "debugfs.ocfs2 -R dlm_locks" still shows that node A owns the lock.
4. While the first script on node B still has the lock, (.i.e. there are still 2 scripts waiting on node B to get the lock), run "killall flock-10s.py"
5. Node B will crash with the attached stack trace.

Comments:
- I've only been able to trigger the BUG when the lock is owned by the opposite node to which I run killall on.
- This is a different stack trace to the previous ones, but I'm hoping they are related and that together, they provide some clues as to what is going wrong.

It's only a few lines, but I'll attach the flock-10s.py script for your convenience after this. Hopefully you are also able to reproduce.
Comment 12 Brad Plant 2009-07-05 06:06:35 UTC
Created attachment 22217 [details]
flock-10s.py - take an flock for 10 seconds

Python script to take an flock for 10 seconds before releasing it
Comment 13 Sunil Mushran 2009-07-07 00:34:09 UTC
Thanks for the investigation. I will look into this when I get some free time.
Comment 14 Joel Becker 2009-07-23 17:08:53 UTC
We're a bit busy, but I'm hoping to get some testing going in the near future.  It really seems like something we'll have an easier time debugging when we can reproduce it locally.
Comment 15 Brad Plant 2009-07-23 21:33:08 UTC
I reckon you're probably right. Let me know if you want me to test something.

Note You need to log in before you can comment on or make changes to this bug.