Created attachment 22116 [details] oops I am running 3 node ocfs2 xen (paravirt_ops) setup on quad core xeons. The guest VMs have 2 CPU's and 1.5GB of memory. When the BUG occurred, I was load testing the system by doing the following: - Repeatedly untar/delete kernel sources on the ocfs2 fs - Benchmark a php application on 2 of the nodes which uses flock (on the ocfs2 fs) to control access to a cache. The php application is being benchmarked with a concurrency of 7. The above scenario puts the VMs under both high IO and CPU load. uptime will report a system load of around 7-8. The kernel was compiled using gcc 4.3.3 from ubuntu 9.04. Please let me know what further info you require.
Please attach the object dump of ocfs2_dlm.ko. $ objdump -DSl /lib/modules/`uname -r`/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko >/tmp/ocfs2_dlm.out Use the module that generated this oops.
Created attachment 22144 [details] objdump -DSl fs/ocfs2/dlm/ocfs2_dlm.o I only use a static kernel so I did the following instead: objdump -DSl fs/ocfs2/dlm/ocfs2_dlm.o >ocfs2_dlm.out Does that give you what you require?
Created attachment 22145 [details] Debug patch. Could you please run with the above patch. So what seems to be happening is that the o2hb heartbeat sends a node down event to the dlm. As part of that, the dlm scans the mle hash to clear all stale mles (now stale because of a node death). The oops is because a mle has an incorrect type. This patch prints the mles found. Hopefully should tell us more about this.
I have applied the patch and recompiled. Now I just have to reproduce the stack trace (not something I've managed to reliably do yet). I will report back when I have something.
-- Progress update -- I've been working with Joel Becker and Sunil Mushran at Oracle on the issue of crashing ocfs2 nodes. The issue appears to be related to the flock system call since only heavy use of this feature appears to cause the crash. What we know: - constant untar/delete of kernel sources on 2 nodes at the same time produces no crash - using flock on the www nodes on the ocfs2 FS will crash a node in a random amount of time (5 min - 2 hours usually) - doing flock on the same nodes, but on a local FS instead does not produce a crash - the issue exists across multiple kernel versions (tested 2.6.27.25, 2.6.28.10, 2.6.30 and 2.6.27.25 openSUSE sources with _NO_ paravirt_ops, i.e. xen sub-arch) - the issue exists with or without the use of paravirt_ops. Unfortunately, when the ocfs2 node crashes, no stack trace is produced. The xen hypervisor only says that the VM rebooted itself. I have written a python script to try and replicate the operations performed by the php web application. Since running this I have managed to get the same stack trace twice. Hopefully this sheds some light on things.
Created attachment 22182 [details] flock stress tester mentioned in comment #5 It takes 2 arguments; the number of processes and the number of flocks to do for each process. I.e. ./flock.py 5 10000 will fork 5 processes and each will do 10000 flocks. You must create a file called "flock.test" in the current directory first. The children will print a '.' to stdout when they attempt an exclusive lock.
Created attachment 22183 [details] stack trace mentioned in comment #5 This stack trace has been produced twice. Both times by running the "flock stress tester" python script attached to this bug.
Created attachment 22194 [details] dlm_mig_lockres_worker+0x574/0xe6f I got this when I killed the node that owned the flock in the lockres (determined using dlm_locks) while other nodes were still trying to flock the same file.
Created attachment 22195 [details] dlm_clean_master_list+0x3b9/0x400 I killed a node and one of the remaining nodes produced this stack trace. This is a similar location to attachment #1 [details] (3260 v's 3266). The difference being attachment #1 [details] was at a BUG() whereas this was at hlist_for_each(list, bucket) {
Created attachment 22196 [details] BUG kmalloc-16: Redzone overwritten After Pekka's comment in http://bugzilla.kernel.org/show_bug.cgi?id=13631, I turned on slub/slab debugging. It would appear there is some memory corruption going on.
Created attachment 22216 [details] kernel BUG at fs/ocfs2/dlmglue.c:2991 I've been trying to trigger the bug using the flock.py script I wrote but I've let in run for hours without a hitch. What I did notice however is that one time when I did a "killall flock.py", the system locked up. Further testing revealed that the following procedure would on almost all occasions, trigger the attached BUG. 1. Run a script on node A thats an EX lock for 10s before releasing. According to "debugfs.ocfs2 -R dlm_locks", the "F" lock is owned by node A. 2. Start the same script on node B multiple times (I was starting 3) so that they all queue waiting to get the lock. 3. When the 10s comes round from the script on node A, one of the scripts on node B will get the lock, but "debugfs.ocfs2 -R dlm_locks" still shows that node A owns the lock. 4. While the first script on node B still has the lock, (.i.e. there are still 2 scripts waiting on node B to get the lock), run "killall flock-10s.py" 5. Node B will crash with the attached stack trace. Comments: - I've only been able to trigger the BUG when the lock is owned by the opposite node to which I run killall on. - This is a different stack trace to the previous ones, but I'm hoping they are related and that together, they provide some clues as to what is going wrong. It's only a few lines, but I'll attach the flock-10s.py script for your convenience after this. Hopefully you are also able to reproduce.
Created attachment 22217 [details] flock-10s.py - take an flock for 10 seconds Python script to take an flock for 10 seconds before releasing it
Thanks for the investigation. I will look into this when I get some free time.
We're a bit busy, but I'm hoping to get some testing going in the near future. It really seems like something we'll have an easier time debugging when we can reproduce it locally.
I reckon you're probably right. Let me know if you want me to test something.