Bug 25782
Summary: | kernel BUG at linux-2.6.36/mm/slub.c:2874! | ||
---|---|---|---|
Product: | Memory Management | Reporter: | Pawel Sikora (pluto) |
Component: | Slab Allocator | Assignee: | Pekka Enberg (penberg) |
Status: | RESOLVED OBSOLETE | ||
Severity: | high | CC: | alan |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.36.2 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | my modular kernel config. |
Description
Pawel Sikora
2010-12-28 22:18:03 UTC
I can't see a BUG statement at line 2874 of 2.6.36.2's mm/slub.c. Please let us know what additional patches have been applied and please let us know what your slub,c:2874 looks like. (In reply to comment #1) > I can't see a BUG statement at line 2874 of 2.6.36.2's mm/slub.c. Please let > us know what additional patches have been applied and please let us know what > your slub,c:2874 looks like. the are applied (by vendor) grsec and vserver patches (not used on this machine). on pure 2.6.36.2 the suitable BUG_ON() is at line 2833: http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.36.y.git;a=blob;f=mm/slub.c;h=13fffe1f0f3dc5992471ffd590326b9517b0cc3a;hb=a1346c99fc89f2b3d35c7d7e2e4aef8ea4124342#l2833 should i build a pure 2.6.36.2 from git and stress it again? Yes, please. I doubt you will able to trigger the problem with mainline, though. The oops looks like slab corruption where vfs_rename() passes an invalid pointer to kfree(). If the problem was in mainline, I suspect a lot more people would be triggering it. ok, i've built the pure 2.6.36.2 and boot on both machines with slub_debug=FZPU. first test machine (server, hostname: odra) have 2 raid-10 devices exported via AoE (ggaoed daemon) as e92.0/e93.0 with linear lvm and ocfs2 on top. on this machine the ocfs2 is mounted in /remote/cluster and it's locally unused. second test machine (client, hostname: hal) imports lvm on AoE devices and mounts in /remote/cluster too. now, on the client machine in /remote/cluster i'm running 'svn up' on a few GB repo from external svn server and observing svn logs. after few minutes the client machine just reboots. there's no ooops nor panic on console. i've checked the bios and there's no logs from health monitoring, so it looks like a total kernel memory corruption. from the other side, on the server machine i've logged an oops from slab checker. here's dmesg fragment: (...) Dec 29 16:09:22 odra kernel: [ 4307.772621] o2net: no longer connected to node hal (num 0) at 10.0.2.24:7777 ^^^^^^^^^ here, the server node noticed the client reboot. Dec 29 16:09:22 odra kernel: [ 4307.772685] (o2hb-56CBFCA7B6,4162,10):o2dlm_eviction_cb:267 o2dlm has evicted node 0 from group 56CBFCA7B60B4C7BBA86D0ABE068A1FD Dec 29 16:09:22 odra kernel: [ 4307.828455] (ocfs2rec,7005,10):ocfs2_replay_journal:1605 Recovering node 0 from slot 1 on device (253,0) Dec 29 16:09:23 odra kernel: [ 4309.206338] (dlm_reco_thread,4177,12):dlm_get_lock_resource:836 56CBFCA7B60B4C7BBA86D0ABE068A1FD:$RECOVERY: at least one node (0) to recover before lock mastery can begin Dec 29 16:09:23 odra kernel: [ 4309.206345] (dlm_reco_thread,4177,12):dlm_get_lock_resource:870 56CBFCA7B60B4C7BBA86D0ABE068A1FD: recovery map is not empty, but must master $RECOVERY lock now Dec 29 16:09:23 odra kernel: [ 4309.206374] (dlm_reco_thread,4177,12):dlm_do_recovery:523 (4177) Node 2 is the Recovery Master for the Dead Node 0 for Domain 56CBFCA7B60B4C7BBA86D0ABE068A1FD Dec 29 16:09:46 odra kernel: [ 4332.206553] (ocfs2rec,7005,10):ocfs2_begin_quota_recovery:407 Beginning quota recovery in slot 1 Dec 29 16:09:46 odra kernel: [ 4332.233272] (kworker/u:0,6651,10):ocfs2_finish_quota_recovery:598 Finishing quota recovery in slot 1 Dec 29 16:23:53 odra kernel: [ 5178.862323] ============================================================================= Dec 29 16:23:53 odra kernel: [ 5178.862327] BUG kmalloc-16: Redzone overwritten Dec 29 16:23:53 odra kernel: [ 5178.862329] ----------------------------------------------------------------------------- Dec 29 16:23:53 odra kernel: [ 5178.862331] Dec 29 16:23:53 odra kernel: [ 5178.862334] INFO: 0xffff88080d41e4e0-0xffff88080d41e4e3. First byte 0x0 instead of 0xcc Dec 29 16:23:53 odra kernel: [ 5178.862366] INFO: Allocated in ocfs2_recovery_init+0x66/0xf0 [ocfs2] age=1478919 cpu=4 pid=4161 Dec 29 16:23:53 odra kernel: [ 5178.862373] INFO: Freed in dev_uevent+0x133/0x160 age=1497427 cpu=4 pid=3928 Dec 29 16:23:53 odra kernel: [ 5178.862376] INFO: Slab 0xffffea001c2e6690 objects=46 used=40 fp=0xffff88080d41e580 flags=0x6000000000000c1 Dec 29 16:23:53 odra kernel: [ 5178.862379] INFO: Object 0xffff88080d41e4d0 @offset=1232 fp=0xffff88080d41e528 Dec 29 16:23:53 odra kernel: [ 5178.862380] Dec 29 16:23:53 odra kernel: [ 5178.862382] Bytes b4 0xffff88080d41e4c0: 0a b3 fe ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a .���....ZZZZZZZZ Dec 29 16:23:53 odra kernel: [ 5178.862391] Object 0xffff88080d41e4d0: 00 00 00 00 00 00 00 00 e0 e4 41 0d 08 88 ff ff ........��A...�� Dec 29 16:23:53 odra kernel: [ 5178.862399] Redzone 0xffff88080d41e4e0: 00 00 00 00 cc cc cc cc ....���� Dec 29 16:23:53 odra kernel: [ 5178.862406] Padding 0xffff88080d41e520: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ Dec 29 16:23:53 odra kernel: [ 5178.862415] Pid: 7619, comm: umount Not tainted 2.6.36.2 #3 Dec 29 16:23:53 odra kernel: [ 5178.862417] Call Trace: Dec 29 16:23:53 odra kernel: [ 5178.862427] [<ffffffff811168ee>] print_trailer+0xfe/0x160 Dec 29 16:23:53 odra kernel: [ 5178.862431] [<ffffffff81116f44>] check_bytes_and_report+0xf4/0x130 Dec 29 16:23:53 odra kernel: [ 5178.862435] [<ffffffff81116fe3>] check_object+0x63/0x260 Dec 29 16:23:53 odra kernel: [ 5178.862449] [<ffffffffa05b724d>] ? ocfs2_recovery_exit+0xbd/0xd0 [ocfs2] Dec 29 16:23:53 odra kernel: [ 5178.862453] [<ffffffff81118355>] __slab_free+0x225/0x390 Dec 29 16:23:53 odra kernel: [ 5178.862457] [<ffffffff81118707>] kfree+0xb7/0x110 Dec 29 16:23:53 odra kernel: [ 5178.862469] [<ffffffffa05b724d>] ocfs2_recovery_exit+0xbd/0xd0 [ocfs2] Dec 29 16:23:53 odra kernel: [ 5178.862473] [<ffffffff81140f7c>] ? iput+0x2c/0x280 Dec 29 16:23:53 odra kernel: [ 5178.862482] [<ffffffffa057ee2d>] ? ocfs2_truncate_log_shutdown+0x7d/0x1c0 [ocfs2] Dec 29 16:23:53 odra kernel: [ 5178.862498] [<ffffffffa05e38b2>] ocfs2_dismount_volume+0xa2/0x430 [ocfs2] Dec 29 16:23:53 odra kernel: [ 5178.862510] [<ffffffffa05e3c90>] ocfs2_put_super+0x50/0x110 [ocfs2] Dec 29 16:23:53 odra kernel: [ 5178.862514] [<ffffffff8112c3a1>] generic_shutdown_super+0x51/0xd0 Dec 29 16:23:53 odra kernel: [ 5178.862517] [<ffffffff8112c44c>] kill_block_super+0x2c/0x50 Dec 29 16:23:53 odra kernel: [ 5178.862529] [<ffffffffa05e1872>] ocfs2_kill_sb+0x72/0x80 [ocfs2] Dec 29 16:23:53 odra kernel: [ 5178.862533] [<ffffffff8112c755>] deactivate_locked_super+0x45/0x60 Dec 29 16:23:53 odra kernel: [ 5178.862536] [<ffffffff8112d4a5>] deactivate_super+0x45/0x60 Dec 29 16:23:53 odra kernel: [ 5178.862540] [<ffffffff81144e86>] mntput_no_expire+0x86/0xf0 Dec 29 16:23:53 odra kernel: [ 5178.862544] [<ffffffff81145997>] sys_umount+0x67/0x370 Dec 29 16:23:53 odra kernel: [ 5178.862550] [<ffffffff81002dab>] system_call_fastpath+0x16/0x1b Dec 29 16:23:53 odra kernel: [ 5178.862553] FIX kmalloc-16: Restoring 0xffff88080d41e4e0-0xffff88080d41e4e3=0xcc Dec 29 16:23:53 odra kernel: [ 5178.862554] Dec 29 16:23:58 odra kernel: [ 5183.637198] ocfs2: Unmounting device (253,0) on (node 2) i suppose that AoE kernel module could overwrite something on both nodes but i don't see any debugging options in `modinfo aoe`. Created attachment 41882 [details]
my modular kernel config.
The oops says that "osb->recovery_map" passed to kfree() in ocfs2_recovery_exit() has is red-zone overwritten (data written below or above the struct). Looks like OCFS2 bug to me. This patch in mainline probably cures the problem: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=75d9bbc73804285020aa4d99bd2a9600edea8945 That should be tagged for stable. Andrew? (In reply to comment #7) > This patch in mainline probably cures the problem: > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=75d9bbc73804285020aa4d99bd2a9600edea8945 is't a cure for client reboot too or only for server redzone damage? Just server, probably. Might be worth-while to try out 2.6.37-rc8 on both client and server side. (In reply to comment #9) > Just server, probably. Might be worth-while to try out 2.6.37-rc8 on both > client and server side. on 2.6.37-rc8 there's no redzone corruption but both machines reboot after few minutes of svn i/o. now, i'll try isolate faulty layer on 1-node ocfs2 cluster with and without ata-over-ethernet... i've isolated from this combo problem one more testcase for redzone corruption (only AoE+LVM) and filled as PR26012. Closing as obsolete, if this is still seen with modern kernels please re-open and update |