Bug 12569 - ocfs2 dlmmaster kernel bug
Summary: ocfs2 dlmmaster kernel bug
Status: RESOLVED OBSOLETE
Alias: None
Product: File System
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: fs_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-28 17:27 UTC by Oleg Gawriloff
Modified: 2012-10-30 15:48 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.25.18-0.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Oleg Gawriloff 2009-01-28 17:27:37 UTC
Latest working kernel version: none
Earliest failing kernel version: 2.6.25.18-0.2
Distribution: openSUSE 11.0
Hardware Environment: HP DL 140G4/6Gb RAM/amd64
Software Environment:
Problem Description:

I'have 3 identical hardware servers with LAMP. Two with openSUSE 11.0 and one with Gentoo which using some SAN storage through FC cards with help of OCFS2. Randomly, but at least 2 times per days all servers fail and all operations on ocfs2 filesystem blocks with "Permission denied" error. Only hardware reboot at least 2 nodes helps. Today I've tracked some messages regarding this problem.
At the begin I've saw that all my servers begin to answer error 500 at any requests. I've stopped all web stuff (apache/nginx) using ocfs2 and unmount all ocfs2 shares. At one node during unmount of one of the ocfs2 shares I've saw following:
falcon-cl3:~ # umount /home/apache/users/D1
Segmentation fault
Jan 29 01:31:49 falcon-cl3 kernel: (19786,3):dlm_empty_lockres:2783 ERROR: lockres W0000000000000002ad047e83635789 still has local locks!
Jan 29 01:31:49 falcon-cl3 kernel: ------------[ cut here ]------------
Jan 29 01:31:49 falcon-cl3 kernel: kernel BUG at fs/ocfs2/dlm/dlmmaster.c:2784!
Jan 29 01:31:49 falcon-cl3 kernel: invalid opcode: 0000 [1] SMP
Jan 29 01:31:49 falcon-cl3 kernel: last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
Jan 29 01:31:49 falcon-cl3 kernel: CPU 3
Jan 29 01:31:49 falcon-cl3 kernel: Modules linked in: xt_pkttype xt_TCPMSS xt_tcpudp ipt_LOG xt_limit ocfs2 8021q ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack ip_tables ip6_tables x_tables microcode fuse ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs ext2 loop dm_mod qla2xxx firmware_class scsi_transport_fc i2c_i801 iTCO_wdt iTCO_vendor_support scsi_tgt i2c_core shpchp pci_hotplug sg i5000_edac tg3 edac_core usbhid hid ff_memless raid456 async_xor async_memcpy async_tx xor raid0 sd_mod ehci_hcd uhci_hcd usbcore edd raid1 ext3 mbcache jbd ata_piix ahci libata scsi_mod dock
Jan 29 01:31:49 falcon-cl3 kernel: Pid: 19786, comm: umount Tainted: G        N 2.6.25.18-0.2-default #1
Jan 29 01:31:49 falcon-cl3 kernel: RIP: 0010:[<ffffffff88328f78>]  [<ffffffff88328f78>] :ocfs2_dlm:dlm_empty_lockres+0xe0d/0xea2
Jan 29 01:31:49 falcon-cl3 kernel: RSP: 0018:ffff81010e865b28  EFLAGS: 00010292
Jan 29 01:31:49 falcon-cl3 kernel: RAX: 000000000000006a RBX: 000000000000001f RCX: ffff81006d0d3800
Jan 29 01:31:49 falcon-cl3 kernel: RDX: 0000000100000000 RSI: 0000000000000096 RDI: 0000000000000296
Jan 29 01:31:49 falcon-cl3 kernel: RBP: ffff81010e865c58 R08: ffffffff80607ee0 R09: ffff810001031280
Jan 29 01:31:49 falcon-cl3 kernel: R10: 0000000000000000 R11: 00001243dc075bdd R12: 0000000000000001
Jan 29 01:31:49 falcon-cl3 kernel: R13: ffff81006d0d3800 R14: ffff81006d0d3800 R15: ffff81006d0d3898
Jan 29 01:31:49 falcon-cl3 kernel: FS:  00007f18e6ace6f0(0000) GS:ffff8101ef0d9a40(0000) knlGS:0000000000000000
Jan 29 01:31:49 falcon-cl3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 29 01:31:49 falcon-cl3 kernel: CR2: 00007f019d6a74a8 CR3: 0000000104898000 CR4: 00000000000006a0
Jan 29 01:31:49 falcon-cl3 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 29 01:31:49 falcon-cl3 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 29 01:31:49 falcon-cl3 kernel: Process umount (pid: 19786, threadinfo ffff81010e864000, task ffff8101650080c0)
Jan 29 01:31:49 falcon-cl3 kernel: Stack:  ffff81010d125320 ffffffff00000001 0000000000000000 ffff81010e865bf8
Jan 29 01:31:49 falcon-cl3 kernel:  0000000000000092 ffff81010e865c10 ffff8101ec18b400 ffffffd900000000
Jan 29 01:31:50 falcon-cl3 kernel:  ffff8101ec18b45c 0000000000000000 ffff81006d0d3828 000000038022e7eb
Jan 29 01:31:50 falcon-cl3 kernel: Call Trace:
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8831f6b6>] :ocfs2_dlm:dlm_unregister_domain+0xe0/0x3d2
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff88404c17>] :ocfs2:ocfs2_dlm_shutdown+0x78/0x94
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8841e412>] :ocfs2:ocfs2_dismount_volume+0x1e1/0x2ec
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8842084e>] :ocfs2:ocfs2_put_super+0x1b/0x20
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802a2558>] generic_shutdown_super+0x63/0xf1
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802a25fb>] kill_block_super+0x15/0x2a
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802a26dc>] deactivate_super+0x69/0x7e
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802b6e9b>] mntput_no_expire+0x55/0x85
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802b7626>] sys_umount+0x311/0x343
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8020bffa>] system_call_after_swapgs+0x8a/0x8f
Jan 29 01:31:50 falcon-cl3 kernel:  [<00007f18e6438147>]
Jan 29 01:31:50 falcon-cl3 kernel:
Jan 29 01:31:50 falcon-cl3 kernel:
Jan 29 01:31:50 falcon-cl3 kernel: Code: df 0a 00 00 45 8b 4d 14 48 c7 c1 b0 fa 32 88 48 c7 c7 98 20 33 88 65 8b 14 25 24 00 00 00 48 89 04 24 89 d2 31 c0 e8 37 e4 11 f8 <0f> 0b eb fe bf 64 00 00 00 e8 b1 79 f1 f7 e9 f3 f3 ff ff 48 8b
Jan 29 01:31:50 falcon-cl3 kernel: RIP  [<ffffffff88328f78>] :ocfs2_dlm:dlm_empty_lockres+0xe0d/0xea2
Jan 29 01:31:50 falcon-cl3 kernel:  RSP <ffff81010e865b28>
Jan 29 01:31:50 falcon-cl3 kernel: ---[ end trace 7004c8862242a9a5 ]---
Jan 29 01:31:50 falcon-cl3 kernel: ------------[ cut here ]------------
Jan 29 01:31:50 falcon-cl3 kernel: WARNING: at kernel/exit.c:892 do_exit+0x41/0x6ec()
Jan 29 01:31:50 falcon-cl3 kernel: Modules linked in: xt_pkttype xt_TCPMSS xt_tcpudp ipt_LOG xt_limit ocfs2 8021q ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack ip_tables ip6_tables x_tables microcode fuse ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs ext2 loop dm_mod qla2xxx firmware_class scsi_transport_fc i2c_i801 iTCO_wdt iTCO_vendor_support scsi_tgt i2c_core shpchp pci_hotplug sg i5000_edac tg3 edac_core usbhid hid ff_memless raid456 async_xor async_memcpy async_tx xor raid0 sd_mod ehci_hcd uhci_hcd usbcore edd raid1 ext3 mbcache jbd ata_piix ahci libata scsi_mod dock
Jan 29 01:31:50 falcon-cl3 kernel: Pid: 19786, comm: umount Tainted: G      D N 2.6.25.18-0.2-default #1
Jan 29 01:31:50 falcon-cl3 kernel:
Jan 29 01:31:50 falcon-cl3 kernel: Call Trace:
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8020d696>] dump_trace+0xc4/0x576
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8020db88>] show_trace+0x40/0x57
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff80447249>] dump_stack+0x72/0x7b
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff80237527>] warn_on_slowpath+0x58/0x80
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8023ade6>] do_exit+0x41/0x6ec
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8044a37f>] oops_begin+0x0/0xa0
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8020dfd9>] die+0x5d/0x66
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8044a8aa>] do_trap+0x110/0x11f
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8020e70c>] do_invalid_op+0xa0/0xa9
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff80449ca9>] error_exit+0x0/0x60
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff88328f78>] :ocfs2_dlm:dlm_empty_lockres+0xe0d/0xea2
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8831f6b6>] :ocfs2_dlm:dlm_unregister_domain+0xe0/0x3d2
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff88404c17>] :ocfs2:ocfs2_dlm_shutdown+0x78/0x94
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8841e412>] :ocfs2:ocfs2_dismount_volume+0x1e1/0x2ec
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8842084e>] :ocfs2:ocfs2_put_super+0x1b/0x20
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802a2558>] generic_shutdown_super+0x63/0xf1
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802a25fb>] kill_block_super+0x15/0x2a
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802a26dc>] deactivate_super+0x69/0x7e
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802b6e9b>] mntput_no_expire+0x55/0x85
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802b7626>] sys_umount+0x311/0x343
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8020bffa>] system_call_after_swapgs+0x8a/0x8f
Jan 29 01:31:50 falcon-cl3 kernel:  [<00007f18e6438147>]
Jan 29 01:31:50 falcon-cl3 kernel:
Jan 29 01:31:50 falcon-cl3 kernel: ---[ end trace 7004c8862242a9a5 ]---

And all operation on this share on other nodes fails. Software reboot using reboot command failed, so I rebooted used power switch. During initial mount in logs I've saw following:
Jan 29 02:27:07 falcon-cl1 kernel: o2net: connection to node falcon-cl3 (num 1) at 172.16.2.127:7777 has been idle for 30.0 seconds, shutting it down.
Jan 29 02:27:07 falcon-cl1 kernel: (10235,3):o2net_idle_timer:1414 here are some times that might help debug the situation: (tmr 1233188797.269265 now 1233188827.266240 dr 1233188797.269248 adv 1233188797.269268:1233188797.269268 func (9df3a87d:513) 1233187854.268609:1233187854.268622)
Jan 29 02:27:07 falcon-cl1 kernel: o2net: no longer connected to node falcon-cl3 (num 1) at 172.16.2.127:7777
Jan 29 02:27:07 falcon-cl1 kernel: (9637,3):dlm_do_master_request:1409 ERROR: link to 1 went down!
Jan 29 02:27:07 falcon-cl1 kernel: (10234,1):dlm_do_master_request:1409 ERROR: link to 1 went down!
Jan 29 02:27:07 falcon-cl1 kernel: (10234,1):dlm_get_lock_resource:986 ERROR: status = -112
Jan 29 02:27:07 falcon-cl1 kernel: (9637,3):dlm_get_lock_resource:986 ERROR: status = -112
Jan 29 02:27:07 falcon-cl1 kernel: (10233,1):dlm_do_master_request:1409 ERROR: link to 1 went down!
Jan 29 02:27:07 falcon-cl1 kernel: (10233,1):dlm_get_lock_resource:986 ERROR: status = -112
Jan 29 02:27:07 falcon-cl1 kernel: (9635,3):dlm_do_master_request:1409 ERROR: link to 1 went down!
Jan 29 02:27:07 falcon-cl1 kernel: (9635,3):dlm_get_lock_resource:986 ERROR: status = -107
Jan 29 02:27:39 falcon-cl1 kernel:  rport-8:0-2: blocked FC remote port time out: removing rport
Jan 29 02:27:40 falcon-cl1 kernel: (9273,1):ocfs2_dlm_eviction_cb:108 device (8,64): dlm has evicted node 1
Jan 29 02:27:40 falcon-cl1 kernel: (3482,1):dlm_get_lock_resource:913 558A175E1CC94570940D9BEFCAE48B8E:O00000000000000022b918c00000000: at least one node (1) to recover before lock mastery can begin
Jan 29 02:27:41 falcon-cl1 kernel: (3482,1):dlm_get_lock_resource:967 558A175E1CC94570940D9BEFCAE48B8E:O00000000000000022b918c00000000: at least one node (1) to recover before lock mastery can begin
Jan 29 02:27:42 falcon-cl1 kernel: (10234,1):dlm_restart_lock_mastery:1292 ERROR: node down! 1
Jan 29 02:27:42 falcon-cl1 kernel: (10234,1):dlm_wait_for_lock_mastery:1109 ERROR: status = -11
Jan 29 02:27:42 falcon-cl1 kernel: (10233,1):dlm_restart_lock_mastery:1292 ERROR: node down! 1
Jan 29 02:27:42 falcon-cl1 kernel: (10233,1):dlm_wait_for_lock_mastery:1109 ERROR: status = -11
Jan 29 02:27:42 falcon-cl1 kernel: (9637,3):dlm_restart_lock_mastery:1292 ERROR: node down! 1
Jan 29 02:27:42 falcon-cl1 kernel: (9637,3):dlm_wait_for_lock_mastery:1109 ERROR: status = -11
Jan 29 02:27:42 falcon-cl1 kernel: (9635,3):dlm_restart_lock_mastery:1292 ERROR: node down! 1
Jan 29 02:27:42 falcon-cl1 kernel: (9635,3):dlm_wait_for_lock_mastery:1109 ERROR: status = -11
Jan 29 02:27:42 falcon-cl1 kernel: (3451,3):ocfs2_dlm_eviction_cb:108 device (8,64): dlm has evicted node 1
Jan 29 02:27:43 falcon-cl1 kernel: (9637,3):dlm_get_lock_resource:967 558A175E1CC94570940D9BEFCAE48B8E:O0000000000000000a7761100000000: at least one node (1) to recover before lock mastery can begin
Jan 29 02:27:43 falcon-cl1 kernel: (10234,1):dlm_get_lock_resource:967 558A175E1CC94570940D9BEFCAE48B8E:M00000000000000022b919100000000: at least one node (1) to recover before lock mastery can begin
Jan 29 02:27:43 falcon-cl1 kernel: (10233,3):dlm_get_lock_resource:967 558A175E1CC94570940D9BEFCAE48B8E:O0000000000000000d5bf2100000000: at least one node (1) to recover before lock mastery can begin
Jan 29 02:27:43 falcon-cl1 kernel: (9635,3):dlm_get_lock_resource:967 558A175E1CC94570940D9BEFCAE48B8E:O0000000000000000b9c6f100000000: at least one node (1) to recover before lock mastery can begin






Steps to reproduce:
Comment 1 Alan 2012-10-30 15:48:47 UTC
If this is still seen on modern kernels then please re-open/update

Note You need to log in before you can comment on or make changes to this bug.