Latest working kernel version: none Earliest failing kernel version: 2.6.25.18-0.2 Distribution: openSUSE 11.0 Hardware Environment: HP DL 140G4/6Gb RAM/amd64 Software Environment: Problem Description: I'have 3 identical hardware servers with LAMP. Two with openSUSE 11.0 and one with Gentoo which using some SAN storage through FC cards with help of OCFS2. Randomly, but at least 2 times per days all servers fail and all operations on ocfs2 filesystem blocks with "Permission denied" error. Only hardware reboot at least 2 nodes helps. Today I've tracked some messages regarding this problem. At the begin I've saw that all my servers begin to answer error 500 at any requests. I've stopped all web stuff (apache/nginx) using ocfs2 and unmount all ocfs2 shares. At one node during unmount of one of the ocfs2 shares I've saw following: falcon-cl3:~ # umount /home/apache/users/D1 Segmentation fault Jan 29 01:31:49 falcon-cl3 kernel: (19786,3):dlm_empty_lockres:2783 ERROR: lockres W0000000000000002ad047e83635789 still has local locks! Jan 29 01:31:49 falcon-cl3 kernel: ------------[ cut here ]------------ Jan 29 01:31:49 falcon-cl3 kernel: kernel BUG at fs/ocfs2/dlm/dlmmaster.c:2784! Jan 29 01:31:49 falcon-cl3 kernel: invalid opcode: 0000 [1] SMP Jan 29 01:31:49 falcon-cl3 kernel: last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map Jan 29 01:31:49 falcon-cl3 kernel: CPU 3 Jan 29 01:31:49 falcon-cl3 kernel: Modules linked in: xt_pkttype xt_TCPMSS xt_tcpudp ipt_LOG xt_limit ocfs2 8021q ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack ip_tables ip6_tables x_tables microcode fuse ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs ext2 loop dm_mod qla2xxx firmware_class scsi_transport_fc i2c_i801 iTCO_wdt iTCO_vendor_support scsi_tgt i2c_core shpchp pci_hotplug sg i5000_edac tg3 edac_core usbhid hid ff_memless raid456 async_xor async_memcpy async_tx xor raid0 sd_mod ehci_hcd uhci_hcd usbcore edd raid1 ext3 mbcache jbd ata_piix ahci libata scsi_mod dock Jan 29 01:31:49 falcon-cl3 kernel: Pid: 19786, comm: umount Tainted: G N 2.6.25.18-0.2-default #1 Jan 29 01:31:49 falcon-cl3 kernel: RIP: 0010:[<ffffffff88328f78>] [<ffffffff88328f78>] :ocfs2_dlm:dlm_empty_lockres+0xe0d/0xea2 Jan 29 01:31:49 falcon-cl3 kernel: RSP: 0018:ffff81010e865b28 EFLAGS: 00010292 Jan 29 01:31:49 falcon-cl3 kernel: RAX: 000000000000006a RBX: 000000000000001f RCX: ffff81006d0d3800 Jan 29 01:31:49 falcon-cl3 kernel: RDX: 0000000100000000 RSI: 0000000000000096 RDI: 0000000000000296 Jan 29 01:31:49 falcon-cl3 kernel: RBP: ffff81010e865c58 R08: ffffffff80607ee0 R09: ffff810001031280 Jan 29 01:31:49 falcon-cl3 kernel: R10: 0000000000000000 R11: 00001243dc075bdd R12: 0000000000000001 Jan 29 01:31:49 falcon-cl3 kernel: R13: ffff81006d0d3800 R14: ffff81006d0d3800 R15: ffff81006d0d3898 Jan 29 01:31:49 falcon-cl3 kernel: FS: 00007f18e6ace6f0(0000) GS:ffff8101ef0d9a40(0000) knlGS:0000000000000000 Jan 29 01:31:49 falcon-cl3 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jan 29 01:31:49 falcon-cl3 kernel: CR2: 00007f019d6a74a8 CR3: 0000000104898000 CR4: 00000000000006a0 Jan 29 01:31:49 falcon-cl3 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 29 01:31:49 falcon-cl3 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jan 29 01:31:49 falcon-cl3 kernel: Process umount (pid: 19786, threadinfo ffff81010e864000, task ffff8101650080c0) Jan 29 01:31:49 falcon-cl3 kernel: Stack: ffff81010d125320 ffffffff00000001 0000000000000000 ffff81010e865bf8 Jan 29 01:31:49 falcon-cl3 kernel: 0000000000000092 ffff81010e865c10 ffff8101ec18b400 ffffffd900000000 Jan 29 01:31:50 falcon-cl3 kernel: ffff8101ec18b45c 0000000000000000 ffff81006d0d3828 000000038022e7eb Jan 29 01:31:50 falcon-cl3 kernel: Call Trace: Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8831f6b6>] :ocfs2_dlm:dlm_unregister_domain+0xe0/0x3d2 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff88404c17>] :ocfs2:ocfs2_dlm_shutdown+0x78/0x94 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8841e412>] :ocfs2:ocfs2_dismount_volume+0x1e1/0x2ec Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8842084e>] :ocfs2:ocfs2_put_super+0x1b/0x20 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff802a2558>] generic_shutdown_super+0x63/0xf1 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff802a25fb>] kill_block_super+0x15/0x2a Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff802a26dc>] deactivate_super+0x69/0x7e Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff802b6e9b>] mntput_no_expire+0x55/0x85 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff802b7626>] sys_umount+0x311/0x343 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8020bffa>] system_call_after_swapgs+0x8a/0x8f Jan 29 01:31:50 falcon-cl3 kernel: [<00007f18e6438147>] Jan 29 01:31:50 falcon-cl3 kernel: Jan 29 01:31:50 falcon-cl3 kernel: Jan 29 01:31:50 falcon-cl3 kernel: Code: df 0a 00 00 45 8b 4d 14 48 c7 c1 b0 fa 32 88 48 c7 c7 98 20 33 88 65 8b 14 25 24 00 00 00 48 89 04 24 89 d2 31 c0 e8 37 e4 11 f8 <0f> 0b eb fe bf 64 00 00 00 e8 b1 79 f1 f7 e9 f3 f3 ff ff 48 8b Jan 29 01:31:50 falcon-cl3 kernel: RIP [<ffffffff88328f78>] :ocfs2_dlm:dlm_empty_lockres+0xe0d/0xea2 Jan 29 01:31:50 falcon-cl3 kernel: RSP <ffff81010e865b28> Jan 29 01:31:50 falcon-cl3 kernel: ---[ end trace 7004c8862242a9a5 ]--- Jan 29 01:31:50 falcon-cl3 kernel: ------------[ cut here ]------------ Jan 29 01:31:50 falcon-cl3 kernel: WARNING: at kernel/exit.c:892 do_exit+0x41/0x6ec() Jan 29 01:31:50 falcon-cl3 kernel: Modules linked in: xt_pkttype xt_TCPMSS xt_tcpudp ipt_LOG xt_limit ocfs2 8021q ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack ip_tables ip6_tables x_tables microcode fuse ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs ext2 loop dm_mod qla2xxx firmware_class scsi_transport_fc i2c_i801 iTCO_wdt iTCO_vendor_support scsi_tgt i2c_core shpchp pci_hotplug sg i5000_edac tg3 edac_core usbhid hid ff_memless raid456 async_xor async_memcpy async_tx xor raid0 sd_mod ehci_hcd uhci_hcd usbcore edd raid1 ext3 mbcache jbd ata_piix ahci libata scsi_mod dock Jan 29 01:31:50 falcon-cl3 kernel: Pid: 19786, comm: umount Tainted: G D N 2.6.25.18-0.2-default #1 Jan 29 01:31:50 falcon-cl3 kernel: Jan 29 01:31:50 falcon-cl3 kernel: Call Trace: Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8020d696>] dump_trace+0xc4/0x576 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8020db88>] show_trace+0x40/0x57 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff80447249>] dump_stack+0x72/0x7b Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff80237527>] warn_on_slowpath+0x58/0x80 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8023ade6>] do_exit+0x41/0x6ec Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8044a37f>] oops_begin+0x0/0xa0 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8020dfd9>] die+0x5d/0x66 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8044a8aa>] do_trap+0x110/0x11f Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8020e70c>] do_invalid_op+0xa0/0xa9 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff80449ca9>] error_exit+0x0/0x60 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff88328f78>] :ocfs2_dlm:dlm_empty_lockres+0xe0d/0xea2 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8831f6b6>] :ocfs2_dlm:dlm_unregister_domain+0xe0/0x3d2 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff88404c17>] :ocfs2:ocfs2_dlm_shutdown+0x78/0x94 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8841e412>] :ocfs2:ocfs2_dismount_volume+0x1e1/0x2ec Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8842084e>] :ocfs2:ocfs2_put_super+0x1b/0x20 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff802a2558>] generic_shutdown_super+0x63/0xf1 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff802a25fb>] kill_block_super+0x15/0x2a Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff802a26dc>] deactivate_super+0x69/0x7e Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff802b6e9b>] mntput_no_expire+0x55/0x85 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff802b7626>] sys_umount+0x311/0x343 Jan 29 01:31:50 falcon-cl3 kernel: [<ffffffff8020bffa>] system_call_after_swapgs+0x8a/0x8f Jan 29 01:31:50 falcon-cl3 kernel: [<00007f18e6438147>] Jan 29 01:31:50 falcon-cl3 kernel: Jan 29 01:31:50 falcon-cl3 kernel: ---[ end trace 7004c8862242a9a5 ]--- And all operation on this share on other nodes fails. Software reboot using reboot command failed, so I rebooted used power switch. During initial mount in logs I've saw following: Jan 29 02:27:07 falcon-cl1 kernel: o2net: connection to node falcon-cl3 (num 1) at 172.16.2.127:7777 has been idle for 30.0 seconds, shutting it down. Jan 29 02:27:07 falcon-cl1 kernel: (10235,3):o2net_idle_timer:1414 here are some times that might help debug the situation: (tmr 1233188797.269265 now 1233188827.266240 dr 1233188797.269248 adv 1233188797.269268:1233188797.269268 func (9df3a87d:513) 1233187854.268609:1233187854.268622) Jan 29 02:27:07 falcon-cl1 kernel: o2net: no longer connected to node falcon-cl3 (num 1) at 172.16.2.127:7777 Jan 29 02:27:07 falcon-cl1 kernel: (9637,3):dlm_do_master_request:1409 ERROR: link to 1 went down! Jan 29 02:27:07 falcon-cl1 kernel: (10234,1):dlm_do_master_request:1409 ERROR: link to 1 went down! Jan 29 02:27:07 falcon-cl1 kernel: (10234,1):dlm_get_lock_resource:986 ERROR: status = -112 Jan 29 02:27:07 falcon-cl1 kernel: (9637,3):dlm_get_lock_resource:986 ERROR: status = -112 Jan 29 02:27:07 falcon-cl1 kernel: (10233,1):dlm_do_master_request:1409 ERROR: link to 1 went down! Jan 29 02:27:07 falcon-cl1 kernel: (10233,1):dlm_get_lock_resource:986 ERROR: status = -112 Jan 29 02:27:07 falcon-cl1 kernel: (9635,3):dlm_do_master_request:1409 ERROR: link to 1 went down! Jan 29 02:27:07 falcon-cl1 kernel: (9635,3):dlm_get_lock_resource:986 ERROR: status = -107 Jan 29 02:27:39 falcon-cl1 kernel: rport-8:0-2: blocked FC remote port time out: removing rport Jan 29 02:27:40 falcon-cl1 kernel: (9273,1):ocfs2_dlm_eviction_cb:108 device (8,64): dlm has evicted node 1 Jan 29 02:27:40 falcon-cl1 kernel: (3482,1):dlm_get_lock_resource:913 558A175E1CC94570940D9BEFCAE48B8E:O00000000000000022b918c00000000: at least one node (1) to recover before lock mastery can begin Jan 29 02:27:41 falcon-cl1 kernel: (3482,1):dlm_get_lock_resource:967 558A175E1CC94570940D9BEFCAE48B8E:O00000000000000022b918c00000000: at least one node (1) to recover before lock mastery can begin Jan 29 02:27:42 falcon-cl1 kernel: (10234,1):dlm_restart_lock_mastery:1292 ERROR: node down! 1 Jan 29 02:27:42 falcon-cl1 kernel: (10234,1):dlm_wait_for_lock_mastery:1109 ERROR: status = -11 Jan 29 02:27:42 falcon-cl1 kernel: (10233,1):dlm_restart_lock_mastery:1292 ERROR: node down! 1 Jan 29 02:27:42 falcon-cl1 kernel: (10233,1):dlm_wait_for_lock_mastery:1109 ERROR: status = -11 Jan 29 02:27:42 falcon-cl1 kernel: (9637,3):dlm_restart_lock_mastery:1292 ERROR: node down! 1 Jan 29 02:27:42 falcon-cl1 kernel: (9637,3):dlm_wait_for_lock_mastery:1109 ERROR: status = -11 Jan 29 02:27:42 falcon-cl1 kernel: (9635,3):dlm_restart_lock_mastery:1292 ERROR: node down! 1 Jan 29 02:27:42 falcon-cl1 kernel: (9635,3):dlm_wait_for_lock_mastery:1109 ERROR: status = -11 Jan 29 02:27:42 falcon-cl1 kernel: (3451,3):ocfs2_dlm_eviction_cb:108 device (8,64): dlm has evicted node 1 Jan 29 02:27:43 falcon-cl1 kernel: (9637,3):dlm_get_lock_resource:967 558A175E1CC94570940D9BEFCAE48B8E:O0000000000000000a7761100000000: at least one node (1) to recover before lock mastery can begin Jan 29 02:27:43 falcon-cl1 kernel: (10234,1):dlm_get_lock_resource:967 558A175E1CC94570940D9BEFCAE48B8E:M00000000000000022b919100000000: at least one node (1) to recover before lock mastery can begin Jan 29 02:27:43 falcon-cl1 kernel: (10233,3):dlm_get_lock_resource:967 558A175E1CC94570940D9BEFCAE48B8E:O0000000000000000d5bf2100000000: at least one node (1) to recover before lock mastery can begin Jan 29 02:27:43 falcon-cl1 kernel: (9635,3):dlm_get_lock_resource:967 558A175E1CC94570940D9BEFCAE48B8E:O0000000000000000b9c6f100000000: at least one node (1) to recover before lock mastery can begin Steps to reproduce:
If this is still seen on modern kernels then please re-open/update