Created attachment 114991 [details] Screenshot of panic Two-node cluster configured using latest corosync (also DRBD 8.4.4, LVM2, and GFS2 but this is unessential). Steps to reproduce: 1. Start corosync on both nodes. 2. Start dlm_controld (version 4.0.2) on both nodes (used SCTP protocol as TCP cannot be used on multi-homed hosts). Adds such lines to kern.log: kernel: [ 580.428664] sctp: Hash tables configured (established 65536 bind 65536) kernel: [ 580.441779] DLM installed 3. Start clvmd on either node. Adds such lines to kern.log: kernel: [ 1345.259502] dlm: Using SCTP for communications kernel: [ 1345.260699] dlm: clvmd: joining the lockspace group... kernel: [ 1345.262962] dlm: clvmd: dlm_recover 1 kernel: [ 1345.262968] dlm: clvmd: group event done 0 0 kernel: [ 1345.262992] dlm: clvmd: add member 1024 kernel: [ 1345.262995] dlm: clvmd: dlm_recover_members 1 nodes kernel: [ 1345.262996] dlm: clvmd: join complete kernel: [ 1345.262998] dlm: clvmd: generation 1 slots 1 1:1024 kernel: [ 1345.262999] dlm: clvmd: dlm_recover_directory kernel: [ 1345.263000] dlm: clvmd: dlm_recover_directory 0 in 0 new kernel: [ 1345.263002] dlm: clvmd: dlm_recover_directory 0 out 0 messages kernel: [ 1345.263019] dlm: clvmd: dlm_recover 1 generation 1 done: 0 ms 4. Start clvmd on second node. With high probability one node or both nodes panic in the similar way. Screenshot in attachment. Stack trace can differ slightly above EOI line, but RIP was always the same. I suppose provided CPU codes correspond to one of BUG_ON macro inside sctp_cmd_interpreter. So, this is a bug. Now this bug totally prevents me from using my cluster as DLM rejects to use TCP for multi-homed hosts.
Created attachment 115001 [details] Screenshot of panic on 3.11.2
Can you get the machine into a high video resolution so you can see all of the panic ? Probably best to send a report to netdev@vger.kernel.org, esppecially as you've got a reproducable case.
I'll try. I've also discovered, that kernel 3.10.19 has no this bug. At least, there is no panic and GFS2 works across cluster.
After finishing pacemaker config I cannot repeat that bug. But there is another one during umounting of GFS2: Dec 2 13:48:39 s0 kernel: [267678.264167] GFS2: fsid=s0s1024:udata.0: recover generation 8 done Dec 2 13:55:07 s0 kernel: [268067.030408] BUG: Dentry ffff880409a85a80{i=19a0043,n=111111_13858195261001095222.png} still in use (-1) [unmount of gfs2 dm-0] Dec 2 13:55:07 s0 kernel: [268067.094800] ------------[ cut here ]------------ Dec 2 13:55:08 s0 kernel: [268067.127117] kernel BUG at fs/dcache.c:917! Dec 2 13:55:08 s0 kernel: [268067.158837] invalid opcode: 0000 [#1] SMP Dec 2 13:55:08 s0 kernel: [268067.190065] Modules linked in: nfnetlink_queue nfnetlink_log nfnetlink gfs2 dlm sctp configfs sha256_ssse3 sha256_generic dm_mod drbd(O) nf_conntrack_ipv4 nf_defrag_ipv4 xt_conn track xt_tcpudp xt_multiport xt_CT nf_conntrack xt_limit iptable_raw iptable_filter ip_tables x_tables psmouse serio_raw i2c_i801 i2c_core evdev button usbhid e1000e ptp ehci_pci ehci_hcd pps_core Dec 2 13:55:08 s0 kernel: [268067.289233] CPU: 1 PID: 29333 Comm: umount Tainted: G O 3.11.8 #1 Dec 2 13:55:08 s0 kernel: [268067.322549] Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 4.6.4 06/30/2011 Dec 2 13:55:08 s0 kernel: [268067.356042] task: ffff88042d1e4d40 ti: ffff8803d0ff2000 task.ti: ffff8803d0ff2000 Dec 2 13:55:08 s0 kernel: [268067.389389] RIP: 0010:[<ffffffff810fb7fe>] [<ffffffff810fb7fe>] shrink_dcache_for_umount_subtree+0x1de/0x1f0 Dec 2 13:55:08 s0 kernel: [268067.423364] RSP: 0018:ffff8803d0ff3ea0 EFLAGS: 00010292 Dec 2 13:55:08 s0 kernel: [268067.457222] RAX: 0000000000000072 RBX: ffff880409a85a80 RCX: 0000000000000000 Dec 2 13:55:08 s0 kernel: [268067.491051] RDX: ffff88043fc4d938 RSI: ffff88043fc4d098 RDI: ffff88043fc4d098 Dec 2 13:55:08 s0 kernel: [268067.524581] RBP: 0000000000000083 R08: 0000000000000000 R09: 000000000000043d Dec 2 13:55:08 s0 kernel: [268067.558371] R10: 0000000000000003 R11: 00000000ad55ad55 R12: ffffffffa02730e0 Dec 2 13:55:08 s0 kernel: [268067.591506] R13: ffff8803d44bb000 R14: ffff8803d44bb020 R15: ffff8803d44bb040 Dec 2 13:55:08 s0 kernel: [268067.624131] FS: 00007f811d521840(0000) GS:ffff88043fc40000(0000) knlGS:0000000000000000 Dec 2 13:55:08 s0 kernel: [268067.657315] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 2 13:55:08 s0 kernel: [268067.690541] CR2: 00007ff125e87750 CR3: 00000002055e4000 CR4: 00000000000407e0 Dec 2 13:55:08 s0 kernel: [268067.724225] Stack: Dec 2 13:55:08 s0 kernel: [268067.757698] ffff88041d1c86f0 ffff88041d1c8400 0000000000000083 ffffffff810fba26 Dec 2 13:55:08 s0 kernel: [268067.792171] ffff88041d1c8400 ffffffff810e8585 ffff88042ec29d40 0000000000000083 Dec 2 13:55:08 s0 kernel: [268067.826646] ffff88042d1e4d40 ffffffff810e8667 ffff88041d1c8400 ffffffffa02779e0 Dec 2 13:55:08 s0 kernel: [268067.860776] Call Trace: Dec 2 13:55:08 s0 kernel: [268067.894397] [<ffffffff810fba26>] ? shrink_dcache_for_umount+0x26/0x60 Dec 2 13:55:08 s0 kernel: [268067.928632] [<ffffffff810e8585>] ? generic_shutdown_super+0x25/0xe0 Dec 2 13:55:08 s0 kernel: [268067.962770] [<ffffffff810e8667>] ? kill_block_super+0x27/0x80 Dec 2 13:55:08 s0 kernel: [268067.996629] [<ffffffff810e88ab>] ? deactivate_locked_super+0x4b/0x80 Dec 2 13:55:08 s0 kernel: [268068.030067] [<ffffffff81103144>] ? SyS_umount+0xa4/0x3a0 Dec 2 13:55:08 s0 kernel: [268068.063367] [<ffffffff814f5ad2>] ? system_call_fastpath+0x16/0x1b Dec 2 13:55:08 s0 kernel: [268068.096225] Code: 00 00 48 8b 40 28 4c 8b 08 48 8b 43 30 48 85 c0 74 1b 48 8b 50 38 48 89 34 24 48 c7 c7 c0 c9 5e 81 48 89 de 31 c0 e8 32 1d 3f 00 <0f> 0b 31 d2 eb e5 0f 0b 66 2 e 0f 1f 84 00 00 00 00 00 41 57 b8 Dec 2 13:55:08 s0 kernel: [268068.164001] RIP [<ffffffff810fb7fe>] shrink_dcache_for_umount_subtree+0x1de/0x1f0 Dec 2 13:55:08 s0 kernel: [268068.196938] RSP <ffff8803d0ff3ea0> Dec 2 13:55:08 s0 kernel: [268068.229258] ---[ end trace 6bc50c748677e108 ]---