Bug 65131 - kernel panic (BUG_ON raised) in SCTP function sctp_cmd_interpreter
Summary: kernel panic (BUG_ON raised) in SCTP function sctp_cmd_interpreter
Status: NEW
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV4 (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-18 03:38 UTC by Yuriy
Modified: 2016-02-15 15:05 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.11.8 custom build, repeated on 3.11.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Screenshot of panic (177.89 KB, image/jpeg)
2013-11-18 03:38 UTC, Yuriy
Details
Screenshot of panic on 3.11.2 (173.56 KB, image/jpeg)
2013-11-18 03:44 UTC, Yuriy
Details

Description Yuriy 2013-11-18 03:38:48 UTC
Created attachment 114991 [details]
Screenshot of panic

Two-node cluster configured using latest corosync (also DRBD 8.4.4, LVM2, and GFS2 but this is unessential).
Steps to reproduce:
1. Start corosync on both nodes.
2. Start dlm_controld (version 4.0.2) on both nodes (used SCTP protocol as TCP cannot be used on multi-homed hosts). Adds such lines to kern.log:
    kernel: [  580.428664] sctp: Hash tables configured (established 65536 bind 65536)
    kernel: [  580.441779] DLM installed
3. Start clvmd on either node. Adds such lines to kern.log:
    kernel: [ 1345.259502] dlm: Using SCTP for communications
    kernel: [ 1345.260699] dlm: clvmd: joining the lockspace group...
    kernel: [ 1345.262962] dlm: clvmd: dlm_recover 1
    kernel: [ 1345.262968] dlm: clvmd: group event done 0 0
    kernel: [ 1345.262992] dlm: clvmd: add member 1024
    kernel: [ 1345.262995] dlm: clvmd: dlm_recover_members 1 nodes
    kernel: [ 1345.262996] dlm: clvmd: join complete
    kernel: [ 1345.262998] dlm: clvmd: generation 1 slots 1 1:1024
    kernel: [ 1345.262999] dlm: clvmd: dlm_recover_directory
    kernel: [ 1345.263000] dlm: clvmd: dlm_recover_directory 0 in 0 new
    kernel: [ 1345.263002] dlm: clvmd: dlm_recover_directory 0 out 0 messages
    kernel: [ 1345.263019] dlm: clvmd: dlm_recover 1 generation 1 done: 0 ms
4. Start clvmd on second node. With high probability one node or both nodes panic in the similar way. Screenshot in attachment.

Stack trace can differ slightly above EOI line, but RIP was always the same. I suppose provided CPU codes correspond to one of BUG_ON macro inside sctp_cmd_interpreter. So, this is a bug.

Now this bug totally prevents me from using my cluster as DLM rejects to use TCP for multi-homed hosts.
Comment 1 Yuriy 2013-11-18 03:44:53 UTC
Created attachment 115001 [details]
Screenshot of panic on 3.11.2
Comment 2 Alan 2013-11-18 14:41:02 UTC
Can you get the machine into a high video resolution so you can see all of the panic ?

Probably best to send a report to netdev@vger.kernel.org, esppecially as you've got a reproducable case.
Comment 3 Yuriy 2013-11-19 23:39:25 UTC
I'll try. I've also discovered, that kernel 3.10.19 has no this bug. At least, there is no panic and GFS2 works across cluster.
Comment 4 Yuriy 2013-12-04 10:50:56 UTC
After finishing pacemaker config I cannot repeat that bug. But there is another one during umounting of GFS2:
Dec  2 13:48:39 s0 kernel: [267678.264167] GFS2: fsid=s0s1024:udata.0: recover generation 8 done
Dec  2 13:55:07 s0 kernel: [268067.030408] BUG: Dentry ffff880409a85a80{i=19a0043,n=111111_13858195261001095222.png} still in use (-1) [unmount of gfs2 dm-0]
Dec  2 13:55:07 s0 kernel: [268067.094800] ------------[ cut here ]------------
Dec  2 13:55:08 s0 kernel: [268067.127117] kernel BUG at fs/dcache.c:917!
Dec  2 13:55:08 s0 kernel: [268067.158837] invalid opcode: 0000 [#1] SMP
Dec  2 13:55:08 s0 kernel: [268067.190065] Modules linked in: nfnetlink_queue nfnetlink_log nfnetlink gfs2 dlm sctp configfs sha256_ssse3 sha256_generic dm_mod drbd(O) nf_conntrack_ipv4 nf_defrag_ipv4 xt_conn
track xt_tcpudp xt_multiport xt_CT nf_conntrack xt_limit iptable_raw iptable_filter ip_tables x_tables psmouse serio_raw i2c_i801 i2c_core evdev button usbhid e1000e ptp ehci_pci ehci_hcd pps_core
Dec  2 13:55:08 s0 kernel: [268067.289233] CPU: 1 PID: 29333 Comm: umount Tainted: G           O 3.11.8 #1
Dec  2 13:55:08 s0 kernel: [268067.322549] Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 4.6.4 06/30/2011
Dec  2 13:55:08 s0 kernel: [268067.356042] task: ffff88042d1e4d40 ti: ffff8803d0ff2000 task.ti: ffff8803d0ff2000
Dec  2 13:55:08 s0 kernel: [268067.389389] RIP: 0010:[<ffffffff810fb7fe>]  [<ffffffff810fb7fe>] shrink_dcache_for_umount_subtree+0x1de/0x1f0
Dec  2 13:55:08 s0 kernel: [268067.423364] RSP: 0018:ffff8803d0ff3ea0  EFLAGS: 00010292
Dec  2 13:55:08 s0 kernel: [268067.457222] RAX: 0000000000000072 RBX: ffff880409a85a80 RCX: 0000000000000000
Dec  2 13:55:08 s0 kernel: [268067.491051] RDX: ffff88043fc4d938 RSI: ffff88043fc4d098 RDI: ffff88043fc4d098
Dec  2 13:55:08 s0 kernel: [268067.524581] RBP: 0000000000000083 R08: 0000000000000000 R09: 000000000000043d
Dec  2 13:55:08 s0 kernel: [268067.558371] R10: 0000000000000003 R11: 00000000ad55ad55 R12: ffffffffa02730e0
Dec  2 13:55:08 s0 kernel: [268067.591506] R13: ffff8803d44bb000 R14: ffff8803d44bb020 R15: ffff8803d44bb040
Dec  2 13:55:08 s0 kernel: [268067.624131] FS:  00007f811d521840(0000) GS:ffff88043fc40000(0000) knlGS:0000000000000000
Dec  2 13:55:08 s0 kernel: [268067.657315] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec  2 13:55:08 s0 kernel: [268067.690541] CR2: 00007ff125e87750 CR3: 00000002055e4000 CR4: 00000000000407e0
Dec  2 13:55:08 s0 kernel: [268067.724225] Stack:
Dec  2 13:55:08 s0 kernel: [268067.757698]  ffff88041d1c86f0 ffff88041d1c8400 0000000000000083 ffffffff810fba26
Dec  2 13:55:08 s0 kernel: [268067.792171]  ffff88041d1c8400 ffffffff810e8585 ffff88042ec29d40 0000000000000083
Dec  2 13:55:08 s0 kernel: [268067.826646]  ffff88042d1e4d40 ffffffff810e8667 ffff88041d1c8400 ffffffffa02779e0
Dec  2 13:55:08 s0 kernel: [268067.860776] Call Trace:
Dec  2 13:55:08 s0 kernel: [268067.894397]  [<ffffffff810fba26>] ? shrink_dcache_for_umount+0x26/0x60
Dec  2 13:55:08 s0 kernel: [268067.928632]  [<ffffffff810e8585>] ? generic_shutdown_super+0x25/0xe0
Dec  2 13:55:08 s0 kernel: [268067.962770]  [<ffffffff810e8667>] ? kill_block_super+0x27/0x80
Dec  2 13:55:08 s0 kernel: [268067.996629]  [<ffffffff810e88ab>] ? deactivate_locked_super+0x4b/0x80
Dec  2 13:55:08 s0 kernel: [268068.030067]  [<ffffffff81103144>] ? SyS_umount+0xa4/0x3a0
Dec  2 13:55:08 s0 kernel: [268068.063367]  [<ffffffff814f5ad2>] ? system_call_fastpath+0x16/0x1b
Dec  2 13:55:08 s0 kernel: [268068.096225] Code: 00 00 48 8b 40 28 4c 8b 08 48 8b 43 30 48 85 c0 74 1b 48 8b 50 38 48 89 34 24 48 c7 c7 c0 c9 5e 81 48 89 de 31 c0 e8 32 1d 3f 00 <0f> 0b 31 d2 eb e5 0f 0b 66 2
e 0f 1f 84 00 00 00 00 00 41 57 b8
Dec  2 13:55:08 s0 kernel: [268068.164001] RIP  [<ffffffff810fb7fe>] shrink_dcache_for_umount_subtree+0x1de/0x1f0
Dec  2 13:55:08 s0 kernel: [268068.196938]  RSP <ffff8803d0ff3ea0>
Dec  2 13:55:08 s0 kernel: [268068.229258] ---[ end trace 6bc50c748677e108 ]---

Note You need to log in before you can comment on or make changes to this bug.