Kernel version: v3.0-7347-g288d5ab (most recent commit was 288d5abec8314ae50fe6692f324b0444acae8486 with date August 3, 2011). While syncing a DRBD volume (on top of an LVM logical volume) for the first time the following complaint appeared in the kernel log: block drbd0: Implicitly set pdsk from Inconsistent to UpToDate block drbd0: conn( Connected -> VerifyS ) disk( Inconsistent -> UpToDate ) pdsk( Inconsistent -> UpToDate ) block drbd0: Starting Online Verify from sector 0 ========================================================= [ INFO: possible irq lock inversion dependency detected ] 3.0.0-scst-debug+ #6 --------------------------------------------------------- drbd0_receiver/5871 just changed the state of lock: (&(&mdev->req_lock)->rlock){..-...}, at: [<ffffffffa023e072>] drbd_endio_read_sec_final+0x42/0x330 [drbd] but this lock took another, SOFTIRQ-unsafe lock in the past: (&(&mdev->peer_seq_lock)->rlock){+.+...} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&(&mdev->peer_seq_lock)->rlock); local_irq_disable(); lock(&(&mdev->req_lock)->rlock); lock(&(&mdev->peer_seq_lock)->rlock); <Interrupt> lock(&(&mdev->req_lock)->rlock); *** DEADLOCK *** no locks held by drbd0_receiver/5871. the shortest dependencies between 2nd lock and 1st lock: -> (&(&mdev->peer_seq_lock)->rlock){+.+...} ops: 3 { HARDIRQ-ON-W at: [<ffffffff81088b14>] __lock_acquire+0x754/0x1d70 [<ffffffff8108a725>] lock_acquire+0x85/0x130 [<ffffffff8140438b>] _raw_spin_lock+0x3b/0x50 [<ffffffffa0263604>] drbd_nl_syncer_conf+0x524/0xbe0 [drbd] [<ffffffffa02689e9>] drbd_connector_callback+0x139/0x750 [drbd] [<ffffffff81298a7b>] cn_rx_skb+0xdb/0x100 [<ffffffff81357603>] netlink_unicast+0x2a3/0x2f0 [<ffffffff8135787b>] netlink_sendmsg+0x22b/0x360 [<ffffffff8131f971>] sock_sendmsg+0xe1/0x120 [<ffffffff813227bd>] sys_sendto+0x10d/0x140 [<ffffffff814059ab>] system_call_fastpath+0x16/0x1b SOFTIRQ-ON-W at: [<ffffffff81088b4d>] __lock_acquire+0x78d/0x1d70 [<ffffffff8108a725>] lock_acquire+0x85/0x130 [<ffffffff8140438b>] _raw_spin_lock+0x3b/0x50 [<ffffffffa0263604>] drbd_nl_syncer_conf+0x524/0xbe0 [drbd] [<ffffffffa02689e9>] drbd_connector_callback+0x139/0x750 [drbd] [<ffffffff81298a7b>] cn_rx_skb+0xdb/0x100 [<ffffffff81357603>] netlink_unicast+0x2a3/0x2f0 [<ffffffff8135787b>] netlink_sendmsg+0x22b/0x360 [<ffffffff8131f971>] sock_sendmsg+0xe1/0x120 [<ffffffff813227bd>] sys_sendto+0x10d/0x140 [<ffffffff814059ab>] system_call_fastpath+0x16/0x1b INITIAL USE at: [<ffffffff810887a4>] __lock_acquire+0x3e4/0x1d70 [<ffffffff8108a725>] lock_acquire+0x85/0x130 [<ffffffff8140438b>] _raw_spin_lock+0x3b/0x50 [<ffffffffa0263604>] drbd_nl_syncer_conf+0x524/0xbe0 [drbd] [<ffffffffa02689e9>] drbd_connector_callback+0x139/0x750 [drbd] [<ffffffff81298a7b>] cn_rx_skb+0xdb/0x100 [<ffffffff81357603>] netlink_unicast+0x2a3/0x2f0 [<ffffffff8135787b>] netlink_sendmsg+0x22b/0x360 [<ffffffff8131f971>] sock_sendmsg+0xe1/0x120 [<ffffffff813227bd>] sys_sendto+0x10d/0x140 [<ffffffff814059ab>] system_call_fastpath+0x16/0x1b } ... key at: [<ffffffffa0275a5c>] __key.46241+0x0/0xffffffffffff35a4 [drbd] ... acquired at: [<ffffffff8108a725>] lock_acquire+0x85/0x130 [<ffffffff814044a6>] _raw_spin_lock_irqsave+0x46/0x60 [<ffffffff81048429>] try_to_wake_up+0x39/0x300 [<ffffffff81048702>] default_wake_function+0x12/0x20 [<ffffffff810721cd>] autoremove_wake_function+0x1d/0x50 [<ffffffff81034b49>] __wake_up_common+0x59/0x90 [<ffffffff81037d98>] __wake_up+0x48/0x70 [<ffffffffa02421b2>] got_RqSReply+0x42/0x90 [drbd] [<ffffffffa024d957>] drbd_asender+0x3b7/0x5f0 [drbd] [<ffffffffa0255262>] drbd_thread_setup+0x72/0x100 [drbd] [<ffffffff81071c06>] kthread+0x96/0xa0 [<ffffffff81407ad4>] kernel_thread_helper+0x4/0x10 -> (&(&mdev->req_lock)->rlock){..-...} ops: 145 { IN-SOFTIRQ-W at: [<ffffffff81088ad5>] __lock_acquire+0x715/0x1d70 [<ffffffff8108a725>] lock_acquire+0x85/0x130 [<ffffffff814044a6>] _raw_spin_lock_irqsave+0x46/0x60 [<ffffffffa023e072>] drbd_endio_read_sec_final+0x42/0x330 [drbd] [<ffffffffa023e3d9>] drbd_endio_sec+0x79/0x4d0 [drbd] [<ffffffff811611bd>] bio_endio+0x1d/0x40 [<ffffffffa02232c4>] dec_pending+0x1f4/0x3c0 [dm_mod] [<ffffffffa0223533>] clone_endio+0xa3/0xe0 [dm_mod] [<ffffffff811611bd>] bio_endio+0x1d/0x40 [<ffffffff811c147b>] req_bio_endio.clone.41+0xab/0xf0 [<ffffffff811c3640>] blk_update_request+0x100/0x570 [<ffffffff811c3ae4>] blk_update_bidi_request+0x34/0xa0 [<ffffffff811c3ccf>] blk_end_bidi_request+0x2f/0x80 [<ffffffff811c3d60>] blk_end_request+0x10/0x20 [<ffffffff812b3f7f>] scsi_io_completion+0xaf/0x630 [<ffffffff812aace8>] scsi_finish_command+0xc8/0x130 [<ffffffff812b3dcf>] scsi_softirq_done+0x13f/0x160 [<ffffffff811ca4a0>] blk_done_softirq+0x80/0xa0 [<ffffffff81055839>] __do_softirq+0xc9/0x230 [<ffffffff81407bcc>] call_softirq+0x1c/0x30 [<ffffffff81004c0d>] do_softirq+0x8d/0xc0 [<ffffffff81055d0e>] irq_exit+0x9e/0xc0 [<ffffffff81408176>] do_IRQ+0x66/0xe0 [<ffffffff81405070>] ret_from_intr+0x0/0x13 [<ffffffff812d945c>] mptspi_qcmd+0x8c/0xd0 [<ffffffff812ac05e>] scsi_dispatch_cmd+0x13e/0x340 [<ffffffff812b30e7>] scsi_request_fn+0x397/0x490 [<ffffffff811c545f>] __make_request+0x32f/0x350 [<ffffffff811c2ea9>] generic_make_request+0x1f9/0x5c0 [<ffffffffa024943a>] drbd_submit_ee+0x13a/0x230 [drbd] [<ffffffffa024ab96>] receive_DataRequest+0x4b6/0x850 [drbd] [<ffffffffa024d366>] drbdd_init+0xf16/0x1150 [drbd] [<ffffffffa0255262>] drbd_thread_setup+0x72/0x100 [drbd] [<ffffffff81071c06>] kthread+0x96/0xa0 [<ffffffff81407ad4>] kernel_thread_helper+0x4/0x10 INITIAL USE at: [<ffffffff810887a4>] __lock_acquire+0x3e4/0x1d70 [<ffffffff8108a725>] lock_acquire+0x85/0x130 [<ffffffff814044a6>] _raw_spin_lock_irqsave+0x46/0x60 [<ffffffffa025d554>] drbd_req_state+0x74/0x4a0 [drbd] [<ffffffffa025d9b2>] _drbd_request_state+0x32/0xe0 [drbd] [<ffffffffa026674c>] drbd_nl_disk_conf+0x72c/0x11c0 [drbd] [<ffffffffa02689e9>] drbd_connector_callback+0x139/0x750 [drbd] [<ffffffff81298a7b>] cn_rx_skb+0xdb/0x100 [<ffffffff81357603>] netlink_unicast+0x2a3/0x2f0 [<ffffffff8135787b>] netlink_sendmsg+0x22b/0x360 [<ffffffff8131f971>] sock_sendmsg+0xe1/0x120 [<ffffffff813227bd>] sys_sendto+0x10d/0x140 [<ffffffff814059ab>] system_call_fastpath+0x16/0x1b } ... key at: [<ffffffffa0275a64>] __key.46240+0x0/0xffffffffffff359c [drbd] ... acquired at: [<ffffffff81087564>] check_usage_forwards+0x94/0x110 [<ffffffff8108816b>] mark_lock+0x19b/0x3f0 [<ffffffff81088ad5>] __lock_acquire+0x715/0x1d70 [<ffffffff8108a725>] lock_acquire+0x85/0x130 [<ffffffff814044a6>] _raw_spin_lock_irqsave+0x46/0x60 [<ffffffffa023e072>] drbd_endio_read_sec_final+0x42/0x330 [drbd] [<ffffffffa023e3d9>] drbd_endio_sec+0x79/0x4d0 [drbd] [<ffffffff811611bd>] bio_endio+0x1d/0x40 [<ffffffffa02232c4>] dec_pending+0x1f4/0x3c0 [dm_mod] [<ffffffffa0223533>] clone_endio+0xa3/0xe0 [dm_mod] [<ffffffff811611bd>] bio_endio+0x1d/0x40 [<ffffffff811c147b>] req_bio_endio.clone.41+0xab/0xf0 [<ffffffff811c3640>] blk_update_request+0x100/0x570 [<ffffffff811c3ae4>] blk_update_bidi_request+0x34/0xa0 [<ffffffff811c3ccf>] blk_end_bidi_request+0x2f/0x80 [<ffffffff811c3d60>] blk_end_request+0x10/0x20 [<ffffffff812b3f7f>] scsi_io_completion+0xaf/0x630 [<ffffffff812aace8>] scsi_finish_command+0xc8/0x130 [<ffffffff812b3dcf>] scsi_softirq_done+0x13f/0x160 [<ffffffff811ca4a0>] blk_done_softirq+0x80/0xa0 [<ffffffff81055839>] __do_softirq+0xc9/0x230 [<ffffffff81407bcc>] call_softirq+0x1c/0x30 [<ffffffff81004c0d>] do_softirq+0x8d/0xc0 [<ffffffff81055d0e>] irq_exit+0x9e/0xc0 [<ffffffff81408176>] do_IRQ+0x66/0xe0 [<ffffffff81405070>] ret_from_intr+0x0/0x13 [<ffffffff812d945c>] mptspi_qcmd+0x8c/0xd0 [<ffffffff812ac05e>] scsi_dispatch_cmd+0x13e/0x340 [<ffffffff812b30e7>] scsi_request_fn+0x397/0x490 [<ffffffff811c545f>] __make_request+0x32f/0x350 [<ffffffff811c2ea9>] generic_make_request+0x1f9/0x5c0 [<ffffffffa024943a>] drbd_submit_ee+0x13a/0x230 [drbd] [<ffffffffa024ab96>] receive_DataRequest+0x4b6/0x850 [drbd] [<ffffffffa024d366>] drbdd_init+0xf16/0x1150 [drbd] [<ffffffffa0255262>] drbd_thread_setup+0x72/0x100 [drbd] [<ffffffff81071c06>] kthread+0x96/0xa0 [<ffffffff81407ad4>] kernel_thread_helper+0x4/0x10 stack backtrace: Pid: 5871, comm: drbd0_receiver Not tainted 3.0.0-scst-debug+ #6 Call Trace: <IRQ> [<ffffffff81087486>] print_irq_inversion_bug+0x1b6/0x200 [<ffffffff810874d0>] ? print_irq_inversion_bug+0x200/0x200 [<ffffffff81087564>] check_usage_forwards+0x94/0x110 [<ffffffff8108816b>] mark_lock+0x19b/0x3f0 [<ffffffff81404d25>] ? _raw_spin_unlock_irqrestore+0x65/0x80 [<ffffffff81088ad5>] __lock_acquire+0x715/0x1d70 [<ffffffff81125b97>] ? __slab_free+0x377/0x440 [<ffffffff8112500a>] ? free_debug_processing+0x1ca/0x350 [<ffffffff810e4f00>] ? mempool_free_pages+0x10/0x10 [<ffffffff8108b1be>] ? trace_hardirqs_on_caller+0xde/0x1a0 [<ffffffff8108a725>] lock_acquire+0x85/0x130 [<ffffffffa023e072>] ? drbd_endio_read_sec_final+0x42/0x330 [drbd] [<ffffffff8140447c>] ? _raw_spin_lock_irqsave+0x1c/0x60 [<ffffffff814044a6>] _raw_spin_lock_irqsave+0x46/0x60 [<ffffffffa023e072>] ? drbd_endio_read_sec_final+0x42/0x330 [drbd] [<ffffffffa023e072>] drbd_endio_read_sec_final+0x42/0x330 [drbd] [<ffffffffa023e3d9>] drbd_endio_sec+0x79/0x4d0 [drbd] [<ffffffff811611bd>] bio_endio+0x1d/0x40 [<ffffffffa02232c4>] dec_pending+0x1f4/0x3c0 [dm_mod] [<ffffffffa022311c>] ? dec_pending+0x4c/0x3c0 [dm_mod] [<ffffffffa0223533>] clone_endio+0xa3/0xe0 [dm_mod] [<ffffffff811611bd>] bio_endio+0x1d/0x40 [<ffffffff811c147b>] req_bio_endio.clone.41+0xab/0xf0 [<ffffffff811c3640>] blk_update_request+0x100/0x570 [<ffffffff811c38c2>] ? blk_update_request+0x382/0x570 [<ffffffff811c3ae4>] blk_update_bidi_request+0x34/0xa0 [<ffffffff811c3ccf>] blk_end_bidi_request+0x2f/0x80 [<ffffffff811c3d60>] blk_end_request+0x10/0x20 [<ffffffff812b3f7f>] scsi_io_completion+0xaf/0x630 [<ffffffff812aace8>] scsi_finish_command+0xc8/0x130 [<ffffffff812b3dcf>] scsi_softirq_done+0x13f/0x160 [<ffffffff811ca4a0>] blk_done_softirq+0x80/0xa0 [<ffffffff81055839>] __do_softirq+0xc9/0x230 [<ffffffff81407bcc>] call_softirq+0x1c/0x30 [<ffffffff81004c0d>] do_softirq+0x8d/0xc0 [<ffffffff81055d0e>] irq_exit+0x9e/0xc0 [<ffffffff81408176>] do_IRQ+0x66/0xe0 [<ffffffff812ab860>] ? perf_trace_scsi_cmd_done_timeout_template+0x1b0/0x1b0 [<ffffffff81405070>] common_interrupt+0x70/0x70 <EOI> [<ffffffff81404d27>] ? _raw_spin_unlock_irqrestore+0x67/0x80 [<ffffffff812d945c>] mptspi_qcmd+0x8c/0xd0 [<ffffffff812ac05e>] scsi_dispatch_cmd+0x13e/0x340 [<ffffffff812b30e7>] scsi_request_fn+0x397/0x490 [<ffffffff811c545f>] __make_request+0x32f/0x350 [<ffffffffa0224193>] ? dm_request+0x33/0x280 [dm_mod] [<ffffffff811c2ea9>] generic_make_request+0x1f9/0x5c0 [<ffffffff81161a64>] ? __bio_add_page+0x114/0x220 [<ffffffffa024943a>] drbd_submit_ee+0x13a/0x230 [drbd] [<ffffffff81043609>] ? sub_preempt_count+0xa9/0xe0 [<ffffffffa024ab96>] receive_DataRequest+0x4b6/0x850 [drbd] [<ffffffffa024d366>] drbdd_init+0xf16/0x1150 [drbd] [<ffffffffa02551f0>] ? we_should_drop_the_connection+0xd0/0xd0 [drbd] [<ffffffffa0255262>] drbd_thread_setup+0x72/0x100 [drbd] [<ffffffffa02551f0>] ? we_should_drop_the_connection+0xd0/0xd0 [drbd] [<ffffffff81071c06>] kthread+0x96/0xa0 [<ffffffff81407ad4>] kernel_thread_helper+0x4/0x10 [<ffffffff81038383>] ? finish_task_switch+0x83/0xe0 [<ffffffff81404c9b>] ? _raw_spin_unlock_irq+0x3b/0x60 [<ffffffff81405121>] ? retint_restore_args+0xe/0xe [<ffffffff81071b70>] ? __init_kthread_worker+0x70/0x70 [<ffffffff81407ad0>] ? gs_change+0xb/0xb block drbd0: Out of sync: start=8, size=712 (sectors)
A similar complaint was logged in the kernel log of the second system (B). Synchronization stalled after some time. The contents of /proc/drbd on system A stalled at this state: 0: cs:VerifyS ro:Secondary/Secondary ds:UpToDate/UpToDate C r----- ns:0 nr:0 dw:0 dr:414660 al:0 bm:0 lo:1 pe:9610 ua:2048 ap:0 ep:1 wo:f oos:2092956 [==>.................] verified: 19.6% (1686488/2092956)K finish: 4:50:49 speed: 96 (180) want: 40,960 K/sec
A previous and similar report of this issue can be found here: http://lists.linbit.com/pipermail/drbd-user/2011-February/015725.html.
hi Bart, We are aware of this issue. We have a fix for that in the drbd-8.4 code. We expect that 8.4 will be merged in the 3.2 merge-window. You quick way to a solution is to use the out of tree code. Best, Phil
(In reply to comment #3) > We are aware of this issue. We have a fix for that in the drbd-8.4 > code. We expect that 8.4 will be merged in the 3.2 merge-window. > > You quick way to a solution is to use the out of tree code. Thanks for the quick reply. If you can tell me the commit ID(s) of the patches that fix this issue in the DRBD git repo, I'll try to port these myself to Linus' kernel tree and retest.
We moved to rcu lockless updates of the protected object. This would be the relevant commit: http://git.drbd.org/?p=linux-2.6-drbd.git;a=commitdiff;h=1dedcbdda6b89a1b82650b4c65b8ee8b506b2ef0