Bug 192731 - LVM snapshot hangs during copy, under load
Summary: LVM snapshot hangs during copy, under load
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Block Layer (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Jens Axboe
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-16 23:30 UTC by Glenn Enright
Modified: 2017-05-26 11:02 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.4.32
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Glenn Enright 2017-01-16 23:30:22 UTC
This may be similar or the same issue as reported in 119841.

Running a xen 4.6 dom0, on CentOS6 with a 4.4.32 vanilla kernel. 

We have a process that copies an lvm snapshot to a file for backup purposes (using dd). Doesn't appear to be hardware related, as we see it occur on multiple platforms (Dell, Supermicro). Only seems to occur under highish IO load. We have also seem the same behaviour when using cp instead. The tool used to copy off the snapshot doesn't seem to make any difference.

From the time the hang occurs any further lv related commands tend to get stuck in D state also, and eventually the guest running on the source lv (and other guests) stops responding. 

Usually a cold reboot of the host is required to clear things. This is ugly if there are multiple users on the host.

This is a pretty significant issue. At my day job we really like the 4.x series for various reasons but are seriously considering reverting to 3.18 kernels so we are no longer hit by this.

We created a small reporting script to read from /proc/$pid/io. We can try to grab other metrics as well, let us know what would be helpful. eg (as an example from one host) ....

#1 8402 call_rwsem_down_write_failed [blkback.123.xvd]
rchar: 0
wchar: 0
syscr: 0
syscw: 0
read_bytes: 32731325440
write_bytes: 0
cancelled_write_bytes: 0

#2 19849 msleep [kworker/1:2]
rchar: 0
wchar: 0
syscr: 0
syscw: 0
read_bytes: 29716480
write_bytes: 0
cancelled_write_bytes: 0

#3 21170 call_rwsem_down_write_failed dd if=/dev/users/obfuscated.ext3-snapshot-20170114-1507 of=obfuscated.ext3 bs=1M conv=sparse
rchar: 30708600292
wchar: 30582767616
syscr: 29294
syscw: 29166
read_bytes: 30711742464
write_bytes: 30582767616
cancelled_write_bytes: 0

#4 23145 call_rwsem_down_write_failed lvdisplay
rchar: 49687
wchar: 7126
syscr: 121
syscw: 186
read_bytes: 0
write_bytes: 0
cancelled_write_bytes: 0

I'll try to get some sysrq dumps next time we see this, if anyone is interested.
Comment 1 john 2017-01-17 03:11:32 UTC
duplicate of #119841 ?
Comment 2 Glenn Enright 2017-01-25 22:34:56 UTC
It may well be a duplicate, just did not want to clobber that report. Anyway promised traces below...

I've included details from a 'w' sysrq dump on a recently stuck machine. New processes needing io to lvm end up at the entry_SYSCALL_64_fastpath point. Most useful seems to be the kworker bit at the end?

<6>1 2017-01-25T21:47:16.503278+00:00 host kernel  - - dmeventd        D ffff880059d9ba38     0 21089      1 0x00000000
<4>1 2017-01-25T21:47:16.503283+00:00 host kernel  - -  ffff880059d9ba38 ffffffff818bd500 ffff880048a04580 0000000000000000
<4>1 2017-01-25T21:47:16.503286+00:00 host kernel  - -  ffff88014020a210 0000000000000000 ffff880059d9b9f0 ffffffff810061a9
<4>1 2017-01-25T21:47:16.503288+00:00 host kernel  - -  ffff88014020a210 ffff88014020aa10 ffff88014020a210 ffff88014020aa10
<4>1 2017-01-25T21:47:16.503293+00:00 host kernel  - - Call Trace:
<4>1 2017-01-25T21:47:16.503296+00:00 host kernel  - -  [<ffffffff810061a9>] ? xen_load_sp0+0xc9/0x1d0
<4>1 2017-01-25T21:47:16.503302+00:00 host kernel  - -  [<ffffffff81006cbd>] ? xen_mc_flush+0xad/0x1b0
<4>1 2017-01-25T21:47:16.503305+00:00 host kernel  - -  [<ffffffff8100d605>] ? xen_clocksource_read+0x15/0x20
<4>1 2017-01-25T21:47:16.503308+00:00 host kernel  - -  [<ffffffff810a0c24>] ? finish_task_switch+0xa4/0x240
<4>1 2017-01-25T21:47:16.503311+00:00 host kernel  - -  [<ffffffff81683e70>] schedule+0x40/0x90
<4>1 2017-01-25T21:47:16.503314+00:00 host kernel  - -  [<ffffffff816865fd>] rwsem_down_write_failed+0x1fd/0x360
<4>1 2017-01-25T21:47:16.503316+00:00 host kernel  - -  [<ffffffff816835d6>] ? __schedule+0x306/0xa30
<4>1 2017-01-25T21:47:16.503321+00:00 host kernel  - -  [<ffffffff81340693>] call_rwsem_down_write_failed+0x13/0x20
<4>1 2017-01-25T21:47:16.503325+00:00 host kernel  - -  [<ffffffff81685dd4>] ? down_write+0x24/0x40
<4>1 2017-01-25T21:47:16.503328+00:00 host kernel  - -  [<ffffffffa04a525f>] snapshot_status+0x2f/0x190 [dm_snapshot]
<4>1 2017-01-25T21:47:16.503334+00:00 host kernel  - -  [<ffffffffa00091c3>] retrieve_status+0xb3/0x1d0 [dm_mod]
<4>1 2017-01-25T21:47:16.503339+00:00 host kernel  - -  [<ffffffffa00092e0>] ? retrieve_status+0x1d0/0x1d0 [dm_mod]
<4>1 2017-01-25T21:47:16.503343+00:00 host kernel  - -  [<ffffffffa000933e>] table_status+0x5e/0x90 [dm_mod]
<4>1 2017-01-25T21:47:16.503347+00:00 host kernel  - -  [<ffffffffa000aa33>] ctl_ioctl+0x1d3/0x410 [dm_mod]
<4>1 2017-01-25T21:47:16.503351+00:00 host kernel  - -  [<ffffffffa000ac83>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
<4>1 2017-01-25T21:47:16.503354+00:00 host kernel  - -  [<ffffffff811e34a8>] do_vfs_ioctl+0x88/0x4b0
<4>1 2017-01-25T21:47:16.503360+00:00 host kernel  - -  [<ffffffff811ecddc>] ? __fget_light+0x2c/0x70
<4>1 2017-01-25T21:47:16.503363+00:00 host kernel  - -  [<ffffffff81088425>] ? __set_current_blocked+0x55/0x60
<4>1 2017-01-25T21:47:16.503368+00:00 host kernel  - -  [<ffffffff811e3962>] SyS_ioctl+0x92/0xa0
<4>1 2017-01-25T21:47:16.503372+00:00 host kernel  - -  [<ffffffff81003615>] ? syscall_return_slowpath+0x65/0x70
<4>1 2017-01-25T21:47:16.503374+00:00 host kernel  - -  [<ffffffff8168772e>] entry_SYSCALL_64_fastpath+0x12/0x71

<6>1 2017-01-25T21:47:16.503379+00:00 host kernel  - - dd              D ffff8800334936b8     0 21296  20055 0x00000000
<4>1 2017-01-25T21:47:16.503383+00:00 host kernel  - -  ffff8800334936b8 ffff880135440140 ffff880075ad13c0 ffff880140256380
<4>1 2017-01-25T21:47:16.503386+00:00 host kernel  - -  0000000000000000 ffff8800334935f8 ffffffff810ac6c4 0014918f609acad7
<4>1 2017-01-25T21:47:16.503388+00:00 host kernel  - -  000000000031661d ffff8800334935d8 ffff880140256380 ffff8800191a33c0
<4>1 2017-01-25T21:47:16.503390+00:00 host kernel  - - Call Trace:
<4>1 2017-01-25T21:47:16.503393+00:00 host kernel  - -  [<ffffffff810ac6c4>] ? check_preempt_wakeup+0x194/0x240
<4>1 2017-01-25T21:47:16.503398+00:00 host kernel  - -  [<ffffffff810a1a17>] ? check_preempt_curr+0x87/0x90
<4>1 2017-01-25T21:47:16.503402+00:00 host kernel  - -  [<ffffffff810a1a3d>] ? ttwu_do_wakeup+0x1d/0xf0
<4>1 2017-01-25T21:47:16.503407+00:00 host kernel  - -  [<ffffffff8168741f>] ? _raw_spin_lock_irqsave+0x1f/0x50
<4>1 2017-01-25T21:47:16.503411+00:00 host kernel  - -  [<ffffffff81687146>] ? _raw_spin_unlock_irqrestore+0x16/0x20
<4>1 2017-01-25T21:47:16.503416+00:00 host kernel  - -  [<ffffffff81683e70>] schedule+0x40/0x90
<4>1 2017-01-25T21:47:16.503420+00:00 host kernel  - -  [<ffffffff816865fd>] rwsem_down_write_failed+0x1fd/0x360
<4>1 2017-01-25T21:47:16.503422+00:00 host kernel  - -  [<ffffffff811b5d32>] ? kmem_cache_alloc+0x72/0x160
<4>1 2017-01-25T21:47:16.503425+00:00 host kernel  - -  [<ffffffff810bac6d>] ? rwsem_wake+0x6d/0xa0
<4>1 2017-01-25T21:47:16.503428+00:00 host kernel  - -  [<ffffffff81340693>] call_rwsem_down_write_failed+0x13/0x20
<4>1 2017-01-25T21:47:16.503431+00:00 host kernel  - -  [<ffffffff81685dd4>] ? down_write+0x24/0x40
<4>1 2017-01-25T21:47:16.503435+00:00 host kernel  - -  [<ffffffffa04a6c20>] snapshot_map+0x90/0x490 [dm_snapshot]
<4>1 2017-01-25T21:47:16.503438+00:00 host kernel  - -  [<ffffffffa00042aa>] __map_bio+0x4a/0x130 [dm_mod]
<4>1 2017-01-25T21:47:16.503442+00:00 host kernel  - -  [<ffffffffa0004897>] __split_and_process_bio+0x327/0x3f0 [dm_mod]
<4>1 2017-01-25T21:47:16.503445+00:00 host kernel  - -  [<ffffffffa00049d4>] dm_make_request+0x74/0xe0 [dm_mod]
<4>1 2017-01-25T21:47:16.503448+00:00 host kernel  - -  [<ffffffff813099af>] generic_make_request+0xff/0x1d0
<4>1 2017-01-25T21:47:16.503451+00:00 host kernel  - -  [<ffffffff81309af0>] submit_bio+0x70/0x140
<4>1 2017-01-25T21:47:16.503454+00:00 host kernel  - -  [<ffffffff8116b84f>] ? __lru_cache_add+0x5f/0x80
<4>1 2017-01-25T21:47:16.503457+00:00 host kernel  - -  [<ffffffff8120f364>] mpage_bio_submit+0x34/0x50
<4>1 2017-01-25T21:47:16.503460+00:00 host kernel  - -  [<ffffffff8120fd2d>] mpage_readpages+0x13d/0x160
<4>1 2017-01-25T21:47:16.503463+00:00 host kernel  - -  [<ffffffff81209b80>] ? I_BDEV+0x20/0x20
<4>1 2017-01-25T21:47:16.503466+00:00 host kernel  - -  [<ffffffff81209b80>] ? I_BDEV+0x20/0x20
<4>1 2017-01-25T21:47:16.503469+00:00 host kernel  - -  [<ffffffff81208692>] ? block_write_end+0x42/0x90
<4>1 2017-01-25T21:47:16.503472+00:00 host kernel  - -  [<ffffffff8115b26b>] ? __page_cache_alloc+0xcb/0x110
<4>1 2017-01-25T21:47:16.503475+00:00 host kernel  - -  [<ffffffff8154efbc>] ? mddev_congested+0x2c/0x40
<4>1 2017-01-25T21:47:16.503533+00:00 host kernel  - -  [<ffffffff8120a46d>] blkdev_readpages+0x1d/0x20
<4>1 2017-01-25T21:47:16.503538+00:00 host kernel  - -  [<ffffffff81169260>] __do_page_cache_readahead+0x1a0/0x240
<4>1 2017-01-25T21:47:16.503541+00:00 host kernel  - -  [<ffffffff8116944d>] ondemand_readahead+0x14d/0x250
<4>1 2017-01-25T21:47:16.503544+00:00 host kernel  - -  [<ffffffff811f9464>] ? inode_congested+0xa4/0x100
<4>1 2017-01-25T21:47:16.503547+00:00 host kernel  - -  [<ffffffff811695c2>] page_cache_async_readahead+0x72/0x80
<4>1 2017-01-25T21:47:16.503550+00:00 host kernel  - -  [<ffffffff8115cb2e>] generic_file_read_iter+0x40e/0x5e0
<4>1 2017-01-25T21:47:16.503553+00:00 host kernel  - -  [<ffffffff8115c721>] ? generic_file_read_iter+0x1/0x5e0
<4>1 2017-01-25T21:47:16.503555+00:00 host kernel  - -  [<ffffffff81209f57>] blkdev_read_iter+0x37/0x40
<4>1 2017-01-25T21:47:16.503558+00:00 host kernel  - -  [<ffffffff811d13dc>] __vfs_read+0xcc/0xf0
<4>1 2017-01-25T21:47:16.503561+00:00 host kernel  - -  [<ffffffff811d168e>] vfs_read+0x8e/0xe0
<4>1 2017-01-25T21:47:16.503564+00:00 host kernel  - -  [<ffffffff811ed262>] ? __fdget_pos+0x12/0x50
<4>1 2017-01-25T21:47:16.503567+00:00 host kernel  - -  [<ffffffff811d1f46>] SyS_read+0x56/0xc0
<4>1 2017-01-25T21:47:16.503570+00:00 host kernel  - -  [<ffffffff811d1d33>] ? SyS_lseek+0x93/0xa0
<4>1 2017-01-25T21:47:16.503573+00:00 host kernel  - -  [<ffffffff8168772e>] entry_SYSCALL_64_fastpath+0x12/0x71

<6>1 2017-01-25T21:47:16.503575+00:00 host kernel  - - kworker/u4:1    D ffff8801039c30e8     0 19641      2 0x00000000
<6>1 2017-01-25T21:47:16.503577+00:00 host kernel  - - Workqueue: writeback wb_workfn (flush-253:61)
<4>1 2017-01-25T21:47:16.503580+00:00 host kernel  - -  ffff8801039c30e8 ffff880135440140 ffff880134e2e440 ffff88013bc01540
<4>1 2017-01-25T21:47:16.503583+00:00 host kernel  - -  ffff88013bc00700 0000000000000000 ffffea0004243c00 ffff88013bc01540
<4>1 2017-01-25T21:47:16.503587+00:00 host kernel  - -  ffff8801090f0000 ffff8801039c3058 ffffffff811b53ab 0000000000000000
<4>1 2017-01-25T21:47:16.503589+00:00 host kernel  - - Call Trace:
<4>1 2017-01-25T21:47:16.503592+00:00 host kernel  - -  [<ffffffff811b53ab>] ? cache_grow+0x15b/0x250
<4>1 2017-01-25T21:47:16.503595+00:00 host kernel  - -  [<ffffffff810061a9>] ? xen_load_sp0+0xc9/0x1d0
<4>1 2017-01-25T21:47:16.503598+00:00 host kernel  - -  [<ffffffff810ac020>] ? update_curr+0x110/0x1a0
<4>1 2017-01-25T21:47:16.503601+00:00 host kernel  - -  [<ffffffff810b01bf>] ? enqueue_entity+0x46f/0x940
<4>1 2017-01-25T21:47:16.503605+00:00 host kernel  - -  [<ffffffff81683e70>] schedule+0x40/0x90
<4>1 2017-01-25T21:47:16.503608+00:00 host kernel  - -  [<ffffffff816865fd>] rwsem_down_write_failed+0x1fd/0x360
<4>1 2017-01-25T21:47:16.503611+00:00 host kernel  - -  [<ffffffff813025ac>] ? bio_alloc_bioset+0xbc/0x1f0
<4>1 2017-01-25T21:47:16.503615+00:00 host kernel  - -  [<ffffffff81340693>] call_rwsem_down_write_failed+0x13/0x20
<4>1 2017-01-25T21:47:16.503618+00:00 host kernel  - -  [<ffffffff81685dd4>] ? down_write+0x24/0x40
<4>1 2017-01-25T21:47:16.503621+00:00 host kernel  - -  [<ffffffffa04a489d>] __origin_write+0x6d/0x2d0 [dm_snapshot]
<4>1 2017-01-25T21:47:16.503624+00:00 host kernel  - -  [<ffffffff8115d3fc>] ? mempool_alloc+0x5c/0x160
<4>1 2017-01-25T21:47:16.503628+00:00 host kernel  - -  [<ffffffffa04a4b66>] do_origin+0x66/0x90 [dm_snapshot]
<4>1 2017-01-25T21:47:16.503631+00:00 host kernel  - -  [<ffffffffa04a4fbf>] origin_map+0x6f/0x90 [dm_snapshot]
<4>1 2017-01-25T21:47:16.503634+00:00 host kernel  - -  [<ffffffffa00042aa>] __map_bio+0x4a/0x130 [dm_mod]
<4>1 2017-01-25T21:47:16.503637+00:00 host kernel  - -  [<ffffffffa0004897>] __split_and_process_bio+0x327/0x3f0 [dm_mod]
<4>1 2017-01-25T21:47:16.503640+00:00 host kernel  - -  [<ffffffffa00049d4>] dm_make_request+0x74/0xe0 [dm_mod]
<4>1 2017-01-25T21:47:16.503643+00:00 host kernel  - -  [<ffffffff813099af>] generic_make_request+0xff/0x1d0
<4>1 2017-01-25T21:47:16.503646+00:00 host kernel  - -  [<ffffffff81092b24>] ? insert_work+0x74/0xc0
<4>1 2017-01-25T21:47:16.503649+00:00 host kernel  - -  [<ffffffff81309af0>] submit_bio+0x70/0x140
<4>1 2017-01-25T21:47:16.503653+00:00 host kernel  - -  [<ffffffff8120755e>] submit_bh_wbc+0x13e/0x180
<4>1 2017-01-25T21:47:16.503656+00:00 host kernel  - -  [<ffffffff8120813d>] __block_write_full_page.clone.0+0x18d/0x310
<4>1 2017-01-25T21:47:16.503659+00:00 host kernel  - -  [<ffffffff81209b80>] ? I_BDEV+0x20/0x20
<4>1 2017-01-25T21:47:16.503661+00:00 host kernel  - -  [<ffffffff81209b80>] ? I_BDEV+0x20/0x20
<4>1 2017-01-25T21:47:16.503664+00:00 host kernel  - -  [<ffffffff812084ad>] block_write_full_page+0xcd/0x120
<4>1 2017-01-25T21:47:16.503668+00:00 host kernel  - -  [<ffffffff8120a4a8>] blkdev_writepage+0x18/0x20
<4>1 2017-01-25T21:47:16.503670+00:00 host kernel  - -  [<ffffffff811654b7>] __writepage+0x17/0x40
<4>1 2017-01-25T21:47:16.503674+00:00 host kernel  - -  [<ffffffff811670ac>] write_cache_pages+0x24c/0x530
<4>1 2017-01-25T21:47:16.503677+00:00 host kernel  - -  [<ffffffff811654a0>] ? set_page_dirty+0x70/0x70
<4>1 2017-01-25T21:47:16.503680+00:00 host kernel  - -  [<ffffffff811673e1>] generic_writepages+0x51/0x80
<4>1 2017-01-25T21:47:16.503683+00:00 host kernel  - -  [<ffffffff81315543>] ? __blk_mq_run_hw_queue+0x293/0x3b0
<4>1 2017-01-25T21:47:16.503685+00:00 host kernel  - -  [<ffffffff81167430>] do_writepages+0x20/0x30
<4>1 2017-01-25T21:47:16.503688+00:00 host kernel  - -  [<ffffffff811fab0f>] __writeback_single_inode+0x4f/0x3b0
<4>1 2017-01-25T21:47:16.503691+00:00 host kernel  - -  [<ffffffff81165bff>] ? __wb_calc_thresh+0x2f/0x140
<4>1 2017-01-25T21:47:16.503693+00:00 host kernel  - -  [<ffffffff8131571c>] ? blk_mq_run_hw_queue+0x6c/0x80
<4>1 2017-01-25T21:47:16.503696+00:00 host kernel  - -  [<ffffffff811feb12>] ? wbc_attach_and_unlock_inode+0x122/0x140
<4>1 2017-01-25T21:47:16.503699+00:00 host kernel  - -  [<ffffffff811ff07e>] writeback_sb_inodes+0x2be/0x4a0
<4>1 2017-01-25T21:47:16.503702+00:00 host kernel  - -  [<ffffffff811669fc>] ? wb_over_bg_thresh+0x15c/0x260
<4>1 2017-01-25T21:47:16.503704+00:00 host kernel  - -  [<ffffffff811ff2ec>] __writeback_inodes_wb+0x8c/0xc0
<4>1 2017-01-25T21:47:16.503707+00:00 host kernel  - -  [<ffffffff811ff614>] wb_writeback+0x244/0x3b0
<4>1 2017-01-25T21:47:16.503710+00:00 host kernel  - -  [<ffffffff81346e2b>] ? find_next_bit+0xb/0x10
<4>1 2017-01-25T21:47:16.503713+00:00 host kernel  - -  [<ffffffff811f9210>] ? get_nr_dirty_pages+0x30/0x60
<4>1 2017-01-25T21:47:16.503716+00:00 host kernel  - -  [<ffffffff811ff923>] wb_do_writeback+0x1a3/0x280
<4>1 2017-01-25T21:47:16.503719+00:00 host kernel  - -  [<ffffffff811ffa7d>] wb_workfn+0x7d/0x1e0
<4>1 2017-01-25T21:47:16.503721+00:00 host kernel  - -  [<ffffffff816835d6>] ? __schedule+0x306/0xa30
<4>1 2017-01-25T21:47:16.503724+00:00 host kernel  - -  [<ffffffff8109381a>] ? queue_delayed_work_on+0x3a/0x50
<4>1 2017-01-25T21:47:16.503727+00:00 host kernel  - -  [<ffffffff81093d15>] process_one_work+0x165/0x500
<4>1 2017-01-25T21:47:16.503730+00:00 host kernel  - -  [<ffffffff81687146>] ? _raw_spin_unlock_irqrestore+0x16/0x20
<4>1 2017-01-25T21:47:16.503732+00:00 host kernel  - -  [<ffffffff810da515>] ? mod_timer+0x105/0x1d0
<4>1 2017-01-25T21:47:16.503734+00:00 host kernel  - -  [<ffffffff81094ae3>] worker_thread+0x133/0x610
<4>1 2017-01-25T21:47:16.503738+00:00 host kernel  - -  [<ffffffff810a2f32>] ? default_wake_function+0x12/0x20
<4>1 2017-01-25T21:47:16.503741+00:00 host kernel  - -  [<ffffffff810b66a6>] ? __wake_up_common+0x56/0x90
<4>1 2017-01-25T21:47:16.503744+00:00 host kernel  - -  [<ffffffff810949b0>] ? create_worker+0x1e0/0x1e0
<4>1 2017-01-25T21:47:16.503747+00:00 host kernel  - -  [<ffffffff810949b0>] ? create_worker+0x1e0/0x1e0
<4>1 2017-01-25T21:47:16.503750+00:00 host kernel  - -  [<ffffffff810994fc>] kthread+0xcc/0xf0
<4>1 2017-01-25T21:47:16.503753+00:00 host kernel  - -  [<ffffffff810a1c8e>] ? schedule_tail+0x1e/0xc0
<4>1 2017-01-25T21:47:16.503807+00:00 host kernel  - -  [<ffffffff81099430>] ? kthread_freezable_should_stop+0x70/0x70
<4>1 2017-01-25T21:47:16.503812+00:00 host kernel  - -  [<ffffffff81687a8f>] ret_from_fork+0x3f/0x70
<4>1 2017-01-25T21:47:16.503815+00:00 host kernel  - -  [<ffffffff81099430>] ? kthread_freezable_should_stop+0x70/0x70
Comment 3 Tomas Mozes 2017-05-26 11:02:31 UTC
Is it fixed for you in 4.4.55?

Note You need to log in before you can comment on or make changes to this bug.