Bug 48851
Summary: | Unable to access data on dm-thin device after machine reset | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Kurk (kurk) |
Component: | LVM2/DM | Assignee: | Alasdair G Kergon (agk) |
Status: | CLOSED INVALID | ||
Severity: | low | CC: | agk, kurk, snitzer |
Priority: | P5 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.4.10 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg after echo t > /proc/sysrq-trigger
Dump of info from proc. See long comment |
Description
Kurk
2012-10-15 14:53:00 UTC
(In reply to comment #0) > This is a bug report for reproducible complete data loss with dm-thin. > > Please note that this bug (I believe it's the same) happened to me even in > modes different from below (such as: no loop-mounted file) however this is > the > only way I found to replicate it reliably and in a short time. OK, can you elaborate on the setup you had that prompted you to try to reproduce on using this questionable configuration? > Kernel is 3.4.10 vanilla > This was on a KVM virtual machine with 2 vCPU but I hope it's reproducible > even > on hardware hosts. I cannot check right now. Any reason you're using 3.4? Various dm-thinp improvements have been made in subsequent releases. And what version of lvm2 are you using? Steps will be taken to reproduce with your sequence using the latest kernel (v3.6) and latest lvm2 (v2_02_98). > root@host1:~# dd if=/dev/zero of=ttest10backend bs=1M count=0 seek=1000 > 0+0 records in > 0+0 records out > 0 bytes (0 B) copied, 0.0003177 s, 0.0 kB/s > [the bug won't reproduce reliably if the backend file is preallocated or > too small. Host filesystem is ext4] > > root@host1:~# losetup /dev/loop0 ttest10backend I say "questionable configuration" above because using a sparse loopback file says to me is there is potential for dm-thin's underlying storage to be unreliable. What mount options were you using for ext4? > root@host1:~# pvcreate /dev/loop0 > Writing physical volume data to disk "/dev/loop0" > Physical volume "/dev/loop0" successfully created > > root@host1:~# vgcreate vgt10 /dev/loop0 > Volume group "vgt10" successfully created > > root@host1:~# lvcreate --size 900M --thinpool vgt10/pool10 > Rounding up size to full physical extent 4.00 MiB > Logical volume "pool10" created > > root@host1:~# lvcreate --name lvt10_1 --virtualsize 400M --thinpool > vgt10/pool10 > Logical volume "lvt10_1" created > > root@host1:~# dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M > [hangs in a few seconds and write speed drops to 40K/sec seen from iotop. > Uninterruptible. Needs hard reset.] Did you see anything in the kernel log that indicated there was a problem? You have an open ended dd there, so eventually you'll run out of space (once you write 400M to fill the thin volume). If there isn't anything useful in the kernel log then we'll need traces of the relevant tasks (e.g. the dd task) via: echo t > /proc/sysrq-trigger > [after hard reset:] We need to understand what was going wrong with the Uninterruptible dd. Again, please check your kernel log. > root@host1:~# losetup /dev/loop0 ttest10backend > root@host1:~# lvs > LV VG Attr LSize Pool Origin Data% Move Log > Copy% > Convert > ...other nonthin vols... -wi-a--- 60.00g > lvt10_1 vgt10 Vwi---tz 400.00m pool10 > pool10 vgt10 twi---tz 900.00m > root@host1:~# vgs > VG #PV #LV #SN Attr VSize VFree > anothergroup 1 1 0 wz--n- 87.64g 27.64g > vgt10 1 3 0 wz--n- 996.00m 92.00m > root@host1:~# vgchange -ay vgt10 > Check of thin pool vgt10/pool10 failed (status:1). Manual repair required > (thin_dump --repair /dev/mapper/vgt10-pool10_tmeta)! > 0 logical volume(s) in volume group "vgt10" now active Again, we need to know the version of lvm2. There may have been a bug in lvm2's interface with the thin_check utility. > ...complete data loss! Apparently irrecoverable. Whole thin pool is affected. While I appreciate the bug report I'm having a hard time with your leap to an unqualified "complete data loss!". Regardless, you have our attention and we'll work through it. (In reply to comment #1) > (In reply to comment #0) > > root@host1:~# dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M > > [hangs in a few seconds and write speed drops to 40K/sec seen from > iotop. > > Uninterruptible. Needs hard reset.] > > Did you see anything in the kernel log that indicated there was a problem? > You > have an open ended dd there, so eventually you'll run out of space (once you > write 400M to fill the thin volume). I cannot reproduce an Uninterruptible hang with dd (and so I cannot reproduce this bug), I get what I expected: # dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M dd: writing `/dev/mapper/vgt10-lvt10_1': No space left on device 401+0 records in 400+0 records out 419430400 bytes (419 MB) copied, 10.4799 s, 40.0 MB/s I'd appreciate it if you could please answer the various questions I had in comment#1 Created attachment 83611 [details]
dmesg after echo t > /proc/sysrq-trigger
Hello Mike I still can reproduce the bug on 3.6.2, what is the kernel you tried? > OK, can you elaborate on the setup you had that prompted you to try to > reproduce on using this questionable configuration? I had another nonstandard configuration: KVM virtual disk partition --> nonthin PV VG LV --> using one LV as PV for another VG --> LV as thin-pool --> thin LVs. I was using this config because I didn't trust thin yet, I thought that if data-loss happened I risked to lose also nonthin data, so I thought layering it inside a nonthin LV was safer. And also because vgcfgbackup/restore do not work if there is at least one thinpool, so still as of now layering is needed for safety. It took a few weeks of work (not stress test) to see the first full data loss in real-life. After that, trying to reproduce the data loss in a zillion ways, one attempt was with a loop device. That was found to be reliable, so I reported the bug with that. I hope it's the same bug. > Any reason you're using 3.4? Various dm-thinp improvements have been made in > subsequent releases. Yes, because it is the most stable kernel (aka oldest) that contains the feature I need, which is thin discard. In fact at first I thought it was a problem of discard, but then I was able to reproduce the data loss without the discard. Dm-thin "improvements" in terms of new features surely go to 3.5 and 3.6 but I assumed bugfixes would be backported to 3.4 so I didn't expect 3.4 to hangup and lose data more than 3.5 or 3.6 , I expected just fewer features and less performance but more stability. Is that a wrong assumption? I have now tried 3.6.2 and the bug is still there. > And what version of lvm2 are you using? It is LVM version: 2.02.95(2) (2012-03-06) Library version: 1.02.74 (2012-03-06) from Debian unstable repositories. Do you think LVM can be the problem? It looks like a kernel problem to me. I have serious problems in upgrading this package because the standard lvm compiled from source doesn't work well with Debian/Ubuntu distro (some problems in volume activation AFAIR, maybe related to udev, which I have not tracked down yet), while trying to merge the new sources into Debian unstable source package results in compile errors. If possible I would like to try to reproduce the bug without trying to upgrade the LVM package please. Note that regarding thin-provisioning-tools (in particular thin_check), that is compiled from source git and updated as of today. > Steps will be taken to reproduce with your sequence using the latest kernel > (v3.6) and latest lvm2 (v2_02_98). > I say "questionable configuration" above because using a sparse loopback file > says to me is there is potential for dm-thin's underlying storage to be unreliable. What do you mean with unreliable? Sure I have enough space in the FS for the backing file to be completely written out (that's just 1GB). It's just that if I write it out before the test, the bug does not trigger easily. It's probably a race condition with some other parts of the kernel. It definitely was not a case of out-of-space in the volume holding the sparse backing file, if this was what you meant to say. > What mount options were you using for ext4? rw,relatime,errors=remount-ro,stripe=1024,data=ordered 0 0 It is on a partition of the virtual disk of KVM. At the host side this virtual disk it is backed again to a sparse file (and again there is enough space to fully write it out), sorry for that, that could indeed contribute to causing the race condition (many in-flight commands waiting for completion maybe). This sparse backend file for KVM never gave problems for other filesystems, logical volumes etc. The sparse file at host side is on ext4, the mount options are: rw,relatime,errors=remount-ro,user_xattr,acl,barrier=1,data=ordered 0 0 . This host filesystem is on a full MD raid1 without LVM, the raid1 is on the partitions of two HDDs. Also please note that the virtualization-host machine was never rebooted, so even if e.g. barriers were not working (which they do AFAICS), the host and page cache stay up so there would be still no data loss expected from that side. > Did you see anything in the kernel log that indicated there was a problem? Just hung tasks: [ 10.016484] Rounding down aligned max_sectors from 4294967295 to 8388600 [ 10.395259] Rounding down aligned max_sectors from 4294967295 to 8388600 [ 10.758234] Rounding down aligned max_sectors from 4294967295 to 8388600 [ 11.128985] Rounding down aligned max_sectors from 4294967295 to 8388600 [ 11.508384] Rounding down aligned max_sectors from 4294967295 to 8388600 [ 11.877132] Rounding down aligned max_sectors from 4294967295 to 8388600 [ 360.724075] INFO: task kworker/u:0:6 blocked for more than 120 seconds. [ 360.726236] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 360.728875] kworker/u:0 D 0000000000000001 0 6 2 0x00000000 [ 360.728881] ffff88007d3ebb20 0000000000000046 000000000000338c ffffffff81c20ac0 [ 360.728885] ffff88007d3ebfd8 ffff88007d3ebfd8 ffff88007d3ebfd8 0000000000013dc0 [ 360.728889] ffff880036c2c470 ffff88007d3f0000 ffff88007d3ebb20 ffff88007fd14660 [ 360.728893] Call Trace: [ 360.728905] [<ffffffffa02a73e0>] ? submit_io+0x80/0x80 [dm_bufio] [ 360.728956] [<ffffffff8169e129>] schedule+0x29/0x70 [ 360.728973] [<ffffffff8169e1ff>] io_schedule+0x8f/0xd0 [ 360.728980] [<ffffffffa02a73ee>] do_io_schedule+0xe/0x20 [dm_bufio] [ 360.728986] [<ffffffff8169cadf>] __wait_on_bit+0x5f/0x90 [ 360.728990] [<ffffffffa02a73e0>] ? submit_io+0x80/0x80 [dm_bufio] [ 360.728994] [<ffffffff8169cb8c>] out_of_line_wait_on_bit+0x7c/0x90 [ 360.729022] [<ffffffff8107ccc0>] ? autoremove_wake_function+0x40/0x40 [ 360.729029] [<ffffffffa02a7d96>] dm_bufio_write_dirty_buffers+0x76/0x1c0 [dm_bufio] [ 360.729036] [<ffffffffa02b4cf3>] dm_bm_flush_and_unlock+0x33/0x70 [dm_persistent_data] [ 360.729052] [<ffffffffa02b7497>] dm_tm_commit+0x37/0x50 [dm_persistent_data] [ 360.729060] [<ffffffffa02cd49d>] __commit_transaction.part.12+0x11d/0x130 [dm_thin_pool] [ 360.729065] [<ffffffffa02cd4ed>] __commit_transaction+0x3d/0x50 [dm_thin_pool] [ 360.729070] [<ffffffffa02cdff2>] dm_pool_commit_metadata+0x42/0x70 [dm_thin_pool] [ 360.729074] [<ffffffffa02c981d>] commit_or_fallback+0x2d/0x60 [dm_thin_pool] [ 360.729084] [<ffffffffa02c9c44>] process_deferred_bios+0x174/0x230 [dm_thin_pool] [ 360.729089] [<ffffffffa02c9d50>] do_worker+0x50/0x60 [dm_thin_pool] [ 360.729094] [<ffffffff81076176>] process_one_work+0x136/0x550 [ 360.729103] [<ffffffffa02c9d00>] ? process_deferred_bios+0x230/0x230 [dm_thin_pool] [ 360.729107] [<ffffffff81077375>] worker_thread+0x165/0x3c0 [ 360.729114] [<ffffffff81077210>] ? manage_workers+0x190/0x190 [ 360.729118] [<ffffffff8107c3a3>] kthread+0x93/0xa0 [ 360.729134] [<ffffffff816a8944>] kernel_thread_helper+0x4/0x10 [ 360.729139] [<ffffffff8107c310>] ? flush_kthread_worker+0xb0/0xb0 [ 360.729142] [<ffffffff816a8940>] ? gs_change+0x13/0x13 [ 480.728077] INFO: task kworker/u:0:6 blocked for more than 120 seconds. [ 480.730176] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 480.732832] kworker/u:0 D 0000000000000001 0 6 2 0x00000000 [ 480.732838] ffff88007d3ebb20 0000000000000046 000000000000338c ffffffff81c20ac0 [ 480.732842] ffff88007d3ebfd8 ffff88007d3ebfd8 ffff88007d3ebfd8 0000000000013dc0 [ 480.732846] ffff880036c2c470 ffff88007d3f0000 ffff88007d3ebb20 ffff88007fd14660 [ 480.732850] Call Trace: [ 480.732861] [<ffffffffa02a73e0>] ? submit_io+0x80/0x80 [dm_bufio] [ 480.732868] [<ffffffff8169e129>] schedule+0x29/0x70 [ 480.732871] [<ffffffff8169e1ff>] io_schedule+0x8f/0xd0 [ 480.732875] [<ffffffffa02a73ee>] do_io_schedule+0xe/0x20 [dm_bufio] [ 480.732883] [<ffffffff8169cadf>] __wait_on_bit+0x5f/0x90 [ 480.732894] [<ffffffffa02a73e0>] ? submit_io+0x80/0x80 [dm_bufio] [ 480.732900] [<ffffffff8169cb8c>] out_of_line_wait_on_bit+0x7c/0x90 [ 480.732906] [<ffffffff8107ccc0>] ? autoremove_wake_function+0x40/0x40 [ 480.732911] [<ffffffffa02a7d96>] dm_bufio_write_dirty_buffers+0x76/0x1c0 [dm_bufio] [ 480.732918] [<ffffffffa02b4cf3>] dm_bm_flush_and_unlock+0x33/0x70 [dm_persistent_data] [ 480.732923] [<ffffffffa02b7497>] dm_tm_commit+0x37/0x50 [dm_persistent_data] [ 480.732929] [<ffffffffa02cd49d>] __commit_transaction.part.12+0x11d/0x130 [dm_thin_pool] [ 480.732934] [<ffffffffa02cd4ed>] __commit_transaction+0x3d/0x50 [dm_thin_pool] [ 480.732938] [<ffffffffa02cdff2>] dm_pool_commit_metadata+0x42/0x70 [dm_thin_pool] [ 480.732942] [<ffffffffa02c981d>] commit_or_fallback+0x2d/0x60 [dm_thin_pool] [ 480.732946] [<ffffffffa02c9c44>] process_deferred_bios+0x174/0x230 [dm_thin_pool] [ 480.732950] [<ffffffffa02c9d50>] do_worker+0x50/0x60 [dm_thin_pool] [ 480.732955] [<ffffffff81076176>] process_one_work+0x136/0x550 [ 480.732959] [<ffffffffa02c9d00>] ? process_deferred_bios+0x230/0x230 [dm_thin_pool] [ 480.732963] [<ffffffff81077375>] worker_thread+0x165/0x3c0 [ 480.732973] [<ffffffff81077210>] ? manage_workers+0x190/0x190 [ 480.732983] [<ffffffff8107c3a3>] kthread+0x93/0xa0 [ 480.732988] [<ffffffff816a8944>] kernel_thread_helper+0x4/0x10 [ 480.732991] [<ffffffff8107c310>] ? flush_kthread_worker+0xb0/0xb0 [ 480.732994] [<ffffffff816a8940>] ? gs_change+0x13/0x13 [ 480.733030] INFO: task dd:2808 blocked for more than 120 seconds. [ 480.735085] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 480.737749] dd D ffffffff8180cd00 0 2808 2541 0x00000006 [ 480.737753] ffff8800795df9f8 0000000000000046 0000000000006446 ffffffff81c20ac0 [ 480.737758] ffff8800795dffd8 ffff8800795dffd8 ffff8800795dffd8 0000000000013dc0 [ 480.737761] ffffffff81c13440 ffff88007bbbc470 ffff8800795df9f8 ffff88007fc14660 [ 480.737765] Call Trace: [ 480.737779] [<ffffffff8112c680>] ? __lock_page+0x70/0x70 [ 480.737790] [<ffffffff8169e129>] schedule+0x29/0x70 [ 480.737793] [<ffffffff8169e1ff>] io_schedule+0x8f/0xd0 [ 480.737797] [<ffffffff8112c68e>] sleep_on_page+0xe/0x20 [ 480.737800] [<ffffffff8169cadf>] __wait_on_bit+0x5f/0x90 [ 480.737804] [<ffffffff8112d29b>] ? find_get_pages_tag+0xcb/0x170 [ 480.737807] [<ffffffff8112c7e8>] wait_on_page_bit+0x78/0x80 [ 480.737811] [<ffffffff8107ccc0>] ? autoremove_wake_function+0x40/0x40 [ 480.737814] [<ffffffff8112c8fc>] filemap_fdatawait_range+0x10c/0x1a0 [ 480.737821] [<ffffffff8112c9bb>] filemap_fdatawait+0x2b/0x30 [ 480.737828] [<ffffffff8112ecf4>] filemap_write_and_wait+0x44/0x60 [ 480.737845] [<ffffffff811be9e1>] __sync_blockdev+0x21/0x40 [ 480.737850] [<ffffffff811bea13>] sync_blockdev+0x13/0x20 [ 480.737853] [<ffffffff811bea89>] __blkdev_put+0x69/0x1c0 [ 480.737869] [<ffffffff81178414>] ? kfree+0x114/0x140 [ 480.737873] [<ffffffff811bec3b>] blkdev_put+0x5b/0x160 [ 480.737876] [<ffffffff811bed65>] blkdev_close+0x25/0x30 [ 480.737884] [<ffffffff81186d4e>] __fput+0xbe/0x240 [ 480.737892] [<ffffffff81186ede>] ____fput+0xe/0x10 [ 480.737896] [<ffffffff81078fd2>] task_work_run+0x72/0x90 [ 480.737902] [<ffffffff8105d780>] do_exit+0x410/0x470 [ 480.737920] [<ffffffff8130c578>] ? blk_finish_plug+0x18/0x50 [ 480.737924] [<ffffffff8106ae4a>] ? __dequeue_signal+0x6a/0xb0 [ 480.737927] [<ffffffff8105d984>] do_group_exit+0x44/0xa0 [ 480.737931] [<ffffffff8106d97b>] get_signal_to_deliver+0x23b/0x480 [ 480.737947] [<ffffffff810147f9>] do_signal+0x29/0x130 [ 480.737951] [<ffffffff81185e70>] ? vfs_write+0x110/0x180 [ 480.737954] [<ffffffff810149b0>] do_notify_resume+0x90/0xd0 [ 480.737958] [<ffffffff816a7b62>] int_signal+0x12/0x17 [ 600.736070] INFO: task kworker/u:0:6 blocked for more than 120 seconds. [ 600.738168] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 600.740817] kworker/u:0 D 0000000000000001 0 6 2 0x00000000 [ 600.740823] ffff88007d3ebb20 0000000000000046 000000000000338c ffffffff81c20ac0 [ 600.740827] ffff88007d3ebfd8 ffff88007d3ebfd8 ffff88007d3ebfd8 0000000000013dc0 [ 600.740831] ffff880036c2c470 ffff88007d3f0000 ffff88007d3ebb20 ffff88007fd14660 [ 600.740835] Call Trace: [ 600.740847] [<ffffffffa02a73e0>] ? submit_io+0x80/0x80 [dm_bufio] [ 600.740854] [<ffffffff8169e129>] schedule+0x29/0x70 [ 600.740857] [<ffffffff8169e1ff>] io_schedule+0x8f/0xd0 [ 600.740861] [<ffffffffa02a73ee>] do_io_schedule+0xe/0x20 [dm_bufio] [ 600.740867] [<ffffffff8169cadf>] __wait_on_bit+0x5f/0x90 [ 600.740884] [<ffffffffa02a73e0>] ? submit_io+0x80/0x80 [dm_bufio] [ 600.740894] [<ffffffff8169cb8c>] out_of_line_wait_on_bit+0x7c/0x90 [ 600.740900] [<ffffffff8107ccc0>] ? autoremove_wake_function+0x40/0x40 [ 600.740905] [<ffffffffa02a7d96>] dm_bufio_write_dirty_buffers+0x76/0x1c0 [dm_bufio] [ 600.740912] [<ffffffffa02b4cf3>] dm_bm_flush_and_unlock+0x33/0x70 [dm_persistent_data] [ 600.740917] [<ffffffffa02b7497>] dm_tm_commit+0x37/0x50 [dm_persistent_data] [ 600.740923] [<ffffffffa02cd49d>] __commit_transaction.part.12+0x11d/0x130 [dm_thin_pool] [ 600.740928] [<ffffffffa02cd4ed>] __commit_transaction+0x3d/0x50 [dm_thin_pool] [ 600.740933] [<ffffffffa02cdff2>] dm_pool_commit_metadata+0x42/0x70 [dm_thin_pool] [ 600.740937] [<ffffffffa02c981d>] commit_or_fallback+0x2d/0x60 [dm_thin_pool] [ 600.740941] [<ffffffffa02c9c44>] process_deferred_bios+0x174/0x230 [dm_thin_pool] [ 600.740945] [<ffffffffa02c9d50>] do_worker+0x50/0x60 [dm_thin_pool] [ 600.740951] [<ffffffff81076176>] process_one_work+0x136/0x550 [ 600.740955] [<ffffffffa02c9d00>] ? process_deferred_bios+0x230/0x230 [dm_thin_pool] [ 600.740959] [<ffffffff81077375>] worker_thread+0x165/0x3c0 [ 600.740974] [<ffffffff81077210>] ? manage_workers+0x190/0x190 [ 600.740985] [<ffffffff8107c3a3>] kthread+0x93/0xa0 [ 600.740990] [<ffffffff816a8944>] kernel_thread_helper+0x4/0x10 [ 600.740994] [<ffffffff8107c310>] ? flush_kthread_worker+0xb0/0xb0 [ 600.740997] [<ffffffff816a8940>] ? gs_change+0x13/0x13 [ 600.741032] INFO: task dd:2808 blocked for more than 120 seconds. [ 600.742759] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 600.745145] dd D ffffffff8180cd00 0 2808 2541 0x00000006 [ 600.745149] ffff8800795df9f8 0000000000000046 0000000000006446 ffffffff81c20ac0 [ 600.745152] ffff8800795dffd8 ffff8800795dffd8 ffff8800795dffd8 0000000000013dc0 [ 600.745155] ffffffff81c13440 ffff88007bbbc470 ffff8800795df9f8 ffff88007fc14660 [ 600.745158] Call Trace: [ 600.745164] [<ffffffff8112c680>] ? __lock_page+0x70/0x70 [ 600.745167] [<ffffffff8169e129>] schedule+0x29/0x70 [ 600.745169] [<ffffffff8169e1ff>] io_schedule+0x8f/0xd0 [ 600.745172] [<ffffffff8112c68e>] sleep_on_page+0xe/0x20 [ 600.745175] [<ffffffff8169cadf>] __wait_on_bit+0x5f/0x90 [ 600.745178] [<ffffffff8112d29b>] ? find_get_pages_tag+0xcb/0x170 [ 600.745181] [<ffffffff8112c7e8>] wait_on_page_bit+0x78/0x80 [ 600.745196] [<ffffffff8107ccc0>] ? autoremove_wake_function+0x40/0x40 [ 600.745215] [<ffffffff8112c8fc>] filemap_fdatawait_range+0x10c/0x1a0 [ 600.745219] [<ffffffff8112c9bb>] filemap_fdatawait+0x2b/0x30 [ 600.745222] [<ffffffff8112ecf4>] filemap_write_and_wait+0x44/0x60 [ 600.745226] [<ffffffff811be9e1>] __sync_blockdev+0x21/0x40 [ 600.745229] [<ffffffff811bea13>] sync_blockdev+0x13/0x20 [ 600.745231] [<ffffffff811bea89>] __blkdev_put+0x69/0x1c0 [ 600.745235] [<ffffffff81178414>] ? kfree+0x114/0x140 [ 600.745238] [<ffffffff811bec3b>] blkdev_put+0x5b/0x160 [ 600.745240] [<ffffffff811bed65>] blkdev_close+0x25/0x30 [ 600.745246] [<ffffffff81186d4e>] __fput+0xbe/0x240 [ 600.745260] [<ffffffff81186ede>] ____fput+0xe/0x10 [ 600.745264] [<ffffffff81078fd2>] task_work_run+0x72/0x90 [ 600.745269] [<ffffffff8105d780>] do_exit+0x410/0x470 [ 600.745274] [<ffffffff8130c578>] ? blk_finish_plug+0x18/0x50 [ 600.745277] [<ffffffff8106ae4a>] ? __dequeue_signal+0x6a/0xb0 [ 600.745280] [<ffffffff8105d984>] do_group_exit+0x44/0xa0 [ 600.745284] [<ffffffff8106d97b>] get_signal_to_deliver+0x23b/0x480 [ 600.745294] [<ffffffff810147f9>] do_signal+0x29/0x130 [ 600.745301] [<ffffffff81185e70>] ? vfs_write+0x110/0x180 [ 600.745304] [<ffffffff810149b0>] do_notify_resume+0x90/0xd0 [ 600.745307] [<ffffffff816a7b62>] int_signal+0x12/0x17 Excerpt from ps: root 2808 0.0 0.0 0 0 pts/1 D+ 12:42 0:00 [dd] root 2679 0.0 0.0 0 0 ? D< 12:42 0:00 [loop0] note that I have already sent kill -9 dd, that's why it appears as [dd] Excerpt from iostat -x 1: dm-9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10.00 0.00 0.00 0.00 0.00 100.00 dm-11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6895.00 0.00 0.00 0.00 0.00 100.00 dm-12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6895.00 0.00 0.00 0.00 0.00 100.00 dm-13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 51190.00 0.00 0.00 0.00 0.00 100.00 the numbers in dm-11, dm-12 go down very very slowly after killing dd, such as 1 point per second I don't see dm-14 number 51190.0 going down at all in 10 minutes. For reference here are the dm numbers: lrwxrwxrwx 1 root root 8 Oct 16 12:42 vgt10-lvt10_1 -> ../dm-14 lrwxrwxrwx 1 root root 8 Oct 16 12:42 vgt10-pool10 -> ../dm-13 lrwxrwxrwx 1 root root 8 Oct 16 12:42 vgt10-pool10_tdata -> ../dm-11 lrwxrwxrwx 1 root root 8 Oct 16 12:42 vgt10-pool10_tmeta -> ../dm-10 lrwxrwxrwx 1 root root 8 Oct 16 12:42 vgt10-pool10-tpool -> ../dm-12 Here are the stack traces for loop0 and dd: (note I have already sent kill -9 dd but its console is still hanged) cat /proc/2808/stack (dd) [<ffffffff8112c68e>] sleep_on_page+0xe/0x20 [<ffffffff8112c7e8>] wait_on_page_bit+0x78/0x80 [<ffffffff8112c8fc>] filemap_fdatawait_range+0x10c/0x1a0 [<ffffffff8112c9bb>] filemap_fdatawait+0x2b/0x30 [<ffffffff8112ecf4>] filemap_write_and_wait+0x44/0x60 [<ffffffff811be9e1>] __sync_blockdev+0x21/0x40 [<ffffffff811bea13>] sync_blockdev+0x13/0x20 [<ffffffff811bea89>] __blkdev_put+0x69/0x1c0 [<ffffffff811bec3b>] blkdev_put+0x5b/0x160 [<ffffffff811bed65>] blkdev_close+0x25/0x30 [<ffffffff81186d4e>] __fput+0xbe/0x240 [<ffffffff81186ede>] ____fput+0xe/0x10 [<ffffffff81078fd2>] task_work_run+0x72/0x90 [<ffffffff8105d780>] do_exit+0x410/0x470 [<ffffffff8105d984>] do_group_exit+0x44/0xa0 [<ffffffff8106d97b>] get_signal_to_deliver+0x23b/0x480 [<ffffffff810147f9>] do_signal+0x29/0x130 [<ffffffff810149b0>] do_notify_resume+0x90/0xd0 [<ffffffff816a7b62>] int_signal+0x12/0x17 [<ffffffffffffffff>] 0xffffffffffffffff cat /proc/2679/stack (loop0) [<ffffffff81689468>] balance_dirty_pages.isra.18+0x491/0x56d [<ffffffff81137f8b>] balance_dirty_pages_ratelimited_nr+0xeb/0x110 [<ffffffff8112cb5f>] generic_perform_write+0x16f/0x210 [<ffffffff8112cc5d>] generic_file_buffered_write+0x5d/0x90 [<ffffffff8112e746>] __generic_file_aio_write+0x1b6/0x3b0 [<ffffffff8112e9bf>] generic_file_aio_write+0x7f/0x100 [<ffffffff8121c2a3>] ext4_file_write+0x93/0xe0 [<ffffffff81185563>] do_sync_write+0xa3/0xe0 [<ffffffff8143e624>] __do_lo_send_write+0x54/0xa0 [<ffffffff8143e943>] do_lo_send_direct_write+0x73/0xa0 [<ffffffff8143fff0>] lo_send+0xa0/0x100 [<ffffffff8144024e>] do_bio_filebacked+0xde/0x150 [<ffffffff81440c42>] loop_thread+0xc2/0x250 [<ffffffff8107c3a3>] kthread+0x93/0xa0 [<ffffffff816a8944>] kernel_thread_helper+0x4/0x10 [<ffffffffffffffff>] 0xffffffffffffffff Here is top: top - 12:58:34 up 18 min, 4 users, load average: 3.24, 2.99, 2.00 Tasks: 165 total, 1 running, 164 sleeping, 0 stopped, 0 zombie Cpu(s): 0.3%us, 0.0%sy, 0.0%ni, 0.0%id, 99.7%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2051176k total, 638444k used, 1412732k free, 258452k buffers Swap: 0k total, 0k used, 0k free, 146980k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2824 root 20 0 17340 1292 904 S 0 0.1 0:00.61 top 1 root 20 0 24336 2264 1280 S 0 0.1 0:01.39 init 2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0 0.0 0:00.08 ksoftirqd/0 5 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/0:0H as you can see we are at 0% CPU usage > You have an open ended dd there, so eventually you'll run out of space (once > you write 400M to fill the thin volume). At 40KB/sec it would have taken 2.7 hours to complete, and I didn't wait that much time. It might be possible that after 2.7 hours the system could recover without data loss but that's just because the volume is just 400MB, if it was 40GB it would be unthinkable to wait. > If there isn't anything useful in the kernel log then we'll need traces of > the relevant tasks (e.g. the dd task) via: echo t > /proc/sysrq-trigger I'm attaching the file, it's called dmesg3-post2sysrq.txt, unfortunately the output seems bigger than kernel buffer so it's truncated at the top. I am also noticing that I/O writes to root filesystem (outside thin) is terribly slowed down, it took like 5 seconds for dumping this dmesg to file. > Again, we need to know the version of lvm2. see above > There may have been a bug in lvm2's interface with the thin_check utility. thin_check doesn't complain if the machine is rebooted in any other situation than during this bug. It doesn't complain in clean unmounts so I believe the metadata are really corrupt. thin_check is latest version as of today compiled from source git. Maybe thin_check could be improved so to recover data even in this situation, but I might suggest to fix the kernel so that it does not cause the bug. Note that there is also the hangup of thinp which by itself would cause problems in production even if there was no data loss. I have checked that also with 3.6.2, after "reboot -nf" ("reboot -f" would hang, but please note that there shouldn't be I/O commands lost with this procedure because the virtualization host is never rebooted) the pool is corrupt after reboot and latest version of thin_check is not able to recover it. So I would say there is (also?) a bug with the order of writes or barriers usage in dm-thin. Thanks for your work Created attachment 83621 [details]
Dump of info from proc. See long comment
Since echo t > sysrq-trigger was too long for kernel buffer I decided to gather info from /proc . This is a dump of all processes and stack traces and status from proc.
Please note that this is a *different reboot* from my previous post and the sysrq-trigger dump so PIDs won't match to that one. Also, this was captured before sending kill -9 to dd.
Hi Mike, if the bug it doesn't reproduce on your system, probably the easiest thing you can try is to increase the size of the thin pool and lv. I have specified a size adequate to cause the problem reliably on my system, but I have noticed a smaller size does not cause it, so I might have stayed too low with the sizes for the general case. Try with all sizes multiplied by 10: dd if=/dev/zero of=ttest10backend bs=1M count=0 seek=10000 losetup /dev/loop0 ttest10backend pvcreate /dev/loop0 vgcreate vgt10 /dev/loop0 lvcreate --size 9000M --thinpool vgt10/pool10 lvcreate --name lvt10_1 --virtualsize 4000M --thinpool vgt10/pool10 dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M This is much easier to try than the next approach that comes to my mind which is a KVM virtual machine with a sparse file for virtual disk backend... (In reply to comment #6) > Hi Mike, if the bug it doesn't reproduce on your system, probably the easiest > thing you can try is to increase the size of the thin pool and lv. > I have specified a size adequate to cause the problem reliably on my system, > but I have noticed a smaller size does not cause it, so I might have stayed > too > low with the sizes for the general case. > > Try with all sizes multiplied by 10: My test system has 12G of physical memory so I went with the following: dd if=/dev/zero of=/mnt/ttest10backend bs=1M count=0 seek=15000 losetup /dev/loop0 /mnt/ttest10backend pvcreate /dev/loop0 vgcreate vgt10 /dev/loop0 lvcreate --size 14000M --thinpool vgt10/pool10 lvcreate --name lvt10_1 --virtualsize 12000M --thinpool vgt10/pool10 dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M But the dd still completed: dd: writing `/dev/mapper/vgt10-lvt10_1': No space left on device 12001+0 records in 12000+0 records out 12582912000 bytes (13 GB) copied, 347.912 s, 36.2 MB/s Though there was a significant stall, before dd completed, due to the VM needing to free memory via shrinking: dd D ffffffff8160f420 0 25374 25091 0x00000000 ffff880280d8d3a8 0000000000000086 ffff880280d8dfd8 00000000000128c0 ffff880280d8c010 00000000000128c0 00000000000128c0 00000000000128c0 ffff880280d8dfd8 00000000000128c0 ffff8803324b8ad0 ffff8802ef65ca90 Call Trace: [<ffffffff814d58e9>] schedule+0x29/0x70 [<ffffffff814d5bde>] schedule_preempt_disabled+0xe/0x10 [<ffffffff814d4353>] __mutex_lock_slowpath+0xd3/0x150 [<ffffffff814d425b>] mutex_lock+0x2b/0x50 [<ffffffff81138dd0>] page_referenced_file+0x50/0x120 [<ffffffff81139839>] page_referenced+0x79/0x1f0 [<ffffffff81118e9f>] shrink_active_list+0x1cf/0x310 [<ffffffff8111915f>] shrink_lruvec+0x17f/0x3a0 [<ffffffff8111947b>] shrink_zones+0xfb/0x200 [<ffffffff8111961e>] do_try_to_free_pages+0x9e/0x340 [<ffffffff81119a6a>] try_to_free_pages+0xda/0x170 [<ffffffff8110e8a2>] __alloc_pages_slowpath+0x392/0x7c0 [<ffffffff8110eeea>] __alloc_pages_nodemask+0x21a/0x230 [<ffffffff81149c66>] alloc_pages_current+0xb6/0x120 [<ffffffff81105d27>] __page_cache_alloc+0xa7/0xc0 [<ffffffff811069ac>] grab_cache_page_write_begin+0x7c/0xe0 [<ffffffff81196040>] ? blkdev_max_block+0x40/0x40 [<ffffffff81193348>] block_write_begin+0x38/0x90 [<ffffffff81197093>] blkdev_write_begin+0x23/0x30 [<ffffffff81104e2a>] generic_perform_write+0xca/0x200 [<ffffffff81104fc4>] generic_file_buffered_write+0x64/0xa0 [<ffffffff81107a6f>] __generic_file_aio_write+0x1bf/0x3b0 [<ffffffff811965f7>] blkdev_aio_write+0x57/0xc0 [<ffffffff8115f9da>] do_sync_write+0xaa/0xf0 [<ffffffff8115ff48>] vfs_write+0xc8/0x190 [<ffffffff81160111>] sys_write+0x51/0x90 [<ffffffff814de8e9>] system_call_fastpath+0x16/0x1b It should be noted that the dm-bufio layer does interface with the shrinker, so its memory will be reclaimed should memory get tight. But I still couldn't reproduce the dm_pool_commit_metadata hang you illustrated with dm_bufio_write_dirty_buffers at the end of the trace. I have 2GB virtual RAM, 2 vCPUs, and no swap. free -m total used free shared buffers cached Mem: 2003 259 1743 0 51 58 -/+ buffers/cache: 149 1854 Swap: 0 0 0 It *might* be easier and more effective to reduce RAM in your VM, than it is to enlarge the sparse file further... (In reply to comment #8) > I have 2GB virtual RAM, 2 vCPUs, and no swap. > free -m > total used free shared buffers cached > Mem: 2003 259 1743 0 51 58 > -/+ buffers/cache: 149 1854 > Swap: 0 0 0 > It *might* be easier and more effective to reduce RAM in your VM, than it is > to > enlarge the sparse file further... I tested with mem=2048M and it worked fine too. Have you tried this procedure on bare metal? Is it possible that the KVM guest is somehow enabling you to hit this? (In reply to comment #9) > (In reply to comment #8) > > I have 2GB virtual RAM, 2 vCPUs, and no swap. > > free -m > > total used free shared buffers cached > > Mem: 2003 259 1743 0 51 58 > > -/+ buffers/cache: 149 1854 > > Swap: 0 0 0 > > It *might* be easier and more effective to reduce RAM in your VM, than it > is to > > enlarge the sparse file further... > > I tested with mem=2048M and it worked fine too. I disabled swap and I now see: kworker/u:0 D ffffffff8160f1a0 0 16590 2 0x00000000 ffff880018e93ae0 0000000000000046 ffff880018e93fd8 00000000000128c0 ffff880018e92010 00000000000128c0 00000000000128c0 00000000000128c0 ffff880018e93fd8 00000000000128c0 ffffffff81a13420 ffff880050232100 Call Trace: [<ffffffffa01c26a0>] ? read_endio+0x40/0x40 [dm_bufio] [<ffffffff814d58e9>] schedule+0x29/0x70 [<ffffffff814d59bc>] io_schedule+0x8c/0xd0 [<ffffffffa01c26ae>] do_io_schedule+0xe/0x20 [dm_bufio] [<ffffffff814d3eef>] __wait_on_bit+0x5f/0x90 [<ffffffffa01c26a0>] ? read_endio+0x40/0x40 [dm_bufio] [<ffffffff814d3f98>] out_of_line_wait_on_bit+0x78/0x90 [<ffffffff81065990>] ? autoremove_wake_function+0x50/0x50 [<ffffffffa01c30c1>] ? __write_dirty_buffers_async+0x71/0xe0 [dm_bufio] [<ffffffffa01c3266>] dm_bufio_write_dirty_buffers+0x86/0x1e0 [dm_bufio] [<ffffffffa030a218>] dm_bm_flush_and_unlock+0x48/0x70 [dm_persistent_data] [<ffffffffa030d422>] dm_tm_commit+0x32/0x40 [dm_persistent_data] [<ffffffffa04e523d>] __commit_transaction+0x22d/0x240 [dm_thin_pool] [<ffffffff8122292a>] ? generic_make_request+0xca/0x100 [<ffffffffa04e1235>] ? issue+0x35/0xe0 [dm_thin_pool] [<ffffffffa04e52b0>] dm_pool_commit_metadata+0x60/0x80 [dm_thin_pool] [<ffffffffa04e0611>] commit_or_fallback+0x41/0x70 [dm_thin_pool] [<ffffffffa04e2eb4>] process_deferred_bios+0x174/0x230 [dm_thin_pool] [<ffffffffa04e2f70>] ? process_deferred_bios+0x230/0x230 [dm_thin_pool] [<ffffffffa04e2fc1>] do_worker+0x51/0x60 [dm_thin_pool] [<ffffffff8105dac9>] process_one_work+0x179/0x4b0 [<ffffffff8105f93e>] worker_thread+0x12e/0x330 [<ffffffff8105f810>] ? manage_workers+0x190/0x190 [<ffffffff8106529e>] kthread+0x9e/0xb0 [<ffffffff814df984>] kernel_thread_helper+0x4/0x10 [<ffffffff81065200>] ? kthread_freezable_should_stop+0x70/0x70 [<ffffffff814df980>] ? gs_change+0x13/0x13 But progress was still being made with periodic bursts of IO. The dd completed, albeit slower: dd: writing `/dev/mapper/vgt10-lvt10_1': No space left on device 12001+0 records in 12000+0 records out 12582912000 bytes (13 GB) copied, 583.837 s, 21.6 MB/s You said "I am also noticing that I/O writes to root filesystem (outside thin) is terribly slowed down, it took like 5 seconds for dumping this dmesg to file." This seems like a generic VM scalability limitation under low memory conditions. Your system is particularly incapable of handling the load, consider: I can successfully write 12G to a thin volume with the same memory config (2G, no swap) yet you cannot even write 400M. Hello Mike To answer your question, no, I have not tried to reproduce the problem on hardware hosts. I don't really have hardware hosts available for testing right now. What hypervisor do you use? And what CPU do you have on the host machine? Mine is a rather old E5420 which does not have very good virtualization performances. Anyway: I remembered I have nonstandard VirtMem and scheduler settings in the VM. Playing with a few Virtual Memory and scheduler settings, sometimes it did not reproduce on my VM. But even with default settings, sometimes the bug did reproduce. But with my custom settings it seems to reproduce more. So see if you can reproduce it like this: Set the VM memory at 2GB Ram, 2vCPU, no swap # Reboot and login. It appears easier to reproduce the bug with a VM which has just rebooted. # VirtMem and scheduler ettings echo 0 > /proc/sys/vm/swappiness echo 10000000 > /proc/sys/vm/dirty_background_bytes echo 200000000 > /proc/sys/vm/dirty_bytes echo 350 > /proc/sys/vm/dirty_expire_centisecs echo 200 > /proc/sys/vm/dirty_writeback_centisecs echo cfq > /sys/block/vda/queue/scheduler #Your VM has vda, i.e. kvm virtio? echo 4 > /sys/block/vda/queue/nr_requests # Do not remove previous sparse file, create a new one mkdir thin_bug_48851 cd thin_bug_48851 cd $(mktemp --tmpdir=. -d) sleep5 dd if=/dev/zero of=ttest10backend bs=1M count=0 seek=1000 losetup /dev/loop0 ttest10backend pvcreate /dev/loop0 vgcreate vgt10 /dev/loop0 lvcreate --size 900M --thinpool vgt10/pool10 lvcreate --name lvt10_1 --virtualsize 400M --thinpool vgt10/pool10 sync sleep 5 sync dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M I have noticed that the earliest symptom of the bug happening is from iostat -x 1: dm-10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11.00 0.00 0.00 0.00 0.00 100.00 dm-11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9432.00 0.00 0.00 0.00 0.00 100.00 dm-12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9432.00 0.00 0.00 0.00 0.00 100.00 dm-13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 48814.00 0.00 0.00 0.00 0.00 100.00 when you see numbers stuck for at least 4-5 seconds for the _tdata _tmeta and -tpool (11.00 and the two 9432.00 in the example above), you are into the bug and it will never resolve. If numbers appear in those lines just for a few seconds, it can still resolve. Instead, a number stuck in the thin LV line (dm-14 in the example above) is not significant. for reference: lrwxrwxrwx 1 root root 8 Oct 17 19:06 vgt10-lvt10_1 -> ../dm-14 lrwxrwxrwx 1 root root 8 Oct 17 19:06 vgt10-pool10 -> ../dm-13 lrwxrwxrwx 1 root root 8 Oct 17 19:06 vgt10-pool10_tdata -> ../dm-11 lrwxrwxrwx 1 root root 8 Oct 17 19:06 vgt10-pool10_tmeta -> ../dm-10 lrwxrwxrwx 1 root root 8 Oct 17 19:06 vgt10-pool10-tpool -> ../dm-12 Let me know if you can reproduce it with the sequence above. The bug is reproducible also with 1 vCPU, with the sequence in my comment #11. Mike, it turns out that my hardware node had a partially defective disk which was delaying I/O and this allowed me to trigger the race more easily than normal. I have been trying to find a better reproducer. Here is a new version of the reproducer which works so reliably for me that I hope it can work even on good hardware. (Tested with 1..4 vCPU and 1..2 GB RAM, KVM virtual machine) You need 2 shells, at least if you want to look at "iostat -x 1" which is *very* interesting... #--- Shell#1: rm -f ttest10backend #deleting old attempts dd if=/dev/zero of=ttest10backend bs=1M count=0 seek=1000 losetup /dev/loop0 ttest10backend pvcreate /dev/loop0 vgcreate vgt10 /dev/loop0 lvcreate --size 900M --thinpool vgt10/pool10 lvcreate --name lvt10_1 --virtualsize 400M --thinpool vgt10/pool10 sync #--- Shell#2 iostat -x 1 #--- Shell#1: # Paste all this text *simultaneously* TIME_DD_TO_RESUME=5 TIME_RESUME_TO_REBOOT=20 echo suspending dmsetup suspend vgt10-pool10_tmeta echo suspend finished dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M & echo sleeping $TIME_DD_TO_RESUME sleep $TIME_DD_TO_RESUME echo resuming dmsetup resume vgt10-pool10_tmeta echo resume finished # # Paste also the following part if you want to see the metadata corruption bug # echo sleeping $TIME_RESUME_TO_REBOOT sleep $TIME_RESUME_TO_REBOOT echo rebooting reboot -nf #-- Upon reboot -- losetup /dev/loop0 ttest10backend vgchange -ay vgt10 #outputs: # Check of thin pool vgt10/pool10 failed (status:1). Manual repair required # (thin_dump --repair /dev/mapper/vgt10-pool10_tmeta)! # 0 logical volume(s) in volume group "vgt10" now active Please note: At least the variable TIME_DD_TO_RESUME will require manual tweaking. If you set it very high, like 15 seconds, you have certainty to hit the hangup bug, but you might not see the corruption bug (i.e. upon reboot the metadata file won't be corrupt). If you set it low like 5 seconds you *should* hit the corruption bug rather easily, but the hangup bug (supposing you don't do reboot -nf) might not be very apparent. But if you set it too low like 2 seconds you might hit none of the two bugs. The right value probably depends on your hardware. It is more difficult to hit the corruption bug than the hangup bug. Please make sure you fix the corruption bug before the hangup bug, because it will be very hard to reproduce the metadata corruption bug once you have fixed the hangup bug. I am not convinced that the two bugs are the same bug. Thanks You're queuing a bunch of in-flight buffered IO and then forcibly killing the machine. It is likely that the "Check of thin pool vgt10/pool10 failed (status:1)." message is side-effect of reference count mismatches that are easily ignored. But because lvm2 isn't currently able to know the error is safe to continue from. Please provide the output from the following once you've reproduced the so-called "corruption": /usr/sbin/thin_check /dev/mapper/vgt10-pool10_tmeta Hi Mike, have you tried if my last reproducer works for you? Anyway, to answer your question, indeed it's reference count mismatches, in one case it was just: ------------- /usr/sbin/thin_check /dev/mapper/vgt10-pool10_tmeta Errors in metadata Errors in data block reference counts 0: was 1, expected 0 ------------- in another case it was much longer: ------------- /usr/sbin/thin_check /dev/mapper/vgt10-pool10_tmeta Errors in metadata Errors in data block reference counts 4016: was 1, expected 0 4017: was 1, expected 0 4018: was 1, expected 0 ..... ..... ALL numbers up to 4415 .... ..... 4415: was 1, expected 0 ------------- So do you think it can be ignored safely, all is still semantically correct? Please note there is still the hangup problem. Were you able to reproduce that? Thank you Hello Mike, is there any news on this bug? We would badly need to use these cool thinp features... Thank you (In reply to comment #15) > Hi Mike, > have you tried if my last reproducer works for you? > > Anyway, to answer your question, > indeed it's reference count mismatches, > in one case it was just: > > ------------- > /usr/sbin/thin_check /dev/mapper/vgt10-pool10_tmeta > Errors in metadata > Errors in data block reference counts > 0: was 1, expected 0 > ------------- > > in another case it was much longer: > > ------------- > /usr/sbin/thin_check /dev/mapper/vgt10-pool10_tmeta > Errors in metadata > Errors in data block reference counts > 4016: was 1, expected 0 > 4017: was 1, expected 0 > 4018: was 1, expected 0 > ..... > ..... ALL numbers up to 4415 .... > ..... > 4415: was 1, expected 0 > ------------- > > So do you think it can be ignored safely, all is still semantically correct? thin_check doesn't actually repair, it just reports. These refcount mismatches are expected but you'll need to repair (to eliminate the mismatches) by doing a thin_dump and then thin_restore, e.g.: thin_dump /dev/mapper/vgt10-pool10_tmeta > thin-pool-metadata thin_restore -i thin-pool-metadata -o /dev/mapper/vgt10-pool10_tmeta So this isn't a bug. The messages mean it was releasing blocks that were preallocated for the I/O that got interrupted and so never completed. The userspace tools and/or documentation need to be improved to separate 'normal' metadata corrections from genuine errors. And we need to offer a trivial way to perform this sort of 'repair' in userspace without causing undue worry. Hello, sorry for the delay, I confirm that in my cases the pool was again accessible after exporting and reimporting. So all is good and thanks; the main problem is that this kind of fix is not intuitive, especially after seeing "Manual repair required", my interpretation was that I had to fix the xml by hand. |