Bug 48851 - Unable to access data on dm-thin device after machine reset
Summary: Unable to access data on dm-thin device after machine reset
Status: CLOSED INVALID
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: LVM2/DM (show other bugs)
Hardware: All Linux
: P5 low
Assignee: Alasdair G Kergon
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-10-15 14:53 UTC by Kurk
Modified: 2015-01-06 18:56 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.4.10
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg after echo t > /proc/sysrq-trigger (244.91 KB, text/plain)
2012-10-16 12:19 UTC, Kurk
Details
Dump of info from proc. See long comment (17.30 KB, application/x-gzip)
2012-10-16 12:57 UTC, Kurk
Details

Description Kurk 2012-10-15 14:53:00 UTC
This is a bug report for reproducible complete data loss with dm-thin.

Please note that this bug (I believe it's the same) happened to me even in modes different from below (such as: no loop-mounted file) however this is the only way I found to replicate it reliably and in a short time.

Kernel is 3.4.10 vanilla
This was on a KVM virtual machine with 2 vCPU but I hope it's reproducible even on hardware hosts. I cannot check right now.


root@host1:~# dd if=/dev/zero of=ttest10backend bs=1M count=0 seek=1000
0+0 records in                                                                                                                                    
0+0 records out                                                                                                                                   
0 bytes (0 B) copied, 0.0003177 s, 0.0 kB/s                                                                                                       
	[the bug won't reproduce reliably if the backend file is preallocated or too small. Host filesystem is ext4]

root@host1:~# losetup /dev/loop0 ttest10backend                                                                                            

root@host1:~# pvcreate /dev/loop0
  Writing physical volume data to disk "/dev/loop0"                                                                                               
  Physical volume "/dev/loop0" successfully created                                                                                               

root@host1:~# vgcreate vgt10 /dev/loop0
  Volume group "vgt10" successfully created                                                                                                       

root@host1:~# lvcreate --size 900M --thinpool vgt10/pool10
  Rounding up size to full physical extent 4.00 MiB                                                                                               
  Logical volume "pool10" created                                                                                                                 

root@host1:~# lvcreate --name lvt10_1 --virtualsize 400M --thinpool vgt10/pool10
  Logical volume "lvt10_1" created                                                                                                                

root@host1:~# dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M 
	[hangs in a few seconds and write speed drops to 40K/sec seen from iotop. Uninterruptible. Needs hard reset.]

[after hard reset:]

root@host1:~# losetup /dev/loop0 ttest10backend 
root@host1:~# lvs
  LV            VG         Attr     LSize   Pool  Origin Data%  Move Log Copy%  Convert
  ...other nonthin vols...  -wi-a---  60.00g 
  lvt10_1       vgt10       Vwi---tz 400.00m pool10                                    
  pool10        vgt10       twi---tz 900.00m                                            
root@host1:~# vgs
  VG         #PV #LV #SN Attr   VSize   VFree 
  anothergroup 1   1   0 wz--n-  87.64g 27.64g
  vgt10        1   3   0 wz--n- 996.00m 92.00m
root@host1:~# vgchange -ay vgt10
  Check of thin pool vgt10/pool10 failed (status:1). Manual repair required (thin_dump --repair /dev/mapper/vgt10-pool10_tmeta)!
  0 logical volume(s) in volume group "vgt10" now active

...complete data loss! Apparently irrecoverable. Whole thin pool is affected.
Comment 1 Mike Snitzer 2012-10-15 15:39:30 UTC
(In reply to comment #0)
> This is a bug report for reproducible complete data loss with dm-thin.
> 
> Please note that this bug (I believe it's the same) happened to me even in
> modes different from below (such as: no loop-mounted file) however this is
> the
> only way I found to replicate it reliably and in a short time.

OK, can you elaborate on the setup you had that prompted you to try to reproduce on using this questionable configuration?

> Kernel is 3.4.10 vanilla
> This was on a KVM virtual machine with 2 vCPU but I hope it's reproducible
> even
> on hardware hosts. I cannot check right now.

Any reason you're using 3.4?  Various dm-thinp improvements have been made in subsequent releases.

And what version of lvm2 are you using?

Steps will be taken to reproduce with your sequence using the latest kernel (v3.6) and latest lvm2 (v2_02_98).

> root@host1:~# dd if=/dev/zero of=ttest10backend bs=1M count=0 seek=1000
> 0+0 records in                                                                
> 0+0 records out                                                               
> 0 bytes (0 B) copied, 0.0003177 s, 0.0 kB/s                                   
>     [the bug won't reproduce reliably if the backend file is preallocated or
> too small. Host filesystem is ext4]
>
> root@host1:~# losetup /dev/loop0 ttest10backend                               
 
I say "questionable configuration" above because using a sparse loopback file says to me is there is potential for dm-thin's underlying storage to be unreliable.

What mount options were you using for ext4?

> root@host1:~# pvcreate /dev/loop0
>   Writing physical volume data to disk "/dev/loop0"                           
>   Physical volume "/dev/loop0" successfully created                           
> 
> root@host1:~# vgcreate vgt10 /dev/loop0
>   Volume group "vgt10" successfully created                                   
> 
> root@host1:~# lvcreate --size 900M --thinpool vgt10/pool10
>   Rounding up size to full physical extent 4.00 MiB                           
>   Logical volume "pool10" created                                             
> 
> root@host1:~# lvcreate --name lvt10_1 --virtualsize 400M --thinpool
> vgt10/pool10
>   Logical volume "lvt10_1" created                                            
> 
> root@host1:~# dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M 
>     [hangs in a few seconds and write speed drops to 40K/sec seen from iotop.
> Uninterruptible. Needs hard reset.]

Did you see anything in the kernel log that indicated there was a problem?  You have an open ended dd there, so eventually you'll run out of space (once you write 400M to fill the thin volume).

If there isn't anything useful in the kernel log then we'll need traces of the relevant tasks (e.g. the dd task) via: echo t > /proc/sysrq-trigger

> [after hard reset:]

We need to understand what was going wrong with the Uninterruptible dd.  Again, please check your kernel log.

> root@host1:~# losetup /dev/loop0 ttest10backend 
> root@host1:~# lvs
>   LV            VG         Attr     LSize   Pool  Origin Data%  Move Log
>   Copy% 
> Convert
>   ...other nonthin vols...  -wi-a---  60.00g 
>   lvt10_1       vgt10       Vwi---tz 400.00m pool10                           
>   pool10        vgt10       twi---tz 900.00m                                  
> root@host1:~# vgs
>   VG         #PV #LV #SN Attr   VSize   VFree 
>   anothergroup 1   1   0 wz--n-  87.64g 27.64g
>   vgt10        1   3   0 wz--n- 996.00m 92.00m
> root@host1:~# vgchange -ay vgt10
>   Check of thin pool vgt10/pool10 failed (status:1). Manual repair required
> (thin_dump --repair /dev/mapper/vgt10-pool10_tmeta)!
>   0 logical volume(s) in volume group "vgt10" now active

Again, we need to know the version of lvm2.  There may have been a bug in lvm2's interface with the thin_check utility.

> ...complete data loss! Apparently irrecoverable. Whole thin pool is affected.

While I appreciate the bug report I'm having a hard time with your leap to an unqualified "complete data loss!".  Regardless, you have our attention and we'll work through it.
Comment 2 Mike Snitzer 2012-10-15 20:42:21 UTC
(In reply to comment #1)
> (In reply to comment #0)
> > root@host1:~# dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M 
> >     [hangs in a few seconds and write speed drops to 40K/sec seen from
> iotop.
> > Uninterruptible. Needs hard reset.]
> 
> Did you see anything in the kernel log that indicated there was a problem? 
> You
> have an open ended dd there, so eventually you'll run out of space (once you
> write 400M to fill the thin volume).

I cannot reproduce an Uninterruptible hang with dd (and so I cannot reproduce this bug), I get what I expected:

# dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M
dd: writing `/dev/mapper/vgt10-lvt10_1': No space left on device
401+0 records in
400+0 records out
419430400 bytes (419 MB) copied, 10.4799 s, 40.0 MB/s

I'd appreciate it if you could please answer the various questions I had in comment#1
Comment 3 Kurk 2012-10-16 12:19:13 UTC
Created attachment 83611 [details]
dmesg after echo t > /proc/sysrq-trigger
Comment 4 Kurk 2012-10-16 12:19:39 UTC
Hello Mike

I still can reproduce the bug on 3.6.2, what is the kernel you tried?

> OK, can you elaborate on the setup you had that prompted you to try to
> reproduce on using this questionable configuration?

I had another nonstandard configuration: KVM virtual disk partition --> nonthin PV VG LV --> using one LV as PV for another VG --> LV as thin-pool --> thin LVs.

I was using this config because I didn't trust thin yet, I thought that if data-loss happened I risked to lose also nonthin data, so I thought layering it inside a nonthin LV was safer. And also because vgcfgbackup/restore do not work if there is at least one thinpool, so still as of now layering is needed for safety.

It took a few weeks of work (not stress test) to see the first full data loss in real-life. After that, trying to reproduce the data loss in a zillion ways, one attempt was with a loop device. That was found to be reliable, so I reported the bug with that. I hope it's the same bug.


> Any reason you're using 3.4?  Various dm-thinp improvements have been made in
> subsequent releases.

Yes, because it is the most stable kernel (aka oldest) that contains the feature I need, which is thin discard.
In fact at first I thought it was a problem of discard, but then I was able to reproduce the data loss without the discard.

Dm-thin "improvements" in terms of new features surely go to 3.5 and 3.6 but I assumed bugfixes would be backported to 3.4 so I didn't expect 3.4 to hangup and lose data more than 3.5 or 3.6 , I expected just fewer features and less performance but more stability. Is that a wrong assumption?

I have now tried 3.6.2 and the bug is still there.


> And what version of lvm2 are you using?
It is
  LVM version:     2.02.95(2) (2012-03-06)
  Library version: 1.02.74 (2012-03-06)
from Debian unstable repositories.

Do you think LVM can be the problem? It looks like a kernel problem to me.

I have serious problems in upgrading this package because the standard lvm compiled from source doesn't work well with Debian/Ubuntu distro (some problems in volume activation AFAIR, maybe related to udev, which I have not tracked down yet), while trying to merge the new sources into Debian unstable source package results in compile errors. If possible I would like to try to reproduce the bug without trying to upgrade the LVM package please.

Note that regarding thin-provisioning-tools (in particular thin_check), that is compiled from source git and updated as of today.

> Steps will be taken to reproduce with your sequence using the latest kernel
> (v3.6) and latest lvm2 (v2_02_98).

> I say "questionable configuration" above because using a sparse loopback file
> says to me is there is potential for dm-thin's underlying storage to be
unreliable. 

What do you mean with unreliable? 
Sure I have enough space in the FS for the backing file to be completely written out (that's just 1GB). It's just that if I write it out before the test, the bug does not trigger easily. It's probably a race condition with some other parts of the kernel.
It definitely was not a case of out-of-space in the volume holding the sparse backing file, if this was what you meant to say.

> What mount options were you using for ext4?

rw,relatime,errors=remount-ro,stripe=1024,data=ordered 0 0

It is on a partition of the virtual disk of KVM.

At the host side this virtual disk it is backed again to a sparse file (and again there is enough space to fully write it out), sorry for that, that could indeed contribute to causing the race condition (many in-flight commands waiting for completion maybe). 
This sparse backend file for KVM never gave problems for other filesystems, logical volumes etc. 
The sparse file at host side is on ext4, the mount options are: rw,relatime,errors=remount-ro,user_xattr,acl,barrier=1,data=ordered 0 0 . This host filesystem is on a full MD raid1 without LVM, the raid1 is on the partitions of two HDDs. Also please note that the virtualization-host machine was never rebooted, so even if e.g. barriers were not working (which they do AFAICS), the host and page cache stay up so there would be still no data loss expected from that side.

> Did you see anything in the kernel log that indicated there was a problem?  

Just hung tasks:

	[   10.016484] Rounding down aligned max_sectors from 4294967295 to 8388600
	[   10.395259] Rounding down aligned max_sectors from 4294967295 to 8388600
	[   10.758234] Rounding down aligned max_sectors from 4294967295 to 8388600
	[   11.128985] Rounding down aligned max_sectors from 4294967295 to 8388600
	[   11.508384] Rounding down aligned max_sectors from 4294967295 to 8388600
	[   11.877132] Rounding down aligned max_sectors from 4294967295 to 8388600
	[  360.724075] INFO: task kworker/u:0:6 blocked for more than 120 seconds.
	[  360.726236] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	[  360.728875] kworker/u:0     D 0000000000000001     0     6      2 0x00000000
	[  360.728881]  ffff88007d3ebb20 0000000000000046 000000000000338c ffffffff81c20ac0
	[  360.728885]  ffff88007d3ebfd8 ffff88007d3ebfd8 ffff88007d3ebfd8 0000000000013dc0
	[  360.728889]  ffff880036c2c470 ffff88007d3f0000 ffff88007d3ebb20 ffff88007fd14660
	[  360.728893] Call Trace:
	[  360.728905]  [<ffffffffa02a73e0>] ? submit_io+0x80/0x80 [dm_bufio]
	[  360.728956]  [<ffffffff8169e129>] schedule+0x29/0x70
	[  360.728973]  [<ffffffff8169e1ff>] io_schedule+0x8f/0xd0
	[  360.728980]  [<ffffffffa02a73ee>] do_io_schedule+0xe/0x20 [dm_bufio]
	[  360.728986]  [<ffffffff8169cadf>] __wait_on_bit+0x5f/0x90
	[  360.728990]  [<ffffffffa02a73e0>] ? submit_io+0x80/0x80 [dm_bufio]
	[  360.728994]  [<ffffffff8169cb8c>] out_of_line_wait_on_bit+0x7c/0x90
	[  360.729022]  [<ffffffff8107ccc0>] ? autoremove_wake_function+0x40/0x40
	[  360.729029]  [<ffffffffa02a7d96>] dm_bufio_write_dirty_buffers+0x76/0x1c0 [dm_bufio]
	[  360.729036]  [<ffffffffa02b4cf3>] dm_bm_flush_and_unlock+0x33/0x70 [dm_persistent_data]
	[  360.729052]  [<ffffffffa02b7497>] dm_tm_commit+0x37/0x50 [dm_persistent_data]
	[  360.729060]  [<ffffffffa02cd49d>] __commit_transaction.part.12+0x11d/0x130 [dm_thin_pool]
	[  360.729065]  [<ffffffffa02cd4ed>] __commit_transaction+0x3d/0x50 [dm_thin_pool]
	[  360.729070]  [<ffffffffa02cdff2>] dm_pool_commit_metadata+0x42/0x70 [dm_thin_pool]
	[  360.729074]  [<ffffffffa02c981d>] commit_or_fallback+0x2d/0x60 [dm_thin_pool]
	[  360.729084]  [<ffffffffa02c9c44>] process_deferred_bios+0x174/0x230 [dm_thin_pool]
	[  360.729089]  [<ffffffffa02c9d50>] do_worker+0x50/0x60 [dm_thin_pool]
	[  360.729094]  [<ffffffff81076176>] process_one_work+0x136/0x550
	[  360.729103]  [<ffffffffa02c9d00>] ? process_deferred_bios+0x230/0x230 [dm_thin_pool]
	[  360.729107]  [<ffffffff81077375>] worker_thread+0x165/0x3c0
	[  360.729114]  [<ffffffff81077210>] ? manage_workers+0x190/0x190
	[  360.729118]  [<ffffffff8107c3a3>] kthread+0x93/0xa0
	[  360.729134]  [<ffffffff816a8944>] kernel_thread_helper+0x4/0x10
	[  360.729139]  [<ffffffff8107c310>] ? flush_kthread_worker+0xb0/0xb0
	[  360.729142]  [<ffffffff816a8940>] ? gs_change+0x13/0x13
	[  480.728077] INFO: task kworker/u:0:6 blocked for more than 120 seconds.
	[  480.730176] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	[  480.732832] kworker/u:0     D 0000000000000001     0     6      2 0x00000000
	[  480.732838]  ffff88007d3ebb20 0000000000000046 000000000000338c ffffffff81c20ac0
	[  480.732842]  ffff88007d3ebfd8 ffff88007d3ebfd8 ffff88007d3ebfd8 0000000000013dc0
	[  480.732846]  ffff880036c2c470 ffff88007d3f0000 ffff88007d3ebb20 ffff88007fd14660
	[  480.732850] Call Trace:
	[  480.732861]  [<ffffffffa02a73e0>] ? submit_io+0x80/0x80 [dm_bufio]
	[  480.732868]  [<ffffffff8169e129>] schedule+0x29/0x70
	[  480.732871]  [<ffffffff8169e1ff>] io_schedule+0x8f/0xd0
	[  480.732875]  [<ffffffffa02a73ee>] do_io_schedule+0xe/0x20 [dm_bufio]
	[  480.732883]  [<ffffffff8169cadf>] __wait_on_bit+0x5f/0x90
	[  480.732894]  [<ffffffffa02a73e0>] ? submit_io+0x80/0x80 [dm_bufio]
	[  480.732900]  [<ffffffff8169cb8c>] out_of_line_wait_on_bit+0x7c/0x90
	[  480.732906]  [<ffffffff8107ccc0>] ? autoremove_wake_function+0x40/0x40
	[  480.732911]  [<ffffffffa02a7d96>] dm_bufio_write_dirty_buffers+0x76/0x1c0 [dm_bufio]
	[  480.732918]  [<ffffffffa02b4cf3>] dm_bm_flush_and_unlock+0x33/0x70 [dm_persistent_data]
	[  480.732923]  [<ffffffffa02b7497>] dm_tm_commit+0x37/0x50 [dm_persistent_data]
	[  480.732929]  [<ffffffffa02cd49d>] __commit_transaction.part.12+0x11d/0x130 [dm_thin_pool]
	[  480.732934]  [<ffffffffa02cd4ed>] __commit_transaction+0x3d/0x50 [dm_thin_pool]
	[  480.732938]  [<ffffffffa02cdff2>] dm_pool_commit_metadata+0x42/0x70 [dm_thin_pool]
	[  480.732942]  [<ffffffffa02c981d>] commit_or_fallback+0x2d/0x60 [dm_thin_pool]
	[  480.732946]  [<ffffffffa02c9c44>] process_deferred_bios+0x174/0x230 [dm_thin_pool]
	[  480.732950]  [<ffffffffa02c9d50>] do_worker+0x50/0x60 [dm_thin_pool]
	[  480.732955]  [<ffffffff81076176>] process_one_work+0x136/0x550
	[  480.732959]  [<ffffffffa02c9d00>] ? process_deferred_bios+0x230/0x230 [dm_thin_pool]
	[  480.732963]  [<ffffffff81077375>] worker_thread+0x165/0x3c0
	[  480.732973]  [<ffffffff81077210>] ? manage_workers+0x190/0x190
	[  480.732983]  [<ffffffff8107c3a3>] kthread+0x93/0xa0
	[  480.732988]  [<ffffffff816a8944>] kernel_thread_helper+0x4/0x10
	[  480.732991]  [<ffffffff8107c310>] ? flush_kthread_worker+0xb0/0xb0
	[  480.732994]  [<ffffffff816a8940>] ? gs_change+0x13/0x13
	[  480.733030] INFO: task dd:2808 blocked for more than 120 seconds.
	[  480.735085] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	[  480.737749] dd              D ffffffff8180cd00     0  2808   2541 0x00000006
	[  480.737753]  ffff8800795df9f8 0000000000000046 0000000000006446 ffffffff81c20ac0
	[  480.737758]  ffff8800795dffd8 ffff8800795dffd8 ffff8800795dffd8 0000000000013dc0
	[  480.737761]  ffffffff81c13440 ffff88007bbbc470 ffff8800795df9f8 ffff88007fc14660
	[  480.737765] Call Trace:
	[  480.737779]  [<ffffffff8112c680>] ? __lock_page+0x70/0x70
	[  480.737790]  [<ffffffff8169e129>] schedule+0x29/0x70
	[  480.737793]  [<ffffffff8169e1ff>] io_schedule+0x8f/0xd0
	[  480.737797]  [<ffffffff8112c68e>] sleep_on_page+0xe/0x20
	[  480.737800]  [<ffffffff8169cadf>] __wait_on_bit+0x5f/0x90
	[  480.737804]  [<ffffffff8112d29b>] ? find_get_pages_tag+0xcb/0x170
	[  480.737807]  [<ffffffff8112c7e8>] wait_on_page_bit+0x78/0x80
	[  480.737811]  [<ffffffff8107ccc0>] ? autoremove_wake_function+0x40/0x40
	[  480.737814]  [<ffffffff8112c8fc>] filemap_fdatawait_range+0x10c/0x1a0
	[  480.737821]  [<ffffffff8112c9bb>] filemap_fdatawait+0x2b/0x30
	[  480.737828]  [<ffffffff8112ecf4>] filemap_write_and_wait+0x44/0x60
	[  480.737845]  [<ffffffff811be9e1>] __sync_blockdev+0x21/0x40
	[  480.737850]  [<ffffffff811bea13>] sync_blockdev+0x13/0x20
	[  480.737853]  [<ffffffff811bea89>] __blkdev_put+0x69/0x1c0
	[  480.737869]  [<ffffffff81178414>] ? kfree+0x114/0x140
	[  480.737873]  [<ffffffff811bec3b>] blkdev_put+0x5b/0x160
	[  480.737876]  [<ffffffff811bed65>] blkdev_close+0x25/0x30
	[  480.737884]  [<ffffffff81186d4e>] __fput+0xbe/0x240
	[  480.737892]  [<ffffffff81186ede>] ____fput+0xe/0x10
	[  480.737896]  [<ffffffff81078fd2>] task_work_run+0x72/0x90
	[  480.737902]  [<ffffffff8105d780>] do_exit+0x410/0x470
	[  480.737920]  [<ffffffff8130c578>] ? blk_finish_plug+0x18/0x50
	[  480.737924]  [<ffffffff8106ae4a>] ? __dequeue_signal+0x6a/0xb0
	[  480.737927]  [<ffffffff8105d984>] do_group_exit+0x44/0xa0
	[  480.737931]  [<ffffffff8106d97b>] get_signal_to_deliver+0x23b/0x480
	[  480.737947]  [<ffffffff810147f9>] do_signal+0x29/0x130
	[  480.737951]  [<ffffffff81185e70>] ? vfs_write+0x110/0x180
	[  480.737954]  [<ffffffff810149b0>] do_notify_resume+0x90/0xd0
	[  480.737958]  [<ffffffff816a7b62>] int_signal+0x12/0x17
	[  600.736070] INFO: task kworker/u:0:6 blocked for more than 120 seconds.
	[  600.738168] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	[  600.740817] kworker/u:0     D 0000000000000001     0     6      2 0x00000000
	[  600.740823]  ffff88007d3ebb20 0000000000000046 000000000000338c ffffffff81c20ac0
	[  600.740827]  ffff88007d3ebfd8 ffff88007d3ebfd8 ffff88007d3ebfd8 0000000000013dc0
	[  600.740831]  ffff880036c2c470 ffff88007d3f0000 ffff88007d3ebb20 ffff88007fd14660
	[  600.740835] Call Trace:
	[  600.740847]  [<ffffffffa02a73e0>] ? submit_io+0x80/0x80 [dm_bufio]
	[  600.740854]  [<ffffffff8169e129>] schedule+0x29/0x70
	[  600.740857]  [<ffffffff8169e1ff>] io_schedule+0x8f/0xd0
	[  600.740861]  [<ffffffffa02a73ee>] do_io_schedule+0xe/0x20 [dm_bufio]
	[  600.740867]  [<ffffffff8169cadf>] __wait_on_bit+0x5f/0x90
	[  600.740884]  [<ffffffffa02a73e0>] ? submit_io+0x80/0x80 [dm_bufio]
	[  600.740894]  [<ffffffff8169cb8c>] out_of_line_wait_on_bit+0x7c/0x90
	[  600.740900]  [<ffffffff8107ccc0>] ? autoremove_wake_function+0x40/0x40
	[  600.740905]  [<ffffffffa02a7d96>] dm_bufio_write_dirty_buffers+0x76/0x1c0 [dm_bufio]
	[  600.740912]  [<ffffffffa02b4cf3>] dm_bm_flush_and_unlock+0x33/0x70 [dm_persistent_data]
	[  600.740917]  [<ffffffffa02b7497>] dm_tm_commit+0x37/0x50 [dm_persistent_data]
	[  600.740923]  [<ffffffffa02cd49d>] __commit_transaction.part.12+0x11d/0x130 [dm_thin_pool]
	[  600.740928]  [<ffffffffa02cd4ed>] __commit_transaction+0x3d/0x50 [dm_thin_pool]
	[  600.740933]  [<ffffffffa02cdff2>] dm_pool_commit_metadata+0x42/0x70 [dm_thin_pool]
	[  600.740937]  [<ffffffffa02c981d>] commit_or_fallback+0x2d/0x60 [dm_thin_pool]
	[  600.740941]  [<ffffffffa02c9c44>] process_deferred_bios+0x174/0x230 [dm_thin_pool]
	[  600.740945]  [<ffffffffa02c9d50>] do_worker+0x50/0x60 [dm_thin_pool]
	[  600.740951]  [<ffffffff81076176>] process_one_work+0x136/0x550
	[  600.740955]  [<ffffffffa02c9d00>] ? process_deferred_bios+0x230/0x230 [dm_thin_pool]
	[  600.740959]  [<ffffffff81077375>] worker_thread+0x165/0x3c0
	[  600.740974]  [<ffffffff81077210>] ? manage_workers+0x190/0x190
	[  600.740985]  [<ffffffff8107c3a3>] kthread+0x93/0xa0
	[  600.740990]  [<ffffffff816a8944>] kernel_thread_helper+0x4/0x10
	[  600.740994]  [<ffffffff8107c310>] ? flush_kthread_worker+0xb0/0xb0
	[  600.740997]  [<ffffffff816a8940>] ? gs_change+0x13/0x13
	[  600.741032] INFO: task dd:2808 blocked for more than 120 seconds.
	[  600.742759] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	[  600.745145] dd              D ffffffff8180cd00     0  2808   2541 0x00000006
	[  600.745149]  ffff8800795df9f8 0000000000000046 0000000000006446 ffffffff81c20ac0
	[  600.745152]  ffff8800795dffd8 ffff8800795dffd8 ffff8800795dffd8 0000000000013dc0
	[  600.745155]  ffffffff81c13440 ffff88007bbbc470 ffff8800795df9f8 ffff88007fc14660
	[  600.745158] Call Trace:
	[  600.745164]  [<ffffffff8112c680>] ? __lock_page+0x70/0x70
	[  600.745167]  [<ffffffff8169e129>] schedule+0x29/0x70
	[  600.745169]  [<ffffffff8169e1ff>] io_schedule+0x8f/0xd0
	[  600.745172]  [<ffffffff8112c68e>] sleep_on_page+0xe/0x20
	[  600.745175]  [<ffffffff8169cadf>] __wait_on_bit+0x5f/0x90
	[  600.745178]  [<ffffffff8112d29b>] ? find_get_pages_tag+0xcb/0x170
	[  600.745181]  [<ffffffff8112c7e8>] wait_on_page_bit+0x78/0x80
	[  600.745196]  [<ffffffff8107ccc0>] ? autoremove_wake_function+0x40/0x40
	[  600.745215]  [<ffffffff8112c8fc>] filemap_fdatawait_range+0x10c/0x1a0
	[  600.745219]  [<ffffffff8112c9bb>] filemap_fdatawait+0x2b/0x30
	[  600.745222]  [<ffffffff8112ecf4>] filemap_write_and_wait+0x44/0x60
	[  600.745226]  [<ffffffff811be9e1>] __sync_blockdev+0x21/0x40
	[  600.745229]  [<ffffffff811bea13>] sync_blockdev+0x13/0x20
	[  600.745231]  [<ffffffff811bea89>] __blkdev_put+0x69/0x1c0
	[  600.745235]  [<ffffffff81178414>] ? kfree+0x114/0x140
	[  600.745238]  [<ffffffff811bec3b>] blkdev_put+0x5b/0x160
	[  600.745240]  [<ffffffff811bed65>] blkdev_close+0x25/0x30
	[  600.745246]  [<ffffffff81186d4e>] __fput+0xbe/0x240
	[  600.745260]  [<ffffffff81186ede>] ____fput+0xe/0x10
	[  600.745264]  [<ffffffff81078fd2>] task_work_run+0x72/0x90
	[  600.745269]  [<ffffffff8105d780>] do_exit+0x410/0x470
	[  600.745274]  [<ffffffff8130c578>] ? blk_finish_plug+0x18/0x50
	[  600.745277]  [<ffffffff8106ae4a>] ? __dequeue_signal+0x6a/0xb0
	[  600.745280]  [<ffffffff8105d984>] do_group_exit+0x44/0xa0
	[  600.745284]  [<ffffffff8106d97b>] get_signal_to_deliver+0x23b/0x480
	[  600.745294]  [<ffffffff810147f9>] do_signal+0x29/0x130
	[  600.745301]  [<ffffffff81185e70>] ? vfs_write+0x110/0x180
	[  600.745304]  [<ffffffff810149b0>] do_notify_resume+0x90/0xd0
	[  600.745307]  [<ffffffff816a7b62>] int_signal+0x12/0x17


Excerpt from ps:
	root      2808  0.0  0.0      0     0 pts/1    D+   12:42   0:00 [dd]
	root      2679  0.0  0.0      0     0 ?        D<   12:42   0:00 [loop0]

note that I have already sent kill -9 dd, that's why it appears as [dd]

Excerpt from iostat -x 1:

dm-9              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-10             0.00     0.00    0.00    0.00     0.00     0.00     0.00    10.00    0.00    0.00    0.00   0.00 100.00
dm-11             0.00     0.00    0.00    0.00     0.00     0.00     0.00  6895.00    0.00    0.00    0.00   0.00 100.00
dm-12             0.00     0.00    0.00    0.00     0.00     0.00     0.00  6895.00    0.00    0.00    0.00   0.00 100.00
dm-13             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-14             0.00     0.00    0.00    0.00     0.00     0.00     0.00 51190.00    0.00    0.00    0.00   0.00 100.00

the numbers in dm-11, dm-12 go down very very slowly after killing dd, such as 1 point per second
I don't see dm-14 number 51190.0 going down at all in 10 minutes.

For reference here are the dm numbers:
	lrwxrwxrwx  1 root root       8 Oct 16 12:42 vgt10-lvt10_1 -> ../dm-14
	lrwxrwxrwx  1 root root       8 Oct 16 12:42 vgt10-pool10 -> ../dm-13
	lrwxrwxrwx  1 root root       8 Oct 16 12:42 vgt10-pool10_tdata -> ../dm-11
	lrwxrwxrwx  1 root root       8 Oct 16 12:42 vgt10-pool10_tmeta -> ../dm-10
	lrwxrwxrwx  1 root root       8 Oct 16 12:42 vgt10-pool10-tpool -> ../dm-12

Here are the stack traces for loop0 and dd: (note I have already sent kill -9 dd but its console is still hanged)

cat /proc/2808/stack  (dd)
	[<ffffffff8112c68e>] sleep_on_page+0xe/0x20
	[<ffffffff8112c7e8>] wait_on_page_bit+0x78/0x80
	[<ffffffff8112c8fc>] filemap_fdatawait_range+0x10c/0x1a0
	[<ffffffff8112c9bb>] filemap_fdatawait+0x2b/0x30
	[<ffffffff8112ecf4>] filemap_write_and_wait+0x44/0x60
	[<ffffffff811be9e1>] __sync_blockdev+0x21/0x40
	[<ffffffff811bea13>] sync_blockdev+0x13/0x20
	[<ffffffff811bea89>] __blkdev_put+0x69/0x1c0
	[<ffffffff811bec3b>] blkdev_put+0x5b/0x160
	[<ffffffff811bed65>] blkdev_close+0x25/0x30
	[<ffffffff81186d4e>] __fput+0xbe/0x240
	[<ffffffff81186ede>] ____fput+0xe/0x10
	[<ffffffff81078fd2>] task_work_run+0x72/0x90
	[<ffffffff8105d780>] do_exit+0x410/0x470
	[<ffffffff8105d984>] do_group_exit+0x44/0xa0
	[<ffffffff8106d97b>] get_signal_to_deliver+0x23b/0x480
	[<ffffffff810147f9>] do_signal+0x29/0x130
	[<ffffffff810149b0>] do_notify_resume+0x90/0xd0
	[<ffffffff816a7b62>] int_signal+0x12/0x17
	[<ffffffffffffffff>] 0xffffffffffffffff

cat /proc/2679/stack  (loop0)
	[<ffffffff81689468>] balance_dirty_pages.isra.18+0x491/0x56d
	[<ffffffff81137f8b>] balance_dirty_pages_ratelimited_nr+0xeb/0x110
	[<ffffffff8112cb5f>] generic_perform_write+0x16f/0x210
	[<ffffffff8112cc5d>] generic_file_buffered_write+0x5d/0x90
	[<ffffffff8112e746>] __generic_file_aio_write+0x1b6/0x3b0
	[<ffffffff8112e9bf>] generic_file_aio_write+0x7f/0x100
	[<ffffffff8121c2a3>] ext4_file_write+0x93/0xe0
	[<ffffffff81185563>] do_sync_write+0xa3/0xe0
	[<ffffffff8143e624>] __do_lo_send_write+0x54/0xa0
	[<ffffffff8143e943>] do_lo_send_direct_write+0x73/0xa0
	[<ffffffff8143fff0>] lo_send+0xa0/0x100
	[<ffffffff8144024e>] do_bio_filebacked+0xde/0x150
	[<ffffffff81440c42>] loop_thread+0xc2/0x250
	[<ffffffff8107c3a3>] kthread+0x93/0xa0
	[<ffffffff816a8944>] kernel_thread_helper+0x4/0x10
	[<ffffffffffffffff>] 0xffffffffffffffff


Here is top:
top - 12:58:34 up 18 min,  4 users,  load average: 3.24, 2.99, 2.00
Tasks: 165 total,   1 running, 164 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3%us,  0.0%sy,  0.0%ni,  0.0%id, 99.7%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   2051176k total,   638444k used,  1412732k free,   258452k buffers
Swap:        0k total,        0k used,        0k free,   146980k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                              
 2824 root      20   0 17340 1292  904 S    0  0.1   0:00.61 top                                                                                  
    1 root      20   0 24336 2264 1280 S    0  0.1   0:01.39 init                                                                                 
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd                                                                             
    3 root      20   0     0    0    0 S    0  0.0   0:00.08 ksoftirqd/0                                                                          
    5 root       0 -20     0    0    0 S    0  0.0   0:00.00 kworker/0:0H               

as you can see we are at 0% CPU usage


> You have an open ended dd there, so eventually you'll run out of space (once
> you write 400M to fill the thin volume).

At 40KB/sec it would have taken 2.7 hours to complete, and I didn't wait that much time. It might be possible that after 2.7 hours the system could recover without data loss but that's just because the volume is just 400MB, if it was 40GB it would be unthinkable to wait.


> If there isn't anything useful in the kernel log then we'll need traces of
> the relevant tasks (e.g. the dd task) via: echo t > /proc/sysrq-trigger

I'm attaching the file, it's called dmesg3-post2sysrq.txt, unfortunately the output seems bigger than kernel buffer so it's truncated at the top.

I am also noticing that I/O writes to root filesystem (outside thin) is terribly slowed down, it took like 5 seconds for dumping this dmesg to file.

> Again, we need to know the version of lvm2. 

see above

> There may have been a bug in lvm2's interface with the thin_check utility.

thin_check doesn't complain if the machine is rebooted in any other situation than during this bug. It doesn't complain in clean unmounts so I believe the metadata are really corrupt. thin_check is latest version as of today compiled from source git.

Maybe thin_check could be improved so to recover data even in this situation, but I might suggest to fix the kernel so that it does not cause the bug. Note that there is also the hangup of thinp which by itself would cause problems in production even if there was no data loss.

I have checked that also with 3.6.2, after "reboot -nf" ("reboot -f" would hang, but please note that there shouldn't be I/O commands lost with this procedure because the virtualization host is never rebooted) the pool is corrupt after reboot and latest version of thin_check is not able to recover it. So I would say there is (also?) a bug with the order of writes or barriers usage in dm-thin.

Thanks for your work
Comment 5 Kurk 2012-10-16 12:57:37 UTC
Created attachment 83621 [details]
Dump of info from proc. See long comment

Since echo t > sysrq-trigger was too long for kernel buffer I decided to gather info from /proc . This is a dump of all processes and stack traces and status from proc.
Please note that this is a *different reboot* from my previous post and the sysrq-trigger dump so PIDs won't match to that one. Also, this was captured before sending kill -9 to dd.
Comment 6 Kurk 2012-10-16 17:42:06 UTC
Hi Mike, if the bug it doesn't reproduce on your system, probably the easiest thing you can try is to increase the size of the thin pool and lv. 
I have specified a size adequate to cause the problem reliably on my system, but I have noticed a smaller size does not cause it, so I might have stayed too low with the sizes for the general case. 

Try with all sizes multiplied by 10:

dd if=/dev/zero of=ttest10backend bs=1M count=0 seek=10000
losetup /dev/loop0 ttest10backend
pvcreate /dev/loop0
vgcreate vgt10 /dev/loop0
lvcreate --size 9000M --thinpool vgt10/pool10
lvcreate --name lvt10_1 --virtualsize 4000M --thinpool vgt10/pool10
dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M

This is much easier to try than the next approach that comes to my mind which is a KVM virtual machine with a sparse file for virtual disk backend...
Comment 7 Mike Snitzer 2012-10-16 19:01:37 UTC
(In reply to comment #6)
> Hi Mike, if the bug it doesn't reproduce on your system, probably the easiest
> thing you can try is to increase the size of the thin pool and lv. 
> I have specified a size adequate to cause the problem reliably on my system,
> but I have noticed a smaller size does not cause it, so I might have stayed
> too
> low with the sizes for the general case. 
> 
> Try with all sizes multiplied by 10:

My test system has 12G of physical memory so I went with the following:

dd if=/dev/zero of=/mnt/ttest10backend bs=1M count=0 seek=15000
losetup /dev/loop0 /mnt/ttest10backend
pvcreate /dev/loop0
vgcreate vgt10 /dev/loop0
lvcreate --size 14000M --thinpool vgt10/pool10
lvcreate --name lvt10_1 --virtualsize 12000M --thinpool vgt10/pool10
dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M

But the dd still completed:
dd: writing `/dev/mapper/vgt10-lvt10_1': No space left on device
12001+0 records in
12000+0 records out
12582912000 bytes (13 GB) copied, 347.912 s, 36.2 MB/s

Though there was a significant stall, before dd completed, due to the VM needing to free memory via shrinking:

dd              D ffffffff8160f420     0 25374  25091 0x00000000
 ffff880280d8d3a8 0000000000000086 ffff880280d8dfd8 00000000000128c0
 ffff880280d8c010 00000000000128c0 00000000000128c0 00000000000128c0
 ffff880280d8dfd8 00000000000128c0 ffff8803324b8ad0 ffff8802ef65ca90
Call Trace:
 [<ffffffff814d58e9>] schedule+0x29/0x70
 [<ffffffff814d5bde>] schedule_preempt_disabled+0xe/0x10
 [<ffffffff814d4353>] __mutex_lock_slowpath+0xd3/0x150
 [<ffffffff814d425b>] mutex_lock+0x2b/0x50
 [<ffffffff81138dd0>] page_referenced_file+0x50/0x120
 [<ffffffff81139839>] page_referenced+0x79/0x1f0
 [<ffffffff81118e9f>] shrink_active_list+0x1cf/0x310
 [<ffffffff8111915f>] shrink_lruvec+0x17f/0x3a0
 [<ffffffff8111947b>] shrink_zones+0xfb/0x200
 [<ffffffff8111961e>] do_try_to_free_pages+0x9e/0x340
 [<ffffffff81119a6a>] try_to_free_pages+0xda/0x170
 [<ffffffff8110e8a2>] __alloc_pages_slowpath+0x392/0x7c0
 [<ffffffff8110eeea>] __alloc_pages_nodemask+0x21a/0x230
 [<ffffffff81149c66>] alloc_pages_current+0xb6/0x120
 [<ffffffff81105d27>] __page_cache_alloc+0xa7/0xc0
 [<ffffffff811069ac>] grab_cache_page_write_begin+0x7c/0xe0
 [<ffffffff81196040>] ? blkdev_max_block+0x40/0x40
 [<ffffffff81193348>] block_write_begin+0x38/0x90
 [<ffffffff81197093>] blkdev_write_begin+0x23/0x30
 [<ffffffff81104e2a>] generic_perform_write+0xca/0x200
 [<ffffffff81104fc4>] generic_file_buffered_write+0x64/0xa0
 [<ffffffff81107a6f>] __generic_file_aio_write+0x1bf/0x3b0
 [<ffffffff811965f7>] blkdev_aio_write+0x57/0xc0
 [<ffffffff8115f9da>] do_sync_write+0xaa/0xf0
 [<ffffffff8115ff48>] vfs_write+0xc8/0x190
 [<ffffffff81160111>] sys_write+0x51/0x90
 [<ffffffff814de8e9>] system_call_fastpath+0x16/0x1b

It should be noted that the dm-bufio layer does interface with the shrinker, so its memory will be reclaimed should memory get tight.

But I still couldn't reproduce the dm_pool_commit_metadata hang you illustrated with dm_bufio_write_dirty_buffers at the end of the trace.
Comment 8 Kurk 2012-10-16 22:11:04 UTC
I have 2GB virtual RAM, 2 vCPUs, and no swap.
free -m
             total       used       free     shared    buffers     cached
Mem:          2003        259       1743          0         51         58
-/+ buffers/cache:        149       1854
Swap:            0          0          0
It *might* be easier and more effective to reduce RAM in your VM, than it is to enlarge the sparse file further...
Comment 9 Mike Snitzer 2012-10-16 22:52:16 UTC
(In reply to comment #8)
> I have 2GB virtual RAM, 2 vCPUs, and no swap.
> free -m
>              total       used       free     shared    buffers     cached
> Mem:          2003        259       1743          0         51         58
> -/+ buffers/cache:        149       1854
> Swap:            0          0          0
> It *might* be easier and more effective to reduce RAM in your VM, than it is
> to
> enlarge the sparse file further...

I tested with mem=2048M and it worked fine too.

Have you tried this procedure on bare metal?
Is it possible that the KVM guest is somehow enabling you to hit this?
Comment 10 Mike Snitzer 2012-10-16 23:14:40 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > I have 2GB virtual RAM, 2 vCPUs, and no swap.
> > free -m
> >              total       used       free     shared    buffers     cached
> > Mem:          2003        259       1743          0         51         58
> > -/+ buffers/cache:        149       1854
> > Swap:            0          0          0
> > It *might* be easier and more effective to reduce RAM in your VM, than it
> is to
> > enlarge the sparse file further...
> 
> I tested with mem=2048M and it worked fine too.

I disabled swap and I now see:

kworker/u:0     D ffffffff8160f1a0     0 16590      2 0x00000000
 ffff880018e93ae0 0000000000000046 ffff880018e93fd8 00000000000128c0
 ffff880018e92010 00000000000128c0 00000000000128c0 00000000000128c0
 ffff880018e93fd8 00000000000128c0 ffffffff81a13420 ffff880050232100
Call Trace:
 [<ffffffffa01c26a0>] ? read_endio+0x40/0x40 [dm_bufio]
 [<ffffffff814d58e9>] schedule+0x29/0x70
 [<ffffffff814d59bc>] io_schedule+0x8c/0xd0
 [<ffffffffa01c26ae>] do_io_schedule+0xe/0x20 [dm_bufio]
 [<ffffffff814d3eef>] __wait_on_bit+0x5f/0x90
 [<ffffffffa01c26a0>] ? read_endio+0x40/0x40 [dm_bufio]
 [<ffffffff814d3f98>] out_of_line_wait_on_bit+0x78/0x90
 [<ffffffff81065990>] ? autoremove_wake_function+0x50/0x50
 [<ffffffffa01c30c1>] ? __write_dirty_buffers_async+0x71/0xe0 [dm_bufio]
 [<ffffffffa01c3266>] dm_bufio_write_dirty_buffers+0x86/0x1e0 [dm_bufio]
 [<ffffffffa030a218>] dm_bm_flush_and_unlock+0x48/0x70 [dm_persistent_data]
 [<ffffffffa030d422>] dm_tm_commit+0x32/0x40 [dm_persistent_data]
 [<ffffffffa04e523d>] __commit_transaction+0x22d/0x240 [dm_thin_pool]
 [<ffffffff8122292a>] ? generic_make_request+0xca/0x100
 [<ffffffffa04e1235>] ? issue+0x35/0xe0 [dm_thin_pool]
 [<ffffffffa04e52b0>] dm_pool_commit_metadata+0x60/0x80 [dm_thin_pool]
 [<ffffffffa04e0611>] commit_or_fallback+0x41/0x70 [dm_thin_pool]
 [<ffffffffa04e2eb4>] process_deferred_bios+0x174/0x230 [dm_thin_pool]
 [<ffffffffa04e2f70>] ? process_deferred_bios+0x230/0x230 [dm_thin_pool]
 [<ffffffffa04e2fc1>] do_worker+0x51/0x60 [dm_thin_pool]
 [<ffffffff8105dac9>] process_one_work+0x179/0x4b0
 [<ffffffff8105f93e>] worker_thread+0x12e/0x330
 [<ffffffff8105f810>] ? manage_workers+0x190/0x190
 [<ffffffff8106529e>] kthread+0x9e/0xb0
 [<ffffffff814df984>] kernel_thread_helper+0x4/0x10
 [<ffffffff81065200>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff814df980>] ? gs_change+0x13/0x13

But progress was still being made with periodic bursts of IO.  The dd completed, albeit slower:

dd: writing `/dev/mapper/vgt10-lvt10_1': No space left on device
12001+0 records in
12000+0 records out
12582912000 bytes (13 GB) copied, 583.837 s, 21.6 MB/s

You said "I am also noticing that I/O writes to root filesystem (outside thin) is terribly slowed down, it took like 5 seconds for dumping this dmesg to file."

This seems like a generic VM scalability limitation under low memory conditions. Your system is particularly incapable of handling the load, consider: I can successfully write 12G to a thin volume with the same memory config (2G, no swap) yet you cannot even write 400M.
Comment 11 Kurk 2012-10-17 17:18:03 UTC
Hello Mike
To answer your question, no, I have not tried to reproduce the problem on hardware hosts. I don't really have hardware hosts available for testing right now.

What hypervisor do you use? And what CPU do you have on the host machine? 
Mine is a rather old E5420 which does not have very good virtualization performances.

Anyway:

I remembered I have nonstandard VirtMem and scheduler settings in the VM.
Playing with a few Virtual Memory and scheduler settings, sometimes it did not reproduce on my VM.
But even with default settings, sometimes the bug did reproduce. 
But with my custom settings it seems to reproduce more.
So see if you can reproduce it like this:

Set the VM memory at 2GB Ram, 2vCPU, no swap

# Reboot and login. It appears easier to reproduce the bug with a VM which has just rebooted.

# VirtMem and scheduler ettings
echo 0 > /proc/sys/vm/swappiness
echo 10000000 > /proc/sys/vm/dirty_background_bytes
echo 200000000 > /proc/sys/vm/dirty_bytes
echo 350 > /proc/sys/vm/dirty_expire_centisecs
echo 200 > /proc/sys/vm/dirty_writeback_centisecs
echo cfq > /sys/block/vda/queue/scheduler   #Your VM has vda, i.e. kvm virtio?
echo 4 > /sys/block/vda/queue/nr_requests

# Do not remove previous sparse file, create a new one

mkdir thin_bug_48851
cd thin_bug_48851
cd $(mktemp --tmpdir=. -d)

sleep5

dd if=/dev/zero of=ttest10backend bs=1M count=0 seek=1000
losetup /dev/loop0 ttest10backend
pvcreate /dev/loop0
vgcreate vgt10 /dev/loop0
lvcreate --size 900M --thinpool vgt10/pool10
lvcreate --name lvt10_1 --virtualsize 400M --thinpool vgt10/pool10
sync
sleep 5
sync
dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M


I have noticed that the earliest symptom of the bug happening is from iostat -x 1:

dm-10             0.00     0.00    0.00    0.00     0.00     0.00     0.00    11.00    0.00    0.00    0.00   0.00 100.00
dm-11             0.00     0.00    0.00    0.00     0.00     0.00     0.00  9432.00    0.00    0.00    0.00   0.00 100.00
dm-12             0.00     0.00    0.00    0.00     0.00     0.00     0.00  9432.00    0.00    0.00    0.00   0.00 100.00
dm-13             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-14             0.00     0.00    0.00    0.00     0.00     0.00     0.00 48814.00    0.00    0.00    0.00   0.00 100.00

when you see numbers stuck for at least 4-5 seconds for the _tdata _tmeta and -tpool (11.00 and the two 9432.00 in the example above), you are into the bug and it will never resolve. If numbers appear in those lines just for a few seconds, it can still resolve.

Instead, a number stuck in the thin LV line (dm-14 in the example above) is not significant.

for reference:
lrwxrwxrwx  1 root root       8 Oct 17 19:06 vgt10-lvt10_1 -> ../dm-14
lrwxrwxrwx  1 root root       8 Oct 17 19:06 vgt10-pool10 -> ../dm-13
lrwxrwxrwx  1 root root       8 Oct 17 19:06 vgt10-pool10_tdata -> ../dm-11
lrwxrwxrwx  1 root root       8 Oct 17 19:06 vgt10-pool10_tmeta -> ../dm-10
lrwxrwxrwx  1 root root       8 Oct 17 19:06 vgt10-pool10-tpool -> ../dm-12

Let me know if you can reproduce it with the sequence above.
Comment 12 Kurk 2012-10-18 12:52:52 UTC
The bug is reproducible also with 1 vCPU, with the sequence in my comment #11.
Comment 13 Kurk 2012-10-18 22:42:10 UTC
Mike,
it turns out that my hardware node had a partially defective disk which was delaying I/O and this allowed me to trigger the race more easily than normal.

I have been trying to find a better reproducer.
Here is a new version of the reproducer which works so reliably for me that I hope it can work even on good hardware.
(Tested with 1..4 vCPU and 1..2 GB RAM, KVM virtual machine)

You need 2 shells, at least if you want to look at "iostat -x 1" which is *very* interesting...


#--- Shell#1:

rm -f ttest10backend #deleting old attempts
dd if=/dev/zero of=ttest10backend bs=1M count=0 seek=1000
losetup /dev/loop0 ttest10backend
pvcreate /dev/loop0
vgcreate vgt10 /dev/loop0
lvcreate --size 900M --thinpool vgt10/pool10
lvcreate --name lvt10_1 --virtualsize 400M --thinpool vgt10/pool10
sync

#--- Shell#2
iostat -x 1


#--- Shell#1:
# Paste all this text *simultaneously*

TIME_DD_TO_RESUME=5
TIME_RESUME_TO_REBOOT=20

echo suspending
dmsetup suspend vgt10-pool10_tmeta
echo suspend finished
dd if=/dev/zero of=/dev/mapper/vgt10-lvt10_1 bs=1M &
echo sleeping $TIME_DD_TO_RESUME
sleep $TIME_DD_TO_RESUME
echo resuming
dmsetup resume vgt10-pool10_tmeta
echo resume finished
#
# Paste also the following part if you want to see the metadata corruption bug
#
echo sleeping $TIME_RESUME_TO_REBOOT
sleep $TIME_RESUME_TO_REBOOT
echo rebooting
reboot -nf


#-- Upon reboot --
losetup /dev/loop0 ttest10backend
vgchange -ay vgt10

#outputs:
#  Check of thin pool vgt10/pool10 failed (status:1). Manual repair required
# (thin_dump --repair /dev/mapper/vgt10-pool10_tmeta)!
#  0 logical volume(s) in volume group "vgt10" now active


Please note:
At least the variable TIME_DD_TO_RESUME will require manual tweaking.
If you set it very high, like 15 seconds, you have certainty to hit the hangup bug, but you might not see the corruption bug (i.e. upon reboot the metadata file won't be corrupt).
If you set it low like 5 seconds you *should* hit the corruption bug rather easily, but the hangup bug (supposing you don't do reboot -nf) might not be very apparent.
But if you set it too low like 2 seconds you might hit none of the two bugs. The right value probably depends on your hardware.
It is more difficult to hit the corruption bug than the hangup bug. Please make sure you fix the corruption bug before the hangup bug, because it will be very hard to reproduce the metadata corruption bug once you have fixed the hangup bug.
I am not convinced that the two bugs are the same bug.
Thanks
Comment 14 Mike Snitzer 2012-10-30 16:24:35 UTC
You're queuing a bunch of in-flight buffered IO and then forcibly killing the machine.

It is likely that the "Check of thin pool vgt10/pool10 failed (status:1)." message is side-effect of reference count mismatches that are easily ignored.  But because lvm2 isn't currently able to know the error is safe to continue from.

Please provide the output from the following once you've reproduced the so-called "corruption":

/usr/sbin/thin_check /dev/mapper/vgt10-pool10_tmeta
Comment 15 Kurk 2012-10-30 22:43:40 UTC
Hi Mike, 
have you tried if my last reproducer works for you?

Anyway, to answer your question,
indeed it's reference count mismatches,
in one case it was just:

-------------
/usr/sbin/thin_check /dev/mapper/vgt10-pool10_tmeta
Errors in metadata
  Errors in data block reference counts
    0: was 1, expected 0
-------------

in another case it was much longer:

-------------
/usr/sbin/thin_check /dev/mapper/vgt10-pool10_tmeta
Errors in metadata
  Errors in data block reference counts
    4016: was 1, expected 0
    4017: was 1, expected 0
    4018: was 1, expected 0
    .....
    ..... ALL numbers up to 4415 ....
    .....
    4415: was 1, expected 0
-------------

So do you think it can be ignored safely, all is still semantically correct?

Please note there is still the hangup problem. Were you able to reproduce that?

Thank you
Comment 16 Kurk 2012-11-10 21:02:57 UTC
Hello Mike, is there any news on this bug?
We would badly need to use these cool thinp features...
Thank you
Comment 17 Mike Snitzer 2012-11-30 15:39:25 UTC
(In reply to comment #15)
> Hi Mike, 
> have you tried if my last reproducer works for you?
> 
> Anyway, to answer your question,
> indeed it's reference count mismatches,
> in one case it was just:
> 
> -------------
> /usr/sbin/thin_check /dev/mapper/vgt10-pool10_tmeta
> Errors in metadata
>   Errors in data block reference counts
>     0: was 1, expected 0
> -------------
> 
> in another case it was much longer:
> 
> -------------
> /usr/sbin/thin_check /dev/mapper/vgt10-pool10_tmeta
> Errors in metadata
>   Errors in data block reference counts
>     4016: was 1, expected 0
>     4017: was 1, expected 0
>     4018: was 1, expected 0
>     .....
>     ..... ALL numbers up to 4415 ....
>     .....
>     4415: was 1, expected 0
> -------------
> 
> So do you think it can be ignored safely, all is still semantically correct?

thin_check doesn't actually repair, it just reports.

These refcount mismatches are expected but you'll need to repair (to eliminate the mismatches) by doing a thin_dump and then thin_restore, e.g.:

thin_dump /dev/mapper/vgt10-pool10_tmeta > thin-pool-metadata
thin_restore -i thin-pool-metadata -o /dev/mapper/vgt10-pool10_tmeta
Comment 18 Alasdair G Kergon 2012-12-03 19:09:19 UTC
So this isn't a bug.

The messages mean it was releasing blocks that were preallocated for the I/O that got interrupted and so never completed.

The userspace tools and/or documentation need to be improved to separate 'normal' metadata corrections from genuine errors.
Comment 19 Alasdair G Kergon 2012-12-03 19:11:09 UTC
And we need to offer a trivial way to perform this sort of 'repair' in userspace without causing undue worry.
Comment 20 Kurk 2012-12-12 15:47:06 UTC
Hello, sorry for the delay, I confirm that in my cases the pool was again accessible after exporting and reimporting. So all is good and thanks; the main problem is that this kind of fix is not intuitive, especially after seeing "Manual repair required", my interpretation was that I had to fix the xml by hand.

Note You need to log in before you can comment on or make changes to this bug.