Bug 10242

Summary: rm command hangs
Product: IO/Storage Reporter: Jean-Luc Coulon (jean.luc.coulon)
Component: LVM2/DMAssignee: Alasdair G Kergon (agk)
Status: CLOSED CODE_FIX    
Severity: high CC: bunk, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25-rc5 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 9832    
Attachments: task blocked, syslog for w to sysrq-trigger

Description Jean-Luc Coulon 2008-03-14 05:47:07 UTC
Latest working kernel version: 2.6.24
Earliest failing kernel version: 2.6.25-rc5-git4
Distribution: Debia/Sid
Hardware Environment: AUS A8V, Athlon 64 x2 4200+
Software Environment: raid1, cryptsetup (luks), lvm2, xfs
Problem Description: "sometimes" a rm command stalls. The files are deleted but the xterm or console are frozen.
A strace on the pid stalls as well without any message after the "attached to pid" message.
Then, it is impossible to sync the filestem (command hangs) or to umount the device (busy).
I've never seen this problem with 2.6.24 (this doesnt mean it doesnt exist). Maybe it was not existing with 2.6.25-rc2 but I've not used it too much.
I have it once or twice a day on 2.6.25-rc5.
The rm process is not killeable. I need to reboot to get rid of it. The filsystem, after playing the journals doesnt appear to be corrupted (xfs_check dosnt report any error).

Steps to reproduce: rm -rf /xxx/xxxx
(I got it mostly cleaning a tree via a script after building a debian package on my machine).

The filesystem is an xfs filesystem.
It is built on a raid1, encrypted with cryptsetup using luks and it is a lvm2 logical volume over this raid1.
Comment 1 Eric Sandeen 2008-03-14 07:28:01 UTC
Does it only happen on luks volumes?

try "echo w > /proc/sysrq-trigger" to see which  tasks that are in uninterruptable (blocked) state. If possible, attach here.

-Eric
Comment 2 Jean-Luc Coulon 2008-03-14 08:32:58 UTC
Created attachment 15267 [details]
task blocked, syslog for w to sysrq-trigger

I've never had the problem on non-luks volume. But non luks have poor write/delete activity (root filesystem, /usr)

I've had the problem no doing a rm but running a c++ compilation.

J-L
Comment 3 Alasdair G Kergon 2008-03-14 11:38:58 UTC
assume this is the known dm-crypt regression - we're working on a patch
Comment 4 Alasdair G Kergon 2008-03-14 11:42:09 UTC
(a ref counting bug meaning in certain circumstances dm-crypt layer holds onto i/o for ever and never reports it completed)
Comment 5 Rafael J. Wysocki 2008-03-14 13:02:12 UTC
Is it a duplicate of Bug #10207?
Comment 6 Adrian Bunk 2008-03-14 13:08:26 UTC
Most likely we won't know for sure whether it's the same as bug #10207 until there's a fix for which Jean-Luc can verify whether or not it fixes the problem for him?
Comment 7 Milan Broz 2008-03-15 03:00:42 UTC
> Most likely we won't know for sure whether it's the same as bug #10207 until
> there's a fix for which Jean-Luc can verify whether or not it fixes the
> problem
> for him?

Please try patch in http://lkml.org/lkml/2008/3/14/347
Comment 8 Jean-Luc Coulon 2008-03-16 07:18:50 UTC
I've tested the patch (on 2.6.25-rc5-git4).
I've stressed a bit the system and I've no more the problem so far.

Jean-Luc
Comment 9 Milan Broz 2008-03-17 12:26:32 UTC
Latest patch for dm-crypt in http://lkml.org/lkml/2008/3/17/214
(the same patch mentioned in bug 10207)
Comment 10 Alasdair G Kergon 2008-03-24 05:48:40 UTC
Please test the patch in comment 9 - I think that one's ready to submit.
Comment 11 Rafael J. Wysocki 2008-03-27 15:37:36 UTC
Patch : http://lkml.org/lkml/2008/3/27/293
Comment 12 Rafael J. Wysocki 2008-03-27 15:38:34 UTC
*** Bug 10207 has been marked as a duplicate of this bug. ***
Comment 13 Adrian Bunk 2008-03-28 15:23:43 UTC
fixed by commit 3f1e9070f63b0eecadfa059959bf7c9dbe835962