Bug 211167
Summary: | fstrim: FITRIM ioctl failed: Device or resource busy on NTFS device after upgrading to Linux kernel 5.10 | ||
---|---|---|---|
Product: | File System | Reporter: | ciwei100000 (ciwei100000+bugzilla) |
Component: | Other | Assignee: | io_other |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | abc, agk, Andreas.Klauer, chriscjsus, david, gmazyland, jack, miklos, rjones |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.10.4 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
strace of ntfs-3g when BLKDISCARD fails
[PATCH] block: Try to handle busy underlying device on discard |
Description
ciwei100000
2021-01-13 05:33:13 UTC
This bug showed up in Fedora recently: https://bugzilla.redhat.com/show_bug.cgi?id=1926954 I bisected it to: 384d87ef2c954fc58e6c5fd8253e4a1984f5fe02 is the first bad commit commit 384d87ef2c954fc58e6c5fd8253e4a1984f5fe02 Author: Jan Kara <jack@suse.cz> Date: Fri Sep 4 10:58:52 2020 +0200 block: Do not discard buffers under a mounted filesystem Discarding blocks and buffers under a mounted filesystem is hardly anything admin wants to do. Usually it will confuse the filesystem and sometimes the loss of buffer_head state (including b_private field) can even cause crashes like: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G O --------- - - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1 Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015 RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2] ... Call Trace: __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2] jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2] kjournald2+0xbd/0x270 [jbd2] So if we don't have block device open with O_EXCL already, claim the block device while we truncate buffer cache. This makes sure any exclusive block device user (such as filesystem) cannot operate on the device while we are discarding buffer cache. Reported-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> [axboe: fix !CONFIG_BLOCK error in truncate_bdev_range()] Signed-off-by: Jens Axboe <axboe@kernel.dk> block/ioctl.c | 16 ++++++++++------ fs/block_dev.c | 37 +++++++++++++++++++++++++++++++++---- include/linux/blkdev.h | 7 +++++++ 3 files changed, 50 insertions(+), 10 deletions(-) Reverting this commit on top of current head (e0756cfc7d7cd08) fixes the problem for me. However a correct fix may involve fixing ntfs-3g in some way, although I'm unclear how. Thanks for the report and for the bisection! But from a first look I'd say EBUSY is actually a desirable outcome. Let me ellaborate a bit. Fstrim returns EBUSY because someone else (likely FUSE filesystem ntfs3g - you can verify with lsof(8) utility) has the block device open with O_EXCL. Previously fstrim would go and discard data under live ntfs3g filesystem which is hardly desirable thing... So why do you think fstrim should succeed? That's what fstrim does, it trims mounted, live filesystems. So naturally the user's expectation is for it to succeed. It's the filesystem itself (FUSE ntfs3g user space process) issuing the blkdiscard ioctl and failing with -EBUSY. There are no unrelated processes involved. What I'm not sure about is what causes this failure. After all, blkdiscard (the userspace command) likewise opens the target device (both exclusive and non-exclusive mode with --force option) and succeeds. So it should already be possible for ntfs3g to do the same. And yet it fails. I did a few tests on this to find out what is going on since the ntfs-3g code is a bit hard to follow (even though - erm - I wrote it ...) $ guestfish -N fs:ntfs exit $ virt-rescue -a test1.img ><rescue> strace -o /tmp/log -f ntfs-3g -o debug /dev/sda1 /sysroot/ & ><rescue> fstrim /sysroot unique: 4, opcode: GETATTR (3), nodeid: 1, insize: 56 unique: 4, error: 0 (Success), outsize: 120 unique: 6, opcode: OPENDIR (27), nodeid: 1, insize: 48 unique: 6, error: 0 (Success), outsize: 32 unique: 8, opcode: IOCTL (39), nodeid: 1, insize: 96 unique: 8, error: 0 (Success), outsize: 56 fstrim: /sysroot: FITRIM ioctl failed: Device or resource busy unique: 10, opcode: RELEASEDIR (29), nodeid: 1, insize: 64 unique: 10, error: 0 (Success), outsize: 16 ><rescue> umount /sysroot Looking at strace the system call which fails is: 218 ioctl(3, BLKDISCARD, [12288, 4096]) = -1 EBUSY (Device or resource busy) I believe that fd 3 is /dev/sda1, and it's the ordinary file descriptor that ntfs-3g uses for everything else it does to the filesystem. It doesn't seem to be opened with O_EXCL and it wouldn't matter even if it was. So yes, like the previous commenter I believe this is a real bug. Created attachment 295259 [details]
strace of ntfs-3g when BLKDISCARD fails
I agree that there's problem somewhere and fstrim should succeed. I'm just trying to narrow down where exactly the problem is and how to fix it. AFAIU fstrim(8) calls into the kernel with FITRIM ioctl. In case of FUSE filesystem, the ioctl is redirected to ntfs-3g in userspace which determines proper ranges to trim and calls BLKDISCARD ioctl on the appropriate block device ranges. The identified commit 384d87ef2c9 adds a check that the file used for calling BLKDISCARD ioctl is either opened O_EXCL or that we can temporarily upgrade the file descriptor to O_EXCL mode. And this fails. As strace shows ntfs-3g has the device open only O_RDWR so we'll be trying to upgrade fd to O_EXCL. And because it fails it means someone else has the device open exclusively. I've now tested this and also checked the kernel code and it indeed appears that the FUSE driver in the kernel claims the block device exclusively which blocks the userspace daemon to claim it exclusively as well. I've added Miklos to CC. Miklos, why is kernel FUSE module claiming the device exclusively? Is it necessary? Because I'd prefer if we could keep BLKDISCARD ioctl requiring O_EXCL open (or ability to upgrade to it) as it protects normal filesystems from being screwed and crashing. Hi Jan, in the last para did you mean to say "why is it NOT claiming the device exclusively"? I can try adding O_EXCL to ntfs-3g code, but I'm a bit busy today preparing for a conference talk so probably won't happen today. Sorry, ignore my last comment. I now understand what you're saying. LVM with issue_discards = 1 also fails when trying to shrink logical volumes. # lvresize --resizefs -L100M testdiscard/test Do you want to unmount "/dev/shm/loop" ? [Y|n] y fsck from util-linux 2.36.1 /dev/mapper/testdiscard-test: 11/65280 files (0.0% non-contiguous), 8843/261120 blocks resize2fs 1.46.1 (9-Feb-2021) Resizing the filesystem on /dev/mapper/testdiscard-test to 25600 (4k) blocks. The filesystem on /dev/mapper/testdiscard-test is now 25600 (4k) blocks long. /dev/loop0: BLKDISCARD ioctl at offset 105906176 size 964689920 failed: Device or resource busy. Size of logical volume testdiscard/test changed from 1020.00 MiB (255 extents) to 100.00 MiB (25 extents). Logical volume testdiscard/test successfully resized. I admit it's an unusual test case since I used a loop device for testing. It seemed to work fine on SSD. Not sure why there is a difference. > It seemed to work fine on SSD. Not sure why there is a difference.
OK, turns out my SSD is LUKS encrypted and didn't allow discards. With allow discards it turns out, LVM appears to be affected in general.
# lvresize --resizefs -L100M SSD/test
Rounding size to boundary between physical extents: 128.00 MiB.
Do you want to unmount "/mnt/test" ? [Y|n] y
fsck from util-linux 2.36.2
/dev/mapper/SSD-test: 11/65536 files (0.0% non-contiguous), 12955/262144 blocks
resize2fs 1.46.1 (9-Feb-2021)
Resizing the filesystem on /dev/mapper/SSD-test to 32768 (4k) blocks.
The filesystem on /dev/mapper/SSD-test is now 32768 (4k) blocks long.
/dev/mapper/luksSSD1: BLKDISCARD ioctl at offset 56506712064 size 939524096 failed: Device or resource busy.
Size of logical volume SSD/test changed from 1.00 GiB (16 extents) to 128.00 MiB (2 extents).
Logical volume SSD/test successfully resized.
At this point I'm not sure if the kernel should be doing this kind of gatekeeping at all. It's up to user space programs to warn about unsafe actions, blkdiscard already checks for filesystems.
Kernel code that OOPSes when encountering corrupt data has to be fixed anyway as there are many ways for corruption to occur.
I'm unable to reproduce comment 11 with the broken kernel using: LVM version: 2.03.10(2) (2020-08-09) Library version: 1.02.173 (2020-08-09) Driver version: 4.43.0 I don't understand why blkdiscard would fail since the filesystem has been unmounted in this scenario. Any tips to reproduce this? (In reply to Jan Kara from comment #7) > I've added Miklos to CC. Miklos, why is kernel FUSE module claiming the > device exclusively? Is it necessary? Because I'd prefer if we could keep > BLKDISCARD ioctl requiring O_EXCL open (or ability to upgrade to it) as it > protects normal filesystems from being screwed and crashing. It is using get_tree_bdev(), a generic helper that opens the device in exclusive mode. From open(2): O_EXCL [...] If the block device is in use by the system (e.g., mounted), open() fails with the error EBUSY. So mounted blockdevs need to be opened in exclusive mode. The userspace filesystem could do that, but we can't fix that in the kernel alone, it needs to be negotiated. So there's no simple fix for this AFAICS. (In reply to Andreas Klauer from comment #11) > At this point I'm not sure if the kernel should be doing this kind of > gatekeeping at all. It's up to user space programs to warn about unsafe > actions, blkdiscard already checks for filesystems. > > Kernel code that OOPSes when encountering corrupt data has to be fixed > anyway as there are many ways for corruption to occur. Just to make things clear: In this case, kernel crashes are not caused by corrupted on-disk data, rather it is about BLKDISCARD ioctl removing data structures from buffer cache in the discarded range. And it is not simple to invalidate buffer cache for the range (which is necessary to make buffer cache consistent with on-disk contents) without discarding the data structures... Essentially it is a conflict between buffer-cache owners - filesystem owns buffer cache it is working on but BLKDISCARD needs to own it as well to be able to make discard results visible. That's why currently BLKDISCARD gets EBUSY, previously BLKDISCARD silently stole buffer cache from under the filesystem... To reproduce the problem with lvm2 - issue_discard=1 and create 2 LVs on a single discardable PV. and just remove 1 LV while the other is still active and uses PV. Clearly lvm2 cannot get 'exclusive' open of such PV in this moment. Possibly ideas (except most trivial one - to revert IMHO not much useful patch - since user can anytime open device of mounted fs and /dev/zero it anyway) lvm2 can create some new LV for discardable extents and try to discard it this way. However this is quite some 'complexification'. Also current kernel patch breaks backward compatibility without any backward compatible 'knob' to switch. (In reply to Miklos Szeredi from comment #13) > (In reply to Jan Kara from comment #7) > > > I've added Miklos to CC. Miklos, why is kernel FUSE module claiming the > > device exclusively? Is it necessary? Because I'd prefer if we could keep > > BLKDISCARD ioctl requiring O_EXCL open (or ability to upgrade to it) as it > > protects normal filesystems from being screwed and crashing. > > It is using get_tree_bdev(), a generic helper that opens the device in > exclusive mode. > > From open(2): > > O_EXCL [...] If the block device is in use by the system > (e.g., mounted), open() fails with the error EBUSY. > > So mounted blockdevs need to be opened in exclusive mode. The userspace > filesystem could do that, but we can't fix that in the kernel alone, it > needs to be negotiated. So there's no simple fix for this AFAICS. Right, so I wonder whether it would be possible for the kernel to only have bdev open without O_EXCL and let userspace daemon open it with O_EXCL (which would even make more sense since it is the userspace daemon which does the IO). I'm not sure whether such change is viable in the FUSE ecosystem though. Could maybe the fuse driver (i.e., ntfs-3g in our case) somehow indicate it is capable of opening the device with O_EXCL and so kernel would release it's exclusive open and let ntfs-3g driver do it during its mount procedure? That way unaware fuse drivers will still get the protection from the kernel and capable drivers that need O_EXCL could still take it. Oh I see, now I'm able to reproduce it. First prepare the disk: $ qemu-img create test-lv.qcow2 -f qcow2 1G $ virt-rescue -a test-lv.qcow2 ><rescue> pvcreate /dev/sda Physical volume "/dev/sda" successfully created. ><rescue> vgcreate VG /dev/sda Volume group "VG" successfully created ><rescue> lvcreate -n LV -L 100M VG Logical volume "LV" created. ><rescue> mkfs.ext4 /dev/VG/LV ><rescue> sync ><rescue> exit Then to reproduce the bug: $ virt-rescue --ro -a test-lv.qcow2 -m /dev/VG/LV ><rescue> lvresize --resizefs -L 50M VG/LV --config devices/issue_discards=1 Rounding size to boundary between physical extents: 52.00 MiB. Do you want to unmount "/sysroot" ? [Y|n] y fsck from util-linux 2.36 /dev/mapper/VG-LV: 11/25688 files (0.0% non-contiguous), 8896/102400 blocks resize2fs 1.45.6 (20-Mar-2020) Resizing the filesystem on /dev/mapper/VG-LV to 53248 (1k) blocks. The filesystem on /dev/mapper/VG-LV is now 53248 (1k) blocks long. /dev/sda: BLKDISCARD ioctl at offset 55574528 size 50331648 failed: Device or resource busy. Size of logical volume VG/LV changed from 100.00 MiB (25 extents) to 52.00 MiB (13 extents). Logical volume VG/LV successfully resized. (Although the BLKDISCARD ioctl fails, lvresize seems to carry on regardless and does not exit with an error.) You can play with the $SUPERMIN_KERNEL and $SUPERMIN_MODULES environment variables to choose a different kernel. https://libguestfs.org/supermin.1.html#ENVIRONMENT-VARIABLES I was able to fix the bug by reverting 384d87ef2c9. (In reply to Zdenek Kabelac from comment #15) > (except most trivial one - to revert IMHO not much useful patch - since user > can anytime open device of mounted fs and /dev/zero it anyway) Yes, he can and that is perfectly fine. Please read my comment 14 why BLKDISCARD is different. > lvm2 can create some new LV for discardable extents and try to discard it > this way. However this is quite some 'complexification'. > > Also current kernel patch breaks backward compatibility without any backward > compatible 'knob' to switch. Well, a 'knob' would not really help. But at this point I agree that too much userspace just depends on the ability to do BLKDISCARD on a device it is not able to open O_EXCL. So I'll ask Jens to revert my patch and we'll have to come up with some other way to protect filesystem's buffer cache from BLKDISCARD. (In reply to Jan Kara from comment #16) > Could maybe the fuse driver (i.e., ntfs-3g in our case) somehow indicate it > is capable of opening the device with O_EXCL and so kernel would release > it's exclusive open and let ntfs-3g driver do it during its mount procedure? > That way unaware fuse drivers will still get the protection from the kernel > and capable drivers that need O_EXCL could still take it. Yes, this could be made to work, but it's not a trivial change. Let's see what the ideas about properly fixing all the userspace issues are after the revert. Revert submitted at: https://lore.kernel.org/linux-block/20210216133849.8244-1-jack@suse.cz I guess we can close this bug for now. Thanks to everybody involved! (In reply to Jan Kara from comment #18) > (In reply to Zdenek Kabelac from comment #15) > > (except most trivial one - to revert IMHO not much useful patch - since > user > > can anytime open device of mounted fs and /dev/zero it anyway) > > Yes, he can and that is perfectly fine. Please read my comment 14 why > BLKDISCARD is different. It's clear it's not a simple situation - but IMHO it should be solved within kernel and avoid exposing troubles to user-land. If there is difference between /dev/zero and discard - maybe that can be resolved and made more similar ? So after some discussion with other developers, we came up with another possible solution. Can you guys check whether the patch I'll attach fixes the problem for you? Created attachment 295345 [details]
[PATCH] block: Try to handle busy underlying device on discard
FWIW fstrim with ntfs-3g driver doesn't complain for me anymore... Sorry, I could only do some superficial testing with your patch. 1) fstrim for ntfs-3g seems to work 2) issue_discards for lvm seems to work 3a) blkdiscard a mounted filesystem is busy 3b) blkdiscard a mounted filesystem at an offset seems to work In case of 3b) what I actually wanted to test is blkdiscard of unoccupied region when fs-size is smaller than device size, which is "safe". But it's fine with discarding the filesystem region too, resulting in filesystem corruption. I'm okay with that (with dd you can corrupt things as well, so why would the kernel care) but how the decision busy vs. not is made remains a mystery to me... it affects only very specific sector ranges? Does this still fix whatever the original issue was? Thanks for testing! The result 3b) is possible - we'll now return EBUSY only if some block in the trimmed range is currently actively being used by the kernel (which is what was causing crashes in the past). So EBUSY results for ranges that may be used by the kernel are kind of random now but that's the cost of letting other cases pass. So the patch seems to work well. I'll send the patch upstream then. Sorry for the delay. I tested the patch in comment 24 on top of the current kernel in git. Both ntfs-3g FITRIM, and my LVM test worked fine. Just wanted to mention that QEMU is also affected by this bug. Discards to block devices from Virtual Machines don't work, although QEMU does not give any warnings. After applying the patch from comment 24 it works now. I used blktrace + blkparse to tell if discards are actually being sent to the device. sudo blktrace -a discard -d /dev/nvme0n1 -o - | blkparse -i - OK, patch is merged upstream. Closing the bug. |