Bug 211167

Summary: fstrim: FITRIM ioctl failed: Device or resource busy on NTFS device after upgrading to Linux kernel 5.10
Product: File System Reporter: ciwei100000 (ciwei100000+bugzilla)
Component: OtherAssignee: io_other
Status: RESOLVED CODE_FIX    
Severity: normal CC: abc, agk, Andreas.Klauer, chriscjsus, david, gmazyland, jack, miklos, rjones
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.10.4 Subsystem:
Regression: No Bisected commit-id:
Attachments: strace of ntfs-3g when BLKDISCARD fails
[PATCH] block: Try to handle busy underlying device on discard

Description ciwei100000 2021-01-13 05:33:13 UTC
I encountered the same issue reported by this post (https://bbs.archlinux.org/viewtopic.php?id=262243).

fstrim seems to fail on NTFS device after upgrading to kernel 5.10. When I type the command "sudo fstrim -av", it works on EXT4 device but not on NTFS. The error message is listed below:

--------------------------------------------------------------
fstrim: /mnt/D: FITRIM ioctl failed: Device or resource busy
/: 3.5 GiB (3715059712 bytes) trimmed on /dev/sda4
--------------------------------------------------------------

I have ntfs3g installed and can confirm it worked on kernel 5.9.15. I am not sure whether it is a kernel change or a bug of fstrim. Please let me know if there is anything I could help.
Comment 1 Richard W.M. Jones 2021-02-09 21:32:30 UTC
This bug showed up in Fedora recently:
https://bugzilla.redhat.com/show_bug.cgi?id=1926954

I bisected it to:

384d87ef2c954fc58e6c5fd8253e4a1984f5fe02 is the first bad commit
commit 384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
Author: Jan Kara <jack@suse.cz>
Date:   Fri Sep 4 10:58:52 2020 +0200

    block: Do not discard buffers under a mounted filesystem
    
    Discarding blocks and buffers under a mounted filesystem is hardly
    anything admin wants to do. Usually it will confuse the filesystem and
    sometimes the loss of buffer_head state (including b_private field) can
    even cause crashes like:
    
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G O     --------- -  - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
    Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
    RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
    ...
    Call Trace:
     __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
     jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
     kjournald2+0xbd/0x270 [jbd2]
    
    So if we don't have block device open with O_EXCL already, claim the
    block device while we truncate buffer cache. This makes sure any
    exclusive block device user (such as filesystem) cannot operate on the
    device while we are discarding buffer cache.
    
    Reported-by: Ye Bin <yebin10@huawei.com>
    Signed-off-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    [axboe: fix !CONFIG_BLOCK error in truncate_bdev_range()]
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

 block/ioctl.c          | 16 ++++++++++------
 fs/block_dev.c         | 37 +++++++++++++++++++++++++++++++++----
 include/linux/blkdev.h |  7 +++++++
 3 files changed, 50 insertions(+), 10 deletions(-)
Comment 2 Richard W.M. Jones 2021-02-09 21:45:05 UTC
Reverting this commit on top of current head (e0756cfc7d7cd08)
fixes the problem for me.  However a correct fix may involve
fixing ntfs-3g in some way, although I'm unclear how.
Comment 3 Jan Kara 2021-02-11 08:52:27 UTC
Thanks for the report and for the bisection! But from a first look I'd say EBUSY is actually a desirable outcome. Let me ellaborate a bit. Fstrim returns EBUSY because someone else (likely FUSE filesystem ntfs3g - you can verify with lsof(8) utility) has the block device open with O_EXCL. Previously fstrim would go and discard data under live ntfs3g filesystem which is hardly desirable thing... So why do you think fstrim should succeed?
Comment 4 Andreas Klauer 2021-02-13 14:58:01 UTC
That's what fstrim does, it trims mounted, live filesystems. So naturally the user's expectation is for it to succeed.

It's the filesystem itself (FUSE ntfs3g user space process) issuing the blkdiscard ioctl and failing with -EBUSY. There are no unrelated processes involved.

What I'm not sure about is what causes this failure. After all, blkdiscard (the userspace command) likewise opens the target device (both exclusive and non-exclusive mode with --force option) and succeeds.

So it should already be possible for ntfs3g to do the same. And yet it fails.
Comment 5 Richard W.M. Jones 2021-02-13 15:12:20 UTC
I did a few tests on this to find out what is going on since
the ntfs-3g code is a bit hard to follow (even though - erm - I
wrote it ...)

$ guestfish -N fs:ntfs exit
$ virt-rescue -a test1.img
><rescue> strace -o /tmp/log -f ntfs-3g -o debug /dev/sda1 /sysroot/ &
><rescue> fstrim /sysroot
unique: 4, opcode: GETATTR (3), nodeid: 1, insize: 56
   unique: 4, error: 0 (Success), outsize: 120
unique: 6, opcode: OPENDIR (27), nodeid: 1, insize: 48
   unique: 6, error: 0 (Success), outsize: 32
unique: 8, opcode: IOCTL (39), nodeid: 1, insize: 96
   unique: 8, error: 0 (Success), outsize: 56
fstrim: /sysroot: FITRIM ioctl failed: Device or resource busy
unique: 10, opcode: RELEASEDIR (29), nodeid: 1, insize: 64
   unique: 10, error: 0 (Success), outsize: 16
><rescue> umount /sysroot

Looking at strace the system call which fails is:

218   ioctl(3, BLKDISCARD, [12288, 4096]) = -1 EBUSY (Device or resource busy)

I believe that fd 3 is /dev/sda1, and it's the ordinary file descriptor
that ntfs-3g uses for everything else it does to the filesystem.  It
doesn't seem to be opened with O_EXCL and it wouldn't matter even if
it was.

So yes, like the previous commenter I believe this is a real bug.
Comment 6 Richard W.M. Jones 2021-02-13 15:12:59 UTC
Created attachment 295259 [details]
strace of ntfs-3g when BLKDISCARD fails
Comment 7 Jan Kara 2021-02-15 11:10:06 UTC
I agree that there's problem somewhere and fstrim should succeed. I'm just trying to narrow down where exactly the problem is and how to fix it.

AFAIU fstrim(8) calls into the kernel with FITRIM ioctl. In case of FUSE filesystem, the ioctl is redirected to ntfs-3g in userspace which determines proper ranges to trim and calls BLKDISCARD ioctl on the appropriate block device ranges.

The identified commit 384d87ef2c9 adds a check that the file used for calling  BLKDISCARD ioctl is either opened O_EXCL or that we can temporarily upgrade the file descriptor to O_EXCL mode. And this fails. As strace shows ntfs-3g has the device open only O_RDWR so we'll be trying to upgrade fd to O_EXCL. And because it fails it means someone else has the device open exclusively. I've now tested this and also checked the kernel code and it indeed appears that the FUSE driver in the kernel claims the block device exclusively which blocks the userspace daemon to claim it exclusively as well.

I've added Miklos to CC. Miklos, why is kernel FUSE module claiming the device exclusively? Is it necessary? Because I'd prefer if we could keep BLKDISCARD ioctl requiring O_EXCL open (or ability to upgrade to it) as it protects normal filesystems from being screwed and crashing.
Comment 8 Richard W.M. Jones 2021-02-15 11:15:16 UTC
Hi Jan, in the last para did you mean to say "why is it NOT claiming
the device exclusively"?  I can try adding O_EXCL to ntfs-3g code, but
I'm a bit busy today preparing for a conference talk so probably
won't happen today.
Comment 9 Richard W.M. Jones 2021-02-15 11:23:57 UTC
Sorry, ignore my last comment.  I now understand what you're saying.
Comment 10 Andreas Klauer 2021-02-16 10:15:39 UTC
LVM with issue_discards = 1 also fails when trying to shrink logical volumes.

# lvresize --resizefs -L100M testdiscard/test
Do you want to unmount "/dev/shm/loop" ? [Y|n] y
fsck from util-linux 2.36.1
/dev/mapper/testdiscard-test: 11/65280 files (0.0% non-contiguous), 8843/261120 blocks
resize2fs 1.46.1 (9-Feb-2021)
Resizing the filesystem on /dev/mapper/testdiscard-test to 25600 (4k) blocks.
The filesystem on /dev/mapper/testdiscard-test is now 25600 (4k) blocks long.

  /dev/loop0: BLKDISCARD ioctl at offset 105906176 size 964689920 failed: Device or resource busy.
  Size of logical volume testdiscard/test changed from 1020.00 MiB (255 extents) to 100.00 MiB (25 extents).
  Logical volume testdiscard/test successfully resized.

I admit it's an unusual test case since I used a loop device for testing. It seemed to work fine on SSD. Not sure why there is a difference.
Comment 11 Andreas Klauer 2021-02-16 10:34:58 UTC
> It seemed to work fine on SSD. Not sure why there is a difference.

OK, turns out my SSD is LUKS encrypted and didn't allow discards. With allow discards it turns out, LVM appears to be affected in general.

# lvresize --resizefs -L100M SSD/test
  Rounding size to boundary between physical extents: 128.00 MiB.
Do you want to unmount "/mnt/test" ? [Y|n] y
fsck from util-linux 2.36.2
/dev/mapper/SSD-test: 11/65536 files (0.0% non-contiguous), 12955/262144 blocks
resize2fs 1.46.1 (9-Feb-2021)
Resizing the filesystem on /dev/mapper/SSD-test to 32768 (4k) blocks.
The filesystem on /dev/mapper/SSD-test is now 32768 (4k) blocks long.

  /dev/mapper/luksSSD1: BLKDISCARD ioctl at offset 56506712064 size 939524096 failed: Device or resource busy.
  Size of logical volume SSD/test changed from 1.00 GiB (16 extents) to 128.00 MiB (2 extents).
  Logical volume SSD/test successfully resized.

At this point I'm not sure if the kernel should be doing this kind of gatekeeping at all. It's up to user space programs to warn about unsafe actions, blkdiscard already checks for filesystems.

Kernel code that OOPSes when encountering corrupt data has to be fixed anyway as there are many ways for corruption to occur.
Comment 12 Richard W.M. Jones 2021-02-16 10:52:17 UTC
I'm unable to reproduce comment 11 with the broken kernel using:

  LVM version:     2.03.10(2) (2020-08-09)
  Library version: 1.02.173 (2020-08-09)
  Driver version:  4.43.0

I don't understand why blkdiscard would fail since the filesystem has
been unmounted in this scenario.  Any tips to reproduce this?
Comment 13 Miklos Szeredi 2021-02-16 12:30:51 UTC
(In reply to Jan Kara from comment #7)

> I've added Miklos to CC. Miklos, why is kernel FUSE module claiming the
> device exclusively? Is it necessary? Because I'd prefer if we could keep
> BLKDISCARD ioctl requiring O_EXCL open (or ability to upgrade to it) as it
> protects normal filesystems from being screwed and crashing.

It is using get_tree_bdev(), a generic helper that opens the device in exclusive mode.

From open(2):

       O_EXCL [...]   If  the block device is in use by the system
              (e.g., mounted), open() fails with the error EBUSY.

So mounted blockdevs need to be opened in exclusive mode.  The userspace filesystem could do that, but we can't fix that in the kernel alone, it needs to be negotiated.  So there's no simple fix for this AFAICS.
Comment 14 Jan Kara 2021-02-16 12:35:25 UTC
(In reply to Andreas Klauer from comment #11)
> At this point I'm not sure if the kernel should be doing this kind of
> gatekeeping at all. It's up to user space programs to warn about unsafe
> actions, blkdiscard already checks for filesystems.
> 
> Kernel code that OOPSes when encountering corrupt data has to be fixed
> anyway as there are many ways for corruption to occur.

Just to make things clear: In this case, kernel crashes are not caused by corrupted on-disk data, rather it is about BLKDISCARD ioctl removing data structures from buffer cache in the discarded range. And it is not simple to invalidate buffer cache for the range (which is necessary to make buffer cache consistent with on-disk contents) without discarding the data structures... Essentially it is a conflict between buffer-cache owners - filesystem owns buffer cache it is working on but BLKDISCARD needs to own it as well to be able to make discard results visible. That's why currently BLKDISCARD gets EBUSY, previously BLKDISCARD silently stole buffer cache from under the filesystem...
Comment 15 Zdenek Kabelac 2021-02-16 12:39:09 UTC
To reproduce the problem with lvm2 -

issue_discard=1  and create 2 LVs on a single discardable PV.
and just remove 1 LV while the other is still active and uses PV.

Clearly lvm2 cannot get 'exclusive' open of such PV in this moment.

Possibly ideas (except most trivial one - to revert IMHO not much useful patch - since user can anytime open  device of mounted fs and /dev/zero it anyway)

lvm2 can create some new LV for discardable extents and try to discard it this way.
However this is quite some 'complexification'.

Also current kernel patch breaks backward compatibility without any backward compatible 'knob' to switch.
Comment 16 Jan Kara 2021-02-16 12:54:31 UTC
(In reply to Miklos Szeredi from comment #13)
> (In reply to Jan Kara from comment #7)
> 
> > I've added Miklos to CC. Miklos, why is kernel FUSE module claiming the
> > device exclusively? Is it necessary? Because I'd prefer if we could keep
> > BLKDISCARD ioctl requiring O_EXCL open (or ability to upgrade to it) as it
> > protects normal filesystems from being screwed and crashing.
> 
> It is using get_tree_bdev(), a generic helper that opens the device in
> exclusive mode.
> 
> From open(2):
> 
>        O_EXCL [...]   If  the block device is in use by the system
>               (e.g., mounted), open() fails with the error EBUSY.
> 
> So mounted blockdevs need to be opened in exclusive mode.  The userspace
> filesystem could do that, but we can't fix that in the kernel alone, it
> needs to be negotiated.  So there's no simple fix for this AFAICS.

Right, so I wonder whether it would be possible for the kernel to only have bdev open without O_EXCL and let userspace daemon open it with O_EXCL (which would even make more sense since it is the userspace daemon which does the IO). I'm not sure whether such change is viable in the FUSE ecosystem though.

Could maybe the fuse driver (i.e., ntfs-3g in our case) somehow indicate it is capable of opening the device with O_EXCL and so kernel would release it's exclusive open and let ntfs-3g driver do it during its mount procedure? That way unaware fuse drivers will still get the protection from the kernel and capable drivers that need O_EXCL could still take it.
Comment 17 Richard W.M. Jones 2021-02-16 13:02:08 UTC
Oh I see, now I'm able to reproduce it.  First prepare the disk:

$ qemu-img create test-lv.qcow2 -f qcow2 1G
$ virt-rescue -a test-lv.qcow2
><rescue> pvcreate /dev/sda 
  Physical volume "/dev/sda" successfully created.
><rescue> vgcreate VG /dev/sda
  Volume group "VG" successfully created
><rescue> lvcreate -n LV -L 100M VG
  Logical volume "LV" created.
><rescue> mkfs.ext4 /dev/VG/LV 
><rescue> sync
><rescue> exit

Then to reproduce the bug:

$ virt-rescue --ro -a test-lv.qcow2 -m /dev/VG/LV
><rescue> lvresize --resizefs -L 50M VG/LV --config devices/issue_discards=1
  Rounding size to boundary between physical extents: 52.00 MiB.
Do you want to unmount "/sysroot" ? [Y|n] y
fsck from util-linux 2.36
/dev/mapper/VG-LV: 11/25688 files (0.0% non-contiguous), 8896/102400 blocks
resize2fs 1.45.6 (20-Mar-2020)
Resizing the filesystem on /dev/mapper/VG-LV to 53248 (1k) blocks.
The filesystem on /dev/mapper/VG-LV is now 53248 (1k) blocks long.

  /dev/sda: BLKDISCARD ioctl at offset 55574528 size 50331648 failed: Device or resource busy.
  Size of logical volume VG/LV changed from 100.00 MiB (25 extents) to 52.00 MiB (13 extents).
  Logical volume VG/LV successfully resized.

(Although the BLKDISCARD ioctl fails, lvresize seems to carry on
regardless and does not exit with an error.)

You can play with the $SUPERMIN_KERNEL and $SUPERMIN_MODULES environment
variables to choose a different kernel.
https://libguestfs.org/supermin.1.html#ENVIRONMENT-VARIABLES

I was able to fix the bug by reverting 384d87ef2c9.
Comment 18 Jan Kara 2021-02-16 13:04:18 UTC
(In reply to Zdenek Kabelac from comment #15)
> (except most trivial one - to revert IMHO not much useful patch - since user
> can anytime open  device of mounted fs and /dev/zero it anyway)

Yes, he can and that is perfectly fine. Please read my comment 14 why BLKDISCARD is different.

> lvm2 can create some new LV for discardable extents and try to discard it
> this way. However this is quite some 'complexification'.
> 
> Also current kernel patch breaks backward compatibility without any backward
> compatible 'knob' to switch.

Well, a 'knob' would not really help. But at this point I agree that too much userspace just depends on the ability to do BLKDISCARD on a device it is not able to open O_EXCL. So I'll ask Jens to revert my patch and we'll have to come up with some other way to protect filesystem's buffer cache from BLKDISCARD.
Comment 19 Miklos Szeredi 2021-02-16 13:23:19 UTC
(In reply to Jan Kara from comment #16)

> Could maybe the fuse driver (i.e., ntfs-3g in our case) somehow indicate it
> is capable of opening the device with O_EXCL and so kernel would release
> it's exclusive open and let ntfs-3g driver do it during its mount procedure?
> That way unaware fuse drivers will still get the protection from the kernel
> and capable drivers that need O_EXCL could still take it.

Yes, this could be made to work, but it's not a trivial change.  Let's see what the ideas about properly fixing all the userspace issues are after the revert.
Comment 20 Jan Kara 2021-02-16 14:02:57 UTC
Revert submitted at:

https://lore.kernel.org/linux-block/20210216133849.8244-1-jack@suse.cz
Comment 21 Jan Kara 2021-02-16 14:04:32 UTC
I guess we can close this bug for now. Thanks to everybody involved!
Comment 22 Zdenek Kabelac 2021-02-16 14:53:19 UTC
(In reply to Jan Kara from comment #18)
> (In reply to Zdenek Kabelac from comment #15)
> > (except most trivial one - to revert IMHO not much useful patch - since
> user
> > can anytime open  device of mounted fs and /dev/zero it anyway)
> 
> Yes, he can and that is perfectly fine. Please read my comment 14 why
> BLKDISCARD is different.

It's clear it's not a simple situation - but IMHO it should be solved
within kernel and avoid exposing troubles to user-land.

If there is difference between /dev/zero and discard - maybe that can
be resolved and made more similar ?
Comment 23 Jan Kara 2021-02-18 14:39:37 UTC
So after some discussion with other developers, we came up with another possible solution. Can you guys check whether the patch I'll attach fixes the problem for you?
Comment 24 Jan Kara 2021-02-18 14:40:11 UTC
Created attachment 295345 [details]
[PATCH] block: Try to handle busy underlying device on discard
Comment 25 Jan Kara 2021-02-18 15:15:35 UTC
FWIW fstrim with ntfs-3g driver doesn't complain for me anymore...
Comment 26 Andreas Klauer 2021-02-21 13:40:12 UTC
Sorry, I could only do some superficial testing with your patch.

1) fstrim for ntfs-3g seems to work
2) issue_discards for lvm seems to work
3a) blkdiscard a mounted filesystem is busy
3b) blkdiscard a mounted filesystem at an offset seems to work

In case of 3b) what I actually wanted to test is blkdiscard of unoccupied region when fs-size is smaller than device size, which is "safe". But it's fine with discarding the filesystem region too, resulting in filesystem corruption.

I'm okay with that (with dd you can corrupt things as well, so why would the kernel care) but how the decision busy vs. not is made remains a mystery to me... it affects only very specific sector ranges? Does this still fix whatever the original issue was?
Comment 27 Jan Kara 2021-02-22 09:43:46 UTC
Thanks for testing! The result 3b) is possible - we'll now return EBUSY only if some block in the trimmed range is currently actively being used by the kernel (which is what was causing crashes in the past). So EBUSY results for ranges that may be used by the kernel are kind of random now but that's the cost of letting other cases pass.

So the patch seems to work well. I'll send the patch upstream then.
Comment 28 Richard W.M. Jones 2021-02-22 11:54:51 UTC
Sorry for the delay.  I tested the patch in comment 24 on top
of the current kernel in git.  Both ntfs-3g FITRIM, and my LVM test
worked fine.
Comment 29 chriscjsus 2021-02-23 08:15:20 UTC
Just wanted to mention that QEMU is also affected by this bug.  Discards to block devices from Virtual Machines don't work, although QEMU does not give any warnings.

After applying the patch from comment 24 it works now.
Comment 30 chriscjsus 2021-02-23 08:20:22 UTC
I used blktrace + blkparse to tell if discards are actually being sent to the device.

sudo blktrace -a discard -d /dev/nvme0n1 -o - | blkparse -i -
Comment 31 Jan Kara 2021-03-17 17:42:21 UTC
OK, patch is merged upstream. Closing the bug.