Bug 208827
Summary: | [fio io_uring] io_uring write data crc32c verify failed | ||
---|---|---|---|
Product: | File System | Reporter: | Zorro Lang (zlang) |
Component: | XFS | Assignee: | FileSystem/XFS Default Virtual Assignee (filesystem_xfs) |
Status: | NEW --- | ||
Severity: | normal | CC: | bugzilla |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | xfs-linux xfs-5.9-merge-7 + v5.8-rc4 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Zorro Lang
2020-08-06 04:57:58 UTC
On Thu, Aug 06, 2020 at 04:57:58AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=208827 > > Bug ID: 208827 > Summary: [fio io_uring] io_uring write data crc32c verify > failed > Product: File System > Version: 2.5 > Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4 > Hardware: All > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: XFS > Assignee: filesystem_xfs@kernel-bugs.kernel.org > Reporter: zlang@redhat.com > Regression: No > > Description of problem: > Our fio io_uring test failed as below: > > # fio io_uring.fio > uring_w: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, > (T) > 64.0KiB-64.0KiB, ioengine=io_uring, iodepth=16 > uring_sqt_w: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) > 64.0KiB-64.0KiB, > (T) 64.0KiB-64.0KiB, ioengine=io_uring, iodepth=16 > uring_rw: (g=0): rw=randrw, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) > 64.0KiB-64.0KiB, ioengine=io_uring, iodepth=16 > uring_sqt_rw: (g=0): rw=randrw, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, > (T) 64.0KiB-64.0KiB, ioengine=io_uring, iodepth=16 > fio-3.21-39-g87622 > Starting 4 threads > uring_w: Laying out IO file (1 file / 256MiB) > uring_sqt_w: Laying out IO file (1 file / 256MiB) > uring_rw: Laying out IO file (1 file / 256MiB) > uring_sqt_rw: Laying out IO file (1 file / 256MiB) > crc32c: verify failed at file /mnt/fio/uring_rw.0.0 offset 265289728, length > 65536 (requested block: offset=265289728, length=65536) > Expected CRC: e8f1ef35 > Received CRC: 9dd0deae > fio: pid=46530, err=84/file:io_u.c:2108, func=io_u_queued_complete, This looks like it's either a short read or it's page cache corruption. I've confirmed that the data on disk is correct when the validation fails, but the data in the page cache is not correct. That is, when the fio verification fails, the second 32kB of the 64kB data block returned does not match the expected data to be returned. Using the options: verify_fatal=1 verify_dump=1 and getting rid of the "unlink=1" option from the config file confirms that reading the data using xfs_io -c "pread -v <off> 64k" returns the bad data. Unmounting the filesystem and mounting it again (or using direct IO to bypass the page cache), and repeating the xfs_io read returns 64kB of data identical to the expected data dump except for 16 bytes in the block header that have some minor differences. I'm not sure this is expected or not, but we can ignore it to begin with because it is clear that there's exactly 8 pages of bad data in the page cache in this range. So, add: verify=pattern verify_pattern=%o to have the buffer stamped with file offset data rather than random data, and it turns out that the bad half of the buffer has an incorrect file offset, but the offset in the entire range on disk is correct. Ok, so now I have confirmed that the data is valid on disk, but incorrect in cache. That means the buffered write did contain correct data in the cache, and that it was written to disk correctly. So some time between the writeback completing and the data being read, we've ended up with stale data in the page cache.... This corruption only appears to happen with io_uring based buffered IO - syscall based buffered IO and buffered IO with AIO doesn't trigger it at all. Hence I suspect there is bug somewhere in the io_uring code or in a code path that only the io_uring code path tickles. I can't really make head or tail of the io_uring code and there's no obvious way to debug exactly what the user application is asking the filesystem to do or what the kernel it returning to the filesytsem (e.g. strace doesn't work). Hence I suspect that this needs the io_uring folk to look at it and isolate it down to the operation that is corrupting the page cache. I'd love to know how we are can triage issues like this in the field given the tools we normally use to triage and debug data corruption issues seem to be largely useless... Cheers, Dave. On Fri, Aug 07, 2020 at 03:12:03AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=208827 > > --- Comment #1 from Dave Chinner (david@fromorbit.com) --- > On Thu, Aug 06, 2020 at 04:57:58AM +0000, bugzilla-daemon@bugzilla.kernel.org > wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=208827 > > > > Bug ID: 208827 > > Summary: [fio io_uring] io_uring write data crc32c verify > > failed > > Product: File System > > Version: 2.5 > > Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4 FWIW, I can reproduce this with a vanilla 5.8 release kernel, so this isn't related to contents of the XFS dev tree at all... In fact, this bug isn't a recent regression. AFAICT, it was introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did reproduce. More info once I've finished bisecting it.... -Dave. On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote: > On Fri, Aug 07, 2020 at 03:12:03AM +0000, bugzilla-daemon@bugzilla.kernel.org > wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=208827 > > > > --- Comment #1 from Dave Chinner (david@fromorbit.com) --- > > On Thu, Aug 06, 2020 at 04:57:58AM +0000, > bugzilla-daemon@bugzilla.kernel.org > > wrote: > > > https://bugzilla.kernel.org/show_bug.cgi?id=208827 > > > > > > Bug ID: 208827 > > > Summary: [fio io_uring] io_uring write data crc32c verify > > > failed > > > Product: File System > > > Version: 2.5 > > > Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4 > > FWIW, I can reproduce this with a vanilla 5.8 release kernel, > so this isn't related to contents of the XFS dev tree at all... > > In fact, this bug isn't a recent regression. AFAICT, it was > introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did > reproduce. More info once I've finished bisecting it.... f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit commit f67676d160c6ee2ed82917fadfed6d29cab8237c Author: Jens Axboe <axboe@kernel.dk> Date: Mon Dec 2 11:03:47 2019 -0700 io_uring: ensure async punted read/write requests copy iovec Currently we don't copy the iovecs when we punt to async context. This can be problematic for applications that store the iovec on the stack, as they often assume that it's safe to let the iovec go out of scope as soon as IO submission has been called. This isn't always safe, as we will re-copy the iovec once we're in async context. Make this 100% safe by copying the iovec just once. With this change, applications may safely store the iovec on the stack for all cases. Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> fs/io_uring.c | 243 +++++++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 181 insertions(+), 62 deletions(-) Full log: $ git bisect log git bisect start # bad: [d5226fa6dbae0569ee43ecfc08bdcd6770fc4755] Linux 5.5 git bisect bad d5226fa6dbae0569ee43ecfc08bdcd6770fc4755 # good: [219d54332a09e8d8741c1e1982f5eae56099de85] Linux 5.4 git bisect good 219d54332a09e8d8741c1e1982f5eae56099de85 # good: [8c39f71ee2019e77ee14f88b1321b2348db51820] Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net git bisect good 8c39f71ee2019e77ee14f88b1321b2348db51820 # good: [76bb8b05960c3d1668e6bee7624ed886cbd135ba] Merge tag 'kbuild-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild git bisect good 76bb8b05960c3d1668e6bee7624ed886cbd135ba # bad: [018e0e3594f7dcd029d258e368c485e742fa9cdb] habanalabs: rate limit error msg on waiting for CS git bisect bad 018e0e3594f7dcd029d258e368c485e742fa9cdb # good: [ec939e4c94bd3ef2fd4f34c15f8aaf79bd0c5ee1] Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc git bisect good ec939e4c94bd3ef2fd4f34c15f8aaf79bd0c5ee1 # good: [7c3ddc6b038feaa9a05b09c3f20e64fed50f9a3f] Merge tag 'ti-k3-soc-for-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kristo/linux into arm/dt git bisect good 7c3ddc6b038feaa9a05b09c3f20e64fed50f9a3f # bad: [9feb1af97e7366b512ecb9e4dd61d3252074cda3] Merge tag 'for-linus-20191205' of git://git.kernel.dk/linux-block git bisect bad 9feb1af97e7366b512ecb9e4dd61d3252074cda3 # good: [b08baef02b26cf7c2123e4a24a2fa1fb7a593ffb] Merge tag 'armsoc-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc git bisect good b08baef02b26cf7c2123e4a24a2fa1fb7a593ffb # good: [3f1266ec704d3efcfc8179c71bed9a75963b6344] Merge tag 'gfs2-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 git bisect good 3f1266ec704d3efcfc8179c71bed9a75963b6344 # bad: [bca1c43cb2dbe4212aea0793bfd91aeb4c2d184d] null_blk: remove unused variable warning on !CONFIG_BLK_DEV_ZONED git bisect bad bca1c43cb2dbe4212aea0793bfd91aeb4c2d184d # bad: [5c4bd1f40c23d08ffbdccd68a5fd63751c794d89] null_blk: fix zone size paramter check git bisect bad 5c4bd1f40c23d08ffbdccd68a5fd63751c794d89 # bad: [03b1230ca12a12e045d83b0357792075bf94a1e0] io_uring: ensure async punted sendmsg/recvmsg requests copy data git bisect bad 03b1230ca12a12e045d83b0357792075bf94a1e0 # good: [490547ca2df66b8413bce97cb651630f2c531487] block: don't send uevent for empty disk when not invalidating git bisect good 490547ca2df66b8413bce97cb651630f2c531487 # bad: [f67676d160c6ee2ed82917fadfed6d29cab8237c] io_uring: ensure async punted read/write requests copy iovec git bisect bad f67676d160c6ee2ed82917fadfed6d29cab8237c # good: [1a6b74fc87024db59d41cd7346bd437f20fb3e2d] io_uring: add general async offload context git bisect good 1a6b74fc87024db59d41cd7346bd437f20fb3e2d # first bad commit: [f67676d160c6ee2ed82917fadfed6d29cab8237c] io_uring: ensure async punted read/write requests copy iove $ The bisect is good, but I've got no idea how this commit is triggering page cache corruption on read IO.... -Dave. [cc Jens]
[Jens, data corruption w/ io_uring and simple fio reproducer. see
the bz link below.]
On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
> On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
> > On Fri, Aug 07, 2020 at 03:12:03AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > > --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
> > > On Thu, Aug 06, 2020 at 04:57:58AM +0000,
> bugzilla-daemon@bugzilla.kernel.org
> > > wrote:
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=208827
> > > >
> > > > Bug ID: 208827
> > > > Summary: [fio io_uring] io_uring write data crc32c verify
> > > > failed
> > > > Product: File System
> > > > Version: 2.5
> > > > Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
> >
> > FWIW, I can reproduce this with a vanilla 5.8 release kernel,
> > so this isn't related to contents of the XFS dev tree at all...
> >
> > In fact, this bug isn't a recent regression. AFAICT, it was
> > introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
> > reproduce. More info once I've finished bisecting it....
>
> f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
> commit f67676d160c6ee2ed82917fadfed6d29cab8237c
> Author: Jens Axboe <axboe@kernel.dk>
> Date: Mon Dec 2 11:03:47 2019 -0700
>
> io_uring: ensure async punted read/write requests copy iovec
....
Ok, I went back to vanilla 5.8 to continue debugging and adding
tracepoints, and it's proving strangely difficult to reproduce now.
However, I did just hit this:
[ 4980.136032] ------------[ cut here ]------------
[ 4980.137665] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000ef911b51>] prepare_to_wait_exclusive+0x3d/0xd0
[ 4980.141403] WARNING: CPU: 13 PID: 6800 at kernel/sched/core.c:6888 __might_sleep+0x74/0x80
[ 4980.143940] CPU: 13 PID: 6800 Comm: fio Not tainted 5.8.0-dgc+ #2549
[ 4980.146147] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[ 4980.148774] RIP: 0010:__might_sleep+0x74/0x80
[ 4980.150455] Code: ff 41 5c 41 5d 41 5e 5d c3 48 8b 90 30 22 00 00 48 c7 c7 a8 b9 50 82 c6 05 38 e4 9a 01 01 48 8b 70 10 48 89 d1 e8 fa 5c fc ff <0f> 0b eb c5 0f 1f 84 00 00 00 002
[ 4980.156255] RSP: 0018:ffffc90005383c58 EFLAGS: 00010282
[ 4980.158299] RAX: 0000000000000000 RBX: 0000561a18122000 RCX: 0000000000000000
[ 4980.160817] RDX: ffff88883eca7de0 RSI: ffff88883ec97a80 RDI: ffff88883ec97a80
[ 4980.163162] RBP: ffffc90005383c70 R08: ffff88883ec97a80 R09: ffff8888070f3000
[ 4980.165635] R10: ffff8888070f3434 R11: ffff8888070f3434 R12: ffffffff8251f46e
[ 4980.168115] R13: 00000000000001ba R14: 0000000000000000 R15: ffff888235647740
[ 4980.170714] FS: 00007f80de7af700(0000) GS:ffff88883ec80000(0000) knlGS:0000000000000000
[ 4980.173442] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4980.175371] CR2: 00007f80d4005008 CR3: 00000005eb01c004 CR4: 0000000000060ee0
[ 4980.177607] Call Trace:
[ 4980.178583] change_protection+0x827/0x9e0
[ 4980.180063] ? kvm_clock_read+0x18/0x30
[ 4980.181654] ? kvm_sched_clock_read+0x9/0x20
[ 4980.183426] ? sysvec_apic_timer_interrupt+0x46/0x90
[ 4980.185160] change_prot_numa+0x19/0x30
[ 4980.186607] task_numa_work+0x1c7/0x2e0
[ 4980.188003] task_work_run+0x64/0xb0
[ 4980.189488] io_cqring_wait+0x118/0x290
[ 4980.191136] ? io_uring_poll+0x80/0x80
[ 4980.192430] __x64_sys_io_uring_enter+0x158/0x2d0
[ 4980.194033] do_syscall_64+0x46/0x90
[ 4980.195278] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 4980.197092] RIP: 0033:0x7f8135c59a79
[ 4980.198412] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b8
[ 4980.203953] RSP: 002b:00007f80de7aacd8 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
[ 4980.206596] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f8135c59a79
[ 4980.208924] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008
[ 4980.211334] RBP: 00007f80defb1000 R08: 0000000000000000 R09: 0000000000000000
[ 4980.213656] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001
[ 4980.216045] R13: 0000000000000001 R14: 0000561a1820cd60 R15: 0000000000000000
[ 4980.218469] ---[ end trace 4a8ca123102be9c2 ]---
No idea if it is relevant to the data corruption, but there's
definitely something not right here...
Cheers,
Dave.
On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
> [cc Jens]
>
> [Jens, data corruption w/ io_uring and simple fio reproducer. see
> the bz link below.]
>
> On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
> > On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
> > > On Fri, Aug 07, 2020 at 03:12:03AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > > > --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
> > > > On Thu, Aug 06, 2020 at 04:57:58AM +0000,
> bugzilla-daemon@bugzilla.kernel.org
> > > > wrote:
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=208827
> > > > >
> > > > > Bug ID: 208827
> > > > > Summary: [fio io_uring] io_uring write data crc32c verify
> > > > > failed
> > > > > Product: File System
> > > > > Version: 2.5
> > > > > Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
> > >
> > > FWIW, I can reproduce this with a vanilla 5.8 release kernel,
> > > so this isn't related to contents of the XFS dev tree at all...
> > >
> > > In fact, this bug isn't a recent regression. AFAICT, it was
> > > introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
> > > reproduce. More info once I've finished bisecting it....
> >
> > f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
> > commit f67676d160c6ee2ed82917fadfed6d29cab8237c
> > Author: Jens Axboe <axboe@kernel.dk>
> > Date: Mon Dec 2 11:03:47 2019 -0700
> >
> > io_uring: ensure async punted read/write requests copy iovec
>
> ....
>
> Ok, I went back to vanilla 5.8 to continue debugging and adding
> tracepoints, and it's proving strangely difficult to reproduce now.
Which turns out to be caused by a tracepoint I inserted to try to
narrow down if this was an invalidation race. I put this in
invalidate_complete_page:
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -257,8 +257,11 @@ int invalidate_inode_page(struct page *page)
struct address_space *mapping = page_mapping(page);
if (!mapping)
return 0;
- if (PageDirty(page) || PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page)) {
+ trace_printk("ino 0x%lx page %p, offset 0x%lx\n",
+ mapping->host->i_ino, page, page->index * PAGE_SIZE);
return 0;
+ }
if (page_mapped(page))
return 0;
return invalidate_complete_page(mapping, page);
And that alone, without even enabling tracepoints, made the
corruption go completely away. So I suspect a page state race
condition and look at POSIX_FADV_DONTNEED, which fio is issuing
before running it's verification reads. First thing that does:
if (!inode_write_congested(mapping->host))
__filemap_fdatawrite_range(mapping, offset, endbyte,
WB_SYNC_NONE);
It starts async writeback of the dirty pages. There's 256MB of dirty
pages on these inodes, and iomap tracing indicates the entire 256MB
immediately runs through the trace_iomap_writepage() tracepoint.
i.e. every page goes Dirty -> Writeback and is submitted for async
IO.
Then the POSIX_FADV_DONTNEED code goes and runs
invalidate_mapping_pages(), which ends up try-locking each page and
then running invalidate_inode_page() on the page, which is where the
trace debug I put in on pages under writeback gets hit. So if
changing the invalidation code for pages under writeback makes the
problem go away, then stopping invalidate_mapping_pages() from
racing with page writeback should make the problem go away, too.
This does indeed make the corruption go away:
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -109,9 +109,8 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
case POSIX_FADV_NOREUSE:
break;
case POSIX_FADV_DONTNEED:
if (!inode_write_congested(mapping->host))
- __filemap_fdatawrite_range(mapping, offset, endbyte,
- WB_SYNC_NONE);
+ filemap_write_and_wait_range(mapping, offset, endbyte);
/*
* First and last FULL page! Partial pages are deliberately
by making the invalidation wait for the pages to go fully to the
clean state before starting.
This, however, only fixes the specific symptom being tripped over
here. To further test this, I removed this writeback from
POSIX_FADV_DONTNEED completely so I could trigger writeback via
controlled background writeback. And, as I expected, whenever
background writeback ran to write back these dirty files, the
verification failures triggered again. It is quite reliable.
So it looks like there is some kind of writeback completion vs page
invalidation race condition occurring, but more work is needed to
isolate it further. I don't know what part the async read plays in
the corruption yet, because I don't know how we are getting pages in
the cache where page->index != the file offset stamped in the data.
That smells of leaking PageUptodate flags...
-Dave.
On 8/10/20 1:08 AM, Dave Chinner wrote: > [cc Jens] > > [Jens, data corruption w/ io_uring and simple fio reproducer. see > the bz link below.] > > On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote: >> On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote: >>> On Fri, Aug 07, 2020 at 03:12:03AM +0000, >>> bugzilla-daemon@bugzilla.kernel.org wrote: >>>> --- Comment #1 from Dave Chinner (david@fromorbit.com) --- >>>> On Thu, Aug 06, 2020 at 04:57:58AM +0000, >>>> bugzilla-daemon@bugzilla.kernel.org >>>> wrote: >>>>> https://bugzilla.kernel.org/show_bug.cgi?id=208827 >>>>> >>>>> Bug ID: 208827 >>>>> Summary: [fio io_uring] io_uring write data crc32c verify >>>>> failed >>>>> Product: File System >>>>> Version: 2.5 >>>>> Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4 >>> >>> FWIW, I can reproduce this with a vanilla 5.8 release kernel, >>> so this isn't related to contents of the XFS dev tree at all... >>> >>> In fact, this bug isn't a recent regression. AFAICT, it was >>> introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did >>> reproduce. More info once I've finished bisecting it.... >> >> f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit >> commit f67676d160c6ee2ed82917fadfed6d29cab8237c >> Author: Jens Axboe <axboe@kernel.dk> >> Date: Mon Dec 2 11:03:47 2019 -0700 >> >> io_uring: ensure async punted read/write requests copy iovec I don't think this commit is related to the issue at all, but I think we're probably on the same page with that. It's probably just changing things slightly enough to avoid the race. > Ok, I went back to vanilla 5.8 to continue debugging and adding > tracepoints, and it's proving strangely difficult to reproduce now. > > However, I did just hit this: > > [ 4980.136032] ------------[ cut here ]------------ > [ 4980.137665] do not call blocking ops when !TASK_RUNNING; state=1 set at > [<00000000ef911b51>] prepare_to_wait_exclusive+0x3d/0xd0 > [ 4980.141403] WARNING: CPU: 13 PID: 6800 at kernel/sched/core.c:6888 > __might_sleep+0x74/0x80 > [ 4980.143940] CPU: 13 PID: 6800 Comm: fio Not tainted 5.8.0-dgc+ #2549 > [ 4980.146147] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.13.0-1 04/01/2014 > [ 4980.148774] RIP: 0010:__might_sleep+0x74/0x80 > [ 4980.150455] Code: ff 41 5c 41 5d 41 5e 5d c3 48 8b 90 30 22 00 00 48 c7 c7 > a8 b9 50 82 c6 05 38 e4 9a 01 01 48 8b 70 10 48 89 d1 e8 fa 5c fc ff <0f> 0b > eb c5 0f 1f 84 00 00 00 002 > [ 4980.156255] RSP: 0018:ffffc90005383c58 EFLAGS: 00010282 > [ 4980.158299] RAX: 0000000000000000 RBX: 0000561a18122000 RCX: > 0000000000000000 > [ 4980.160817] RDX: ffff88883eca7de0 RSI: ffff88883ec97a80 RDI: > ffff88883ec97a80 > [ 4980.163162] RBP: ffffc90005383c70 R08: ffff88883ec97a80 R09: > ffff8888070f3000 > [ 4980.165635] R10: ffff8888070f3434 R11: ffff8888070f3434 R12: > ffffffff8251f46e > [ 4980.168115] R13: 00000000000001ba R14: 0000000000000000 R15: > ffff888235647740 > [ 4980.170714] FS: 00007f80de7af700(0000) GS:ffff88883ec80000(0000) > knlGS:0000000000000000 > [ 4980.173442] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 4980.175371] CR2: 00007f80d4005008 CR3: 00000005eb01c004 CR4: > 0000000000060ee0 > [ 4980.177607] Call Trace: > [ 4980.178583] change_protection+0x827/0x9e0 > [ 4980.180063] ? kvm_clock_read+0x18/0x30 > [ 4980.181654] ? kvm_sched_clock_read+0x9/0x20 > [ 4980.183426] ? sysvec_apic_timer_interrupt+0x46/0x90 > [ 4980.185160] change_prot_numa+0x19/0x30 > [ 4980.186607] task_numa_work+0x1c7/0x2e0 > [ 4980.188003] task_work_run+0x64/0xb0 > [ 4980.189488] io_cqring_wait+0x118/0x290 I'll get this one: commit 4c6e277c4cc4a6b3b2b9c66a7b014787ae757cc1 Author: Jens Axboe <axboe@kernel.dk> Date: Wed Jul 1 11:29:10 2020 -0600 io_uring: abstract out task work running queued up for stable. Should not be related to this at all, current -git shouldn't run into this. On 8/10/20 3:08 AM, Dave Chinner wrote:
> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
>> [cc Jens]
>>
>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
>> the bz link below.]
>>
>> On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
>>> On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
>>>> On Fri, Aug 07, 2020 at 03:12:03AM +0000,
>>>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>>>> --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
>>>>> On Thu, Aug 06, 2020 at 04:57:58AM +0000,
>>>>> bugzilla-daemon@bugzilla.kernel.org
>>>>> wrote:
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=208827
>>>>>>
>>>>>> Bug ID: 208827
>>>>>> Summary: [fio io_uring] io_uring write data crc32c verify
>>>>>> failed
>>>>>> Product: File System
>>>>>> Version: 2.5
>>>>>> Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
>>>>
>>>> FWIW, I can reproduce this with a vanilla 5.8 release kernel,
>>>> so this isn't related to contents of the XFS dev tree at all...
>>>>
>>>> In fact, this bug isn't a recent regression. AFAICT, it was
>>>> introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
>>>> reproduce. More info once I've finished bisecting it....
>>>
>>> f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
>>> commit f67676d160c6ee2ed82917fadfed6d29cab8237c
>>> Author: Jens Axboe <axboe@kernel.dk>
>>> Date: Mon Dec 2 11:03:47 2019 -0700
>>>
>>> io_uring: ensure async punted read/write requests copy iovec
>>
>> ....
>>
>> Ok, I went back to vanilla 5.8 to continue debugging and adding
>> tracepoints, and it's proving strangely difficult to reproduce now.
>
> Which turns out to be caused by a tracepoint I inserted to try to
> narrow down if this was an invalidation race. I put this in
> invalidate_complete_page:
>
>
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -257,8 +257,11 @@ int invalidate_inode_page(struct page *page)
> struct address_space *mapping = page_mapping(page);
> if (!mapping)
> return 0;
> - if (PageDirty(page) || PageWriteback(page))
> + if (PageDirty(page) || PageWriteback(page)) {
> + trace_printk("ino 0x%lx page %p, offset 0x%lx\n",
> + mapping->host->i_ino, page, page->index * PAGE_SIZE);
> return 0;
> + }
> if (page_mapped(page))
> return 0;
> return invalidate_complete_page(mapping, page);
>
>
> And that alone, without even enabling tracepoints, made the
> corruption go completely away. So I suspect a page state race
> condition and look at POSIX_FADV_DONTNEED, which fio is issuing
> before running it's verification reads. First thing that does:
>
> if (!inode_write_congested(mapping->host))
> __filemap_fdatawrite_range(mapping, offset, endbyte,
> WB_SYNC_NONE);
>
> It starts async writeback of the dirty pages. There's 256MB of dirty
> pages on these inodes, and iomap tracing indicates the entire 256MB
> immediately runs through the trace_iomap_writepage() tracepoint.
> i.e. every page goes Dirty -> Writeback and is submitted for async
> IO.
>
> Then the POSIX_FADV_DONTNEED code goes and runs
> invalidate_mapping_pages(), which ends up try-locking each page and
> then running invalidate_inode_page() on the page, which is where the
> trace debug I put in on pages under writeback gets hit. So if
> changing the invalidation code for pages under writeback makes the
> problem go away, then stopping invalidate_mapping_pages() from
> racing with page writeback should make the problem go away, too.
>
> This does indeed make the corruption go away:
>
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -109,9 +109,8 @@ int generic_fadvise(struct file *file, loff_t offset,
> loff_t len, int advice)
> case POSIX_FADV_NOREUSE:
> break;
> case POSIX_FADV_DONTNEED:
> if (!inode_write_congested(mapping->host))
> - __filemap_fdatawrite_range(mapping, offset, endbyte,
> - WB_SYNC_NONE);
> + filemap_write_and_wait_range(mapping, offset,
> endbyte);
>
> /*
> * First and last FULL page! Partial pages are deliberately
>
> by making the invalidation wait for the pages to go fully to the
> clean state before starting.
>
> This, however, only fixes the specific symptom being tripped over
> here. To further test this, I removed this writeback from
> POSIX_FADV_DONTNEED completely so I could trigger writeback via
> controlled background writeback. And, as I expected, whenever
> background writeback ran to write back these dirty files, the
> verification failures triggered again. It is quite reliable.
>
> So it looks like there is some kind of writeback completion vs page
> invalidation race condition occurring, but more work is needed to
> isolate it further. I don't know what part the async read plays in
> the corruption yet, because I don't know how we are getting pages in
> the cache where page->index != the file offset stamped in the data.
> That smells of leaking PageUptodate flags...
The async read really isn't doing anything that you could not do with
separate threads. Unfortunately it's not that easy to have multiple
threads working on the same region with fio, or we could've reproduced
it with a job file written to use that instead.
I'll dig a bit here...
On 8/10/20 7:15 PM, Jens Axboe wrote:
> On 8/10/20 3:08 AM, Dave Chinner wrote:
>> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
>>> [cc Jens]
>>>
>>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
>>> the bz link below.]
>>>
>>> On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
>>>> On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
>>>>> On Fri, Aug 07, 2020 at 03:12:03AM +0000,
>>>>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>>>>> --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
>>>>>> On Thu, Aug 06, 2020 at 04:57:58AM +0000,
>>>>>> bugzilla-daemon@bugzilla.kernel.org
>>>>>> wrote:
>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=208827
>>>>>>>
>>>>>>> Bug ID: 208827
>>>>>>> Summary: [fio io_uring] io_uring write data crc32c verify
>>>>>>> failed
>>>>>>> Product: File System
>>>>>>> Version: 2.5
>>>>>>> Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
>>>>>
>>>>> FWIW, I can reproduce this with a vanilla 5.8 release kernel,
>>>>> so this isn't related to contents of the XFS dev tree at all...
>>>>>
>>>>> In fact, this bug isn't a recent regression. AFAICT, it was
>>>>> introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
>>>>> reproduce. More info once I've finished bisecting it....
>>>>
>>>> f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
>>>> commit f67676d160c6ee2ed82917fadfed6d29cab8237c
>>>> Author: Jens Axboe <axboe@kernel.dk>
>>>> Date: Mon Dec 2 11:03:47 2019 -0700
>>>>
>>>> io_uring: ensure async punted read/write requests copy iovec
>>>
>>> ....
>>>
>>> Ok, I went back to vanilla 5.8 to continue debugging and adding
>>> tracepoints, and it's proving strangely difficult to reproduce now.
>>
>> Which turns out to be caused by a tracepoint I inserted to try to
>> narrow down if this was an invalidation race. I put this in
>> invalidate_complete_page:
>>
>>
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -257,8 +257,11 @@ int invalidate_inode_page(struct page *page)
>> struct address_space *mapping = page_mapping(page);
>> if (!mapping)
>> return 0;
>> - if (PageDirty(page) || PageWriteback(page))
>> + if (PageDirty(page) || PageWriteback(page)) {
>> + trace_printk("ino 0x%lx page %p, offset 0x%lx\n",
>> + mapping->host->i_ino, page, page->index *
>> PAGE_SIZE);
>> return 0;
>> + }
>> if (page_mapped(page))
>> return 0;
>> return invalidate_complete_page(mapping, page);
>>
>>
>> And that alone, without even enabling tracepoints, made the
>> corruption go completely away. So I suspect a page state race
>> condition and look at POSIX_FADV_DONTNEED, which fio is issuing
>> before running it's verification reads. First thing that does:
>>
>> if (!inode_write_congested(mapping->host))
>> __filemap_fdatawrite_range(mapping, offset, endbyte,
>> WB_SYNC_NONE);
>>
>> It starts async writeback of the dirty pages. There's 256MB of dirty
>> pages on these inodes, and iomap tracing indicates the entire 256MB
>> immediately runs through the trace_iomap_writepage() tracepoint.
>> i.e. every page goes Dirty -> Writeback and is submitted for async
>> IO.
>>
>> Then the POSIX_FADV_DONTNEED code goes and runs
>> invalidate_mapping_pages(), which ends up try-locking each page and
>> then running invalidate_inode_page() on the page, which is where the
>> trace debug I put in on pages under writeback gets hit. So if
>> changing the invalidation code for pages under writeback makes the
>> problem go away, then stopping invalidate_mapping_pages() from
>> racing with page writeback should make the problem go away, too.
>>
>> This does indeed make the corruption go away:
>>
>> --- a/mm/fadvise.c
>> +++ b/mm/fadvise.c
>> @@ -109,9 +109,8 @@ int generic_fadvise(struct file *file, loff_t offset,
>> loff_t len, int advice)
>> case POSIX_FADV_NOREUSE:
>> break;
>> case POSIX_FADV_DONTNEED:
>> if (!inode_write_congested(mapping->host))
>> - __filemap_fdatawrite_range(mapping, offset, endbyte,
>> - WB_SYNC_NONE);
>> + filemap_write_and_wait_range(mapping, offset,
>> endbyte);
>>
>> /*
>> * First and last FULL page! Partial pages are deliberately
>>
>> by making the invalidation wait for the pages to go fully to the
>> clean state before starting.
>>
>> This, however, only fixes the specific symptom being tripped over
>> here. To further test this, I removed this writeback from
>> POSIX_FADV_DONTNEED completely so I could trigger writeback via
>> controlled background writeback. And, as I expected, whenever
>> background writeback ran to write back these dirty files, the
>> verification failures triggered again. It is quite reliable.
>>
>> So it looks like there is some kind of writeback completion vs page
>> invalidation race condition occurring, but more work is needed to
>> isolate it further. I don't know what part the async read plays in
>> the corruption yet, because I don't know how we are getting pages in
>> the cache where page->index != the file offset stamped in the data.
>> That smells of leaking PageUptodate flags...
>
> The async read really isn't doing anything that you could not do with
> separate threads. Unfortunately it's not that easy to have multiple
> threads working on the same region with fio, or we could've reproduced
> it with a job file written to use that instead.
>
> I'll dig a bit here...
Have we verified that the actual page cache is inconsistent, or is that
just an assumption? I'm asking since I poked a bit on the fio side, and
suspiciously the failed verification was a short IO. At least
originally, fio will retry those, but it could be a bug in the io_uring
engine for fio.
I'll poke some more.
On Mon, Aug 10, 2020 at 07:08:59PM +1000, Dave Chinner wrote: > On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote: > > [cc Jens] > > > > [Jens, data corruption w/ io_uring and simple fio reproducer. see > > the bz link below.] Looks like a io_uring/fio bugs at this point, Jens. All your go fast bits turns the buffered read into a short read, and neither fio nor io_uring async buffered read path handle short reads. Details below. Also, the block_rq_* trace points appear to emit corrupt trace records that crash trace-cmd. That probably needs to be looked into as well: $ sudo trace-cmd record -e xfs_file\* -e iomap\* -e block_bio\* -e block_rq_complete -e printk fio tests/io_uring_corruption.fio ..... $ trace-cmd report > s.t ug! no event found for type 4228 ug! no event found for type 32 ug! negative record size -4 ug! no event found for type 16722 ug! no event found for type 0 ug! no event found for type 4228 ug! no event found for type 32 ug! negative record size -4 ug! no event found for type 16722 ug! no event found for type 4228 ug! no event found for type 4230 ug! no event found for type 32 ug! no event found for type 0 ug! no event found for type 4230 ug! no event found for type 32 ug! no event found for type 0 ug! no event found for type 32 ug! negative record size -4 Segmentation fault $ > > On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote: > > > On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote: > > > > On Fri, Aug 07, 2020 at 03:12:03AM +0000, > bugzilla-daemon@bugzilla.kernel.org wrote: > > > > > --- Comment #1 from Dave Chinner (david@fromorbit.com) --- > > > > > On Thu, Aug 06, 2020 at 04:57:58AM +0000, > bugzilla-daemon@bugzilla.kernel.org > > > > > wrote: > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=208827 > > > > > > > > > > > > Bug ID: 208827 > > > > > > Summary: [fio io_uring] io_uring write data crc32c > verify > > > > > > failed > > > > > > Product: File System > > > > > > Version: 2.5 > > > > > > Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4 > > > > > > > > FWIW, I can reproduce this with a vanilla 5.8 release kernel, > > > > so this isn't related to contents of the XFS dev tree at all... > > > > > > > > In fact, this bug isn't a recent regression. AFAICT, it was > > > > introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did > > > > reproduce. More info once I've finished bisecting it.... > > > > > > f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit > > > commit f67676d160c6ee2ed82917fadfed6d29cab8237c > > > Author: Jens Axboe <axboe@kernel.dk> > > > Date: Mon Dec 2 11:03:47 2019 -0700 > > > > > > io_uring: ensure async punted read/write requests copy iovec > > > > .... > > > > Ok, I went back to vanilla 5.8 to continue debugging and adding > > tracepoints, and it's proving strangely difficult to reproduce now. > > Which turns out to be caused by a tracepoint I inserted to try to > narrow down if this was an invalidation race. I put this in > invalidate_complete_page: > > > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -257,8 +257,11 @@ int invalidate_inode_page(struct page *page) > struct address_space *mapping = page_mapping(page); > if (!mapping) > return 0; > - if (PageDirty(page) || PageWriteback(page)) > + if (PageDirty(page) || PageWriteback(page)) { > + trace_printk("ino 0x%lx page %p, offset 0x%lx\n", > + mapping->host->i_ino, page, page->index * PAGE_SIZE); > return 0; > + } > if (page_mapped(page)) > return 0; > return invalidate_complete_page(mapping, page); So I updated to the current 5.9 whcih has all the new async buffered read stuff in it to see if that fixed the problem, but it didn't. Instead, it actually made this problem occur almost instantenously. I could put the above trace code back in and it didn't purturb the situation, either. Neither did writeback, which I was now able to confirm didn't actually impact on this problem at all. The problem is, instead, reads racing with readahead completion. There is no page cache corruption at all, just fio is reusing buffers that contain stale data from previous read IO and hence any part of the buffer that doesn't get updated by the read IO ends up being detected as "corrupt" because they already contain stale data. The problem is clear from this sequence trace: io_uring-sq-4518 [012] 52.806976: xfs_file_buffered_read: dev 253:32 ino 0x86 size 0x10000000 offset 0x6f40000 count 0x10000 io_uring-sq-4518 [012] 52.806987: iomap_readahead: dev 253:32 ino 0x86 nr_pages 16 io_uring-sq-4518 [012] 52.806987: iomap_apply: dev 253:32 ino 0x86 pos 116654080 length 65536 flags (0x0) ops xfs_read_iomap_ops caller iomap_readahead actor iomap_readahead_actor io_uring-sq-4518 [012] 52.806988: iomap_apply_dstmap: dev 253:32 ino 0x86 bdev 253:32 addr 922058752 offset 116654080 length 65536 type MAPPED flags io_uring-sq-4518 [012] 52.806990: block_bio_queue: 253,32 RA 1800896 + 128 [io_uring-sq] .... io_uring-sq-4518 [012] 52.806992: xfs_file_buffered_read: dev 253:32 ino 0x86 size 0x10000000 offset 0x6f40000 count 0xfffffffffffffff5 io_uring-sq-4518 [012] 52.806993: xfs_file_buffered_read: dev 253:32 ino 0x86 size 0x10000000 offset 0x6f40000 count 0x10000 io_uring-sq-4518 [012] 52.806994: xfs_file_buffered_read: dev 253:32 ino 0x86 size 0x10000000 offset 0x6f40000 count 0xfffffffffffffdef io_uring-sq-4518 [012] 52.806995: xfs_file_buffered_read: dev 253:32 ino 0x86 size 0x10000000 offset 0x6bf0000 count 0x10000 .... fio-4515 [013] 52.807855: block_rq_complete: 253,32 RA () 1800896 + 128 [0] io_uring-sq-4518 [012] 52.807863: xfs_file_buffered_read: dev 253:32 ino 0x86 size 0x10000000 offset 0x6f40000 count 0x10000 io_uring-sq-4518 [012] 52.807866: xfs_file_buffered_read: dev 253:32 ino 0x86 size 0x10000000 offset 0x6f41000 count 0x1000 .... fio-4515 [013] 52.807871: block_rq_complete: 253,32 RA () 1752640 + 128 [0] fio-4515 [013] 52.807876: block_rq_complete: 253,32 RA () 1675200 + 128 [0] fio-4515 [013] 52.807880: block_rq_complete: 253,32 RA () 1652672 + 128 [0] What we see is -3- calls to do the buffered read. The first call triggers readahead and we can see iomap map the file offset and issue the bio. This is IOCB_NOWAIT context, so this is actually broken as we can block for a long time in this path (e.g. on inode locks). Once the IO is issued, we see a second xfs_file_buffered_read trace with the return value in the count - that's -EAGAIN because IOCB_NOWAIT is set and now we are going to block waiting for IO. Immediately after this, the io_uring code issues the buffered read again, only this time with IOCB_WAITQ|IOCB_NOWAIT set and it hits the "page not up to date" path. This then hits __wait_on_page_locked_async() which sets the PageWaiter bit and returns -EIOCBQUEUED to tell the caller that it has been queued. About a millisecond later, the readahead IO completes, and the block layer logs it. This ends up calling iomap_read_end_io() which walks each page(bvec) in the bio, marks them uptodate and unlocks them. The first page unlock sees the PageWaiter bit, which then does a wake_up_page_bit(page, PG_locked); because the PageWaiter bit is set. We see the same io_uring worker thread then immmediately re-issue the buffered read and it does it with IOCB_WAITQ|IOCB_NOWAIT context. This gets the first page, then finds the second page not up to date or has to block trying to lock it. It aborts the read at that point, and because we've already read a page it returns 4096 bytes read. I left the post-buffered read IO completions that were being run by the block layer to show that the buffered read was issued and completed while the block layer was still processing the readahead bio completion. At this point (a positive byte count being returned) the io_uring layer considers the IO done and returns the result to the userspace app. Fio then completely fails to check for a short read despite being in "verify" mode and instead yells about corruption due to stale data that it had left in the buffer that it passed the kernel.... IOWs, there does not appear to be a filesystem or page cache issue here at all - it's just an unhandled short read. Jens, if strace worked on io_uring, I wouldn't have wasted 2 and half days tracking this down. It would have taken 5 minutes with strace and it would have been totally obvious. I'd suggest this needs addressing as a matter of priority over everything else for io_uring.... Cheers, Dave. On 8/10/20 7:50 PM, Jens Axboe wrote:
> On 8/10/20 7:15 PM, Jens Axboe wrote:
>> On 8/10/20 3:08 AM, Dave Chinner wrote:
>>> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
>>>> [cc Jens]
>>>>
>>>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
>>>> the bz link below.]
>>>>
>>>> On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
>>>>> On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
>>>>>> On Fri, Aug 07, 2020 at 03:12:03AM +0000,
>>>>>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>>>>>> --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
>>>>>>> On Thu, Aug 06, 2020 at 04:57:58AM +0000,
>>>>>>> bugzilla-daemon@bugzilla.kernel.org
>>>>>>> wrote:
>>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=208827
>>>>>>>>
>>>>>>>> Bug ID: 208827
>>>>>>>> Summary: [fio io_uring] io_uring write data crc32c verify
>>>>>>>> failed
>>>>>>>> Product: File System
>>>>>>>> Version: 2.5
>>>>>>>> Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
>>>>>>
>>>>>> FWIW, I can reproduce this with a vanilla 5.8 release kernel,
>>>>>> so this isn't related to contents of the XFS dev tree at all...
>>>>>>
>>>>>> In fact, this bug isn't a recent regression. AFAICT, it was
>>>>>> introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
>>>>>> reproduce. More info once I've finished bisecting it....
>>>>>
>>>>> f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
>>>>> commit f67676d160c6ee2ed82917fadfed6d29cab8237c
>>>>> Author: Jens Axboe <axboe@kernel.dk>
>>>>> Date: Mon Dec 2 11:03:47 2019 -0700
>>>>>
>>>>> io_uring: ensure async punted read/write requests copy iovec
>>>>
>>>> ....
>>>>
>>>> Ok, I went back to vanilla 5.8 to continue debugging and adding
>>>> tracepoints, and it's proving strangely difficult to reproduce now.
>>>
>>> Which turns out to be caused by a tracepoint I inserted to try to
>>> narrow down if this was an invalidation race. I put this in
>>> invalidate_complete_page:
>>>
>>>
>>> --- a/mm/truncate.c
>>> +++ b/mm/truncate.c
>>> @@ -257,8 +257,11 @@ int invalidate_inode_page(struct page *page)
>>> struct address_space *mapping = page_mapping(page);
>>> if (!mapping)
>>> return 0;
>>> - if (PageDirty(page) || PageWriteback(page))
>>> + if (PageDirty(page) || PageWriteback(page)) {
>>> + trace_printk("ino 0x%lx page %p, offset 0x%lx\n",
>>> + mapping->host->i_ino, page, page->index *
>>> PAGE_SIZE);
>>> return 0;
>>> + }
>>> if (page_mapped(page))
>>> return 0;
>>> return invalidate_complete_page(mapping, page);
>>>
>>>
>>> And that alone, without even enabling tracepoints, made the
>>> corruption go completely away. So I suspect a page state race
>>> condition and look at POSIX_FADV_DONTNEED, which fio is issuing
>>> before running it's verification reads. First thing that does:
>>>
>>> if (!inode_write_congested(mapping->host))
>>> __filemap_fdatawrite_range(mapping, offset, endbyte,
>>> WB_SYNC_NONE);
>>>
>>> It starts async writeback of the dirty pages. There's 256MB of dirty
>>> pages on these inodes, and iomap tracing indicates the entire 256MB
>>> immediately runs through the trace_iomap_writepage() tracepoint.
>>> i.e. every page goes Dirty -> Writeback and is submitted for async
>>> IO.
>>>
>>> Then the POSIX_FADV_DONTNEED code goes and runs
>>> invalidate_mapping_pages(), which ends up try-locking each page and
>>> then running invalidate_inode_page() on the page, which is where the
>>> trace debug I put in on pages under writeback gets hit. So if
>>> changing the invalidation code for pages under writeback makes the
>>> problem go away, then stopping invalidate_mapping_pages() from
>>> racing with page writeback should make the problem go away, too.
>>>
>>> This does indeed make the corruption go away:
>>>
>>> --- a/mm/fadvise.c
>>> +++ b/mm/fadvise.c
>>> @@ -109,9 +109,8 @@ int generic_fadvise(struct file *file, loff_t offset,
>>> loff_t len, int advice)
>>> case POSIX_FADV_NOREUSE:
>>> break;
>>> case POSIX_FADV_DONTNEED:
>>> if (!inode_write_congested(mapping->host))
>>> - __filemap_fdatawrite_range(mapping, offset,
>>> endbyte,
>>> - WB_SYNC_NONE);
>>> + filemap_write_and_wait_range(mapping, offset,
>>> endbyte);
>>>
>>> /*
>>> * First and last FULL page! Partial pages are deliberately
>>>
>>> by making the invalidation wait for the pages to go fully to the
>>> clean state before starting.
>>>
>>> This, however, only fixes the specific symptom being tripped over
>>> here. To further test this, I removed this writeback from
>>> POSIX_FADV_DONTNEED completely so I could trigger writeback via
>>> controlled background writeback. And, as I expected, whenever
>>> background writeback ran to write back these dirty files, the
>>> verification failures triggered again. It is quite reliable.
>>>
>>> So it looks like there is some kind of writeback completion vs page
>>> invalidation race condition occurring, but more work is needed to
>>> isolate it further. I don't know what part the async read plays in
>>> the corruption yet, because I don't know how we are getting pages in
>>> the cache where page->index != the file offset stamped in the data.
>>> That smells of leaking PageUptodate flags...
>>
>> The async read really isn't doing anything that you could not do with
>> separate threads. Unfortunately it's not that easy to have multiple
>> threads working on the same region with fio, or we could've reproduced
>> it with a job file written to use that instead.
>>
>> I'll dig a bit here...
>
> Have we verified that the actual page cache is inconsistent, or is that
> just an assumption? I'm asking since I poked a bit on the fio side, and
> suspiciously the failed verification was a short IO. At least
> originally, fio will retry those, but it could be a bug in the io_uring
> engine for fio.
>
> I'll poke some more.
The on-disk state seems sane. I added a hack that clears the rest of the
buffer to 0x5a when we get a short read. When the verify fails, the io_u
that was attempted verified looks like this:
0000fe0 116b b418 d14b 0477 822d 6dcd 201d 1316
0000ff0 3045 eca3 0d1c 1a4f e608 0571 6b52 015e
0001000 5a5a 5a5a 5a5a 5a5a 5a5a 5a5a 5a5a 5a5a
*
0010000
where fio dumps this as the expected data:
0000fe0 116b b418 d14b 0477 822d 6dcd 201d 1316
0000ff0 3045 eca3 0d1c 1a4f e608 0571 6b52 015e
0001000 3cc1 4daa cab1 12ba c798 0b54 b281 0a05
0001010 98f3 bd9e 30a5 1728 531e 6b3a 2745 1877
Fio says the offset is 62652416 in the file, and reading
that 64k block from the file and dumping it:
0000fe0 116b b418 d14b 0477 822d 6dcd 201d 1316
0000ff0 3045 eca3 0d1c 1a4f e608 0571 6b52 015e
0001000 3cc1 4daa cab1 12ba c798 0b54 b281 0a05
0001010 98f3 bd9e 30a5 1728 531e 6b3a 2745 1877
So it seems to me like the file state is consistent, at least after the
run, and that this seems more likely to be a fio issue with short
read handling.
Poking along...
On 8/10/20 8:00 PM, Dave Chinner wrote: > On Mon, Aug 10, 2020 at 07:08:59PM +1000, Dave Chinner wrote: >> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote: >>> [cc Jens] >>> >>> [Jens, data corruption w/ io_uring and simple fio reproducer. see >>> the bz link below.] > > Looks like a io_uring/fio bugs at this point, Jens. All your go fast > bits turns the buffered read into a short read, and neither fio nor > io_uring async buffered read path handle short reads. Details below. It's a fio issue. The io_uring engine uses a different path for short IO completions, and that's being ignored by the backend... Hence the IO just gets completed and not retried for this case, and that'll then trigger verification as if it did complete. I'm fixing it up. > Also, the block_rq_* trace points appear to emit corrupt trace > records that crash trace-cmd. That probably needs to be looked into > as well: > > $ sudo trace-cmd record -e xfs_file\* -e iomap\* -e block_bio\* -e > block_rq_complete -e printk fio tests/io_uring_corruption.fio > ..... > $ trace-cmd report > s.t > ug! no event found for type 4228 > ug! no event found for type 32 > ug! negative record size -4 > ug! no event found for type 16722 > ug! no event found for type 0 > ug! no event found for type 4228 > ug! no event found for type 32 > ug! negative record size -4 > ug! no event found for type 16722 > ug! no event found for type 4228 > ug! no event found for type 4230 > ug! no event found for type 32 > ug! no event found for type 0 > ug! no event found for type 4230 > ug! no event found for type 32 > ug! no event found for type 0 > ug! no event found for type 32 > ug! negative record size -4 > Segmentation fault > $ No idea what that is, I regularly use "normal" blktrace and it works just fine for me. >>> On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote: >>>> On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote: >>>>> On Fri, Aug 07, 2020 at 03:12:03AM +0000, >>>>> bugzilla-daemon@bugzilla.kernel.org wrote: >>>>>> --- Comment #1 from Dave Chinner (david@fromorbit.com) --- >>>>>> On Thu, Aug 06, 2020 at 04:57:58AM +0000, >>>>>> bugzilla-daemon@bugzilla.kernel.org >>>>>> wrote: >>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=208827 >>>>>>> >>>>>>> Bug ID: 208827 >>>>>>> Summary: [fio io_uring] io_uring write data crc32c verify >>>>>>> failed >>>>>>> Product: File System >>>>>>> Version: 2.5 >>>>>>> Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4 >>>>> >>>>> FWIW, I can reproduce this with a vanilla 5.8 release kernel, >>>>> so this isn't related to contents of the XFS dev tree at all... >>>>> >>>>> In fact, this bug isn't a recent regression. AFAICT, it was >>>>> introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did >>>>> reproduce. More info once I've finished bisecting it.... >>>> >>>> f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit >>>> commit f67676d160c6ee2ed82917fadfed6d29cab8237c >>>> Author: Jens Axboe <axboe@kernel.dk> >>>> Date: Mon Dec 2 11:03:47 2019 -0700 >>>> >>>> io_uring: ensure async punted read/write requests copy iovec >>> >>> .... >>> >>> Ok, I went back to vanilla 5.8 to continue debugging and adding >>> tracepoints, and it's proving strangely difficult to reproduce now. >> >> Which turns out to be caused by a tracepoint I inserted to try to >> narrow down if this was an invalidation race. I put this in >> invalidate_complete_page: >> >> >> --- a/mm/truncate.c >> +++ b/mm/truncate.c >> @@ -257,8 +257,11 @@ int invalidate_inode_page(struct page *page) >> struct address_space *mapping = page_mapping(page); >> if (!mapping) >> return 0; >> - if (PageDirty(page) || PageWriteback(page)) >> + if (PageDirty(page) || PageWriteback(page)) { >> + trace_printk("ino 0x%lx page %p, offset 0x%lx\n", >> + mapping->host->i_ino, page, page->index * >> PAGE_SIZE); >> return 0; >> + } >> if (page_mapped(page)) >> return 0; >> return invalidate_complete_page(mapping, page); > > So I updated to the current 5.9 whcih has all the new async buffered > read stuff in it to see if that fixed the problem, but it didn't. > Instead, it actually made this problem occur almost instantenously. > > I could put the above trace code back in and it didn't purturb the > situation, either. Neither did writeback, which I was now able to > confirm didn't actually impact on this problem at all. > > The problem is, instead, reads racing with readahead completion. > There is no page cache corruption at all, just fio is reusing > buffers that contain stale data from previous read IO and hence any > part of the buffer that doesn't get updated by the read IO ends up > being detected as "corrupt" because they already contain stale data. > > The problem is clear from this sequence trace: > > io_uring-sq-4518 [012] 52.806976: xfs_file_buffered_read: dev 253:32 > ino 0x86 size 0x10000000 offset 0x6f40000 count 0x10000 > io_uring-sq-4518 [012] 52.806987: iomap_readahead: dev 253:32 > ino 0x86 nr_pages 16 > io_uring-sq-4518 [012] 52.806987: iomap_apply: dev 253:32 > ino 0x86 pos 116654080 length 65536 flags (0x0) ops xfs_read_iomap_ops > caller iomap_readahead actor iomap_readahead_actor > io_uring-sq-4518 [012] 52.806988: iomap_apply_dstmap: dev 253:32 > ino 0x86 bdev 253:32 addr 922058752 offset 116654080 length 65536 type > MAPPED flags > io_uring-sq-4518 [012] 52.806990: block_bio_queue: 253,32 RA > 1800896 + 128 [io_uring-sq] > .... > io_uring-sq-4518 [012] 52.806992: xfs_file_buffered_read: dev 253:32 > ino 0x86 size 0x10000000 offset 0x6f40000 count 0xfffffffffffffff5 > io_uring-sq-4518 [012] 52.806993: xfs_file_buffered_read: dev 253:32 > ino 0x86 size 0x10000000 offset 0x6f40000 count 0x10000 > io_uring-sq-4518 [012] 52.806994: xfs_file_buffered_read: dev 253:32 > ino 0x86 size 0x10000000 offset 0x6f40000 count 0xfffffffffffffdef > io_uring-sq-4518 [012] 52.806995: xfs_file_buffered_read: dev 253:32 > ino 0x86 size 0x10000000 offset 0x6bf0000 count 0x10000 > .... > fio-4515 [013] 52.807855: block_rq_complete: 253,32 RA () > 1800896 + 128 [0] > io_uring-sq-4518 [012] 52.807863: xfs_file_buffered_read: dev 253:32 > ino 0x86 size 0x10000000 offset 0x6f40000 count 0x10000 > io_uring-sq-4518 [012] 52.807866: xfs_file_buffered_read: dev 253:32 > ino 0x86 size 0x10000000 offset 0x6f41000 count 0x1000 > .... > fio-4515 [013] 52.807871: block_rq_complete: 253,32 RA () > 1752640 + 128 [0] > fio-4515 [013] 52.807876: block_rq_complete: 253,32 RA () > 1675200 + 128 [0] > fio-4515 [013] 52.807880: block_rq_complete: 253,32 RA () > 1652672 + 128 [0] > > What we see is -3- calls to do the buffered read. The first call > triggers readahead and we can see iomap map the file offset and > issue the bio. This is IOCB_NOWAIT context, so this is actually > broken as we can block for a long time in this path (e.g. on > inode locks). > > Once the IO is issued, we see a second xfs_file_buffered_read trace > with the return value in the count - that's -EAGAIN because > IOCB_NOWAIT is set and now we are going to block waiting for IO. > > Immediately after this, the io_uring code issues the buffered read > again, only this time with IOCB_WAITQ|IOCB_NOWAIT set and it hits > the "page not up to date" path. This then hits > __wait_on_page_locked_async() which sets the PageWaiter bit and > returns -EIOCBQUEUED to tell the caller that it has been queued. > > About a millisecond later, the readahead IO completes, and the block > layer logs it. This ends up calling iomap_read_end_io() which walks > each page(bvec) in the bio, marks them uptodate and unlocks them. > The first page unlock sees the PageWaiter bit, which then does a > > wake_up_page_bit(page, PG_locked); > > because the PageWaiter bit is set. We see the same io_uring worker > thread then immmediately re-issue the buffered read and it does it > with IOCB_WAITQ|IOCB_NOWAIT context. This gets the first page, then > finds the second page not up to date or has to block trying to lock > it. It aborts the read at that point, and because we've already read > a page it returns 4096 bytes read. > > I left the post-buffered read IO completions that were being run by > the block layer to show that the buffered read was issued and > completed while the block layer was still processing the readahead > bio completion. > > At this point (a positive byte count being returned) the io_uring > layer considers the IO done and returns the result to the userspace > app. Fio then completely fails to check for a short read despite > being in "verify" mode and instead yells about corruption due to > stale data that it had left in the buffer that it passed the > kernel.... > > IOWs, there does not appear to be a filesystem or page cache issue > here at all - it's just an unhandled short read. Right, the page cache is consistent, nothing wrong on that side. This is purely a fio issue with messing up the short read. It'd be nice to do better on the short reads, maybe wait for the entire range to be ready instead of just triggering on the first page. Something to look into. > Jens, if strace worked on io_uring, I wouldn't have wasted 2 and > half days tracking this down. It would have taken 5 minutes with > strace and it would have been totally obvious. I'd suggest this > needs addressing as a matter of priority over everything else for > io_uring.... Sorry, you should have pinged me earlier. In lieu of strace, we could expand the io_uring issue event to include a bigger dump of exactly what is in the command being issued instead of just the opcode. There's already info in there enabling someone to tie the complete and submit events together, so it could have been deduced that we never retried a short IO on the application side. But it should not be that hard to dig out, I agree we need to make it easier to debug these kinds of things. Definitely on the list! On 8/10/20 8:01 PM, Jens Axboe wrote:
> On 8/10/20 7:50 PM, Jens Axboe wrote:
>> On 8/10/20 7:15 PM, Jens Axboe wrote:
>>> On 8/10/20 3:08 AM, Dave Chinner wrote:
>>>> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
>>>>> [cc Jens]
>>>>>
>>>>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
>>>>> the bz link below.]
>>>>>
>>>>> On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
>>>>>> On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
>>>>>>> On Fri, Aug 07, 2020 at 03:12:03AM +0000,
>>>>>>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>>>>>>> --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
>>>>>>>> On Thu, Aug 06, 2020 at 04:57:58AM +0000,
>>>>>>>> bugzilla-daemon@bugzilla.kernel.org
>>>>>>>> wrote:
>>>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=208827
>>>>>>>>>
>>>>>>>>> Bug ID: 208827
>>>>>>>>> Summary: [fio io_uring] io_uring write data crc32c verify
>>>>>>>>> failed
>>>>>>>>> Product: File System
>>>>>>>>> Version: 2.5
>>>>>>>>> Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
>>>>>>>
>>>>>>> FWIW, I can reproduce this with a vanilla 5.8 release kernel,
>>>>>>> so this isn't related to contents of the XFS dev tree at all...
>>>>>>>
>>>>>>> In fact, this bug isn't a recent regression. AFAICT, it was
>>>>>>> introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
>>>>>>> reproduce. More info once I've finished bisecting it....
>>>>>>
>>>>>> f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
>>>>>> commit f67676d160c6ee2ed82917fadfed6d29cab8237c
>>>>>> Author: Jens Axboe <axboe@kernel.dk>
>>>>>> Date: Mon Dec 2 11:03:47 2019 -0700
>>>>>>
>>>>>> io_uring: ensure async punted read/write requests copy iovec
>>>>>
>>>>> ....
>>>>>
>>>>> Ok, I went back to vanilla 5.8 to continue debugging and adding
>>>>> tracepoints, and it's proving strangely difficult to reproduce now.
>>>>
>>>> Which turns out to be caused by a tracepoint I inserted to try to
>>>> narrow down if this was an invalidation race. I put this in
>>>> invalidate_complete_page:
>>>>
>>>>
>>>> --- a/mm/truncate.c
>>>> +++ b/mm/truncate.c
>>>> @@ -257,8 +257,11 @@ int invalidate_inode_page(struct page *page)
>>>> struct address_space *mapping = page_mapping(page);
>>>> if (!mapping)
>>>> return 0;
>>>> - if (PageDirty(page) || PageWriteback(page))
>>>> + if (PageDirty(page) || PageWriteback(page)) {
>>>> + trace_printk("ino 0x%lx page %p, offset 0x%lx\n",
>>>> + mapping->host->i_ino, page, page->index *
>>>> PAGE_SIZE);
>>>> return 0;
>>>> + }
>>>> if (page_mapped(page))
>>>> return 0;
>>>> return invalidate_complete_page(mapping, page);
>>>>
>>>>
>>>> And that alone, without even enabling tracepoints, made the
>>>> corruption go completely away. So I suspect a page state race
>>>> condition and look at POSIX_FADV_DONTNEED, which fio is issuing
>>>> before running it's verification reads. First thing that does:
>>>>
>>>> if (!inode_write_congested(mapping->host))
>>>> __filemap_fdatawrite_range(mapping, offset, endbyte,
>>>> WB_SYNC_NONE);
>>>>
>>>> It starts async writeback of the dirty pages. There's 256MB of dirty
>>>> pages on these inodes, and iomap tracing indicates the entire 256MB
>>>> immediately runs through the trace_iomap_writepage() tracepoint.
>>>> i.e. every page goes Dirty -> Writeback and is submitted for async
>>>> IO.
>>>>
>>>> Then the POSIX_FADV_DONTNEED code goes and runs
>>>> invalidate_mapping_pages(), which ends up try-locking each page and
>>>> then running invalidate_inode_page() on the page, which is where the
>>>> trace debug I put in on pages under writeback gets hit. So if
>>>> changing the invalidation code for pages under writeback makes the
>>>> problem go away, then stopping invalidate_mapping_pages() from
>>>> racing with page writeback should make the problem go away, too.
>>>>
>>>> This does indeed make the corruption go away:
>>>>
>>>> --- a/mm/fadvise.c
>>>> +++ b/mm/fadvise.c
>>>> @@ -109,9 +109,8 @@ int generic_fadvise(struct file *file, loff_t offset,
>>>> loff_t len, int advice)
>>>> case POSIX_FADV_NOREUSE:
>>>> break;
>>>> case POSIX_FADV_DONTNEED:
>>>> if (!inode_write_congested(mapping->host))
>>>> - __filemap_fdatawrite_range(mapping, offset,
>>>> endbyte,
>>>> - WB_SYNC_NONE);
>>>> + filemap_write_and_wait_range(mapping, offset,
>>>> endbyte);
>>>>
>>>> /*
>>>> * First and last FULL page! Partial pages are
>>>> deliberately
>>>>
>>>> by making the invalidation wait for the pages to go fully to the
>>>> clean state before starting.
>>>>
>>>> This, however, only fixes the specific symptom being tripped over
>>>> here. To further test this, I removed this writeback from
>>>> POSIX_FADV_DONTNEED completely so I could trigger writeback via
>>>> controlled background writeback. And, as I expected, whenever
>>>> background writeback ran to write back these dirty files, the
>>>> verification failures triggered again. It is quite reliable.
>>>>
>>>> So it looks like there is some kind of writeback completion vs page
>>>> invalidation race condition occurring, but more work is needed to
>>>> isolate it further. I don't know what part the async read plays in
>>>> the corruption yet, because I don't know how we are getting pages in
>>>> the cache where page->index != the file offset stamped in the data.
>>>> That smells of leaking PageUptodate flags...
>>>
>>> The async read really isn't doing anything that you could not do with
>>> separate threads. Unfortunately it's not that easy to have multiple
>>> threads working on the same region with fio, or we could've reproduced
>>> it with a job file written to use that instead.
>>>
>>> I'll dig a bit here...
>>
>> Have we verified that the actual page cache is inconsistent, or is that
>> just an assumption? I'm asking since I poked a bit on the fio side, and
>> suspiciously the failed verification was a short IO. At least
>> originally, fio will retry those, but it could be a bug in the io_uring
>> engine for fio.
>>
>> I'll poke some more.
>
> The on-disk state seems sane. I added a hack that clears the rest of the
> buffer to 0x5a when we get a short read. When the verify fails, the io_u
> that was attempted verified looks like this:
>
> 0000fe0 116b b418 d14b 0477 822d 6dcd 201d 1316
> 0000ff0 3045 eca3 0d1c 1a4f e608 0571 6b52 015e
> 0001000 5a5a 5a5a 5a5a 5a5a 5a5a 5a5a 5a5a 5a5a
> *
> 0010000
>
> where fio dumps this as the expected data:
>
> 0000fe0 116b b418 d14b 0477 822d 6dcd 201d 1316
> 0000ff0 3045 eca3 0d1c 1a4f e608 0571 6b52 015e
> 0001000 3cc1 4daa cab1 12ba c798 0b54 b281 0a05
> 0001010 98f3 bd9e 30a5 1728 531e 6b3a 2745 1877
>
> Fio says the offset is 62652416 in the file, and reading
> that 64k block from the file and dumping it:
>
> 0000fe0 116b b418 d14b 0477 822d 6dcd 201d 1316
> 0000ff0 3045 eca3 0d1c 1a4f e608 0571 6b52 015e
> 0001000 3cc1 4daa cab1 12ba c798 0b54 b281 0a05
> 0001010 98f3 bd9e 30a5 1728 531e 6b3a 2745 1877
>
> So it seems to me like the file state is consistent, at least after the
> run, and that this seems more likely to be a fio issue with short
> read handling.
>
> Poking along...
I've pushed some fixes for fio for this, it now works fine for me. JFYI.
On Mon, Aug 10, 2020 at 08:19:57PM -0600, Jens Axboe wrote: > > IOWs, there does not appear to be a filesystem or page cache issue > > here at all - it's just an unhandled short read. > > Right, the page cache is consistent, nothing wrong on that side. This is > purely a fio issue with messing up the short read. > > It'd be nice to do better on the short reads, maybe wait for the entire > range to be ready instead of just triggering on the first page. > Something to look into. > > > Jens, if strace worked on io_uring, I wouldn't have wasted 2 and > > half days tracking this down. It would have taken 5 minutes with > > strace and it would have been totally obvious. I'd suggest this > > needs addressing as a matter of priority over everything else for > > io_uring.... > > Sorry, you should have pinged me earlier. I pinged you as early as I would ping anyone else when it's apparent I'm probably not looking at a filesystem issue. The problem is here that io_uring is completely unfamiliar code to most of the fileystem developers that are going to have to diagnose problems like this in the field. That's why I spent the time on it - nobody else in my team or the upstream people I work with day to day has even looked at this code or has a clue of how it works. We need to have people who know how this precarious stack of cards actually works, otherwise we're completely screwed when a user has a similar sort of issue. And right now, all I see are big red flags. This exercise has lead me to realise that that we really don't have *any* tools that can be used to diagnose io_uring problems at all. That's something that clearly needs to be fixed, because fo some reason this crazy io_uring stuff has suddenly become the responsibility of distro filesystem people are being asked to maintain and debug... So saying "you should have pinged me earlier" is a hugely naive response. It completely ignores the fact that people are going to have to triage issues like this on a regular basis. And you don't scale to having every single user ping you to help debug io_uring problems they are seeing. IOWs, "you should have pinged me" is not a solution, nor does it absolve you of the responsibility for making your subsystem debuggable by the expected user and application devleoper base. > In lieu of strace, we could > expand the io_uring issue event to include a bigger dump of exactly what > is in the command being issued instead of just the opcode. I didn't find anything even remotely useful in the handful of trace points in the io_uring code..... > There's > already info in there enabling someone to tie the complete and submit > events together, so it could have been deduced that we never retried a > short IO on the application side. But it should not be that hard to dig > out, ... and now you're assuming that the person trying to debug their application knows exactly how the whole io_uring thing works inside the kernel so they can make sense of the tracepoint information. There's no documentation for the interface, there's no design doc that describes how the async engine is supposed to work, what constraints it places on code paths that need to be able to run async, etc. If you aren't deep in the guts of io_uring all the time, then working out what the tracepoints are documenting is no easy task. > I agree we need to make it easier to debug these kinds of things. > Definitely on the list! As I said, I think it should be at the top of your list. Cheers, Dave. On Mon, Aug 10, 2020 at 08:19:57PM -0600, Jens Axboe wrote:
> On 8/10/20 8:00 PM, Dave Chinner wrote:
> > On Mon, Aug 10, 2020 at 07:08:59PM +1000, Dave Chinner wrote:
> >> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
> >>> [cc Jens]
> >>>
> >>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
> >>> the bz link below.]
> >
> > Looks like a io_uring/fio bugs at this point, Jens. All your go fast
> > bits turns the buffered read into a short read, and neither fio nor
> > io_uring async buffered read path handle short reads. Details below.
>
> It's a fio issue. The io_uring engine uses a different path for short
> IO completions, and that's being ignored by the backend... Hence the
> IO just gets completed and not retried for this case, and that'll then
> trigger verification as if it did complete. I'm fixing it up.
I just updated fio to:
cb7d7abb (HEAD -> master, origin/master, origin/HEAD) io_u: set io_u->verify_offset in fill_io_u()
The workload still reports corruption almost instantly. Only this
time, the trace is not reporting a short read.
File is patterned with:
verify_pattern=0x33333333%o-16
Offset of "bad" data is 0x1240000.
Expected:
00000000: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000010: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000020: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000030: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000040: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000050: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000060: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000070: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000080: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
.....
0000ffd0: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
0000ffe0: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
0000fff0: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
Received:
00000000: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000010: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000020: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000030: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000040: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000050: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000060: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000070: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000080: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
.....
0000ffd0: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
0000ffe0: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
0000fff0: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
Looks like the data in the expected buffer is wrong - the data
pattern in the received buffer is correct according the defined
pattern.
Error is 100% reproducable from the same test case. Same bad byte in
the expected buffer dump every single time.
-Dave.
On 8/11/20 1:05 AM, Dave Chinner wrote:
> On Mon, Aug 10, 2020 at 08:19:57PM -0600, Jens Axboe wrote:
>> On 8/10/20 8:00 PM, Dave Chinner wrote:
>>> On Mon, Aug 10, 2020 at 07:08:59PM +1000, Dave Chinner wrote:
>>>> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
>>>>> [cc Jens]
>>>>>
>>>>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
>>>>> the bz link below.]
>>>
>>> Looks like a io_uring/fio bugs at this point, Jens. All your go fast
>>> bits turns the buffered read into a short read, and neither fio nor
>>> io_uring async buffered read path handle short reads. Details below.
>>
>> It's a fio issue. The io_uring engine uses a different path for short
>> IO completions, and that's being ignored by the backend... Hence the
>> IO just gets completed and not retried for this case, and that'll then
>> trigger verification as if it did complete. I'm fixing it up.
>
> I just updated fio to:
>
> cb7d7abb (HEAD -> master, origin/master, origin/HEAD) io_u: set
> io_u->verify_offset in fill_io_u()
>
> The workload still reports corruption almost instantly. Only this
> time, the trace is not reporting a short read.
>
> File is patterned with:
>
> verify_pattern=0x33333333%o-16
>
> Offset of "bad" data is 0x1240000.
>
> Expected:
>
> 00000000: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000010: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000020: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000030: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000040: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000050: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000060: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000070: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000080: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> .....
> 0000ffd0: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 0000ffe0: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 0000fff0: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
>
>
> Received:
>
> 00000000: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000010: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000020: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000030: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000040: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000050: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000060: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000070: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000080: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> .....
> 0000ffd0: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 0000ffe0: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 0000fff0: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
>
>
> Looks like the data in the expected buffer is wrong - the data
> pattern in the received buffer is correct according the defined
> pattern.
>
> Error is 100% reproducable from the same test case. Same bad byte in
> the expected buffer dump every single time.
What job file are you running? It's not impossible that I broken
something else in fio, the io_u->verify_offset is a bit risky... I'll
get it fleshed out shortly.
Dunno if it's at all related, but if it is then Dave can blame me for his two wasted days because I ran into this 3 weeks ago and just got around to filing a bug today. Bug 208875 - btrfs checksum errors in guest when virtio device io=io_uring https://bugzilla.kernel.org/show_bug.cgi?id=208875 Jens Axboe <axboe@kernel.dk> writes: > So it seems to me like the file state is consistent, at least after the > run, and that this seems more likely to be a fio issue with short > read handling. Any idea why there was a short read, though? -Jeff On Tue, Aug 11, 2020 at 07:10:30AM -0600, Jens Axboe wrote:
> On 8/11/20 1:05 AM, Dave Chinner wrote:
> > On Mon, Aug 10, 2020 at 08:19:57PM -0600, Jens Axboe wrote:
> >> On 8/10/20 8:00 PM, Dave Chinner wrote:
> >>> On Mon, Aug 10, 2020 at 07:08:59PM +1000, Dave Chinner wrote:
> >>>> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
> >>>>> [cc Jens]
> >>>>>
> >>>>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
> >>>>> the bz link below.]
> >>>
> >>> Looks like a io_uring/fio bugs at this point, Jens. All your go fast
> >>> bits turns the buffered read into a short read, and neither fio nor
> >>> io_uring async buffered read path handle short reads. Details below.
> >>
> >> It's a fio issue. The io_uring engine uses a different path for short
> >> IO completions, and that's being ignored by the backend... Hence the
> >> IO just gets completed and not retried for this case, and that'll then
> >> trigger verification as if it did complete. I'm fixing it up.
> >
> > I just updated fio to:
> >
> > cb7d7abb (HEAD -> master, origin/master, origin/HEAD) io_u: set
> io_u->verify_offset in fill_io_u()
> >
> > The workload still reports corruption almost instantly. Only this
> > time, the trace is not reporting a short read.
> >
> > File is patterned with:
> >
> > verify_pattern=0x33333333%o-16
> >
> > Offset of "bad" data is 0x1240000.
> >
> > Expected:
> >
> > 00000000: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000010: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000020: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000030: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000040: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000050: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000060: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000070: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000080: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > .....
> > 0000ffd0: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff
> 3333............
> > 0000ffe0: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff
> 3333............
> > 0000fff0: 33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff
> 3333............
> >
> >
> > Received:
> >
> > 00000000: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000010: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000020: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000030: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000040: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000050: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000060: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000070: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000080: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > .....
> > 0000ffd0: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff
> 3333............
> > 0000ffe0: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff
> 3333............
> > 0000fff0: 33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff
> 3333............
> >
> >
> > Looks like the data in the expected buffer is wrong - the data
> > pattern in the received buffer is correct according the defined
> > pattern.
> >
> > Error is 100% reproducable from the same test case. Same bad byte in
> > the expected buffer dump every single time.
>
> What job file are you running? It's not impossible that I broken
> something else in fio, the io_u->verify_offset is a bit risky... I'll
> get it fleshed out shortly.
Details are in the bugzilla I pointed you at. I modified the
original config specified to put per-file and offset identifiers
into the file data rather than using random data. This is
"determining the origin of stale data 101" stuff - the first thing
we _always_ do when trying to diagnose data corruption is identify
where the bad data came from.
Entire config file is below.
CHeers,
Dave.
On Tue, Aug 11, 2020 at 04:56:37PM -0400, Jeff Moyer wrote: > Jens Axboe <axboe@kernel.dk> writes: > > > So it seems to me like the file state is consistent, at least after the > > run, and that this seems more likely to be a fio issue with short > > read handling. > > Any idea why there was a short read, though? Yes. See: https://lore.kernel.org/linux-xfs/20200807024211.GG2114@dread.disaster.area/T/#maf3bd9325fb3ac0773089ca58609a2cea0386ddf It's a race between the readahead io completion marking pages uptodate and unlocking them, and the io_uring worker function getting woken on the first page being unlocked and running the buffered read before the entire readahead IO completion has unlocked all the pages in the IO. Basically, io_uring is re-running the IOCB_NOWAIT|IOCB_WAITQ IO when there are still pages locked under IO. This will happen much more frequently the larger the buffered read (these are only 64kB) and the readahead windows are opened. Essentially, the io_uring buffered read needs to wait until _all_ pages in the IO are marked up to date and unlocked, not just the first one. And not just the last one, either - readahead can be broken into multiple bios (because it spans extents) and there is no guarantee of order of completion of the readahead bios given by the readahead code.... Cheers, Dave. On Wed, Aug 12, 2020 at 07:59:13AM +1000, Dave Chinner wrote:
> On Tue, Aug 11, 2020 at 07:10:30AM -0600, Jens Axboe wrote:
> > What job file are you running? It's not impossible that I broken
> > something else in fio, the io_u->verify_offset is a bit risky... I'll
> > get it fleshed out shortly.
>
> Details are in the bugzilla I pointed you at. I modified the
> original config specified to put per-file and offset identifiers
> into the file data rather than using random data. This is
> "determining the origin of stale data 101" stuff - the first thing
> we _always_ do when trying to diagnose data corruption is identify
> where the bad data came from.
>
> Entire config file is below.
Just as a data point: btrfs fails this test even faster than XFS.
Both with the old 3.21 fio binary and the new one.
Evidence points to this code being very poorly tested. Both
filesystems it is enabled on fail validation with the tool is
supposed to exercise and validate io_uring IO path behaviour.
Regardless of whether this is a tool failure or a kernel code
failure, the fact is that nobody ran data validation tests on this
shiny new code. And for a storage API that is reading and writing
user data, that's a pretty major process failure....
Improvements required, Jens.
-Dave.
Anyone seeing massive latency spikes? https://bugzilla.kernel.org/show_bug.cgi?id=208875#c5 On 8/11/20 4:09 PM, Dave Chinner wrote: > On Tue, Aug 11, 2020 at 04:56:37PM -0400, Jeff Moyer wrote: >> Jens Axboe <axboe@kernel.dk> writes: >> >>> So it seems to me like the file state is consistent, at least after the >>> run, and that this seems more likely to be a fio issue with short >>> read handling. >> >> Any idea why there was a short read, though? > > Yes. See: > > > https://lore.kernel.org/linux-xfs/20200807024211.GG2114@dread.disaster.area/T/#maf3bd9325fb3ac0773089ca58609a2cea0386ddf > > It's a race between the readahead io completion marking pages > uptodate and unlocking them, and the io_uring worker function > getting woken on the first page being unlocked and running the > buffered read before the entire readahead IO completion has unlocked > all the pages in the IO. > > Basically, io_uring is re-running the IOCB_NOWAIT|IOCB_WAITQ IO when > there are still pages locked under IO. This will happen much more > frequently the larger the buffered read (these are only 64kB) and > the readahead windows are opened. > > Essentially, the io_uring buffered read needs to wait until _all_ > pages in the IO are marked up to date and unlocked, not just the > first one. And not just the last one, either - readahead can be > broken into multiple bios (because it spans extents) and there is no > guarantee of order of completion of the readahead bios given by the > readahead code.... Yes, it would ideally wait, or at least trigger on the last one. I'll see if I can improve that. For any of my testing, the amount of triggered short reads is minimal. For the verify case that we just ran, we're talking 8-12 ios out of 820 thousand, or 0.001% of them. So nothing that makes a performance difference in practical terms, though it would be nice to not hand back short reads if we can avoid it. Not for performance reasons, but for usage reasons. On 8/11/20 5:00 PM, Dave Chinner wrote: > On Wed, Aug 12, 2020 at 07:59:13AM +1000, Dave Chinner wrote: >> On Tue, Aug 11, 2020 at 07:10:30AM -0600, Jens Axboe wrote: >>> What job file are you running? It's not impossible that I broken >>> something else in fio, the io_u->verify_offset is a bit risky... I'll >>> get it fleshed out shortly. >> >> Details are in the bugzilla I pointed you at. I modified the >> original config specified to put per-file and offset identifiers >> into the file data rather than using random data. This is >> "determining the origin of stale data 101" stuff - the first thing >> we _always_ do when trying to diagnose data corruption is identify >> where the bad data came from. >> >> Entire config file is below. > > Just as a data point: btrfs fails this test even faster than XFS. > Both with the old 3.21 fio binary and the new one. I can't trigger any failures with the fixes I committed, so that's a bit puzzling. What storage are you running this on? I'll try a few other things. > Evidence points to this code being very poorly tested. Both > filesystems it is enabled on fail validation with the tool is > supposed to exercise and validate io_uring IO path behaviour. > > Regardless of whether this is a tool failure or a kernel code > failure, the fact is that nobody ran data validation tests on this > shiny new code. And for a storage API that is reading and writing > user data, that's a pretty major process failure.... > > Improvements required, Jens. Let's turn down the condescension and assumptions a notch, please. The fio io_uring engine was originally written to benchmark it, and I haven't really been using it for verification purposes (as is evident). In my experience, I get much better validation from running internal production. And plenty of that has been done. That said, the fio side should of course work, since others can't necessarily run what I'm running, and it does provide ways to exercise it that aren't necessarily seen in prod. But blankly assuming that "nobody ran data validation tests on this shiny new code" is blatantly false. Jens Axboe <axboe@kernel.dk> writes: > Yes, it would ideally wait, or at least trigger on the last one. I'll > see if I can improve that. For any of my testing, the amount of > triggered short reads is minimal. For the verify case that we just ran, > we're talking 8-12 ios out of 820 thousand, or 0.001% of them. So > nothing that makes a performance difference in practical terms, though > it would be nice to not hand back short reads if we can avoid it. Not > for performance reasons, but for usage reasons. I think you could make the case that handing back a short read is a bug (unless RWF_NOWAIT was specified, of course). At the very least, it violates the principle of least surprise, and the fact that it happens infrequently actually makes it a worse problem when it comes to debugging. -Jeff On 8/12/20 9:24 AM, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
>
>> Yes, it would ideally wait, or at least trigger on the last one. I'll
>> see if I can improve that. For any of my testing, the amount of
>> triggered short reads is minimal. For the verify case that we just
>> ran, we're talking 8-12 ios out of 820 thousand, or 0.001% of them.
>> So nothing that makes a performance difference in practical terms,
>> though it would be nice to not hand back short reads if we can avoid
>> it. Not for performance reasons, but for usage reasons.
>
> I think you could make the case that handing back a short read is a
> bug (unless RWF_NOWAIT was specified, of course). At the very least,
> it violates the principle of least surprise, and the fact that it
> happens infrequently actually makes it a worse problem when it comes
> to debugging.
It's definitely on my list to ensure that we handle the somewhat
expected case by just retrying it internally, because I do agree that it
can be surprising. FWIW, this isn't a change with 5.9-rc, io_uring has
always potentially short buffered reads when when the io-wq offload was
done.
While it may happen infrequently with this specific test case, others
can trigger it more often.
|