Bug 208827

Summary: [fio io_uring] io_uring write data crc32c verify failed
Product: File System Reporter: Zorro Lang (zlang)
Component: XFSAssignee: FileSystem/XFS Default Virtual Assignee (filesystem_xfs)
Status: NEW ---    
Severity: normal CC: bugzilla
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4 Subsystem:
Regression: No Bisected commit-id:

Description Zorro Lang 2020-08-06 04:57:58 UTC
Description of problem:
Our fio io_uring test failed as below:

# fio io_uring.fio
uring_w: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=io_uring, iodepth=16
uring_sqt_w: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=io_uring, iodepth=16
uring_rw: (g=0): rw=randrw, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=io_uring, iodepth=16
uring_sqt_rw: (g=0): rw=randrw, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=io_uring, iodepth=16
fio-3.21-39-g87622
Starting 4 threads
uring_w: Laying out IO file (1 file / 256MiB)
uring_sqt_w: Laying out IO file (1 file / 256MiB)
uring_rw: Laying out IO file (1 file / 256MiB)
uring_sqt_rw: Laying out IO file (1 file / 256MiB)
crc32c: verify failed at file /mnt/fio/uring_rw.0.0 offset 265289728, length 65536 (requested block: offset=265289728, length=65536)
       Expected CRC: e8f1ef35
       Received CRC: 9dd0deae
fio: pid=46530, err=84/file:io_u.c:2108, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character
crc32c: verify failed at file /mnt/fio/uring_sqt_rw.0.0 offset 268369920, length 65536 (requested block: offset=268369920, length=65536)
       Expected CRC: 7e2a183e
       Received CRC: 652d5dbe
fio: pid=46531, err=84/file:io_u.c:2108, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character
crc32c: verify failed at file /mnt/fio/uring_sqt_w.0.0 offset 267911168, length 65536 (requested block: offset=267911168, length=65536)
       Expected CRC: 39d9d324
       Received CRC: a4a056be
fio: pid=46529, err=84/file:io_u.c:2108, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character
crc32c: verify failed at file /mnt/fio/uring_w.0.0 offset 264699904, length 65536 (requested block: offset=264699904, length=65536)
       Expected CRC: 3b5b87de
       Received CRC: 4e521e28
fio: pid=46528, err=84/file:io_u.c:2108, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character

uring_w: (groupid=0, jobs=1): err=84 (file:io_u.c:2108, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character): pid=46528: Thu Aug  6 12:25:59 2020
  read: IOPS=455, BW=28.4MiB/s (29.8MB/s)(4448MiB/156391msec)
    slat (usec): min=6, max=112247, avg=52.95, stdev=893.43
    clat (nsec): min=148, max=13492M, avg=9750435.02, stdev=176173961.33
     lat (usec): min=35, max=13492k, avg=9803.70, stdev=176175.42
    clat percentiles (usec):
     |  1.00th=[    433],  5.00th=[    717], 10.00th=[    807],
     | 20.00th=[    824], 30.00th=[    832], 40.00th=[    840],
     | 50.00th=[    848], 60.00th=[    865], 70.00th=[    906],
     | 80.00th=[   1221], 90.00th=[   3851], 95.00th=[   7832],
     | 99.00th=[  96994], 99.50th=[ 256902], 99.90th=[2071987],
     | 99.95th=[3238003], 99.99th=[9730786]
  write: IOPS=5059, BW=316MiB/s (332MB/s)(4608MiB/14572msec); 0 zone resets
    slat (usec): min=26, max=767, avg=34.21, stdev= 9.47
    clat (usec): min=216, max=5434, avg=1708.28, stdev=450.47
     lat (usec): min=280, max=5552, avg=1742.70, stdev=451.53
    clat percentiles (usec):
     |  1.00th=[  758],  5.00th=[  996], 10.00th=[ 1045], 20.00th=[ 1139],
     | 30.00th=[ 1647], 40.00th=[ 1696], 50.00th=[ 1893], 60.00th=[ 1926],
     | 70.00th=[ 1926], 80.00th=[ 1942], 90.00th=[ 1991], 95.00th=[ 2180],
     | 99.00th=[ 3261], 99.50th=[ 3294], 99.90th=[ 3752], 99.95th=[ 3818],
     | 99.99th=[ 5014]   
   bw (  KiB/s): min= 8288, max=524288, per=93.06%, avg=258949.10, stdev=157175.28, samples=30
   iops        : min=  129, max= 8192, avg=4045.83, stdev=2455.87, samples=30
  lat (nsec)   : 250=0.02%, 500=0.12%, 750=0.04%, 1000=0.11%
  lat (usec)   : 2=0.11%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.03%
  lat (usec)   : 500=0.24%, 750=2.61%, 1000=36.81%
  lat (msec)   : 2=49.61%, 4=5.45%, 10=2.93%, 20=0.88%, 50=0.39%
  lat (msec)   : 100=0.16%, 250=0.23%, 500=0.11%, 750=0.04%, 1000=0.02%
  lat (msec)   : 2000=0.04%, >=2000=0.05%
  cpu          : usr=2.05%, sys=6.92%, ctx=117371, majf=0, minf=11114
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=99.6%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.1%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=71166,73728,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
...
...

Steps to Reproduce:
1. mkfs.xfs -f $general_partition_dev
2. mount $general_partition_dev /mnt/fio
3. run fio with below config file
$ cat uring.fio 
[global]
directory=/mnt/fio
size=256M
iodepth=16
bs=64k
verify=crc32c
thread=1
loops=200
unlink=1

[uring_w]
rw=randwrite
ioengine=io_uring
hipri=0
fixedbufs=0
registerfiles=0
sqthread_poll=0

[uring_sqt_w]
rw=randwrite
ioengine=io_uring
hipri=0
fixedbufs=0
registerfiles=1
sqthread_poll=1

[uring_rw]
rw=randrw
ioengine=io_uring
hipri=0
fixedbufs=0
registerfiles=0
sqthread_poll=0

[uring_sqt_rw]
rw=randrw
ioengine=io_uring
hipri=0
fixedbufs=0
registerfiles=1
sqthread_poll=1

Additional info:
I thought it's a io_uring issue at first, but I can't reproduce it on ext4. So I report this bug XFS to get xfs-devel checking.
Comment 1 Dave Chinner 2020-08-07 03:12:03 UTC
On Thu, Aug 06, 2020 at 04:57:58AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=208827
> 
>             Bug ID: 208827
>            Summary: [fio io_uring] io_uring write data crc32c verify
>                     failed
>            Product: File System
>            Version: 2.5
>     Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: XFS
>           Assignee: filesystem_xfs@kernel-bugs.kernel.org
>           Reporter: zlang@redhat.com
>         Regression: No
> 
> Description of problem:
> Our fio io_uring test failed as below:
> 
> # fio io_uring.fio
> uring_w: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB,
> (T)
> 64.0KiB-64.0KiB, ioengine=io_uring, iodepth=16
> uring_sqt_w: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W)
> 64.0KiB-64.0KiB,
> (T) 64.0KiB-64.0KiB, ioengine=io_uring, iodepth=16
> uring_rw: (g=0): rw=randrw, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T)
> 64.0KiB-64.0KiB, ioengine=io_uring, iodepth=16
> uring_sqt_rw: (g=0): rw=randrw, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB,
> (T) 64.0KiB-64.0KiB, ioengine=io_uring, iodepth=16
> fio-3.21-39-g87622
> Starting 4 threads
> uring_w: Laying out IO file (1 file / 256MiB)
> uring_sqt_w: Laying out IO file (1 file / 256MiB)
> uring_rw: Laying out IO file (1 file / 256MiB)
> uring_sqt_rw: Laying out IO file (1 file / 256MiB)
> crc32c: verify failed at file /mnt/fio/uring_rw.0.0 offset 265289728, length
> 65536 (requested block: offset=265289728, length=65536)
>        Expected CRC: e8f1ef35
>        Received CRC: 9dd0deae
> fio: pid=46530, err=84/file:io_u.c:2108, func=io_u_queued_complete,

This looks like it's either a short read or it's page cache
corruption. I've confirmed that the data on disk is correct when the
validation fails, but the data in the page cache is not correct.

That is, when the fio verification fails, the second 32kB of the
64kB data block returned does not match the expected data to be
returned. Using the options:

verify_fatal=1
verify_dump=1

and getting rid of the "unlink=1" option from the config file
confirms that reading the data using xfs_io -c "pread -v <off> 64k"
returns the bad data.

Unmounting the filesystem and mounting it again (or using direct IO
to bypass the page cache), and repeating the xfs_io read returns
64kB of data identical to the expected data dump except for 16 bytes
in the block header that have some minor differences. I'm not sure
this is expected or not, but we can ignore it to begin with because
it is clear that there's exactly 8 pages of bad data in the page
cache in this range.

So, add:

verify=pattern
verify_pattern=%o

to have the buffer stamped with file offset data rather than random
data, and it turns out that the bad half of the buffer has an
incorrect file offset, but the offset in the entire range on disk is
correct.

Ok, so now I have confirmed that the data is valid on disk, but
incorrect in cache. That means the buffered write did contain
correct data in the cache, and that it was written to disk
correctly. So some time between the writeback completing and the
data being read, we've ended up with stale data in the page
cache....

This corruption only appears to happen with io_uring based buffered
IO - syscall based buffered IO and buffered IO with AIO doesn't
trigger it at all. Hence I suspect there is bug somewhere in the
io_uring code or in a code path that only the io_uring code path
tickles.

I can't really make head or tail of the io_uring code and there's no
obvious way to debug exactly what the user application is asking the
filesystem to do or what the kernel it returning to the filesytsem
(e.g. strace doesn't work). Hence I suspect that this needs the
io_uring folk to look at it and isolate it down to the operation
that is corrupting the page cache. I'd love to know how we are
can triage issues like this in the field given the tools we
normally use to triage and debug data corruption issues seem to be
largely useless...

Cheers,

Dave.
Comment 2 Dave Chinner 2020-08-10 00:09:43 UTC
On Fri, Aug 07, 2020 at 03:12:03AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=208827
> 
> --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
> On Thu, Aug 06, 2020 at 04:57:58AM +0000, bugzilla-daemon@bugzilla.kernel.org
> wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=208827
> > 
> >             Bug ID: 208827
> >            Summary: [fio io_uring] io_uring write data crc32c verify
> >                     failed
> >            Product: File System
> >            Version: 2.5
> >     Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4

FWIW, I can reproduce this with a vanilla 5.8 release kernel,
so this isn't related to contents of the XFS dev tree at all...

In fact, this bug isn't a recent regression. AFAICT, it was
introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
reproduce. More info once I've finished bisecting it....

-Dave.
Comment 3 Dave Chinner 2020-08-10 03:56:10 UTC
On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
> On Fri, Aug 07, 2020 at 03:12:03AM +0000, bugzilla-daemon@bugzilla.kernel.org
> wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=208827
> > 
> > --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
> > On Thu, Aug 06, 2020 at 04:57:58AM +0000,
> bugzilla-daemon@bugzilla.kernel.org
> > wrote:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=208827
> > > 
> > >             Bug ID: 208827
> > >            Summary: [fio io_uring] io_uring write data crc32c verify
> > >                     failed
> > >            Product: File System
> > >            Version: 2.5
> > >     Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
> 
> FWIW, I can reproduce this with a vanilla 5.8 release kernel,
> so this isn't related to contents of the XFS dev tree at all...
> 
> In fact, this bug isn't a recent regression. AFAICT, it was
> introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
> reproduce. More info once I've finished bisecting it....

f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
commit f67676d160c6ee2ed82917fadfed6d29cab8237c
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon Dec 2 11:03:47 2019 -0700

    io_uring: ensure async punted read/write requests copy iovec
    
    Currently we don't copy the iovecs when we punt to async context. This
    can be problematic for applications that store the iovec on the stack,
    as they often assume that it's safe to let the iovec go out of scope
    as soon as IO submission has been called. This isn't always safe, as we
    will re-copy the iovec once we're in async context.
    
    Make this 100% safe by copying the iovec just once. With this change,
    applications may safely store the iovec on the stack for all cases.
    
    Reported-by: 李通洲 <carter.li@eoitek.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

 fs/io_uring.c | 243 +++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 181 insertions(+), 62 deletions(-)

Full log:

$  git bisect log
git bisect start
# bad: [d5226fa6dbae0569ee43ecfc08bdcd6770fc4755] Linux 5.5
git bisect bad d5226fa6dbae0569ee43ecfc08bdcd6770fc4755
# good: [219d54332a09e8d8741c1e1982f5eae56099de85] Linux 5.4
git bisect good 219d54332a09e8d8741c1e1982f5eae56099de85
# good: [8c39f71ee2019e77ee14f88b1321b2348db51820] Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
git bisect good 8c39f71ee2019e77ee14f88b1321b2348db51820
# good: [76bb8b05960c3d1668e6bee7624ed886cbd135ba] Merge tag 'kbuild-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
git bisect good 76bb8b05960c3d1668e6bee7624ed886cbd135ba
# bad: [018e0e3594f7dcd029d258e368c485e742fa9cdb] habanalabs: rate limit error msg on waiting for CS
git bisect bad 018e0e3594f7dcd029d258e368c485e742fa9cdb
# good: [ec939e4c94bd3ef2fd4f34c15f8aaf79bd0c5ee1] Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect good ec939e4c94bd3ef2fd4f34c15f8aaf79bd0c5ee1
# good: [7c3ddc6b038feaa9a05b09c3f20e64fed50f9a3f] Merge tag 'ti-k3-soc-for-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kristo/linux into arm/dt
git bisect good 7c3ddc6b038feaa9a05b09c3f20e64fed50f9a3f
# bad: [9feb1af97e7366b512ecb9e4dd61d3252074cda3] Merge tag 'for-linus-20191205' of git://git.kernel.dk/linux-block
git bisect bad 9feb1af97e7366b512ecb9e4dd61d3252074cda3
# good: [b08baef02b26cf7c2123e4a24a2fa1fb7a593ffb] Merge tag 'armsoc-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect good b08baef02b26cf7c2123e4a24a2fa1fb7a593ffb
# good: [3f1266ec704d3efcfc8179c71bed9a75963b6344] Merge tag 'gfs2-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
git bisect good 3f1266ec704d3efcfc8179c71bed9a75963b6344
# bad: [bca1c43cb2dbe4212aea0793bfd91aeb4c2d184d] null_blk: remove unused variable warning on !CONFIG_BLK_DEV_ZONED
git bisect bad bca1c43cb2dbe4212aea0793bfd91aeb4c2d184d
# bad: [5c4bd1f40c23d08ffbdccd68a5fd63751c794d89] null_blk: fix zone size paramter check
git bisect bad 5c4bd1f40c23d08ffbdccd68a5fd63751c794d89
# bad: [03b1230ca12a12e045d83b0357792075bf94a1e0] io_uring: ensure async punted sendmsg/recvmsg requests copy data
git bisect bad 03b1230ca12a12e045d83b0357792075bf94a1e0
# good: [490547ca2df66b8413bce97cb651630f2c531487] block: don't send uevent for empty disk when not invalidating
git bisect good 490547ca2df66b8413bce97cb651630f2c531487
# bad: [f67676d160c6ee2ed82917fadfed6d29cab8237c] io_uring: ensure async punted read/write requests copy iovec
git bisect bad f67676d160c6ee2ed82917fadfed6d29cab8237c
# good: [1a6b74fc87024db59d41cd7346bd437f20fb3e2d] io_uring: add general async offload context
git bisect good 1a6b74fc87024db59d41cd7346bd437f20fb3e2d
# first bad commit: [f67676d160c6ee2ed82917fadfed6d29cab8237c] io_uring: ensure async punted read/write requests copy iove
$

The bisect is good, but I've got no idea how this commit is
triggering page cache corruption on read IO....

-Dave.
Comment 4 Dave Chinner 2020-08-10 07:08:15 UTC
[cc Jens]

[Jens, data corruption w/ io_uring and simple fio reproducer. see
the bz link below.]

On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
> On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
> > On Fri, Aug 07, 2020 at 03:12:03AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > > --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
> > > On Thu, Aug 06, 2020 at 04:57:58AM +0000,
> bugzilla-daemon@bugzilla.kernel.org
> > > wrote:
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=208827
> > > > 
> > > >             Bug ID: 208827
> > > >            Summary: [fio io_uring] io_uring write data crc32c verify
> > > >                     failed
> > > >            Product: File System
> > > >            Version: 2.5
> > > >     Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
> > 
> > FWIW, I can reproduce this with a vanilla 5.8 release kernel,
> > so this isn't related to contents of the XFS dev tree at all...
> > 
> > In fact, this bug isn't a recent regression. AFAICT, it was
> > introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
> > reproduce. More info once I've finished bisecting it....
> 
> f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
> commit f67676d160c6ee2ed82917fadfed6d29cab8237c
> Author: Jens Axboe <axboe@kernel.dk>
> Date:   Mon Dec 2 11:03:47 2019 -0700
> 
>     io_uring: ensure async punted read/write requests copy iovec

....

Ok, I went back to vanilla 5.8 to continue debugging and adding
tracepoints, and it's proving strangely difficult to reproduce now.

However, I did just hit this:

[ 4980.136032] ------------[ cut here ]------------
[ 4980.137665] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000ef911b51>] prepare_to_wait_exclusive+0x3d/0xd0
[ 4980.141403] WARNING: CPU: 13 PID: 6800 at kernel/sched/core.c:6888 __might_sleep+0x74/0x80
[ 4980.143940] CPU: 13 PID: 6800 Comm: fio Not tainted 5.8.0-dgc+ #2549
[ 4980.146147] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[ 4980.148774] RIP: 0010:__might_sleep+0x74/0x80
[ 4980.150455] Code: ff 41 5c 41 5d 41 5e 5d c3 48 8b 90 30 22 00 00 48 c7 c7 a8 b9 50 82 c6 05 38 e4 9a 01 01 48 8b 70 10 48 89 d1 e8 fa 5c fc ff <0f> 0b eb c5 0f 1f 84 00 00 00 002
[ 4980.156255] RSP: 0018:ffffc90005383c58 EFLAGS: 00010282
[ 4980.158299] RAX: 0000000000000000 RBX: 0000561a18122000 RCX: 0000000000000000
[ 4980.160817] RDX: ffff88883eca7de0 RSI: ffff88883ec97a80 RDI: ffff88883ec97a80
[ 4980.163162] RBP: ffffc90005383c70 R08: ffff88883ec97a80 R09: ffff8888070f3000
[ 4980.165635] R10: ffff8888070f3434 R11: ffff8888070f3434 R12: ffffffff8251f46e
[ 4980.168115] R13: 00000000000001ba R14: 0000000000000000 R15: ffff888235647740
[ 4980.170714] FS:  00007f80de7af700(0000) GS:ffff88883ec80000(0000) knlGS:0000000000000000
[ 4980.173442] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4980.175371] CR2: 00007f80d4005008 CR3: 00000005eb01c004 CR4: 0000000000060ee0
[ 4980.177607] Call Trace:
[ 4980.178583]  change_protection+0x827/0x9e0
[ 4980.180063]  ? kvm_clock_read+0x18/0x30
[ 4980.181654]  ? kvm_sched_clock_read+0x9/0x20
[ 4980.183426]  ? sysvec_apic_timer_interrupt+0x46/0x90
[ 4980.185160]  change_prot_numa+0x19/0x30
[ 4980.186607]  task_numa_work+0x1c7/0x2e0
[ 4980.188003]  task_work_run+0x64/0xb0
[ 4980.189488]  io_cqring_wait+0x118/0x290
[ 4980.191136]  ? io_uring_poll+0x80/0x80
[ 4980.192430]  __x64_sys_io_uring_enter+0x158/0x2d0
[ 4980.194033]  do_syscall_64+0x46/0x90
[ 4980.195278]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 4980.197092] RIP: 0033:0x7f8135c59a79
[ 4980.198412] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b8
[ 4980.203953] RSP: 002b:00007f80de7aacd8 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
[ 4980.206596] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f8135c59a79
[ 4980.208924] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008
[ 4980.211334] RBP: 00007f80defb1000 R08: 0000000000000000 R09: 0000000000000000
[ 4980.213656] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001
[ 4980.216045] R13: 0000000000000001 R14: 0000561a1820cd60 R15: 0000000000000000
[ 4980.218469] ---[ end trace 4a8ca123102be9c2 ]---

No idea if it is relevant to the data corruption, but there's
definitely something not right here...

Cheers,

Dave.
Comment 5 Dave Chinner 2020-08-10 09:09:04 UTC
On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
> [cc Jens]
> 
> [Jens, data corruption w/ io_uring and simple fio reproducer. see
> the bz link below.]
> 
> On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
> > On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
> > > On Fri, Aug 07, 2020 at 03:12:03AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > > > --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
> > > > On Thu, Aug 06, 2020 at 04:57:58AM +0000,
> bugzilla-daemon@bugzilla.kernel.org
> > > > wrote:
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=208827
> > > > > 
> > > > >             Bug ID: 208827
> > > > >            Summary: [fio io_uring] io_uring write data crc32c verify
> > > > >                     failed
> > > > >            Product: File System
> > > > >            Version: 2.5
> > > > >     Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
> > > 
> > > FWIW, I can reproduce this with a vanilla 5.8 release kernel,
> > > so this isn't related to contents of the XFS dev tree at all...
> > > 
> > > In fact, this bug isn't a recent regression. AFAICT, it was
> > > introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
> > > reproduce. More info once I've finished bisecting it....
> > 
> > f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
> > commit f67676d160c6ee2ed82917fadfed6d29cab8237c
> > Author: Jens Axboe <axboe@kernel.dk>
> > Date:   Mon Dec 2 11:03:47 2019 -0700
> > 
> >     io_uring: ensure async punted read/write requests copy iovec
> 
> ....
> 
> Ok, I went back to vanilla 5.8 to continue debugging and adding
> tracepoints, and it's proving strangely difficult to reproduce now.

Which turns out to be caused by a tracepoint I inserted to try to
narrow down if this was an invalidation race. I put this in
invalidate_complete_page:


--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -257,8 +257,11 @@ int invalidate_inode_page(struct page *page)
        struct address_space *mapping = page_mapping(page);
        if (!mapping)
                return 0;
-       if (PageDirty(page) || PageWriteback(page))
+       if (PageDirty(page) || PageWriteback(page)) {
+               trace_printk("ino 0x%lx page %p, offset 0x%lx\n",
+                       mapping->host->i_ino, page, page->index * PAGE_SIZE);
                return 0;
+       }
        if (page_mapped(page))
                return 0;
        return invalidate_complete_page(mapping, page);


And that alone, without even enabling tracepoints, made the
corruption go completely away. So I suspect a page state race
condition and look at POSIX_FADV_DONTNEED, which fio is issuing
before running it's verification reads. First thing that does:

	if (!inode_write_congested(mapping->host))
		__filemap_fdatawrite_range(mapping, offset, endbyte,
					   WB_SYNC_NONE);

It starts async writeback of the dirty pages. There's 256MB of dirty
pages on these inodes, and iomap tracing indicates the entire 256MB
immediately runs through the trace_iomap_writepage() tracepoint.
i.e. every page goes Dirty -> Writeback and is submitted for async
IO.

Then the POSIX_FADV_DONTNEED code goes and runs
invalidate_mapping_pages(), which ends up try-locking each page and
then running invalidate_inode_page() on the page, which is where the
trace debug I put in on pages under writeback gets hit. So if
changing the invalidation code for pages under writeback makes the
problem go away, then stopping invalidate_mapping_pages() from
racing with page writeback should make the problem go away, too.

This does indeed make the corruption go away:

--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -109,9 +109,8 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
        case POSIX_FADV_NOREUSE:
                break;
        case POSIX_FADV_DONTNEED:
                if (!inode_write_congested(mapping->host))
-                       __filemap_fdatawrite_range(mapping, offset, endbyte,
-                                                  WB_SYNC_NONE);
+                       filemap_write_and_wait_range(mapping, offset, endbyte);
 
                /*
                 * First and last FULL page! Partial pages are deliberately

by making the invalidation wait for the pages to go fully to the
clean state before starting.

This, however, only fixes the specific symptom being tripped over
here.  To further test this, I removed this writeback from
POSIX_FADV_DONTNEED completely so I could trigger writeback via
controlled background writeback. And, as I expected, whenever
background writeback ran to write back these dirty files, the
verification failures triggered again. It is quite reliable.

So it looks like there is some kind of writeback completion vs page
invalidation race condition occurring, but more work is needed to
isolate it further. I don't know what part the async read plays in
the corruption yet, because I don't know how we are getting pages in
the cache where page->index != the file offset stamped in the data.
That smells of leaking PageUptodate flags...

-Dave.
Comment 6 Jens Axboe 2020-08-11 01:07:11 UTC
On 8/10/20 1:08 AM, Dave Chinner wrote:
> [cc Jens]
> 
> [Jens, data corruption w/ io_uring and simple fio reproducer. see
> the bz link below.]
> 
> On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
>> On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
>>> On Fri, Aug 07, 2020 at 03:12:03AM +0000,
>>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>>> --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
>>>> On Thu, Aug 06, 2020 at 04:57:58AM +0000,
>>>> bugzilla-daemon@bugzilla.kernel.org
>>>> wrote:
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=208827
>>>>>
>>>>>             Bug ID: 208827
>>>>>            Summary: [fio io_uring] io_uring write data crc32c verify
>>>>>                     failed
>>>>>            Product: File System
>>>>>            Version: 2.5
>>>>>     Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
>>>
>>> FWIW, I can reproduce this with a vanilla 5.8 release kernel,
>>> so this isn't related to contents of the XFS dev tree at all...
>>>
>>> In fact, this bug isn't a recent regression. AFAICT, it was
>>> introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
>>> reproduce. More info once I've finished bisecting it....
>>
>> f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
>> commit f67676d160c6ee2ed82917fadfed6d29cab8237c
>> Author: Jens Axboe <axboe@kernel.dk>
>> Date:   Mon Dec 2 11:03:47 2019 -0700
>>
>>     io_uring: ensure async punted read/write requests copy iovec

I don't think this commit is related to the issue at all, but I think
we're probably on the same page with that. It's probably just changing
things slightly enough to avoid the race.

> Ok, I went back to vanilla 5.8 to continue debugging and adding
> tracepoints, and it's proving strangely difficult to reproduce now.
> 
> However, I did just hit this:
> 
> [ 4980.136032] ------------[ cut here ]------------
> [ 4980.137665] do not call blocking ops when !TASK_RUNNING; state=1 set at
> [<00000000ef911b51>] prepare_to_wait_exclusive+0x3d/0xd0
> [ 4980.141403] WARNING: CPU: 13 PID: 6800 at kernel/sched/core.c:6888
> __might_sleep+0x74/0x80
> [ 4980.143940] CPU: 13 PID: 6800 Comm: fio Not tainted 5.8.0-dgc+ #2549
> [ 4980.146147] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.13.0-1 04/01/2014
> [ 4980.148774] RIP: 0010:__might_sleep+0x74/0x80
> [ 4980.150455] Code: ff 41 5c 41 5d 41 5e 5d c3 48 8b 90 30 22 00 00 48 c7 c7
> a8 b9 50 82 c6 05 38 e4 9a 01 01 48 8b 70 10 48 89 d1 e8 fa 5c fc ff <0f> 0b
> eb c5 0f 1f 84 00 00 00 002
> [ 4980.156255] RSP: 0018:ffffc90005383c58 EFLAGS: 00010282
> [ 4980.158299] RAX: 0000000000000000 RBX: 0000561a18122000 RCX:
> 0000000000000000
> [ 4980.160817] RDX: ffff88883eca7de0 RSI: ffff88883ec97a80 RDI:
> ffff88883ec97a80
> [ 4980.163162] RBP: ffffc90005383c70 R08: ffff88883ec97a80 R09:
> ffff8888070f3000
> [ 4980.165635] R10: ffff8888070f3434 R11: ffff8888070f3434 R12:
> ffffffff8251f46e
> [ 4980.168115] R13: 00000000000001ba R14: 0000000000000000 R15:
> ffff888235647740
> [ 4980.170714] FS:  00007f80de7af700(0000) GS:ffff88883ec80000(0000)
> knlGS:0000000000000000
> [ 4980.173442] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4980.175371] CR2: 00007f80d4005008 CR3: 00000005eb01c004 CR4:
> 0000000000060ee0
> [ 4980.177607] Call Trace:
> [ 4980.178583]  change_protection+0x827/0x9e0
> [ 4980.180063]  ? kvm_clock_read+0x18/0x30
> [ 4980.181654]  ? kvm_sched_clock_read+0x9/0x20
> [ 4980.183426]  ? sysvec_apic_timer_interrupt+0x46/0x90
> [ 4980.185160]  change_prot_numa+0x19/0x30
> [ 4980.186607]  task_numa_work+0x1c7/0x2e0
> [ 4980.188003]  task_work_run+0x64/0xb0
> [ 4980.189488]  io_cqring_wait+0x118/0x290

I'll get this one:

commit 4c6e277c4cc4a6b3b2b9c66a7b014787ae757cc1
Author: Jens Axboe <axboe@kernel.dk>
Date:   Wed Jul 1 11:29:10 2020 -0600

    io_uring: abstract out task work running

queued up for stable. Should not be related to this at all, current -git
shouldn't run into this.
Comment 7 Jens Axboe 2020-08-11 01:15:34 UTC
On 8/10/20 3:08 AM, Dave Chinner wrote:
> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
>> [cc Jens]
>>
>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
>> the bz link below.]
>>
>> On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
>>> On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
>>>> On Fri, Aug 07, 2020 at 03:12:03AM +0000,
>>>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>>>> --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
>>>>> On Thu, Aug 06, 2020 at 04:57:58AM +0000,
>>>>> bugzilla-daemon@bugzilla.kernel.org
>>>>> wrote:
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=208827
>>>>>>
>>>>>>             Bug ID: 208827
>>>>>>            Summary: [fio io_uring] io_uring write data crc32c verify
>>>>>>                     failed
>>>>>>            Product: File System
>>>>>>            Version: 2.5
>>>>>>     Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
>>>>
>>>> FWIW, I can reproduce this with a vanilla 5.8 release kernel,
>>>> so this isn't related to contents of the XFS dev tree at all...
>>>>
>>>> In fact, this bug isn't a recent regression. AFAICT, it was
>>>> introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
>>>> reproduce. More info once I've finished bisecting it....
>>>
>>> f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
>>> commit f67676d160c6ee2ed82917fadfed6d29cab8237c
>>> Author: Jens Axboe <axboe@kernel.dk>
>>> Date:   Mon Dec 2 11:03:47 2019 -0700
>>>
>>>     io_uring: ensure async punted read/write requests copy iovec
>>
>> ....
>>
>> Ok, I went back to vanilla 5.8 to continue debugging and adding
>> tracepoints, and it's proving strangely difficult to reproduce now.
> 
> Which turns out to be caused by a tracepoint I inserted to try to
> narrow down if this was an invalidation race. I put this in
> invalidate_complete_page:
> 
> 
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -257,8 +257,11 @@ int invalidate_inode_page(struct page *page)
>         struct address_space *mapping = page_mapping(page);
>         if (!mapping)
>                 return 0;
> -       if (PageDirty(page) || PageWriteback(page))
> +       if (PageDirty(page) || PageWriteback(page)) {
> +               trace_printk("ino 0x%lx page %p, offset 0x%lx\n",
> +                       mapping->host->i_ino, page, page->index * PAGE_SIZE);
>                 return 0;
> +       }
>         if (page_mapped(page))
>                 return 0;
>         return invalidate_complete_page(mapping, page);
> 
> 
> And that alone, without even enabling tracepoints, made the
> corruption go completely away. So I suspect a page state race
> condition and look at POSIX_FADV_DONTNEED, which fio is issuing
> before running it's verification reads. First thing that does:
> 
>       if (!inode_write_congested(mapping->host))
>               __filemap_fdatawrite_range(mapping, offset, endbyte,
>                                          WB_SYNC_NONE);
> 
> It starts async writeback of the dirty pages. There's 256MB of dirty
> pages on these inodes, and iomap tracing indicates the entire 256MB
> immediately runs through the trace_iomap_writepage() tracepoint.
> i.e. every page goes Dirty -> Writeback and is submitted for async
> IO.
> 
> Then the POSIX_FADV_DONTNEED code goes and runs
> invalidate_mapping_pages(), which ends up try-locking each page and
> then running invalidate_inode_page() on the page, which is where the
> trace debug I put in on pages under writeback gets hit. So if
> changing the invalidation code for pages under writeback makes the
> problem go away, then stopping invalidate_mapping_pages() from
> racing with page writeback should make the problem go away, too.
> 
> This does indeed make the corruption go away:
> 
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -109,9 +109,8 @@ int generic_fadvise(struct file *file, loff_t offset,
> loff_t len, int advice)
>         case POSIX_FADV_NOREUSE:
>                 break;
>         case POSIX_FADV_DONTNEED:
>                 if (!inode_write_congested(mapping->host))
> -                       __filemap_fdatawrite_range(mapping, offset, endbyte,
> -                                                  WB_SYNC_NONE);
> +                       filemap_write_and_wait_range(mapping, offset,
> endbyte);
>  
>                 /*
>                  * First and last FULL page! Partial pages are deliberately
> 
> by making the invalidation wait for the pages to go fully to the
> clean state before starting.
> 
> This, however, only fixes the specific symptom being tripped over
> here.  To further test this, I removed this writeback from
> POSIX_FADV_DONTNEED completely so I could trigger writeback via
> controlled background writeback. And, as I expected, whenever
> background writeback ran to write back these dirty files, the
> verification failures triggered again. It is quite reliable.
> 
> So it looks like there is some kind of writeback completion vs page
> invalidation race condition occurring, but more work is needed to
> isolate it further. I don't know what part the async read plays in
> the corruption yet, because I don't know how we are getting pages in
> the cache where page->index != the file offset stamped in the data.
> That smells of leaking PageUptodate flags...

The async read really isn't doing anything that you could not do with
separate threads. Unfortunately it's not that easy to have multiple
threads working on the same region with fio, or we could've reproduced
it with a job file written to use that instead.

I'll dig a bit here...
Comment 8 Jens Axboe 2020-08-11 01:50:19 UTC
On 8/10/20 7:15 PM, Jens Axboe wrote:
> On 8/10/20 3:08 AM, Dave Chinner wrote:
>> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
>>> [cc Jens]
>>>
>>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
>>> the bz link below.]
>>>
>>> On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
>>>> On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
>>>>> On Fri, Aug 07, 2020 at 03:12:03AM +0000,
>>>>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>>>>> --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
>>>>>> On Thu, Aug 06, 2020 at 04:57:58AM +0000,
>>>>>> bugzilla-daemon@bugzilla.kernel.org
>>>>>> wrote:
>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=208827
>>>>>>>
>>>>>>>             Bug ID: 208827
>>>>>>>            Summary: [fio io_uring] io_uring write data crc32c verify
>>>>>>>                     failed
>>>>>>>            Product: File System
>>>>>>>            Version: 2.5
>>>>>>>     Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
>>>>>
>>>>> FWIW, I can reproduce this with a vanilla 5.8 release kernel,
>>>>> so this isn't related to contents of the XFS dev tree at all...
>>>>>
>>>>> In fact, this bug isn't a recent regression. AFAICT, it was
>>>>> introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
>>>>> reproduce. More info once I've finished bisecting it....
>>>>
>>>> f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
>>>> commit f67676d160c6ee2ed82917fadfed6d29cab8237c
>>>> Author: Jens Axboe <axboe@kernel.dk>
>>>> Date:   Mon Dec 2 11:03:47 2019 -0700
>>>>
>>>>     io_uring: ensure async punted read/write requests copy iovec
>>>
>>> ....
>>>
>>> Ok, I went back to vanilla 5.8 to continue debugging and adding
>>> tracepoints, and it's proving strangely difficult to reproduce now.
>>
>> Which turns out to be caused by a tracepoint I inserted to try to
>> narrow down if this was an invalidation race. I put this in
>> invalidate_complete_page:
>>
>>
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -257,8 +257,11 @@ int invalidate_inode_page(struct page *page)
>>         struct address_space *mapping = page_mapping(page);
>>         if (!mapping)
>>                 return 0;
>> -       if (PageDirty(page) || PageWriteback(page))
>> +       if (PageDirty(page) || PageWriteback(page)) {
>> +               trace_printk("ino 0x%lx page %p, offset 0x%lx\n",
>> +                       mapping->host->i_ino, page, page->index *
>> PAGE_SIZE);
>>                 return 0;
>> +       }
>>         if (page_mapped(page))
>>                 return 0;
>>         return invalidate_complete_page(mapping, page);
>>
>>
>> And that alone, without even enabling tracepoints, made the
>> corruption go completely away. So I suspect a page state race
>> condition and look at POSIX_FADV_DONTNEED, which fio is issuing
>> before running it's verification reads. First thing that does:
>>
>>      if (!inode_write_congested(mapping->host))
>>              __filemap_fdatawrite_range(mapping, offset, endbyte,
>>                                         WB_SYNC_NONE);
>>
>> It starts async writeback of the dirty pages. There's 256MB of dirty
>> pages on these inodes, and iomap tracing indicates the entire 256MB
>> immediately runs through the trace_iomap_writepage() tracepoint.
>> i.e. every page goes Dirty -> Writeback and is submitted for async
>> IO.
>>
>> Then the POSIX_FADV_DONTNEED code goes and runs
>> invalidate_mapping_pages(), which ends up try-locking each page and
>> then running invalidate_inode_page() on the page, which is where the
>> trace debug I put in on pages under writeback gets hit. So if
>> changing the invalidation code for pages under writeback makes the
>> problem go away, then stopping invalidate_mapping_pages() from
>> racing with page writeback should make the problem go away, too.
>>
>> This does indeed make the corruption go away:
>>
>> --- a/mm/fadvise.c
>> +++ b/mm/fadvise.c
>> @@ -109,9 +109,8 @@ int generic_fadvise(struct file *file, loff_t offset,
>> loff_t len, int advice)
>>         case POSIX_FADV_NOREUSE:
>>                 break;
>>         case POSIX_FADV_DONTNEED:
>>                 if (!inode_write_congested(mapping->host))
>> -                       __filemap_fdatawrite_range(mapping, offset, endbyte,
>> -                                                  WB_SYNC_NONE);
>> +                       filemap_write_and_wait_range(mapping, offset,
>> endbyte);
>>  
>>                 /*
>>                  * First and last FULL page! Partial pages are deliberately
>>
>> by making the invalidation wait for the pages to go fully to the
>> clean state before starting.
>>
>> This, however, only fixes the specific symptom being tripped over
>> here.  To further test this, I removed this writeback from
>> POSIX_FADV_DONTNEED completely so I could trigger writeback via
>> controlled background writeback. And, as I expected, whenever
>> background writeback ran to write back these dirty files, the
>> verification failures triggered again. It is quite reliable.
>>
>> So it looks like there is some kind of writeback completion vs page
>> invalidation race condition occurring, but more work is needed to
>> isolate it further. I don't know what part the async read plays in
>> the corruption yet, because I don't know how we are getting pages in
>> the cache where page->index != the file offset stamped in the data.
>> That smells of leaking PageUptodate flags...
> 
> The async read really isn't doing anything that you could not do with
> separate threads. Unfortunately it's not that easy to have multiple
> threads working on the same region with fio, or we could've reproduced
> it with a job file written to use that instead.
> 
> I'll dig a bit here...

Have we verified that the actual page cache is inconsistent, or is that
just an assumption? I'm asking since I poked a bit on the fio side, and
suspiciously the failed verification was a short IO. At least
originally, fio will retry those, but it could be a bug in the io_uring
engine for fio.

I'll poke some more.
Comment 9 Dave Chinner 2020-08-11 02:00:58 UTC
On Mon, Aug 10, 2020 at 07:08:59PM +1000, Dave Chinner wrote:
> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
> > [cc Jens]
> > 
> > [Jens, data corruption w/ io_uring and simple fio reproducer. see
> > the bz link below.]

Looks like a io_uring/fio bugs at this point, Jens. All your go fast
bits turns the buffered read into a short read, and neither fio nor
io_uring async buffered read path handle short reads. Details below.

Also, the block_rq_* trace points appear to emit corrupt trace
records that crash trace-cmd. That probably needs to be looked into
as well:

$ sudo trace-cmd record -e xfs_file\* -e iomap\* -e block_bio\* -e block_rq_complete -e printk fio tests/io_uring_corruption.fio
.....
$ trace-cmd report > s.t
ug! no event found for type 4228
ug! no event found for type 32
ug! negative record size -4
ug! no event found for type 16722
ug! no event found for type 0
ug! no event found for type 4228
ug! no event found for type 32
ug! negative record size -4
ug! no event found for type 16722
ug! no event found for type 4228
ug! no event found for type 4230
ug! no event found for type 32
ug! no event found for type 0
ug! no event found for type 4230
ug! no event found for type 32
ug! no event found for type 0
ug! no event found for type 32
ug! negative record size -4
Segmentation fault
$



> > On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
> > > On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 07, 2020 at 03:12:03AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > > > > --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
> > > > > On Thu, Aug 06, 2020 at 04:57:58AM +0000,
> bugzilla-daemon@bugzilla.kernel.org
> > > > > wrote:
> > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=208827
> > > > > > 
> > > > > >             Bug ID: 208827
> > > > > >            Summary: [fio io_uring] io_uring write data crc32c
> verify
> > > > > >                     failed
> > > > > >            Product: File System
> > > > > >            Version: 2.5
> > > > > >     Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
> > > > 
> > > > FWIW, I can reproduce this with a vanilla 5.8 release kernel,
> > > > so this isn't related to contents of the XFS dev tree at all...
> > > > 
> > > > In fact, this bug isn't a recent regression. AFAICT, it was
> > > > introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
> > > > reproduce. More info once I've finished bisecting it....
> > > 
> > > f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
> > > commit f67676d160c6ee2ed82917fadfed6d29cab8237c
> > > Author: Jens Axboe <axboe@kernel.dk>
> > > Date:   Mon Dec 2 11:03:47 2019 -0700
> > > 
> > >     io_uring: ensure async punted read/write requests copy iovec
> > 
> > ....
> > 
> > Ok, I went back to vanilla 5.8 to continue debugging and adding
> > tracepoints, and it's proving strangely difficult to reproduce now.
> 
> Which turns out to be caused by a tracepoint I inserted to try to
> narrow down if this was an invalidation race. I put this in
> invalidate_complete_page:
> 
> 
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -257,8 +257,11 @@ int invalidate_inode_page(struct page *page)
>         struct address_space *mapping = page_mapping(page);
>         if (!mapping)
>                 return 0;
> -       if (PageDirty(page) || PageWriteback(page))
> +       if (PageDirty(page) || PageWriteback(page)) {
> +               trace_printk("ino 0x%lx page %p, offset 0x%lx\n",
> +                       mapping->host->i_ino, page, page->index * PAGE_SIZE);
>                 return 0;
> +       }
>         if (page_mapped(page))
>                 return 0;
>         return invalidate_complete_page(mapping, page);

So I updated to the current 5.9 whcih has all the new async buffered
read stuff in it to see if that fixed the problem, but it didn't.
Instead, it actually made this problem occur almost instantenously.

I could put the above trace code back in and it didn't purturb the
situation, either. Neither did writeback, which I was now able to
confirm didn't actually impact on this problem at all.

The problem is, instead, reads racing with readahead completion.
There is no page cache corruption at all, just fio is reusing
buffers that contain stale data from previous read IO and hence any
part of the buffer that doesn't get updated by the read IO ends up
being detected as "corrupt" because they already contain stale data.

The problem is clear from this sequence trace:

     io_uring-sq-4518  [012]    52.806976: xfs_file_buffered_read: dev 253:32 ino 0x86 size 0x10000000 offset 0x6f40000 count 0x10000
     io_uring-sq-4518  [012]    52.806987: iomap_readahead:      dev 253:32 ino 0x86 nr_pages 16
     io_uring-sq-4518  [012]    52.806987: iomap_apply:          dev 253:32 ino 0x86 pos 116654080 length 65536 flags  (0x0) ops xfs_read_iomap_ops caller iomap_readahead actor iomap_readahead_actor
     io_uring-sq-4518  [012]    52.806988: iomap_apply_dstmap:   dev 253:32 ino 0x86 bdev 253:32 addr 922058752 offset 116654080 length 65536 type MAPPED flags 
     io_uring-sq-4518  [012]    52.806990: block_bio_queue:      253,32 RA 1800896 + 128 [io_uring-sq]
....
     io_uring-sq-4518  [012]    52.806992: xfs_file_buffered_read: dev 253:32 ino 0x86 size 0x10000000 offset 0x6f40000 count 0xfffffffffffffff5
     io_uring-sq-4518  [012]    52.806993: xfs_file_buffered_read: dev 253:32 ino 0x86 size 0x10000000 offset 0x6f40000 count 0x10000
     io_uring-sq-4518  [012]    52.806994: xfs_file_buffered_read: dev 253:32 ino 0x86 size 0x10000000 offset 0x6f40000 count 0xfffffffffffffdef
     io_uring-sq-4518  [012]    52.806995: xfs_file_buffered_read: dev 253:32 ino 0x86 size 0x10000000 offset 0x6bf0000 count 0x10000
....
             fio-4515  [013]    52.807855: block_rq_complete:    253,32 RA () 1800896 + 128 [0]
     io_uring-sq-4518  [012]    52.807863: xfs_file_buffered_read: dev 253:32 ino 0x86 size 0x10000000 offset 0x6f40000 count 0x10000
     io_uring-sq-4518  [012]    52.807866: xfs_file_buffered_read: dev 253:32 ino 0x86 size 0x10000000 offset 0x6f41000 count 0x1000
....
             fio-4515  [013]    52.807871: block_rq_complete:    253,32 RA () 1752640 + 128 [0]
             fio-4515  [013]    52.807876: block_rq_complete:    253,32 RA () 1675200 + 128 [0]
             fio-4515  [013]    52.807880: block_rq_complete:    253,32 RA () 1652672 + 128 [0]

What we see is -3- calls to do the buffered read. The first call
triggers readahead and we can see iomap map the file offset and
issue the bio. This is IOCB_NOWAIT context, so this is actually
broken as we can block for a long time in this path (e.g. on
inode locks).

Once the IO is issued, we see a second xfs_file_buffered_read trace
with the return value in the count - that's -EAGAIN because
IOCB_NOWAIT is set and now we are going to block waiting for IO.

Immediately after this, the io_uring code issues the buffered read
again, only this time with IOCB_WAITQ|IOCB_NOWAIT set and it hits
the "page not up to date" path. This then hits
__wait_on_page_locked_async() which sets the PageWaiter bit and
returns -EIOCBQUEUED to tell the caller that it has been queued.

About a millisecond later, the readahead IO completes, and the block
layer logs it. This ends up calling iomap_read_end_io() which walks
each page(bvec) in the bio, marks them uptodate and unlocks them.
The first page unlock sees the PageWaiter bit, which then does a

	wake_up_page_bit(page, PG_locked);

because the PageWaiter bit is set. We see the same io_uring worker
thread then immmediately re-issue the buffered read and it does it
with IOCB_WAITQ|IOCB_NOWAIT context. This gets the first page, then
finds the second page not up to date or has to block trying to lock
it. It aborts the read at that point, and because we've already read
a page it returns 4096 bytes read.

I left the post-buffered read IO completions that were being run by
the block layer to show that the buffered read was issued and
completed while the block layer was still processing the readahead
bio completion.

At this point (a positive byte count being returned) the io_uring
layer considers the IO done and returns the result to the userspace
app. Fio then completely fails to check for a short read despite
being in "verify" mode and instead yells about corruption due to
stale data that it had left in the buffer that it passed the
kernel....

IOWs, there does not appear to be a filesystem or page cache issue
here at all - it's just an unhandled short read.

Jens, if strace worked on io_uring, I wouldn't have wasted 2 and
half days tracking this down. It would have taken 5 minutes with
strace and it would have been totally obvious. I'd suggest this
needs addressing as a matter of priority over everything else for
io_uring....

Cheers, 

Dave.
Comment 10 Jens Axboe 2020-08-11 02:01:31 UTC
On 8/10/20 7:50 PM, Jens Axboe wrote:
> On 8/10/20 7:15 PM, Jens Axboe wrote:
>> On 8/10/20 3:08 AM, Dave Chinner wrote:
>>> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
>>>> [cc Jens]
>>>>
>>>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
>>>> the bz link below.]
>>>>
>>>> On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
>>>>> On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
>>>>>> On Fri, Aug 07, 2020 at 03:12:03AM +0000,
>>>>>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>>>>>> --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
>>>>>>> On Thu, Aug 06, 2020 at 04:57:58AM +0000,
>>>>>>> bugzilla-daemon@bugzilla.kernel.org
>>>>>>> wrote:
>>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=208827
>>>>>>>>
>>>>>>>>             Bug ID: 208827
>>>>>>>>            Summary: [fio io_uring] io_uring write data crc32c verify
>>>>>>>>                     failed
>>>>>>>>            Product: File System
>>>>>>>>            Version: 2.5
>>>>>>>>     Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
>>>>>>
>>>>>> FWIW, I can reproduce this with a vanilla 5.8 release kernel,
>>>>>> so this isn't related to contents of the XFS dev tree at all...
>>>>>>
>>>>>> In fact, this bug isn't a recent regression. AFAICT, it was
>>>>>> introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
>>>>>> reproduce. More info once I've finished bisecting it....
>>>>>
>>>>> f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
>>>>> commit f67676d160c6ee2ed82917fadfed6d29cab8237c
>>>>> Author: Jens Axboe <axboe@kernel.dk>
>>>>> Date:   Mon Dec 2 11:03:47 2019 -0700
>>>>>
>>>>>     io_uring: ensure async punted read/write requests copy iovec
>>>>
>>>> ....
>>>>
>>>> Ok, I went back to vanilla 5.8 to continue debugging and adding
>>>> tracepoints, and it's proving strangely difficult to reproduce now.
>>>
>>> Which turns out to be caused by a tracepoint I inserted to try to
>>> narrow down if this was an invalidation race. I put this in
>>> invalidate_complete_page:
>>>
>>>
>>> --- a/mm/truncate.c
>>> +++ b/mm/truncate.c
>>> @@ -257,8 +257,11 @@ int invalidate_inode_page(struct page *page)
>>>         struct address_space *mapping = page_mapping(page);
>>>         if (!mapping)
>>>                 return 0;
>>> -       if (PageDirty(page) || PageWriteback(page))
>>> +       if (PageDirty(page) || PageWriteback(page)) {
>>> +               trace_printk("ino 0x%lx page %p, offset 0x%lx\n",
>>> +                       mapping->host->i_ino, page, page->index *
>>> PAGE_SIZE);
>>>                 return 0;
>>> +       }
>>>         if (page_mapped(page))
>>>                 return 0;
>>>         return invalidate_complete_page(mapping, page);
>>>
>>>
>>> And that alone, without even enabling tracepoints, made the
>>> corruption go completely away. So I suspect a page state race
>>> condition and look at POSIX_FADV_DONTNEED, which fio is issuing
>>> before running it's verification reads. First thing that does:
>>>
>>>     if (!inode_write_congested(mapping->host))
>>>             __filemap_fdatawrite_range(mapping, offset, endbyte,
>>>                                        WB_SYNC_NONE);
>>>
>>> It starts async writeback of the dirty pages. There's 256MB of dirty
>>> pages on these inodes, and iomap tracing indicates the entire 256MB
>>> immediately runs through the trace_iomap_writepage() tracepoint.
>>> i.e. every page goes Dirty -> Writeback and is submitted for async
>>> IO.
>>>
>>> Then the POSIX_FADV_DONTNEED code goes and runs
>>> invalidate_mapping_pages(), which ends up try-locking each page and
>>> then running invalidate_inode_page() on the page, which is where the
>>> trace debug I put in on pages under writeback gets hit. So if
>>> changing the invalidation code for pages under writeback makes the
>>> problem go away, then stopping invalidate_mapping_pages() from
>>> racing with page writeback should make the problem go away, too.
>>>
>>> This does indeed make the corruption go away:
>>>
>>> --- a/mm/fadvise.c
>>> +++ b/mm/fadvise.c
>>> @@ -109,9 +109,8 @@ int generic_fadvise(struct file *file, loff_t offset,
>>> loff_t len, int advice)
>>>         case POSIX_FADV_NOREUSE:
>>>                 break;
>>>         case POSIX_FADV_DONTNEED:
>>>                 if (!inode_write_congested(mapping->host))
>>> -                       __filemap_fdatawrite_range(mapping, offset,
>>> endbyte,
>>> -                                                  WB_SYNC_NONE);
>>> +                       filemap_write_and_wait_range(mapping, offset,
>>> endbyte);
>>>  
>>>                 /*
>>>                  * First and last FULL page! Partial pages are deliberately
>>>
>>> by making the invalidation wait for the pages to go fully to the
>>> clean state before starting.
>>>
>>> This, however, only fixes the specific symptom being tripped over
>>> here.  To further test this, I removed this writeback from
>>> POSIX_FADV_DONTNEED completely so I could trigger writeback via
>>> controlled background writeback. And, as I expected, whenever
>>> background writeback ran to write back these dirty files, the
>>> verification failures triggered again. It is quite reliable.
>>>
>>> So it looks like there is some kind of writeback completion vs page
>>> invalidation race condition occurring, but more work is needed to
>>> isolate it further. I don't know what part the async read plays in
>>> the corruption yet, because I don't know how we are getting pages in
>>> the cache where page->index != the file offset stamped in the data.
>>> That smells of leaking PageUptodate flags...
>>
>> The async read really isn't doing anything that you could not do with
>> separate threads. Unfortunately it's not that easy to have multiple
>> threads working on the same region with fio, or we could've reproduced
>> it with a job file written to use that instead.
>>
>> I'll dig a bit here...
> 
> Have we verified that the actual page cache is inconsistent, or is that
> just an assumption? I'm asking since I poked a bit on the fio side, and
> suspiciously the failed verification was a short IO. At least
> originally, fio will retry those, but it could be a bug in the io_uring
> engine for fio.
> 
> I'll poke some more.

The on-disk state seems sane. I added a hack that clears the rest of the
buffer to 0x5a when we get a short read. When the verify fails, the io_u
that was attempted verified looks like this:

0000fe0 116b b418 d14b 0477 822d 6dcd 201d 1316
0000ff0 3045 eca3 0d1c 1a4f e608 0571 6b52 015e
0001000 5a5a 5a5a 5a5a 5a5a 5a5a 5a5a 5a5a 5a5a
*
0010000

where fio dumps this as the expected data:

0000fe0 116b b418 d14b 0477 822d 6dcd 201d 1316
0000ff0 3045 eca3 0d1c 1a4f e608 0571 6b52 015e
0001000 3cc1 4daa cab1 12ba c798 0b54 b281 0a05
0001010 98f3 bd9e 30a5 1728 531e 6b3a 2745 1877

Fio says the offset is 62652416 in the file, and reading
that 64k block from the file and dumping it:

0000fe0 116b b418 d14b 0477 822d 6dcd 201d 1316
0000ff0 3045 eca3 0d1c 1a4f e608 0571 6b52 015e
0001000 3cc1 4daa cab1 12ba c798 0b54 b281 0a05
0001010 98f3 bd9e 30a5 1728 531e 6b3a 2745 1877

So it seems to me like the file state is consistent, at least after the
run, and that this seems more likely to be a fio issue with short
read handling.

Poking along...
Comment 11 Jens Axboe 2020-08-11 02:20:00 UTC
On 8/10/20 8:00 PM, Dave Chinner wrote:
> On Mon, Aug 10, 2020 at 07:08:59PM +1000, Dave Chinner wrote:
>> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
>>> [cc Jens]
>>>
>>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
>>> the bz link below.]
> 
> Looks like a io_uring/fio bugs at this point, Jens. All your go fast
> bits turns the buffered read into a short read, and neither fio nor
> io_uring async buffered read path handle short reads. Details below.

It's a fio issue. The io_uring engine uses a different path for short
IO completions, and that's being ignored by the backend... Hence the
IO just gets completed and not retried for this case, and that'll then
trigger verification as if it did complete. I'm fixing it up.

> Also, the block_rq_* trace points appear to emit corrupt trace
> records that crash trace-cmd. That probably needs to be looked into
> as well:
> 
> $ sudo trace-cmd record -e xfs_file\* -e iomap\* -e block_bio\* -e
> block_rq_complete -e printk fio tests/io_uring_corruption.fio
> .....
> $ trace-cmd report > s.t
> ug! no event found for type 4228
> ug! no event found for type 32
> ug! negative record size -4
> ug! no event found for type 16722
> ug! no event found for type 0
> ug! no event found for type 4228
> ug! no event found for type 32
> ug! negative record size -4
> ug! no event found for type 16722
> ug! no event found for type 4228
> ug! no event found for type 4230
> ug! no event found for type 32
> ug! no event found for type 0
> ug! no event found for type 4230
> ug! no event found for type 32
> ug! no event found for type 0
> ug! no event found for type 32
> ug! negative record size -4
> Segmentation fault
> $

No idea what that is, I regularly use "normal" blktrace and it works just
fine for me.

>>> On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
>>>> On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
>>>>> On Fri, Aug 07, 2020 at 03:12:03AM +0000,
>>>>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>>>>> --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
>>>>>> On Thu, Aug 06, 2020 at 04:57:58AM +0000,
>>>>>> bugzilla-daemon@bugzilla.kernel.org
>>>>>> wrote:
>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=208827
>>>>>>>
>>>>>>>             Bug ID: 208827
>>>>>>>            Summary: [fio io_uring] io_uring write data crc32c verify
>>>>>>>                     failed
>>>>>>>            Product: File System
>>>>>>>            Version: 2.5
>>>>>>>     Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
>>>>>
>>>>> FWIW, I can reproduce this with a vanilla 5.8 release kernel,
>>>>> so this isn't related to contents of the XFS dev tree at all...
>>>>>
>>>>> In fact, this bug isn't a recent regression. AFAICT, it was
>>>>> introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
>>>>> reproduce. More info once I've finished bisecting it....
>>>>
>>>> f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
>>>> commit f67676d160c6ee2ed82917fadfed6d29cab8237c
>>>> Author: Jens Axboe <axboe@kernel.dk>
>>>> Date:   Mon Dec 2 11:03:47 2019 -0700
>>>>
>>>>     io_uring: ensure async punted read/write requests copy iovec
>>>
>>> ....
>>>
>>> Ok, I went back to vanilla 5.8 to continue debugging and adding
>>> tracepoints, and it's proving strangely difficult to reproduce now.
>>
>> Which turns out to be caused by a tracepoint I inserted to try to
>> narrow down if this was an invalidation race. I put this in
>> invalidate_complete_page:
>>
>>
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -257,8 +257,11 @@ int invalidate_inode_page(struct page *page)
>>         struct address_space *mapping = page_mapping(page);
>>         if (!mapping)
>>                 return 0;
>> -       if (PageDirty(page) || PageWriteback(page))
>> +       if (PageDirty(page) || PageWriteback(page)) {
>> +               trace_printk("ino 0x%lx page %p, offset 0x%lx\n",
>> +                       mapping->host->i_ino, page, page->index *
>> PAGE_SIZE);
>>                 return 0;
>> +       }
>>         if (page_mapped(page))
>>                 return 0;
>>         return invalidate_complete_page(mapping, page);
> 
> So I updated to the current 5.9 whcih has all the new async buffered
> read stuff in it to see if that fixed the problem, but it didn't.
> Instead, it actually made this problem occur almost instantenously.
> 
> I could put the above trace code back in and it didn't purturb the
> situation, either. Neither did writeback, which I was now able to
> confirm didn't actually impact on this problem at all.
> 
> The problem is, instead, reads racing with readahead completion.
> There is no page cache corruption at all, just fio is reusing
> buffers that contain stale data from previous read IO and hence any
> part of the buffer that doesn't get updated by the read IO ends up
> being detected as "corrupt" because they already contain stale data.
> 
> The problem is clear from this sequence trace:
> 
>      io_uring-sq-4518  [012]    52.806976: xfs_file_buffered_read: dev 253:32
>      ino 0x86 size 0x10000000 offset 0x6f40000 count 0x10000
>      io_uring-sq-4518  [012]    52.806987: iomap_readahead:      dev 253:32
>      ino 0x86 nr_pages 16
>      io_uring-sq-4518  [012]    52.806987: iomap_apply:          dev 253:32
>      ino 0x86 pos 116654080 length 65536 flags  (0x0) ops xfs_read_iomap_ops
>      caller iomap_readahead actor iomap_readahead_actor
>      io_uring-sq-4518  [012]    52.806988: iomap_apply_dstmap:   dev 253:32
>      ino 0x86 bdev 253:32 addr 922058752 offset 116654080 length 65536 type
>      MAPPED flags 
>      io_uring-sq-4518  [012]    52.806990: block_bio_queue:      253,32 RA
>      1800896 + 128 [io_uring-sq]
> ....
>      io_uring-sq-4518  [012]    52.806992: xfs_file_buffered_read: dev 253:32
>      ino 0x86 size 0x10000000 offset 0x6f40000 count 0xfffffffffffffff5
>      io_uring-sq-4518  [012]    52.806993: xfs_file_buffered_read: dev 253:32
>      ino 0x86 size 0x10000000 offset 0x6f40000 count 0x10000
>      io_uring-sq-4518  [012]    52.806994: xfs_file_buffered_read: dev 253:32
>      ino 0x86 size 0x10000000 offset 0x6f40000 count 0xfffffffffffffdef
>      io_uring-sq-4518  [012]    52.806995: xfs_file_buffered_read: dev 253:32
>      ino 0x86 size 0x10000000 offset 0x6bf0000 count 0x10000
> ....
>              fio-4515  [013]    52.807855: block_rq_complete:    253,32 RA ()
>              1800896 + 128 [0]
>      io_uring-sq-4518  [012]    52.807863: xfs_file_buffered_read: dev 253:32
>      ino 0x86 size 0x10000000 offset 0x6f40000 count 0x10000
>      io_uring-sq-4518  [012]    52.807866: xfs_file_buffered_read: dev 253:32
>      ino 0x86 size 0x10000000 offset 0x6f41000 count 0x1000
> ....
>              fio-4515  [013]    52.807871: block_rq_complete:    253,32 RA ()
>              1752640 + 128 [0]
>              fio-4515  [013]    52.807876: block_rq_complete:    253,32 RA ()
>              1675200 + 128 [0]
>              fio-4515  [013]    52.807880: block_rq_complete:    253,32 RA ()
>              1652672 + 128 [0]
> 
> What we see is -3- calls to do the buffered read. The first call
> triggers readahead and we can see iomap map the file offset and
> issue the bio. This is IOCB_NOWAIT context, so this is actually
> broken as we can block for a long time in this path (e.g. on
> inode locks).
> 
> Once the IO is issued, we see a second xfs_file_buffered_read trace
> with the return value in the count - that's -EAGAIN because
> IOCB_NOWAIT is set and now we are going to block waiting for IO.
> 
> Immediately after this, the io_uring code issues the buffered read
> again, only this time with IOCB_WAITQ|IOCB_NOWAIT set and it hits
> the "page not up to date" path. This then hits
> __wait_on_page_locked_async() which sets the PageWaiter bit and
> returns -EIOCBQUEUED to tell the caller that it has been queued.
> 
> About a millisecond later, the readahead IO completes, and the block
> layer logs it. This ends up calling iomap_read_end_io() which walks
> each page(bvec) in the bio, marks them uptodate and unlocks them.
> The first page unlock sees the PageWaiter bit, which then does a
> 
>       wake_up_page_bit(page, PG_locked);
> 
> because the PageWaiter bit is set. We see the same io_uring worker
> thread then immmediately re-issue the buffered read and it does it
> with IOCB_WAITQ|IOCB_NOWAIT context. This gets the first page, then
> finds the second page not up to date or has to block trying to lock
> it. It aborts the read at that point, and because we've already read
> a page it returns 4096 bytes read.
> 
> I left the post-buffered read IO completions that were being run by
> the block layer to show that the buffered read was issued and
> completed while the block layer was still processing the readahead
> bio completion.
> 
> At this point (a positive byte count being returned) the io_uring
> layer considers the IO done and returns the result to the userspace
> app. Fio then completely fails to check for a short read despite
> being in "verify" mode and instead yells about corruption due to
> stale data that it had left in the buffer that it passed the
> kernel....
> 
> IOWs, there does not appear to be a filesystem or page cache issue
> here at all - it's just an unhandled short read.

Right, the page cache is consistent, nothing wrong on that side. This is
purely a fio issue with messing up the short read.

It'd be nice to do better on the short reads, maybe wait for the entire
range to be ready instead of just triggering on the first page.
Something to look into.

> Jens, if strace worked on io_uring, I wouldn't have wasted 2 and
> half days tracking this down. It would have taken 5 minutes with
> strace and it would have been totally obvious. I'd suggest this
> needs addressing as a matter of priority over everything else for
> io_uring....

Sorry, you should have pinged me earlier. In lieu of strace, we could
expand the io_uring issue event to include a bigger dump of exactly what
is in the command being issued instead of just the opcode. There's
already info in there enabling someone to tie the complete and submit
events together, so it could have been deduced that we never retried a
short IO on the application side. But it should not be that hard to dig
out, I agree we need to make it easier to debug these kinds of things.
Definitely on the list!
Comment 12 Jens Axboe 2020-08-11 03:01:22 UTC
On 8/10/20 8:01 PM, Jens Axboe wrote:
> On 8/10/20 7:50 PM, Jens Axboe wrote:
>> On 8/10/20 7:15 PM, Jens Axboe wrote:
>>> On 8/10/20 3:08 AM, Dave Chinner wrote:
>>>> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
>>>>> [cc Jens]
>>>>>
>>>>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
>>>>> the bz link below.]
>>>>>
>>>>> On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
>>>>>> On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
>>>>>>> On Fri, Aug 07, 2020 at 03:12:03AM +0000,
>>>>>>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>>>>>>> --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
>>>>>>>> On Thu, Aug 06, 2020 at 04:57:58AM +0000,
>>>>>>>> bugzilla-daemon@bugzilla.kernel.org
>>>>>>>> wrote:
>>>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=208827
>>>>>>>>>
>>>>>>>>>             Bug ID: 208827
>>>>>>>>>            Summary: [fio io_uring] io_uring write data crc32c verify
>>>>>>>>>                     failed
>>>>>>>>>            Product: File System
>>>>>>>>>            Version: 2.5
>>>>>>>>>     Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
>>>>>>>
>>>>>>> FWIW, I can reproduce this with a vanilla 5.8 release kernel,
>>>>>>> so this isn't related to contents of the XFS dev tree at all...
>>>>>>>
>>>>>>> In fact, this bug isn't a recent regression. AFAICT, it was
>>>>>>> introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
>>>>>>> reproduce. More info once I've finished bisecting it....
>>>>>>
>>>>>> f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
>>>>>> commit f67676d160c6ee2ed82917fadfed6d29cab8237c
>>>>>> Author: Jens Axboe <axboe@kernel.dk>
>>>>>> Date:   Mon Dec 2 11:03:47 2019 -0700
>>>>>>
>>>>>>     io_uring: ensure async punted read/write requests copy iovec
>>>>>
>>>>> ....
>>>>>
>>>>> Ok, I went back to vanilla 5.8 to continue debugging and adding
>>>>> tracepoints, and it's proving strangely difficult to reproduce now.
>>>>
>>>> Which turns out to be caused by a tracepoint I inserted to try to
>>>> narrow down if this was an invalidation race. I put this in
>>>> invalidate_complete_page:
>>>>
>>>>
>>>> --- a/mm/truncate.c
>>>> +++ b/mm/truncate.c
>>>> @@ -257,8 +257,11 @@ int invalidate_inode_page(struct page *page)
>>>>         struct address_space *mapping = page_mapping(page);
>>>>         if (!mapping)
>>>>                 return 0;
>>>> -       if (PageDirty(page) || PageWriteback(page))
>>>> +       if (PageDirty(page) || PageWriteback(page)) {
>>>> +               trace_printk("ino 0x%lx page %p, offset 0x%lx\n",
>>>> +                       mapping->host->i_ino, page, page->index *
>>>> PAGE_SIZE);
>>>>                 return 0;
>>>> +       }
>>>>         if (page_mapped(page))
>>>>                 return 0;
>>>>         return invalidate_complete_page(mapping, page);
>>>>
>>>>
>>>> And that alone, without even enabling tracepoints, made the
>>>> corruption go completely away. So I suspect a page state race
>>>> condition and look at POSIX_FADV_DONTNEED, which fio is issuing
>>>> before running it's verification reads. First thing that does:
>>>>
>>>>    if (!inode_write_congested(mapping->host))
>>>>            __filemap_fdatawrite_range(mapping, offset, endbyte,
>>>>                                       WB_SYNC_NONE);
>>>>
>>>> It starts async writeback of the dirty pages. There's 256MB of dirty
>>>> pages on these inodes, and iomap tracing indicates the entire 256MB
>>>> immediately runs through the trace_iomap_writepage() tracepoint.
>>>> i.e. every page goes Dirty -> Writeback and is submitted for async
>>>> IO.
>>>>
>>>> Then the POSIX_FADV_DONTNEED code goes and runs
>>>> invalidate_mapping_pages(), which ends up try-locking each page and
>>>> then running invalidate_inode_page() on the page, which is where the
>>>> trace debug I put in on pages under writeback gets hit. So if
>>>> changing the invalidation code for pages under writeback makes the
>>>> problem go away, then stopping invalidate_mapping_pages() from
>>>> racing with page writeback should make the problem go away, too.
>>>>
>>>> This does indeed make the corruption go away:
>>>>
>>>> --- a/mm/fadvise.c
>>>> +++ b/mm/fadvise.c
>>>> @@ -109,9 +109,8 @@ int generic_fadvise(struct file *file, loff_t offset,
>>>> loff_t len, int advice)
>>>>         case POSIX_FADV_NOREUSE:
>>>>                 break;
>>>>         case POSIX_FADV_DONTNEED:
>>>>                 if (!inode_write_congested(mapping->host))
>>>> -                       __filemap_fdatawrite_range(mapping, offset,
>>>> endbyte,
>>>> -                                                  WB_SYNC_NONE);
>>>> +                       filemap_write_and_wait_range(mapping, offset,
>>>> endbyte);
>>>>  
>>>>                 /*
>>>>                  * First and last FULL page! Partial pages are
>>>>                  deliberately
>>>>
>>>> by making the invalidation wait for the pages to go fully to the
>>>> clean state before starting.
>>>>
>>>> This, however, only fixes the specific symptom being tripped over
>>>> here.  To further test this, I removed this writeback from
>>>> POSIX_FADV_DONTNEED completely so I could trigger writeback via
>>>> controlled background writeback. And, as I expected, whenever
>>>> background writeback ran to write back these dirty files, the
>>>> verification failures triggered again. It is quite reliable.
>>>>
>>>> So it looks like there is some kind of writeback completion vs page
>>>> invalidation race condition occurring, but more work is needed to
>>>> isolate it further. I don't know what part the async read plays in
>>>> the corruption yet, because I don't know how we are getting pages in
>>>> the cache where page->index != the file offset stamped in the data.
>>>> That smells of leaking PageUptodate flags...
>>>
>>> The async read really isn't doing anything that you could not do with
>>> separate threads. Unfortunately it's not that easy to have multiple
>>> threads working on the same region with fio, or we could've reproduced
>>> it with a job file written to use that instead.
>>>
>>> I'll dig a bit here...
>>
>> Have we verified that the actual page cache is inconsistent, or is that
>> just an assumption? I'm asking since I poked a bit on the fio side, and
>> suspiciously the failed verification was a short IO. At least
>> originally, fio will retry those, but it could be a bug in the io_uring
>> engine for fio.
>>
>> I'll poke some more.
> 
> The on-disk state seems sane. I added a hack that clears the rest of the
> buffer to 0x5a when we get a short read. When the verify fails, the io_u
> that was attempted verified looks like this:
> 
> 0000fe0 116b b418 d14b 0477 822d 6dcd 201d 1316
> 0000ff0 3045 eca3 0d1c 1a4f e608 0571 6b52 015e
> 0001000 5a5a 5a5a 5a5a 5a5a 5a5a 5a5a 5a5a 5a5a
> *
> 0010000
> 
> where fio dumps this as the expected data:
> 
> 0000fe0 116b b418 d14b 0477 822d 6dcd 201d 1316
> 0000ff0 3045 eca3 0d1c 1a4f e608 0571 6b52 015e
> 0001000 3cc1 4daa cab1 12ba c798 0b54 b281 0a05
> 0001010 98f3 bd9e 30a5 1728 531e 6b3a 2745 1877
> 
> Fio says the offset is 62652416 in the file, and reading
> that 64k block from the file and dumping it:
> 
> 0000fe0 116b b418 d14b 0477 822d 6dcd 201d 1316
> 0000ff0 3045 eca3 0d1c 1a4f e608 0571 6b52 015e
> 0001000 3cc1 4daa cab1 12ba c798 0b54 b281 0a05
> 0001010 98f3 bd9e 30a5 1728 531e 6b3a 2745 1877
> 
> So it seems to me like the file state is consistent, at least after the
> run, and that this seems more likely to be a fio issue with short
> read handling.
> 
> Poking along...

I've pushed some fixes for fio for this, it now works fine for me. JFYI.
Comment 13 Dave Chinner 2020-08-11 05:53:25 UTC
On Mon, Aug 10, 2020 at 08:19:57PM -0600, Jens Axboe wrote:
> > IOWs, there does not appear to be a filesystem or page cache issue
> > here at all - it's just an unhandled short read.
> 
> Right, the page cache is consistent, nothing wrong on that side. This is
> purely a fio issue with messing up the short read.
> 
> It'd be nice to do better on the short reads, maybe wait for the entire
> range to be ready instead of just triggering on the first page.
> Something to look into.
> 
> > Jens, if strace worked on io_uring, I wouldn't have wasted 2 and
> > half days tracking this down. It would have taken 5 minutes with
> > strace and it would have been totally obvious. I'd suggest this
> > needs addressing as a matter of priority over everything else for
> > io_uring....
> 
> Sorry, you should have pinged me earlier.

I pinged you as early as I would ping anyone else when it's apparent
I'm probably not looking at a filesystem issue.

The problem is here that io_uring is completely unfamiliar code to
most of the fileystem developers that are going to have to diagnose
problems like this in the field.  That's why I spent the time on it
- nobody else in my team or the upstream people I work with day to
day has even looked at this code or has a clue of how it works.

We need to have people who know how this precarious stack of cards
actually works, otherwise we're completely screwed when a user has a
similar sort of issue.  And right now, all I see are big red flags.

This exercise has lead me to realise that that we really don't
have *any* tools that can be used to diagnose io_uring problems at
all. That's something that clearly needs to be fixed, because fo
some reason this crazy io_uring stuff has suddenly become the
responsibility of distro filesystem people are being asked to
maintain and debug...

So saying "you should have pinged me earlier" is a hugely naive
response. It completely ignores the fact that people are going to
have to triage issues like this on a regular basis. And you don't
scale to having every single user ping you to help debug io_uring
problems they are seeing. IOWs, "you should have pinged me" is not a
solution, nor does it absolve you of the responsibility for making
your subsystem debuggable by the expected user and application
devleoper base.

> In lieu of strace, we could
> expand the io_uring issue event to include a bigger dump of exactly what
> is in the command being issued instead of just the opcode.

I didn't find anything even remotely useful in the handful of trace
points in the io_uring code.....

> There's
> already info in there enabling someone to tie the complete and submit
> events together, so it could have been deduced that we never retried a
> short IO on the application side.  But it should not be that hard to dig
> out,

... and now you're assuming that the person trying to debug their
application knows exactly how the whole io_uring thing works inside
the kernel so they can make sense of the tracepoint information.
There's no documentation for the interface, there's no design doc
that describes how the async engine is supposed to work, what
constraints it places on code paths that need to be able to run
async, etc.  If you aren't deep in the guts of io_uring all the
time, then working out what the tracepoints are documenting is no
easy task.

> I agree we need to make it easier to debug these kinds of things.
> Definitely on the list!

As I said, I think it should be at the top of your list.

Cheers,

Dave.
Comment 14 Dave Chinner 2020-08-11 07:05:12 UTC
On Mon, Aug 10, 2020 at 08:19:57PM -0600, Jens Axboe wrote:
> On 8/10/20 8:00 PM, Dave Chinner wrote:
> > On Mon, Aug 10, 2020 at 07:08:59PM +1000, Dave Chinner wrote:
> >> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
> >>> [cc Jens]
> >>>
> >>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
> >>> the bz link below.]
> > 
> > Looks like a io_uring/fio bugs at this point, Jens. All your go fast
> > bits turns the buffered read into a short read, and neither fio nor
> > io_uring async buffered read path handle short reads. Details below.
> 
> It's a fio issue. The io_uring engine uses a different path for short
> IO completions, and that's being ignored by the backend... Hence the
> IO just gets completed and not retried for this case, and that'll then
> trigger verification as if it did complete. I'm fixing it up.

I just updated fio to:

cb7d7abb (HEAD -> master, origin/master, origin/HEAD) io_u: set io_u->verify_offset in fill_io_u()

The workload still reports corruption almost instantly. Only this
time, the trace is not reporting a short read.

File is patterned with:

verify_pattern=0x33333333%o-16

Offset of "bad" data is 0x1240000.

Expected:

00000000:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000010:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000020:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000030:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000040:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000050:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000060:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000070:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
00000080:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
.....
0000ffd0:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff  3333............
0000ffe0:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff  3333............
0000fff0:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff  3333............


Received:

00000000:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000010:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000020:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000030:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000040:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000050:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000060:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000070:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
00000080:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
.....
0000ffd0:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff  3333............
0000ffe0:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff  3333............
0000fff0:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff  3333............


Looks like the data in the expected buffer is wrong - the data
pattern in the received buffer is correct according the defined
pattern.

Error is 100% reproducable from the same test case. Same bad byte in
the expected buffer dump every single time.

-Dave.
Comment 15 Jens Axboe 2020-08-11 13:10:34 UTC
On 8/11/20 1:05 AM, Dave Chinner wrote:
> On Mon, Aug 10, 2020 at 08:19:57PM -0600, Jens Axboe wrote:
>> On 8/10/20 8:00 PM, Dave Chinner wrote:
>>> On Mon, Aug 10, 2020 at 07:08:59PM +1000, Dave Chinner wrote:
>>>> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
>>>>> [cc Jens]
>>>>>
>>>>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
>>>>> the bz link below.]
>>>
>>> Looks like a io_uring/fio bugs at this point, Jens. All your go fast
>>> bits turns the buffered read into a short read, and neither fio nor
>>> io_uring async buffered read path handle short reads. Details below.
>>
>> It's a fio issue. The io_uring engine uses a different path for short
>> IO completions, and that's being ignored by the backend... Hence the
>> IO just gets completed and not retried for this case, and that'll then
>> trigger verification as if it did complete. I'm fixing it up.
> 
> I just updated fio to:
> 
> cb7d7abb (HEAD -> master, origin/master, origin/HEAD) io_u: set
> io_u->verify_offset in fill_io_u()
> 
> The workload still reports corruption almost instantly. Only this
> time, the trace is not reporting a short read.
> 
> File is patterned with:
> 
> verify_pattern=0x33333333%o-16
> 
> Offset of "bad" data is 0x1240000.
> 
> Expected:
> 
> 00000000:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000010:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000020:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000030:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000040:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000050:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000060:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000070:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000080:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> .....
> 0000ffd0:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff  3333............
> 0000ffe0:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff  3333............
> 0000fff0:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff  3333............
> 
> 
> Received:
> 
> 00000000:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000010:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000020:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000030:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000040:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000050:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000060:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000070:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> 00000080:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> .....
> 0000ffd0:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff  3333............
> 0000ffe0:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff  3333............
> 0000fff0:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff  3333............
> 
> 
> Looks like the data in the expected buffer is wrong - the data
> pattern in the received buffer is correct according the defined
> pattern.
> 
> Error is 100% reproducable from the same test case. Same bad byte in
> the expected buffer dump every single time.

What job file are you running? It's not impossible that I broken
something else in fio, the io_u->verify_offset is a bit risky... I'll
get it fleshed out shortly.
Comment 16 Chris Murphy 2020-08-11 16:16:45 UTC
Dunno if it's at all related, but if it is then Dave can blame me for his two wasted days because I ran into this 3 weeks ago and just got around to filing a bug today.

Bug 208875 - btrfs checksum errors in guest when virtio device io=io_uring
https://bugzilla.kernel.org/show_bug.cgi?id=208875
Comment 17 Jeff Moyer 2020-08-11 20:56:46 UTC
Jens Axboe <axboe@kernel.dk> writes:

> So it seems to me like the file state is consistent, at least after the
> run, and that this seems more likely to be a fio issue with short
> read handling.

Any idea why there was a short read, though?

-Jeff
Comment 18 Dave Chinner 2020-08-11 21:59:18 UTC
On Tue, Aug 11, 2020 at 07:10:30AM -0600, Jens Axboe wrote:
> On 8/11/20 1:05 AM, Dave Chinner wrote:
> > On Mon, Aug 10, 2020 at 08:19:57PM -0600, Jens Axboe wrote:
> >> On 8/10/20 8:00 PM, Dave Chinner wrote:
> >>> On Mon, Aug 10, 2020 at 07:08:59PM +1000, Dave Chinner wrote:
> >>>> On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
> >>>>> [cc Jens]
> >>>>>
> >>>>> [Jens, data corruption w/ io_uring and simple fio reproducer. see
> >>>>> the bz link below.]
> >>>
> >>> Looks like a io_uring/fio bugs at this point, Jens. All your go fast
> >>> bits turns the buffered read into a short read, and neither fio nor
> >>> io_uring async buffered read path handle short reads. Details below.
> >>
> >> It's a fio issue. The io_uring engine uses a different path for short
> >> IO completions, and that's being ignored by the backend... Hence the
> >> IO just gets completed and not retried for this case, and that'll then
> >> trigger verification as if it did complete. I'm fixing it up.
> > 
> > I just updated fio to:
> > 
> > cb7d7abb (HEAD -> master, origin/master, origin/HEAD) io_u: set
> io_u->verify_offset in fill_io_u()
> > 
> > The workload still reports corruption almost instantly. Only this
> > time, the trace is not reporting a short read.
> > 
> > File is patterned with:
> > 
> > verify_pattern=0x33333333%o-16
> > 
> > Offset of "bad" data is 0x1240000.
> > 
> > Expected:
> > 
> > 00000000:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000010:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000020:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000030:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000040:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000050:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000060:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000070:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000080:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 3333............
> > .....
> > 0000ffd0:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 
> 3333............
> > 0000ffe0:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 
> 3333............
> > 0000fff0:  33 33 33 33 00 10 24 01 00 00 00 00 f0 ff ff ff 
> 3333............
> > 
> > 
> > Received:
> > 
> > 00000000:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000010:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000020:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000030:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000040:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000050:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000060:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000070:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > 00000080:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 3333............
> > .....
> > 0000ffd0:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 
> 3333............
> > 0000ffe0:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 
> 3333............
> > 0000fff0:  33 33 33 33 00 00 24 01 00 00 00 00 f0 ff ff ff 
> 3333............
> > 
> > 
> > Looks like the data in the expected buffer is wrong - the data
> > pattern in the received buffer is correct according the defined
> > pattern.
> > 
> > Error is 100% reproducable from the same test case. Same bad byte in
> > the expected buffer dump every single time.
> 
> What job file are you running? It's not impossible that I broken
> something else in fio, the io_u->verify_offset is a bit risky... I'll
> get it fleshed out shortly.

Details are in the bugzilla I pointed you at. I modified the
original config specified to put per-file and offset identifiers
into the file data rather than using random data. This is
"determining the origin of stale data 101" stuff - the first thing
we _always_ do when trying to diagnose data corruption is identify
where the bad data came from.

Entire config file is below.

CHeers,

Dave.
Comment 19 Dave Chinner 2020-08-11 22:09:34 UTC
On Tue, Aug 11, 2020 at 04:56:37PM -0400, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
> > So it seems to me like the file state is consistent, at least after the
> > run, and that this seems more likely to be a fio issue with short
> > read handling.
> 
> Any idea why there was a short read, though?

Yes. See:

https://lore.kernel.org/linux-xfs/20200807024211.GG2114@dread.disaster.area/T/#maf3bd9325fb3ac0773089ca58609a2cea0386ddf

It's a race between the readahead io completion marking pages
uptodate and unlocking them, and the io_uring worker function
getting woken on the first page being unlocked and running the
buffered read before the entire readahead IO completion has unlocked
all the pages in the IO.

Basically, io_uring is re-running the IOCB_NOWAIT|IOCB_WAITQ IO when
there are still pages locked under IO. This will happen much more
frequently the larger the buffered read (these are only 64kB) and
the readahead windows are opened.

Essentially, the io_uring buffered read needs to wait until _all_
pages in the IO are marked up to date and unlocked, not just the
first one. And not just the last one, either - readahead can be
broken into multiple bios (because it spans extents) and there is no
guarantee of order of completion of the readahead bios given by the
readahead code....

Cheers,

Dave.
Comment 20 Dave Chinner 2020-08-11 23:00:57 UTC
On Wed, Aug 12, 2020 at 07:59:13AM +1000, Dave Chinner wrote:
> On Tue, Aug 11, 2020 at 07:10:30AM -0600, Jens Axboe wrote:
> > What job file are you running? It's not impossible that I broken
> > something else in fio, the io_u->verify_offset is a bit risky... I'll
> > get it fleshed out shortly.
> 
> Details are in the bugzilla I pointed you at. I modified the
> original config specified to put per-file and offset identifiers
> into the file data rather than using random data. This is
> "determining the origin of stale data 101" stuff - the first thing
> we _always_ do when trying to diagnose data corruption is identify
> where the bad data came from.
> 
> Entire config file is below.

Just as a data point: btrfs fails this test even faster than XFS.
Both with the old 3.21 fio binary and the new one.

Evidence points to this code being very poorly tested. Both
filesystems it is enabled on fail validation with the tool is
supposed to exercise and validate io_uring IO path behaviour.

Regardless of whether this is a tool failure or a kernel code
failure, the fact is that nobody ran data validation tests on this
shiny new code. And for a storage API that is reading and writing
user data, that's a pretty major process failure....

Improvements required, Jens.

-Dave.
Comment 21 Chris Murphy 2020-08-12 03:15:10 UTC
Anyone seeing massive latency spikes?
https://bugzilla.kernel.org/show_bug.cgi?id=208875#c5
Comment 22 Jens Axboe 2020-08-12 15:14:02 UTC
On 8/11/20 4:09 PM, Dave Chinner wrote:
> On Tue, Aug 11, 2020 at 04:56:37PM -0400, Jeff Moyer wrote:
>> Jens Axboe <axboe@kernel.dk> writes:
>>
>>> So it seems to me like the file state is consistent, at least after the
>>> run, and that this seems more likely to be a fio issue with short
>>> read handling.
>>
>> Any idea why there was a short read, though?
> 
> Yes. See:
> 
>
> https://lore.kernel.org/linux-xfs/20200807024211.GG2114@dread.disaster.area/T/#maf3bd9325fb3ac0773089ca58609a2cea0386ddf
> 
> It's a race between the readahead io completion marking pages
> uptodate and unlocking them, and the io_uring worker function
> getting woken on the first page being unlocked and running the
> buffered read before the entire readahead IO completion has unlocked
> all the pages in the IO.
> 
> Basically, io_uring is re-running the IOCB_NOWAIT|IOCB_WAITQ IO when
> there are still pages locked under IO. This will happen much more
> frequently the larger the buffered read (these are only 64kB) and
> the readahead windows are opened.
> 
> Essentially, the io_uring buffered read needs to wait until _all_
> pages in the IO are marked up to date and unlocked, not just the
> first one. And not just the last one, either - readahead can be
> broken into multiple bios (because it spans extents) and there is no
> guarantee of order of completion of the readahead bios given by the
> readahead code....

Yes, it would ideally wait, or at least trigger on the last one. I'll
see if I can improve that. For any of my testing, the amount of
triggered short reads is minimal. For the verify case that we just ran,
we're talking 8-12 ios out of 820 thousand, or 0.001% of them. So
nothing that makes a performance difference in practical terms, though
it would be nice to not hand back short reads if we can avoid it. Not
for performance reasons, but for usage reasons.
Comment 23 Jens Axboe 2020-08-12 15:19:54 UTC
On 8/11/20 5:00 PM, Dave Chinner wrote:
> On Wed, Aug 12, 2020 at 07:59:13AM +1000, Dave Chinner wrote:
>> On Tue, Aug 11, 2020 at 07:10:30AM -0600, Jens Axboe wrote:
>>> What job file are you running? It's not impossible that I broken
>>> something else in fio, the io_u->verify_offset is a bit risky... I'll
>>> get it fleshed out shortly.
>>
>> Details are in the bugzilla I pointed you at. I modified the
>> original config specified to put per-file and offset identifiers
>> into the file data rather than using random data. This is
>> "determining the origin of stale data 101" stuff - the first thing
>> we _always_ do when trying to diagnose data corruption is identify
>> where the bad data came from.
>>
>> Entire config file is below.
> 
> Just as a data point: btrfs fails this test even faster than XFS.
> Both with the old 3.21 fio binary and the new one.

I can't trigger any failures with the fixes I committed, so that's a bit
puzzling. What storage are you running this on? I'll try a few other
things.

> Evidence points to this code being very poorly tested. Both
> filesystems it is enabled on fail validation with the tool is
> supposed to exercise and validate io_uring IO path behaviour.
> 
> Regardless of whether this is a tool failure or a kernel code
> failure, the fact is that nobody ran data validation tests on this
> shiny new code. And for a storage API that is reading and writing
> user data, that's a pretty major process failure....
> 
> Improvements required, Jens.

Let's turn down the condescension and assumptions a notch, please. The
fio io_uring engine was originally written to benchmark it, and I
haven't really been using it for verification purposes (as is evident).
In my experience, I get much better validation from running internal
production. And plenty of that has been done.

That said, the fio side should of course work, since others can't
necessarily run what I'm running, and it does provide ways to exercise
it that aren't necessarily seen in prod. But blankly assuming that
"nobody ran data validation tests on this shiny new code" is blatantly
false.
Comment 24 Jeff Moyer 2020-08-12 15:24:14 UTC
Jens Axboe <axboe@kernel.dk> writes:

> Yes, it would ideally wait, or at least trigger on the last one. I'll
> see if I can improve that. For any of my testing, the amount of
> triggered short reads is minimal. For the verify case that we just ran,
> we're talking 8-12 ios out of 820 thousand, or 0.001% of them. So
> nothing that makes a performance difference in practical terms, though
> it would be nice to not hand back short reads if we can avoid it. Not
> for performance reasons, but for usage reasons.

I think you could make the case that handing back a short read is a
bug (unless RWF_NOWAIT was specified, of course).  At the very least, it
violates the principle of least surprise, and the fact that it happens
infrequently actually makes it a worse problem when it comes to
debugging.

-Jeff
Comment 25 Jens Axboe 2020-08-12 15:26:55 UTC
On 8/12/20 9:24 AM, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> Yes, it would ideally wait, or at least trigger on the last one. I'll
>> see if I can improve that. For any of my testing, the amount of
>> triggered short reads is minimal. For the verify case that we just
>> ran, we're talking 8-12 ios out of 820 thousand, or 0.001% of them.
>> So nothing that makes a performance difference in practical terms,
>> though it would be nice to not hand back short reads if we can avoid
>> it. Not for performance reasons, but for usage reasons.
> 
> I think you could make the case that handing back a short read is a
> bug (unless RWF_NOWAIT was specified, of course).  At the very least,
> it violates the principle of least surprise, and the fact that it
> happens infrequently actually makes it a worse problem when it comes
> to debugging.

It's definitely on my list to ensure that we handle the somewhat
expected case by just retrying it internally, because I do agree that it
can be surprising. FWIW, this isn't a change with 5.9-rc, io_uring has
always potentially short buffered reads when when the io-wq offload was
done.

While it may happen infrequently with this specific test case, others
can trigger it more often.