Bug 200681 - [inline_data] read() does not see what write() has just written through different FD in the same thread
Summary: [inline_data] read() does not see what write() has just written through diffe...
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-30 11:46 UTC by Anatoly Trosinenko
Modified: 2023-11-22 20:08 UTC (History)
3 users (show)

See Also:
Kernel Version: v4.18-rc7 and tytso/dev (commit f4a79b7b4)
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Kernel config (113.10 KB, text/x-mpsub)
2018-07-30 11:46 UTC, Anatoly Trosinenko
Details
Reproducer (1.28 KB, application/x-sh)
2018-07-30 11:47 UTC, Anatoly Trosinenko
Details
Full log (19.68 KB, text/plain)
2018-07-30 11:47 UTC, Anatoly Trosinenko
Details
Reproducer (735 bytes, text/plain)
2018-10-13 16:04 UTC, Anatoly Trosinenko
Details

Description Anatoly Trosinenko 2018-07-30 11:46:16 UTC
Created attachment 277603 [details]
Kernel config

# How to reproduce:
1) Compile v4.18-rc7 or tytso/dev (commit f4a79b7b4) with the attached config
2) Run mkinitrd.sh to create a self contained initrd reproducer and ext4 image
3) Run
   qemu-system-x86_64 -kernel /path/to/linux.git/arch/x86/boot/bzImage -initrd ./initrd.cpio -m 128M -drive file=ext4.img,index=0,format=raw

# What happens:

When reproducer is run:

[CUT]
char buf  [100000];
char rdbuf[100000];

int main() {
  for (int i = 0; i < sizeof(buf); ++i)
    buf[i] = i + 1;
  int fd1 = creat("/mnt/xyz", 0660);
  write(fd1, buf, 48);
  int fd2 = open("/mnt/xyz", O_RDONLY, 0660);
  write(fd1, buf, 17968);
  
  int res1 = read(fd2, rdbuf, 17968);
  memset(rdbuf, -1, sizeof(rdbuf));
  int res2 = read(fd2, rdbuf, 12336);
  fprintf(stderr, "res1 = %d res2 = %d\t\tfirst bytes of data are: %02x %02x %02x...\n", res1, res2, rdbuf[0], rdbuf[1], rdbuf[2]);
  return 0;
}
[/CUT]

The following output occurs:


[CUT]
Mounting...
[    4.773445] EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: (null)
Running reproducer...
res1 = 17968 res2 = 48		first bytes of data are: 00 00 00...


BusyBox v1.27.2 (Ubuntu 1:1.27.2-2ubuntu3) built-in shell (ash)
Enter 'help' for a list of built-in commands.

/bin/sh: can't access tty; job control turned off
/ # [    6.400525] ext4lazyinit (1013) used greatest stack depth: 13968 bytes left
[/CUT]


Output 00 00 00 instead of 01 02 03 looks quite suspicious. On the other hand, I'm not sure this is forbidden by POSIX.
Comment 1 Anatoly Trosinenko 2018-07-30 11:47:08 UTC
Created attachment 277605 [details]
Reproducer
Comment 2 Anatoly Trosinenko 2018-07-30 11:47:27 UTC
Created attachment 277607 [details]
Full log
Comment 3 Theodore Tso 2018-08-01 15:21:59 UTC
So it's not reproducing for me using my standard test framework, kvm-xfstests:

https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md

What I did:

Edited your reproducer program to use "./xyz" instead of "/mnt/xyz", compiled it, and put the binary in /tmp/kvm-xfstests-tytso/repro.  (/tmp/kvm-xfstests-$USER will get mounted using 9p by kvm-xfstests in the guest OS as /vtmp.)

% kvm-xfstests shell
...
root@kvm-xfstests:~# mount /vtmp
root@kvm-xfstests:~# mke2fs -Fq -t ext4 -O inline_data -I 1024 /dev/vdc 
/dev/vdc contains a ext4 file system
	last mounted on /vdc on Wed Aug  1 11:09:56 2018
root@kvm-xfstests:~# mount /vdc
[  761.839810] EXT4-fs (vdc): mounted filesystem with ordered data mode. Opts: (null)
root@kvm-xfstests:~# cd /vdc
root@kvm-xfstests:/vdc# /vtmp/repro
res1 = 17968 res2 = 48		first bytes of data are: 01 02 03...

If I have time I can try to repro it with your reproducer, but maybe you can try using kvm-xfstests yourself?  I think you will find it is a much more convenient system, and if ext4 users, developers, and bug reports all use the same system it saves a lot of time --- otherwise there are many more of *you* than there is of *me* and I don't scale that well unless I ask people to use common tools and procedures.    (See https://thunk.org/gce-xfstests for an introduction to kvm-xfstests and gce-xfstests.)
Comment 4 Anatoly Trosinenko 2018-08-01 15:40:18 UTC
Thank you very much for pointers!

Please excuse me my ignorance, I will try to rewrite/debug my reproducer, so it will be compatible with the standard testing framework. I realize that standards are better than custom built tools from every bug reporter (probably for that reporters too).
Comment 5 Theodore Tso 2018-08-01 16:04:34 UTC
There are in fact a number of ways that inline_data is broken for files --- in particular when a file transitions from a small file that fits inside the inode, to a large file.   However, when I looked at your reproducer, it did *not* appear to be one of the cases which I would have expected to be problematic.

So when I tried it in my KVM framework, I was not surprised when it worked for me.   That being said, as I've said, there are definitely problems with inline_data --- there is a reason why it is not enabled by default!

(For example, generic/477 is broken for the inline case, which can be easily seen by running "kvm-xfstests -c inline generic/477".  For another, there are some cases where if you crash immediately after a inline->regular conversion, the file contents might not be correct.)
Comment 6 Anatoly Trosinenko 2018-10-13 16:03:56 UTC
It looks like the difference was that I have specified 8M partition size. How to reproduce it (I repeat parts of your documentation just to make sure I have got it right):

1) Checkout master branch of torvalds/linux (tested on commit bab5c80b211)
2) Copy fstests/kernel-configs/x86_64-config-4.14 to .config and compile after `make olddefconfig`
3) Take the attached repro.c and execute the following commands:

$ gcc --static repro.c -o repro # Inside /tmp/kvm-xfstest-$USER
$ kvm-xfstest shell
root@kvm-xfstests:~# mount /vtmp
root@kvm-xfstests:~# mke2fs -Fq -t ext4 -O inline_data -I 1024 /dev/vdc 8M
/dev/vdc contains a ext4 file system
        last mounted on /vdc on Sat Oct 13 15:59:40 2018
root@kvm-xfstests:~# mount /vdc
[    9.812236] EXT4-fs (vdc): mounted filesystem with ordered data mode. Opts: (null)
root@kvm-xfstests:~# cd /vdc
root@kvm-xfstests:/vdc# /vtmp/repro
First write res = 48
Second write res = 17968
res1 = 17968 res2 = 48          first bytes of data are: 00 00 00...

When the full partition is used for the filesystem, the bug is not reproducible, as expected.

Maybe with a 8M partition there is just not enough room for data, so maybe everything is right except for the return codes saying everything was written successfully instead of ENOSPC... (but the file size is 18016 as if all the data were written to it).
Comment 7 Anatoly Trosinenko 2018-10-13 16:04:52 UTC
Created attachment 279013 [details]
Reproducer
Comment 8 Theodore Tso 2018-10-13 21:25:09 UTC
Thanks for the repro.   It looks like the critical factor is not the size of the filesystem, but the blocksize.  It just so happens that using a file system of size of 8M causes mke2fs to default to a 1k block size.  This can be seen by trying testing with a file system created by:

mke2fs -Fq -t ext4 -O inline_data -I 1024 -b 1024 /dev/vdc 

I've done some initial investigation, and something really strange is going on.  If an fsync(2) is forced write after the writes, the problem goes away.  By changing the size of the reads, it's the second read which is returning all zeros --- up to the point where when the first read is made to be larger than 16k, at which point it looks like it's the last 4k page which is all zeros.

Unfortuantely, it's not clear what is happening.   The best I can say at this point is that it looks the problem is related to inline_data, where when doing a large write which forces a conversion from an inline_data to normal file, and where the blocksize is != the page size, something is going very wrong.

To date, I've been only testing inline_data with a blocksize of 4k, and that's probably why I haven't seen any problems like this.

I'm curious how you found this bug; were you deliberately using a 1k block size, or were you trying to use inline_data with very small file systems?
Comment 9 Anatoly Trosinenko 2018-10-14 07:29:22 UTC
> I'm curious how you found this bug; were you deliberately using a 1k block
> size, or were you trying to use inline_data with very small file systems?

Frankly speaking, I have not stumbled upon this bug in real life --- just done some FS fuzzing and for improving performance used smaller FS images. Maybe smaller images have larger probability of triggering some bugs in my setup, too (now I know why :) ). But inline_data (as well as other extended options) were enabled deliberately, of course, supposing they are less stable than the core features.

Note You need to log in before you can comment on or make changes to this bug.