Created attachment 26701 [details]
Testcase showing the corruption
we use the direct async I/O API for file I/O in VirtualBox since 3.2 and got some reports about corrupted guest filesystems. It turned out that only users are affected which disk images are stored on a ext4 filesystem. I could create a testcase which reproduces the error. Further investigation showed that the data corruption happens if ftruncate is used, data is written to the so far unallocated offset and immediately read back. The buffer holding the read data contains only \0 afterwards. The data is there if the file is closed and another
program is used to view the file content (less for example). Everything works if ftruncate is not used.
The same testcase with ftruncate works on ext3 here. A simple testcase is attached.
Compile with: gcc -D_GNU_SOURCE -o aio_corrupt aio_corrupt.c -laio
Usage: aio_corrupt <path/to/the/file>
Expected output: Success!
Output on ext3: Success!
Output on ext4: Corrupted buffer!
Thanks for the report ... interestingly, testing this on RHEL6 (which is just what I had conveniently booted, 2.6.32 + backports) seems fine:
# ./aio_corrupt testfile
# mount | grep scratch
/dev/sdb3 on /mnt/testarea/scratch type ext4 (rw)
I can test 2.6.34 as well, but any other interesting pieces of the setup - block size less than page size perhaps?
I tested it on a 64bit Ubuntu 10.04 with 2.6.32 and 2.6.34.
# mount | grep Test
/dev/sdb1 on /media/Test type ext4 (rw,nosuid,nodev,uhelper=udisks)
The 2.6.34 kernel is from the mainline ppa http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.34-lucid/
These kernels don't contain any additional patches according to https://wiki.ubuntu.com/KernelTeam/MainlineBuilds
Seems the bug doesn't trigger always. The first invocation of the testcase on a fresh created ext4 succeeds but the second one and any other run fails.
This is what I've done (the mkfs.ext4 output is in german but should be uninteresting):
alexander@Hatak:~$ sudo mkfs.ext4 /dev/sda1
mke2fs 1.41.11 (14-Mar-2010)
Stride=0 blocks, Stripe width=0 blocks
1876800 Inodes, 7506363 Blöcke
375318 Blöcke (5.00%) reserviert für den Superuser
32768 Blöcke pro Gruppe, 32768 Fragmente pro Gruppe
8160 Inodes pro Gruppe
Superblock-Sicherungskopien gespeichert in den Blöcken:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
Schreibe Inode-Tabellen: erledigt
Erstelle Journal (32768 Blöcke): erledigt
Schreibe Superblöcke und Dateisystem-Accountinginformationen: erledigt
Das Dateisystem wird automatisch nach jeweils 32 Einhäng-Vorgängen bzw.
alle 180 Tage überprüft, je nachdem, was zuerst eintritt. Veränderbar mit
tune2fs -c oder -t .
alexander@Hatak:~$ sudo mount /dev/sda1 /mnt
alexander@Hatak:~$ sudo ./aio_corrupt /mnt/test
alexander@Hatak:~$ sudo ./aio_corrupt /mnt/test2
This is with the standard Ubuntu 2.6.32 kernel.
# for I in `seq 1 50`; do ./aio_corrupt test$I; done
still works for me. I don't have any patches in RHEL6 that aren't upstream, but I'll go ahead & recheck on an upstream kernel, I guess.
Installed RHEL 6 in a VM now (the image is on ext3) and I can confirm that it isn't reproducible with the default 2.6.32 kernel. I downloaded and compiled a vanilla 2.6.34 kernel from http://www.kernel.org The problem is there again but it doesn't happen on every run of the testcase. 5 failures on 11 runs.
Could be timing sensitive.
The chance to reproduce this issue seems to be really dependent of the host.
Just tried it on a Notebook with Ubuntu 10.04 and an ext4 root.
# for I in `seq 1 50`; do ./aio_corrupt /mnt/test$I; done
The test failed only 4 times on 50 runs.
When compiling the test case, please make sure to add a third parameter to open(), for example 0777, otherwise it wouldn't compile on Ubuntu (O_CREAT requires 3 parameters for open).
Executed the testcase here on an up-to-date Ubuntu 10.04 installed on an ext4-formatted partition. The testcase does NOT fail if the test file already exists. But it fails about 130 out of 500 times if I remove the test file prior to each run. When using Linux 2.6.34 vanilla: 7080 out of 10000 runs reported a corrupted buffer. And, like expected, the test never fails on an ext3 partition.
So please make sure to run the testcase like this:
# for i in `seq 1 500`; do rm -f /mnt/tst; ./aio_corrupt /mnt/tst; done
where /mnt is an ext4 partition.
Please, could someone of the ext4 maintainers have a look? We believe that this bug is responsible for many data corruption reports we get for our latest VirtualBox release as we made the asynchronous file API the default for certain controller types.
This problem is probably related: http://www.spinics.net/lists/linux-ext4/msg19493.html
Can those who can reproduce this test this patch from mingming?
When unaligned DIO writes, skip zero out the block if the buffer is marked
unwritten. That means there is an asynconous direct IO (append or fill the hole)
Signed-off-by: Mingming Cao <firstname.lastname@example.org>
fs/direct-io.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- linux-git.orig/fs/direct-io.c 2010-05-07 15:42:22.855033403 -0700
+++ linux-git/fs/direct-io.c 2010-05-07 15:44:17.695007770 -0700
@@ -740,7 +740,8 @@
struct page *page;
dio->start_zero_done = 1;
- if (!dio->blkfactor || !buffer_new(&dio->map_bh))
+ if (!dio->blkfactor || !buffer_new(&dio->map_bh)
+ || buffer_unwritten(&dio->map_bh))
dio_blocks_per_fs_block = 1 << dio->blkfactor;
The test still fails here.
And like in the mail thread it fails for XFS too.
(In reply to comment #14)
> And like in the mail thread it fails for XFS too.
How about with these 2 patches from the xfs list?
[PATCH 1/2] direct-io: move aio_complete into ->end_io
[PATCH 2/2] xfs: move aio completion after unwritten extent conversion
(thread starts at http://oss.sgi.com/archives/xfs/2010-06/msg00269.html)
And for ext4 perhaps try this on top of the above:
[PATCH] EXT4: move aio completion after unwritten extent conversion
These patches fix the data corruption here. Thanks!
Are these patches going to be merged for 2.6.35?
Alexander, there is another corruption problem that we've been chasing related to unaligned AIO on ext4 & xfs; see:
The patches above won't take care of that problem.
Chuck, good question about when they'll be merged... I dunno.
Yes, the patches in comment #15 will be merged in 2.6.35. They are already in my tree (although they haven't been pushed out yet as of this writing; they'll be in the ext4 tree soon though). I've negotiated with Christoph to make sure there won't be any merge conflicts with the XFS tree (since the first patch is going to be carried in both the ext4 and xfs trees).
2.6.35 is out and those patches are not in there.
The patches are in my tree and I'll push them during this merge window, and then get them into 220.127.116.11. Sorry, things got busy and I didn't have a chance to push them to Linus before 2.6.35 shipped.
Yesterday I observed another data corruption that looks very close to this one, but slightly different (and much more serious).
The scenario is this: oracle10 database, creating db and importing data. Oracle uses ftruncate() to create file of a desired size, next it fills the file with block headers, so there's no unwritten data in the file. So far so good. There's also temporary tablespaces, which aren't filled during creation, but just ftruncate'd.
The problem here happens when I extend a temp file during heavy database writes. It were needed a few times because in order to create large indexes, large temporary space were needed.
And each time I try extend temp file (say, from 10 to 20Gb), oracle performs the ftruncate on it, and continues writing to it and to other data files (which were pre-filled after ftruncate, even before system restart). And during this time, there's a very likely chance to have _other_ data files to be corrupt while it is writing to the newly extended temp space.
We observed several chunks of zeros (of size 1Mb each) written over _other_ files during this time. Re-read of those other files returns the same zeros.
So, this delayed allocation corrupts _other_ data files which are already allocated and written.
Avoiding gaps eliminates the problem.
Note that oracle uses aio and direct-io and "gapful" files at least for the temporary ts.
The kernel in question is 18.104.22.168.
This is quite a major issue...
But I repeat: i'm not sure it's related to this bugreport. At least the roots are somewhere very close.
BTW, when it started in XFS, anyone know?
Michael, if you can come up with any testcase (not including oracle) that shows the problem, that'd be great.
However, I'd really suggest trying a current upstream kernel if possible just to be sure you're not hitting an old bug.
It sounds very odd that writing into sparse space on file B would corrupt writes to non-sparse file A...
There is an outstanding bug where non-fs-block aligned AIO to a sparse file can cause corruption (related to partial zeroing of the block which is outside the range of the IO, and this is not coordinated across multiple AIOs...) but this corrupts the sparse file being written to, not other files in the filesystem.
I wonder if it's possible that oracle is using the tempfile as a data source, somehow mis-reading 0s out of it, and writing those 0s to the main files?
Anyway, I think the current bug is well-understood and fixed, so if your problem persists upstream I'd suggest opening a new bug.
You asked about XFS, do you see the same problem there?
Well, it already was too difficult weekend (I had to migrate some large amount of data but hit the issue which means the job isn't done still, at the end of Monday)...
2.6.32 is current long-term-support kernel, and the patches mentioned in this bug weren't applied to the version I'm using now. So I'm not saying the bug is present in current git version.
Yes, it is quite possible that 'orrible is reading corrupt data from tmp filesystem - I didn't thought about that.
So I'll try to reproduce it later, when the thing will be done.
But the things seems to be quite clear now, this bug plus your explanation (reading zeros from tmp) -- the zero pieces are all 64-blocks long, which is a typical allocation unit in the data files.
Speaking of XFS. I tried a few different things (just a few, because the whole procedure takes large amount of time). I used ext4 on raid0 just to load data (to move the db off to another, final machine later) in a hope to speed things up, usually we use XFS. And finally I tried to switch to XFS and raid10 - configuration which is used since ages on other machines - tried that before finding this bugreport (I thought about the correlation between gaps and corruption on ext4 later). I'm not seeing problems with XFS so far (the load is still ongoing), but I also tried hard to avoid the problematic case with gapful files after reading this bugreport. So I don't know if it were problematic with XFS if I were not to avoid gaps. But remember, I need to complete the job... ;)
I asked about XFS because it is mentioned in this bugreport, with clear indication that it has the problem as well as ext4. So I wonder since when that problem were present, well, just.. curious.
And by the way, what's the final patch for ext4 case for this?
As far as I can see, no patches mentioned in this thread went to -stable kernels.
Linux 2.6.36 seems to work fine but older kernels (like the Ubuntu 2.6.35 kernel) still show this problem.
Fix seems to be included in the Ubuntu kernel 2.6.35-23.36 and later. I think this bug can be closed.
Alexander, this bug should be fixed upstream, do you mind retesting to confirm, and close if so?