Bug 16165

Summary: Wrong data returned on read after write if file size was changed with ftruncate before
Product: File System Reporter: Alexander Eichner (alexander.eichner)
Component: ext4Assignee: fs_ext4 (fs_ext4)
Status: RESOLVED CODE_FIX    
Severity: normal CC: alan, andrew.james.barr, cebbert, Charles.Butterfield, frank.mehnert, gabriel, mjt, philantrop, sandeen, tytso, yugzhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: at least up to 2.6.34 Subsystem:
Regression: No Bisected commit-id:
Attachments: Testcase showing the corruption

Description Alexander Eichner 2010-06-08 22:01:55 UTC
Created attachment 26701 [details]
Testcase showing the corruption

Hi,

we use the direct async I/O API for file I/O in VirtualBox since 3.2 and got some reports about corrupted guest filesystems. It turned out that only users are affected which disk images are stored on a ext4 filesystem. I could create a testcase which reproduces the error. Further investigation showed that the data corruption happens if ftruncate is used, data is written to the so far unallocated offset and immediately read back. The buffer holding the read data contains only \0 afterwards. The data is there if the file is closed and another
program is used to view the file content (less for example). Everything works if ftruncate is not used.
The same testcase with ftruncate works on ext3 here. A simple testcase is attached.

Compile with: gcc -D_GNU_SOURCE -o aio_corrupt aio_corrupt.c -laio
Usage: aio_corrupt <path/to/the/file>

Expected output: Success!
Output on ext3: Success!
Output on ext4: Corrupted buffer!

Kind regards,
Alexander Eichner
Comment 1 Eric Sandeen 2010-06-08 22:25:01 UTC
Thanks for the report ... interestingly, testing this on RHEL6 (which is just what I had conveniently booted, 2.6.32 + backports) seems fine:

# ./aio_corrupt testfile
Success!
# pwd
/mnt/testarea/scratch
# mount | grep scratch
/dev/sdb3 on /mnt/testarea/scratch type ext4 (rw)

I can test 2.6.34 as well, but any other interesting pieces of the setup - block size less than page size perhaps?

-Eric
Comment 2 Alexander Eichner 2010-06-08 22:27:08 UTC
I tested it on a 64bit Ubuntu 10.04 with 2.6.32 and 2.6.34.
Comment 3 Alexander Eichner 2010-06-08 22:29:02 UTC
# mount | grep Test
/dev/sdb1 on /media/Test type ext4 (rw,nosuid,nodev,uhelper=udisks)
Comment 4 Alexander Eichner 2010-06-08 22:38:10 UTC
The 2.6.34 kernel is from the mainline ppa http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.34-lucid/
These kernels don't contain any additional patches according to https://wiki.ubuntu.com/KernelTeam/MainlineBuilds
Comment 5 Alexander Eichner 2010-06-09 15:26:17 UTC
Seems the bug doesn't trigger always. The first invocation of the testcase on a fresh created ext4 succeeds but the second one and any other run fails.
This is what I've done (the mkfs.ext4 output is in german but should be uninteresting):

alexander@Hatak:~$ sudo mkfs.ext4 /dev/sda1
mke2fs 1.41.11 (14-Mar-2010)
Dateisystem-Label=
OS-Typ: Linux
Blockgröße=4096 (log=2)
Fragmentgröße=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
1876800 Inodes, 7506363 Blöcke
375318 Blöcke (5.00%) reserviert für den Superuser
Erster Datenblock=0
Maximale Dateisystem-Blöcke=4294967296
230 Blockgruppen
32768 Blöcke pro Gruppe, 32768 Fragmente pro Gruppe
8160 Inodes pro Gruppe
Superblock-Sicherungskopien gespeichert in den Blöcken: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000

Schreibe Inode-Tabellen: erledigt                        
Erstelle Journal (32768 Blöcke): erledigt
Schreibe Superblöcke und Dateisystem-Accountinginformationen: erledigt

Das Dateisystem wird automatisch nach jeweils 32 Einhäng-Vorgängen bzw.
alle 180 Tage überprüft, je nachdem, was zuerst eintritt. Veränderbar mit
tune2fs -c oder -t .
alexander@Hatak:~$ sudo mount /dev/sda1 /mnt
alexander@Hatak:~$ sudo ./aio_corrupt /mnt/test
Success!
alexander@Hatak:~$ sudo ./aio_corrupt /mnt/test2
Corrupted buffer!

This is with the standard Ubuntu 2.6.32 kernel.
Comment 6 Eric Sandeen 2010-06-09 15:49:01 UTC
Hm.

# for I in `seq 1 50`; do ./aio_corrupt test$I; done

still works for me.  I don't have any patches in RHEL6 that aren't upstream, but I'll go ahead & recheck on an upstream kernel, I guess.

Thanks,
-Eric
Comment 7 Alexander Eichner 2010-06-09 22:57:38 UTC
Installed RHEL 6 in a VM now (the image is on ext3) and I can confirm that it isn't reproducible with the default 2.6.32 kernel. I downloaded and compiled a vanilla 2.6.34 kernel from http://www.kernel.org The problem is there again but it doesn't happen on every run of the testcase. 5 failures on 11 runs.
Could be timing sensitive.
Comment 8 Alexander Eichner 2010-06-10 07:38:08 UTC
The chance to reproduce this issue seems to be really dependent of the host.
Just tried it on a Notebook with Ubuntu 10.04 and an ext4 root.

# for I in `seq 1 50`; do ./aio_corrupt /mnt/test$I; done

The test failed only 4 times on 50 runs.
Comment 9 Frank Mehnert 2010-06-11 14:26:51 UTC
When compiling the test case, please make sure to add a third parameter to open(), for example 0777, otherwise it wouldn't compile on Ubuntu (O_CREAT requires 3 parameters for open).

Executed the testcase here on an up-to-date Ubuntu 10.04 installed on an ext4-formatted partition. The testcase does NOT fail if the test file already exists. But it fails about 130 out of 500 times if I remove the test file prior to each run. When using Linux 2.6.34 vanilla: 7080 out of 10000 runs reported a corrupted buffer. And, like expected, the test never fails on an ext3 partition.

So please make sure to run the testcase like this:

# for i in `seq 1 500`; do rm -f /mnt/tst; ./aio_corrupt /mnt/tst; done

where /mnt is an ext4 partition.
Comment 10 Frank Mehnert 2010-06-16 08:19:52 UTC
Please, could someone of the ext4 maintainers have a look? We believe that this bug is responsible for many data corruption reports we get for our latest VirtualBox release as we made the asynchronous file API the default for certain controller types.
Comment 11 Frank Mehnert 2010-06-18 15:13:44 UTC
This problem is probably related: http://www.spinics.net/lists/linux-ext4/msg19493.html
Comment 12 Eric Sandeen 2010-06-18 15:30:13 UTC
Can those who can reproduce this test this patch from mingming?

When unaligned DIO writes, skip zero out the block if the buffer is marked
unwritten. That means there is an asynconous direct IO (append or fill the hole)
still pending.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/direct-io.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-git/fs/direct-io.c
===================================================================
--- linux-git.orig/fs/direct-io.c	2010-05-07 15:42:22.855033403 -0700
+++ linux-git/fs/direct-io.c	2010-05-07 15:44:17.695007770 -0700
@@ -740,7 +740,8 @@
 	struct page *page;
 
 	dio->start_zero_done = 1;
-	if (!dio->blkfactor || !buffer_new(&dio->map_bh))
+	if (!dio->blkfactor || !buffer_new(&dio->map_bh)
+	    || buffer_unwritten(&dio->map_bh))
 		return;
 
 	dio_blocks_per_fs_block = 1 << dio->blkfactor;
Comment 13 Alexander Eichner 2010-06-19 18:10:46 UTC
The test still fails here.
Comment 14 Alexander Eichner 2010-06-19 18:12:31 UTC
And like in the mail thread it fails for XFS too.
Comment 15 Eric Sandeen 2010-06-24 15:53:20 UTC
(In reply to comment #14)
> And like in the mail thread it fails for XFS too.

How about with these 2 patches from the xfs list?

[PATCH 1/2] direct-io: move aio_complete into ->end_io
[PATCH 2/2] xfs: move aio completion after unwritten extent conversion

(thread starts at http://oss.sgi.com/archives/xfs/2010-06/msg00269.html)
Comment 16 Eric Sandeen 2010-06-24 22:20:25 UTC
And for ext4 perhaps try this on top of the above:

[PATCH] EXT4: move aio completion after unwritten extent conversion
http://marc.info/?l=linux-ext4&m=127741775019003&w=2
Comment 17 Alexander Eichner 2010-06-27 09:05:28 UTC
These patches fix the data corruption here. Thanks!
Comment 18 Chuck Ebbert 2010-07-23 21:45:36 UTC
Are these patches going to be merged for 2.6.35?
Comment 19 Eric Sandeen 2010-07-23 21:51:37 UTC
Alexander, there is another corruption problem that we've been chasing related to unaligned AIO on ext4 & xfs; see:

http://archives.free.net.ph/message/20100722.190526.4de99e37.en.html

and

http://git.kernel.org/?p=fs/xfs/xfstests-dev.git;a=commit;h=b05ae2d65379ccf92eb3fe8360325dd2287bf766

The patches above won't take care of that problem.

Chuck, good question about when they'll be merged... I dunno.

-Eric
Comment 20 Theodore Tso 2010-07-23 22:09:38 UTC
Yes, the patches in comment #15 will be merged in 2.6.35.  They are already in my tree (although they haven't been pushed out yet as of this writing; they'll be in the ext4 tree soon though).  I've negotiated with Christoph to make sure there won't be any merge conflicts with the XFS tree (since the first patch is going to be carried in both the ext4 and xfs trees).
Comment 21 Chuck Ebbert 2010-08-04 02:06:42 UTC
2.6.35 is out and those patches are not in there.
Comment 22 Theodore Tso 2010-08-04 03:40:03 UTC
The patches are in my tree and I'll push them during this merge window, and then get them into 2.6.35.1.   Sorry, things got busy and I didn't have a chance to push them to Linus before 2.6.35 shipped.
Comment 23 Michael Tokarev 2010-08-16 13:24:32 UTC
Yesterday I observed another data corruption that looks very close to this one, but slightly different (and much more serious).

The scenario is this: oracle10 database, creating db and importing data.  Oracle uses ftruncate() to create file of a desired size, next it fills the file with block headers, so there's no unwritten data in the file.  So far so good.  There's also temporary tablespaces, which aren't filled during creation, but just ftruncate'd.

The problem here happens when I extend a temp file during heavy database writes.  It were needed a few times because in order to create large indexes, large temporary space were needed.

And each time I try extend temp file (say, from 10 to 20Gb), oracle performs the ftruncate on it, and continues writing to it and to other data files (which were pre-filled after ftruncate, even before system restart).  And during this time, there's a very likely chance to have _other_ data files to be corrupt while it is writing to the newly extended temp space.

We observed several chunks of zeros (of size 1Mb each) written over _other_ files during this time.  Re-read of those other files returns the same zeros.

So, this delayed allocation corrupts _other_ data files which are already allocated and written.

Avoiding gaps eliminates the problem.

Note that oracle uses aio and direct-io and "gapful" files at least for the temporary ts.

The kernel in question is 2.6.32.15.

This is quite a major issue...

But I repeat: i'm not sure it's related to this bugreport.  At least the roots are somewhere very close.

BTW, when it started in XFS, anyone know?
Comment 24 Eric Sandeen 2010-08-16 14:53:20 UTC
Michael, if you can come up with any testcase (not including oracle) that shows the problem, that'd be great.

However, I'd really suggest trying a current upstream kernel if possible just to be sure you're not hitting an old bug.

It sounds very odd that writing into sparse space on file B would corrupt writes to non-sparse file A...

There is an outstanding bug where non-fs-block aligned AIO to a sparse file can cause corruption (related to partial zeroing of the block which is outside the range of the IO, and this is not coordinated across multiple AIOs...) but this corrupts the sparse file being written to, not other files in the filesystem.

I wonder if it's possible that oracle is using the tempfile as a data source, somehow mis-reading 0s out of it, and writing those 0s to the main files?

Anyway, I think the current bug is well-understood and fixed, so if your problem persists upstream I'd suggest opening a new bug.

You asked about XFS, do you see the same problem there?

Thanks,
-Eric
Comment 25 Michael Tokarev 2010-08-16 19:26:39 UTC
Well, it already was too difficult weekend (I had to migrate some large amount of data but hit the issue which means the job isn't done still, at the end of Monday)...

2.6.32 is current long-term-support kernel, and the patches mentioned in this bug weren't applied to the version I'm using now.  So I'm not saying the bug is present in current git version.

Yes, it is quite possible that 'orrible is reading corrupt data from tmp filesystem - I didn't thought about that.

So I'll try to reproduce it later, when the thing will be done.

But the things seems to be quite clear now, this bug plus your explanation (reading zeros from tmp) -- the zero pieces are all 64-blocks long, which is a typical allocation unit in the data files.

Speaking of XFS.  I tried a few different things (just a few, because the whole procedure takes large amount of time).  I used ext4 on raid0 just to load data (to move the db off to another, final machine later) in a hope to speed things up, usually we use XFS.  And finally I tried to switch to XFS and raid10 - configuration which is used since ages on other machines - tried that before finding this bugreport (I thought about the correlation between gaps and corruption on ext4 later).  I'm not seeing problems with XFS so far (the load is still ongoing), but I also tried hard to avoid the problematic case with gapful files after reading this bugreport.  So I don't know if it were problematic with XFS if I were not to avoid gaps.  But remember, I need to complete the job... ;)

I asked about XFS because it is mentioned in this bugreport, with clear indication that it has the problem as well as ext4.  So I wonder since when that problem were present, well, just.. curious.

And by the way, what's the final patch for ext4 case for this?

Thanks!
Comment 26 Michael Tokarev 2010-08-16 19:59:34 UTC
As far as I can see, no patches mentioned in this thread went to -stable kernels.
Comment 27 Frank Mehnert 2010-11-04 11:00:31 UTC
Linux 2.6.36 seems to work fine but older kernels (like the Ubuntu 2.6.35 kernel) still show this problem.
Comment 28 Frank Mehnert 2011-01-10 12:40:17 UTC
Fix seems to be included in the Ubuntu kernel 2.6.35-23.36 and later. I think this bug can be closed.
Comment 29 Eric Sandeen 2011-04-23 17:42:53 UTC
Alexander, this bug should be fixed upstream, do you mind retesting to confirm, and close if so?