Bug 15579

Summary: ext4 -o discard produces incorrect blocks of zeroes in newly created files under heavy read+truncate+append-new-file load
Product: File System Reporter: Andreas Beckmann (kernel-bugs)
Component: ext4Assignee: Eric Sandeen (sandeen)
Status: RESOLVED CODE_FIX    
Severity: normal CC: dmonakhov, Greg.Freemyer, linux-kernel-bugs, sandeen, yugzhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.33 Subsystem:
Regression: No Bisected commit-id:
Attachments: Proposed patch for this problem

Description Andreas Beckmann 2010-03-19 10:51:33 UTC
I'm testing ext4 -o discard on a Super Talent FTM56GX25H SSD. The speed increase by using the discard option seems promising.
But I'm experiencing problems under a certain stressful file system load:

(approximate description, the actual sizes/numbers are not exact MB/GB, but that shouldn't be a problem)
* you have a 252 GB ext4 -m 0 -T largefile filesystem
* you have 250 input files of size 1 GB each and an empty output file
* while the input has not been consumed
  - load 1 MB from the end of each input file
  - truncate the input files to reduce their size by 1 MB
  - do some computation ...
  - append 250 MB to the output file

Checking the output file after operation has finished I find blocks of 0x00 that should not be there. These blocks are usually the size of 1MB (the size that was truncated and 'discarded') and always multiples of 16KB (the minimal discard/TRIM-able unit (also the discard/TRIM alignment) of the SSD, found by doing manual experiments using hdparm --trim-sector-ranges).
In several repetitions I've counted about 10-12MB of invalid 0x00 bytes in the output.

The problem does not occur if I use 250000 inputfiles instead, read a subset of 250 files and delete them before writing the output. This is significantly slower.

A possible cause could be some race condition between
* freeing filesystem blocks by truncating a file and queuing them for DISCARD/TRIM
* allocating free filesystem blocks for a new append/write to a file
* submitting the DISCARD/TRIM request to the disk
* submitting the write request to the disk

Is there a possibility to generate debug information from ext4 that would be helpful for tracking down this problem? The file system on the SSD is the only ext[2-4] file system in the machine.
Comment 1 Dmitry Monakhov 2010-03-19 12:40:57 UTC
Some time ago i've posted comat discard support which simulate 
discard by generating simple zero filled request 
http://lkml.org/lkml/2010/2/11/74
Many changes was requested so i'm still working on new version (it will be ready
soon).
But it may be useful for debugging needs with conjunction with blktrace.
Comment 2 Theodore Tso 2010-03-19 18:13:46 UTC
Created attachment 25616 [details]
Proposed patch for this problem

Oh, sh*t.   If what I think is happening, is happening, this is definitely a brown paper bag bug.

Does this fix it for you?
Comment 3 Andreas Beckmann 2010-03-21 09:45:54 UTC
(In reply to comment #2)
> Does this fix it for you?

The patch didn't apply cleanly to 2.6.33, I had to "remove the old code block" manually.
So far after rebuilding the ext4 module I haven't experienced the problem any more. Please get this into 2.6.33.x. 

Thanks!
Comment 4 Eric Sandeen 2010-03-22 21:40:52 UTC
Just for what it's worth, I've had trouble reproducing this on another brand of SSD... something like this (don't let the xfs_io throw you; it's just a convenient way to generate the IO).  I did this on a 512M filesystem.

#!/bin/bash

SCRATCH_MNT=/mnt/scratch

rm -f $SCRATCH_MNT/*
touch $SCRATCH_MNT/outputfile

# Create several large-ish files
for I in `seq 1 240`; do
  xfs_io -F -f -c "pwrite 0 2m" $SCRATCH_MNT/file$I &>/dev/null
done

# reread the last bit of each, just for kicks, and truncate off 1m
for I in `seq 1 240`; do
  xfs_io -F -c "pread 1m 2m" $SCRATCH_MNT/file$I &>/dev/null
  xfs_io -F -c "truncate 1m" $SCRATCH_MNT/file$I
done

# Append the outputfile
xfs_io -F -c "pwrite 0 250m" $SCRATCH_MNT/outputfile &>/dev/null

In the end I don't get any corruption.  I was hoping to write a testcase for this (one that didn't take 250G) :)

Does the above reflect your use case?  Does the above corrupt the outputfile on your filesystem?  (note the "rm -rf" above, careful with that).  You could substitute dd for xfs_io without much trouble if desired.
Comment 5 Andreas Beckmann 2010-03-23 11:10:15 UTC
(In reply to comment #4)
> Just for what it's worth, I've had trouble reproducing this on another brand
> of
> SSD... something like this (don't let the xfs_io throw you; it's just a
> convenient way to generate the IO).  I did this on a 512M filesystem.

Might be a probability issue. For the 250 GB case I did in total about 200000 truncations on about 250 files and found in the output file 8 and 13 corrupt blocks (I only kept detailed numbers for two cases). Reducing the block size might "help" by increasing the number of I/Os.

I can't test your script right now, the disks are all busy with some long running experiments. There should be another one just back from RMA on my desk, so I can try it tomorrow when I'm back there (was travelling for a week).

What do you do on the remaining space of the SSD? Try putting a file system there and fill it with something so that the SSD is 99% filled so it can't that easily remap the blocks you are writing to.
Comment 6 Eric Sandeen 2010-03-23 14:29:35 UTC
(In reply to comment #5)

> What do you do on the remaining space of the SSD? Try putting a file system
> there and fill it with something so that the SSD is 99% filled so it can't
> that
> easily remap the blocks you are writing to.

Hm, I suppose that could be, and it makes it a little harder to write a generic testcase....
Comment 7 Greg.Freemyer 2010-03-23 21:01:04 UTC
If the number of available unmapped blocks has an impact, that seems most likely to be a SSD firmware bug to me.

ie. If the linux kernel is sending control messages in the wrong order, then it should cause corruption regardless of the number of unmapped blocks.
Comment 8 Andreas Beckmann 2010-03-29 08:16:54 UTC
(In reply to comment #7)
> If the number of available unmapped blocks has an impact, that seems most
> likely to be a SSD firmware bug to me.
> 
> ie. If the linux kernel is sending control messages in the wrong order, then
> it
> should cause corruption regardless of the number of unmapped blocks.

That's correct except that you may get a timing issue (e.g. writing to free unmapped blocks is/could be/should be a bit faster than clearing the blocks first) which could turn this into race condition debugging ...
Comment 9 Andreas Beckmann 2010-03-29 08:36:38 UTC
(In reply to comment #4)
> Just for what it's worth, I've had trouble reproducing this on another brand
> of
> SSD... something like this (don't let the xfs_io throw you; it's just a
> convenient way to generate the IO).  I did this on a 512M filesystem.

With some small modifications I can reproduce this every time: I do two iterations of truncating + writing the output. Seems to happen in the second write only.
You can skip the reading, not neccessary.
N=236 ist the smallest N where the problem occurs, N=253 the maximum number of files fitting on the file system.

./find-zeroes is my tool to check for "0x00 holes"

mkfs options: -m 0 -T largefile4

#!/bin/bash

SCRATCH_MNT=/mnt/scratch

N=253

#rm -f $SCRATCH_MNT/*
#touch $SCRATCH_MNT/outputfile
#xfs_io -F -c "pwrite 0 ${N}m" $SCRATCH_MNT/outputfile &>/dev/null
#xfs_io -F -c "pwrite ${N}M ${N}m" $SCRATCH_MNT/outputfile &>/dev/null
#./find-zeroes $SCRATCH_MNT/outputfile

rm -f $SCRATCH_MNT/*
touch $SCRATCH_MNT/outputfile

# Create several large-ish files
for I in `seq 1 $N`; do
  xfs_io -F -f -c "pwrite 0 2m" $SCRATCH_MNT/file$I &>/dev/null
done

# reread the last bit of each, just for kicks, and truncate off 1m
for I in `seq 1 $N`; do
  xfs_io -F -c "pread 1m 1m" $SCRATCH_MNT/file$I &>/dev/null
  xfs_io -F -c "truncate 1m" $SCRATCH_MNT/file$I
done

# Append the outputfile
xfs_io -F -c "pwrite 0 ${N}m" $SCRATCH_MNT/outputfile &>/dev/null

# reread the last bit of each, just for kicks, and truncate off 1m
for I in `seq 1 $N`; do
  xfs_io -F -c "pread 0m 1m" $SCRATCH_MNT/file$I &>/dev/null
  xfs_io -F -c "truncate 0m" $SCRATCH_MNT/file$I
done

# Append the outputfile
xfs_io -F -c "pwrite ${N}M ${N}m" $SCRATCH_MNT/outputfile &>/dev/null

./find-zeroes $SCRATCH_MNT/outputfile



$ ./trash-ext4-discard
at 246800384 length 18489344
size 511950848 zeroes 18489344
$ ./trash-ext4-discard
at 246808576 length 18481152
size 511950848 zeroes 18481152
$ ./trash-ext4-discard
at 246857728 length 18432000
size 511848448 zeroes 18432000
$ ./trash-ext4-discard
at 246640640 length 18649088
size 512086016 zeroes 18649088
$ ./trash-ext4-discard
at 246800384 length 18489344
size 511959040 zeroes 18489344

actually this is enough:


# Create several large-ish files
for I in `seq 1 $N`; do
  xfs_io -F -f -c "pwrite 0 1m" $SCRATCH_MNT/file$I &>/dev/null
done

# Append the outputfile
xfs_io -F -c "pwrite 0 ${N}m" $SCRATCH_MNT/outputfile &>/dev/null

# truncate all
for I in `seq 1 $N`; do
  xfs_io -F -c "truncate 0m" $SCRATCH_MNT/file$I
done

# Append the outputfile
xfs_io -F -c "pwrite ${N}M ${N}m" $SCRATCH_MNT/outputfile &>/dev/null


$ ./trash-ext4-discard2
at 228061184 length 37228544
size 530579456 zeroes 37228544
Comment 10 Andreas Beckmann 2010-03-29 08:43:03 UTC
(In reply to comment #9)
> mkfs options: -m 0 -T largefile4
no, only -T largefile otherwise I can't create enough files
Comment 11 Eric Sandeen 2010-03-29 14:54:57 UTC
Thanks, I'll make sure I can reproduce this and turn it into a testcase.
Comment 12 Andreas Beckmann 2010-05-19 10:50:44 UTC
I just saw that this patch went into 2.6.34
commmit b90f687018e6d6c77d981b09203780f7001407e5

Thanks!
Comment 13 Eric Sandeen 2010-05-19 15:58:25 UTC
Taking bug so I can close it :)