I'm testing ext4 -o discard on a Super Talent FTM56GX25H SSD. The speed increase by using the discard option seems promising. But I'm experiencing problems under a certain stressful file system load: (approximate description, the actual sizes/numbers are not exact MB/GB, but that shouldn't be a problem) * you have a 252 GB ext4 -m 0 -T largefile filesystem * you have 250 input files of size 1 GB each and an empty output file * while the input has not been consumed - load 1 MB from the end of each input file - truncate the input files to reduce their size by 1 MB - do some computation ... - append 250 MB to the output file Checking the output file after operation has finished I find blocks of 0x00 that should not be there. These blocks are usually the size of 1MB (the size that was truncated and 'discarded') and always multiples of 16KB (the minimal discard/TRIM-able unit (also the discard/TRIM alignment) of the SSD, found by doing manual experiments using hdparm --trim-sector-ranges). In several repetitions I've counted about 10-12MB of invalid 0x00 bytes in the output. The problem does not occur if I use 250000 inputfiles instead, read a subset of 250 files and delete them before writing the output. This is significantly slower. A possible cause could be some race condition between * freeing filesystem blocks by truncating a file and queuing them for DISCARD/TRIM * allocating free filesystem blocks for a new append/write to a file * submitting the DISCARD/TRIM request to the disk * submitting the write request to the disk Is there a possibility to generate debug information from ext4 that would be helpful for tracking down this problem? The file system on the SSD is the only ext[2-4] file system in the machine.
Some time ago i've posted comat discard support which simulate discard by generating simple zero filled request http://lkml.org/lkml/2010/2/11/74 Many changes was requested so i'm still working on new version (it will be ready soon). But it may be useful for debugging needs with conjunction with blktrace.
Created attachment 25616 [details] Proposed patch for this problem Oh, sh*t. If what I think is happening, is happening, this is definitely a brown paper bag bug. Does this fix it for you?
(In reply to comment #2) > Does this fix it for you? The patch didn't apply cleanly to 2.6.33, I had to "remove the old code block" manually. So far after rebuilding the ext4 module I haven't experienced the problem any more. Please get this into 2.6.33.x. Thanks!
Just for what it's worth, I've had trouble reproducing this on another brand of SSD... something like this (don't let the xfs_io throw you; it's just a convenient way to generate the IO). I did this on a 512M filesystem. #!/bin/bash SCRATCH_MNT=/mnt/scratch rm -f $SCRATCH_MNT/* touch $SCRATCH_MNT/outputfile # Create several large-ish files for I in `seq 1 240`; do xfs_io -F -f -c "pwrite 0 2m" $SCRATCH_MNT/file$I &>/dev/null done # reread the last bit of each, just for kicks, and truncate off 1m for I in `seq 1 240`; do xfs_io -F -c "pread 1m 2m" $SCRATCH_MNT/file$I &>/dev/null xfs_io -F -c "truncate 1m" $SCRATCH_MNT/file$I done # Append the outputfile xfs_io -F -c "pwrite 0 250m" $SCRATCH_MNT/outputfile &>/dev/null In the end I don't get any corruption. I was hoping to write a testcase for this (one that didn't take 250G) :) Does the above reflect your use case? Does the above corrupt the outputfile on your filesystem? (note the "rm -rf" above, careful with that). You could substitute dd for xfs_io without much trouble if desired.
(In reply to comment #4) > Just for what it's worth, I've had trouble reproducing this on another brand > of > SSD... something like this (don't let the xfs_io throw you; it's just a > convenient way to generate the IO). I did this on a 512M filesystem. Might be a probability issue. For the 250 GB case I did in total about 200000 truncations on about 250 files and found in the output file 8 and 13 corrupt blocks (I only kept detailed numbers for two cases). Reducing the block size might "help" by increasing the number of I/Os. I can't test your script right now, the disks are all busy with some long running experiments. There should be another one just back from RMA on my desk, so I can try it tomorrow when I'm back there (was travelling for a week). What do you do on the remaining space of the SSD? Try putting a file system there and fill it with something so that the SSD is 99% filled so it can't that easily remap the blocks you are writing to.
(In reply to comment #5) > What do you do on the remaining space of the SSD? Try putting a file system > there and fill it with something so that the SSD is 99% filled so it can't > that > easily remap the blocks you are writing to. Hm, I suppose that could be, and it makes it a little harder to write a generic testcase....
If the number of available unmapped blocks has an impact, that seems most likely to be a SSD firmware bug to me. ie. If the linux kernel is sending control messages in the wrong order, then it should cause corruption regardless of the number of unmapped blocks.
(In reply to comment #7) > If the number of available unmapped blocks has an impact, that seems most > likely to be a SSD firmware bug to me. > > ie. If the linux kernel is sending control messages in the wrong order, then > it > should cause corruption regardless of the number of unmapped blocks. That's correct except that you may get a timing issue (e.g. writing to free unmapped blocks is/could be/should be a bit faster than clearing the blocks first) which could turn this into race condition debugging ...
(In reply to comment #4) > Just for what it's worth, I've had trouble reproducing this on another brand > of > SSD... something like this (don't let the xfs_io throw you; it's just a > convenient way to generate the IO). I did this on a 512M filesystem. With some small modifications I can reproduce this every time: I do two iterations of truncating + writing the output. Seems to happen in the second write only. You can skip the reading, not neccessary. N=236 ist the smallest N where the problem occurs, N=253 the maximum number of files fitting on the file system. ./find-zeroes is my tool to check for "0x00 holes" mkfs options: -m 0 -T largefile4 #!/bin/bash SCRATCH_MNT=/mnt/scratch N=253 #rm -f $SCRATCH_MNT/* #touch $SCRATCH_MNT/outputfile #xfs_io -F -c "pwrite 0 ${N}m" $SCRATCH_MNT/outputfile &>/dev/null #xfs_io -F -c "pwrite ${N}M ${N}m" $SCRATCH_MNT/outputfile &>/dev/null #./find-zeroes $SCRATCH_MNT/outputfile rm -f $SCRATCH_MNT/* touch $SCRATCH_MNT/outputfile # Create several large-ish files for I in `seq 1 $N`; do xfs_io -F -f -c "pwrite 0 2m" $SCRATCH_MNT/file$I &>/dev/null done # reread the last bit of each, just for kicks, and truncate off 1m for I in `seq 1 $N`; do xfs_io -F -c "pread 1m 1m" $SCRATCH_MNT/file$I &>/dev/null xfs_io -F -c "truncate 1m" $SCRATCH_MNT/file$I done # Append the outputfile xfs_io -F -c "pwrite 0 ${N}m" $SCRATCH_MNT/outputfile &>/dev/null # reread the last bit of each, just for kicks, and truncate off 1m for I in `seq 1 $N`; do xfs_io -F -c "pread 0m 1m" $SCRATCH_MNT/file$I &>/dev/null xfs_io -F -c "truncate 0m" $SCRATCH_MNT/file$I done # Append the outputfile xfs_io -F -c "pwrite ${N}M ${N}m" $SCRATCH_MNT/outputfile &>/dev/null ./find-zeroes $SCRATCH_MNT/outputfile $ ./trash-ext4-discard at 246800384 length 18489344 size 511950848 zeroes 18489344 $ ./trash-ext4-discard at 246808576 length 18481152 size 511950848 zeroes 18481152 $ ./trash-ext4-discard at 246857728 length 18432000 size 511848448 zeroes 18432000 $ ./trash-ext4-discard at 246640640 length 18649088 size 512086016 zeroes 18649088 $ ./trash-ext4-discard at 246800384 length 18489344 size 511959040 zeroes 18489344 actually this is enough: # Create several large-ish files for I in `seq 1 $N`; do xfs_io -F -f -c "pwrite 0 1m" $SCRATCH_MNT/file$I &>/dev/null done # Append the outputfile xfs_io -F -c "pwrite 0 ${N}m" $SCRATCH_MNT/outputfile &>/dev/null # truncate all for I in `seq 1 $N`; do xfs_io -F -c "truncate 0m" $SCRATCH_MNT/file$I done # Append the outputfile xfs_io -F -c "pwrite ${N}M ${N}m" $SCRATCH_MNT/outputfile &>/dev/null $ ./trash-ext4-discard2 at 228061184 length 37228544 size 530579456 zeroes 37228544
(In reply to comment #9) > mkfs options: -m 0 -T largefile4 no, only -T largefile otherwise I can't create enough files
Thanks, I'll make sure I can reproduce this and turn it into a testcase.
I just saw that this patch went into 2.6.34 commmit b90f687018e6d6c77d981b09203780f7001407e5 Thanks!
Taking bug so I can close it :)