Bug 15579
Summary: | ext4 -o discard produces incorrect blocks of zeroes in newly created files under heavy read+truncate+append-new-file load | ||
---|---|---|---|
Product: | File System | Reporter: | Andreas Beckmann (kernel-bugs) |
Component: | ext4 | Assignee: | Eric Sandeen (sandeen) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | dmonakhov, Greg.Freemyer, linux-kernel-bugs, sandeen, yugzhang |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.33 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | Proposed patch for this problem |
Description
Andreas Beckmann
2010-03-19 10:51:33 UTC
Some time ago i've posted comat discard support which simulate discard by generating simple zero filled request http://lkml.org/lkml/2010/2/11/74 Many changes was requested so i'm still working on new version (it will be ready soon). But it may be useful for debugging needs with conjunction with blktrace. Created attachment 25616 [details]
Proposed patch for this problem
Oh, sh*t. If what I think is happening, is happening, this is definitely a brown paper bag bug.
Does this fix it for you?
(In reply to comment #2) > Does this fix it for you? The patch didn't apply cleanly to 2.6.33, I had to "remove the old code block" manually. So far after rebuilding the ext4 module I haven't experienced the problem any more. Please get this into 2.6.33.x. Thanks! Just for what it's worth, I've had trouble reproducing this on another brand of SSD... something like this (don't let the xfs_io throw you; it's just a convenient way to generate the IO). I did this on a 512M filesystem. #!/bin/bash SCRATCH_MNT=/mnt/scratch rm -f $SCRATCH_MNT/* touch $SCRATCH_MNT/outputfile # Create several large-ish files for I in `seq 1 240`; do xfs_io -F -f -c "pwrite 0 2m" $SCRATCH_MNT/file$I &>/dev/null done # reread the last bit of each, just for kicks, and truncate off 1m for I in `seq 1 240`; do xfs_io -F -c "pread 1m 2m" $SCRATCH_MNT/file$I &>/dev/null xfs_io -F -c "truncate 1m" $SCRATCH_MNT/file$I done # Append the outputfile xfs_io -F -c "pwrite 0 250m" $SCRATCH_MNT/outputfile &>/dev/null In the end I don't get any corruption. I was hoping to write a testcase for this (one that didn't take 250G) :) Does the above reflect your use case? Does the above corrupt the outputfile on your filesystem? (note the "rm -rf" above, careful with that). You could substitute dd for xfs_io without much trouble if desired. (In reply to comment #4) > Just for what it's worth, I've had trouble reproducing this on another brand > of > SSD... something like this (don't let the xfs_io throw you; it's just a > convenient way to generate the IO). I did this on a 512M filesystem. Might be a probability issue. For the 250 GB case I did in total about 200000 truncations on about 250 files and found in the output file 8 and 13 corrupt blocks (I only kept detailed numbers for two cases). Reducing the block size might "help" by increasing the number of I/Os. I can't test your script right now, the disks are all busy with some long running experiments. There should be another one just back from RMA on my desk, so I can try it tomorrow when I'm back there (was travelling for a week). What do you do on the remaining space of the SSD? Try putting a file system there and fill it with something so that the SSD is 99% filled so it can't that easily remap the blocks you are writing to. (In reply to comment #5) > What do you do on the remaining space of the SSD? Try putting a file system > there and fill it with something so that the SSD is 99% filled so it can't > that > easily remap the blocks you are writing to. Hm, I suppose that could be, and it makes it a little harder to write a generic testcase.... If the number of available unmapped blocks has an impact, that seems most likely to be a SSD firmware bug to me. ie. If the linux kernel is sending control messages in the wrong order, then it should cause corruption regardless of the number of unmapped blocks. (In reply to comment #7) > If the number of available unmapped blocks has an impact, that seems most > likely to be a SSD firmware bug to me. > > ie. If the linux kernel is sending control messages in the wrong order, then > it > should cause corruption regardless of the number of unmapped blocks. That's correct except that you may get a timing issue (e.g. writing to free unmapped blocks is/could be/should be a bit faster than clearing the blocks first) which could turn this into race condition debugging ... (In reply to comment #4) > Just for what it's worth, I've had trouble reproducing this on another brand > of > SSD... something like this (don't let the xfs_io throw you; it's just a > convenient way to generate the IO). I did this on a 512M filesystem. With some small modifications I can reproduce this every time: I do two iterations of truncating + writing the output. Seems to happen in the second write only. You can skip the reading, not neccessary. N=236 ist the smallest N where the problem occurs, N=253 the maximum number of files fitting on the file system. ./find-zeroes is my tool to check for "0x00 holes" mkfs options: -m 0 -T largefile4 #!/bin/bash SCRATCH_MNT=/mnt/scratch N=253 #rm -f $SCRATCH_MNT/* #touch $SCRATCH_MNT/outputfile #xfs_io -F -c "pwrite 0 ${N}m" $SCRATCH_MNT/outputfile &>/dev/null #xfs_io -F -c "pwrite ${N}M ${N}m" $SCRATCH_MNT/outputfile &>/dev/null #./find-zeroes $SCRATCH_MNT/outputfile rm -f $SCRATCH_MNT/* touch $SCRATCH_MNT/outputfile # Create several large-ish files for I in `seq 1 $N`; do xfs_io -F -f -c "pwrite 0 2m" $SCRATCH_MNT/file$I &>/dev/null done # reread the last bit of each, just for kicks, and truncate off 1m for I in `seq 1 $N`; do xfs_io -F -c "pread 1m 1m" $SCRATCH_MNT/file$I &>/dev/null xfs_io -F -c "truncate 1m" $SCRATCH_MNT/file$I done # Append the outputfile xfs_io -F -c "pwrite 0 ${N}m" $SCRATCH_MNT/outputfile &>/dev/null # reread the last bit of each, just for kicks, and truncate off 1m for I in `seq 1 $N`; do xfs_io -F -c "pread 0m 1m" $SCRATCH_MNT/file$I &>/dev/null xfs_io -F -c "truncate 0m" $SCRATCH_MNT/file$I done # Append the outputfile xfs_io -F -c "pwrite ${N}M ${N}m" $SCRATCH_MNT/outputfile &>/dev/null ./find-zeroes $SCRATCH_MNT/outputfile $ ./trash-ext4-discard at 246800384 length 18489344 size 511950848 zeroes 18489344 $ ./trash-ext4-discard at 246808576 length 18481152 size 511950848 zeroes 18481152 $ ./trash-ext4-discard at 246857728 length 18432000 size 511848448 zeroes 18432000 $ ./trash-ext4-discard at 246640640 length 18649088 size 512086016 zeroes 18649088 $ ./trash-ext4-discard at 246800384 length 18489344 size 511959040 zeroes 18489344 actually this is enough: # Create several large-ish files for I in `seq 1 $N`; do xfs_io -F -f -c "pwrite 0 1m" $SCRATCH_MNT/file$I &>/dev/null done # Append the outputfile xfs_io -F -c "pwrite 0 ${N}m" $SCRATCH_MNT/outputfile &>/dev/null # truncate all for I in `seq 1 $N`; do xfs_io -F -c "truncate 0m" $SCRATCH_MNT/file$I done # Append the outputfile xfs_io -F -c "pwrite ${N}M ${N}m" $SCRATCH_MNT/outputfile &>/dev/null $ ./trash-ext4-discard2 at 228061184 length 37228544 size 530579456 zeroes 37228544 (In reply to comment #9) > mkfs options: -m 0 -T largefile4 no, only -T largefile otherwise I can't create enough files Thanks, I'll make sure I can reproduce this and turn it into a testcase. I just saw that this patch went into 2.6.34 commmit b90f687018e6d6c77d981b09203780f7001407e5 Thanks! Taking bug so I can close it :) |