Bug 10871
Summary: | Apparent infinite loop on reading broken filesystem | ||
---|---|---|---|
Product: | File System | Reporter: | Sami Liedes (sami.liedes) |
Component: | ext2 | Assignee: | Andrew Morton (akpm) |
Status: | CLOSED OBSOLETE | ||
Severity: | normal | CC: | alan, eugeneteo, jack, sandeen, sudha.anil |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.25.4 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
A broken ext2 filesystem image (test case), bzip2 compressed
ext2_find_entry() returns immediately upon finding an error in reading-directory-blocks |
Description
Sami Liedes
2008-06-06 05:43:38 UTC
Created attachment 16412 [details]
A broken ext2 filesystem image (test case), bzip2 compressed
I am able to reproduce this on x86 while mounting as ext2. However, the machine does not crash; the above lines are printed whenever I tried to access the file system. So, what's the expected behaviour in case of corrupted file systems ? (mount itself should fail ?) Ok, now I can see the problem; the infinite loop is observed when doing "ls /mnt/doc"..... Some observations (arch x86, kernel 2.6.31.13): $mount -oloop,errors=continue -t ext2 hdb.1548 /mnt $time ls /mnt/doc ls: cannot access temp/doc: No such file or directory real 0m36.560s user 0m0.004s sys 0m31.918s In the dmesg, the followling lines are printed repeatedly for the above duration (36.560 seconds): attempt to access beyond end of device loop0: rw=0, want=4298207984, limit=20480 attempt to access beyond end of device loop0: rw=0, want=4298207984, limit=20480 attempt to access beyond end of device loop0: rw=0, want=4298207984, limit=20480 It is not quite an infinite loop. The root directory in the corrupted filesystem thinks it has large number of data blocks (i_blocks = 2123063714; dir_pages() = 846572). So, when looking up for a file named, "doc", the code (ext2_find_entry()) attempts to read all these data blocks and find the entry for, "doc". Specifically the function, "ext2_find_entry()", returns only after attempting to read all the pages of the directory from the disk. Each such attempt prints the following lines: attempt to access beyond end of device loop0: rw=0, want=?????????, limit=20480 A possible fix could be to fail the lookup right after finding that dir is corrupted. However, not sure if that beats the purpose of "errors=continue" flag. Something like we have in ext3, from commit 40b851348fe9bf49c26025b34261d25142269b60, might help: diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index 7516957..5304eaf 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -342,6 +342,9 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir) return 0; } } + /* corrupt size? Maybe no more blocks to read */ + if (filp->f_pos > inode->i_blocks << 9) + break; filp->f_pos += ext2_rec_len_from_disk(de->rec_len); } ext2_put_page(page); -Eric Created attachment 26916 [details]
ext2_find_entry() returns immediately upon finding an error in reading-directory-blocks
Please review the attached patch.
When we fixed something similar on ext3, the hope was that even if there was an IO error on one block of a dir, we could still hope to read later parts of the directory. Bailing out in this way wouldn't allow that ... It's an option, but it's different behavior compared to ext3. -Eric Also, if you do it this way, dir_has_error has no purpose, it'll never be tested after it's set. I agree. However, in that case, there is no need of a fix here. The code is already doing the right thing - reading (or attempting to read) all the data blocks of the directory till the end before announcing a failure of inode lookup. Just that in this particular case (example above), we have a large number of data blocks for the corrupted directory and hence we see so many prints in the dmesg. Ok, the patch in comment #6 won't help because in this case both i_blocks and i_size are fubar: Inode: 2 Type: directory Mode: 0466 Flags: 0xce84283 Generation: 2542945240 User: 59974 Group: 60919 Size: -827409154 File ACL: 1540230428 Directory ACL: -182146329 Links: 3453 Blockcount: 2123063714 ISTR considering testing for a sparse directory inode when we fixed ext3, but for some reason we didn't go that way; in this case it seems like a reasonable test, if size & block count are both corrupt, it's unlikely that they'd "match" to create a non-sparse file. Maybe on the first error encountered, we should compare size to blocks and permanently bail if they don't match, since that would indicate metadata corruption rather than just an IO error. There are so many things wrong with the root dir, I think you could choose just about any sanity test to disqualify it ;) From a user/tester POV, it seems you could determine metadata corruption when the directory extends past the end of the filesystem (or in this case, I guess the code could figure out that when reading a block has failed because it's past the end of device, no read further past the EOD are going to succeed). In any case I don't think thousands of error messages saying essentially the same thing are going to be helpful to the admin. But I agree that this seems a rather minor problem. The reason I use errors=continue is that I thought it's more likely to cause problems ;) and that would presumably help find more bugs or potential bugs in the fs code. Checking for metadata corruption sounds like a good plan. I did not follow how the above test would correctly handle a sparsefile ? In the case of a sparse file too, i_size and i_blocks do not match right ? Alternatively, we should test for the corruption of dir's inode like the e2fsck does - checking different attributes of inode.... Sami: would you mind sharing the code you used to corrupt the filesystem image ? Thanks, Anil Eric: Is it the case that dir is never a sparsefile ? In that case i_size and i_block mismatch would probably suffice. Thanks, Anil Here's a script I use inside qemu. hda is the root fs, hdc is a 10 MiB filesystem (here ext2), and hdb is a 10 MiB block device where the corrupted filesystem will be written and which is subsequently mounted. zzuf is an input fuzzer (e.g. Debian has it, package zzuf). The parameter -s $s tells zzuf to use seed $s (for repeatability), and -r 0:0.03 tells it to generate a random probability p between 0 and 0.03, and to flip bits in input with that probability (so in some runs around 3 % of the bits get flipped, while in some runs it's close to none). hda contains a minimal Debian system, and I just run it in qemu with something like qemu -kernel bzImage -append 'root=/dev/hda console=ttyS0,115200n8' -hda hda -hdb hdb -hdc hdc -nographic -serial pty ------------------------------------------------------------ #!/bin/sh if [ "`hostname`" != "fstest" ]; then echo "This is a dangerous script." echo "Set your hostname to \`fstest\' if you want to use it." exit 1 fi umount /dev/hdb umount /dev/hdc /etc/init.d/sysklogd stop /etc/init.d/klogd stop /etc/init.d/cron stop mount /dev/hda / -t ext3 -o remount,ro || exit 1 #ulimit -t 20 for ((s=$1; s<1000000000; s++)); do umount /mnt echo '***** zzuffing *****' seed $s zzuf -r 0:0.03 -s $s </dev/hdc >/dev/hdb || exit mount /dev/hdb /mnt -t ext2 -o errors=continue || continue cd /mnt || continue timeout 30 cp -r doc doc2 >&/dev/null timeout 30 find -xdev >&/dev/null timeout 30 find -xdev -print0 2>/dev/null |xargs -0 touch -- 2>/dev/null timeout 30 mkdir tmp >&/dev/null timeout 30 echo whoah >tmp/filu 2>/dev/null timeout 30 rm -rf /mnt/* >&/dev/null cd / done ------------------------------------------------------------ Yes, I think that a sparse dir should be invalid. This is probably something we want to test for in the error paths, though, to decide if we want to bail out. Doing these tests on every readdir might be expensive, but we could use them to decide if we need to stop trying after an error ... Unfortunately for a very large filesystem we won't hit the error we see here until we have tried traversing many, many blocks ... in this case the fs is small enough that the errors come early. -Eric Eric: Do you think the following test is good enough to check the mismatch ? i_blocks != (i_size >> 9) (or should I be adding a one to the right-hand-side) Thanks, Anil This seems to be doing the job right: --- fs/ext2/dir.c 2010-06-24 11:18:20.549140505 -0700 +++ fs/ext2/dir.c.orig 2010-06-23 12:39:08.897558337 -0700 @@ -399,11 +399,8 @@ de = ext2_next_entry(de); } ext2_put_page(page); - } else { + } else dir_has_error = 1; - if (unlikely((dir->i_size>>9) != dir->i_blocks)) - goto out; - } if (++n >= npages) n = 0; (pls ignore comment #19) This seems to be doing the job: --- fs/ext2/dir.c.orig 2010-06-23 12:39:08.897558337 -0700 +++ fs/ext2/dir.c 2010-06-24 11:18:20.549140505 -0700 @@ -399,8 +399,11 @@ de = ext2_next_entry(de); } ext2_put_page(page); - } else + } else { dir_has_error = 1; + if (unlikely((dir->i_size>>9) != dir->i_blocks)) + goto out; + } if (++n >= npages) n = 0; This actually won't work because indirect blocks are also accounted in i_blocks. Thus for directories which have indirect block this check will fail... |