Bug 10871

Summary:	Apparent infinite loop on reading broken filesystem
Product:	File System	Reporter:	Sami Liedes (sami.liedes)
Component:	ext2	Assignee:	Andrew Morton (akpm)
Status:	CLOSED OBSOLETE
Severity:	normal	CC:	alan, eugeneteo, jack, sandeen, sudha.anil
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	2.6.25.4	Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	A broken ext2 filesystem image (test case), bzip2 compressed ext2_find_entry() returns immediately upon finding an error in reading-directory-blocks

Description Sami Liedes 2008-06-06 05:43:38 UTC

Distribution: Debian sid
Hardware Environment: x86 qemu (where I try to break the kernel; I assume it's broken on real x86 too, but haven't tested)
Software Environment: debootstrapped minimal Debian sid
Problem Description:

Attempting to read a directory on the attached (intentionally broken) ext2 filesystem enters an apparently infinite loop of out-of-bounds device reads. This might be a problem since, for example, in some environments non-privileged users are allowed to insert USB sticks which are automatically mounted. And it's probably always better to be as robust as possible wrt filesystem corruption.

From my tests it seems ext2 is very robust, while I can get ext3 journaling code to crash quite easily (will report that separately after a while, just thought it might be better to get the non-ext3 specific bugs squished first). I have yet to test other file systems.

I can also give you the (qemu) testing environment images and scripts if you wish. The breaking is done by randomly flipping each bit in the image with a fixed probability (currently 0.0001), mounting it, cp -r:ing a directory, doing a find -xdev, touching all files, mkdir tmp, creating and writing a file and doing a rm -rf to the entire fs.

Steps to reproduce (in the qemu; adjust appropriately for loop-mounting the file):

1. mount /dev/hdb /mnt -o 'errors=continue' -t ext2
2. cd /mnt
3. ls doc

After this I get the following in dmesg:

---------
attempt to access beyond end of device
hdb: rw=0, want=1493366918, limit=20480
Buffer I/O error on device hdb, logical block 746683458
attempt to access beyond end of device
hdb: rw=0, want=3700621328, limit=20480
Buffer I/O error on device hdb, logical block 1850310663
attempt to access beyond end of device
hdb: rw=0, want=2397154474, limit=20480
Buffer I/O error on device hdb, logical block 1198577236
attempt to access beyond end of device
hdb: rw=0, want=4265175662, limit=20480
Buffer I/O error on device hdb, logical block 2132587830
attempt to access beyond end of device
hdb: rw=0, want=1493366918, limit=20480
Buffer I/O error on device hdb, logical block 746683458
attempt to access beyond end of device
hdb: rw=0, want=3700621328, limit=20480
Buffer I/O error on device hdb, logical block 1850310663
attempt to access beyond end of device
hdb: rw=0, want=2397154474, limit=20480
Buffer I/O error on device hdb, logical block 1198577236
attempt to access beyond end of device
hdb: rw=0, want=4265175662, limit=20480
Buffer I/O error on device hdb, logical block 2132587830
attempt to access beyond end of device
hdb: rw=0, want=3378575734, limit=20480
Buffer I/O error on device hdb, logical block 1689287866
attempt to access beyond end of device
attempt to access beyond end of device
hdb: rw=0, want=121897816, limit=20480
Buffer I/O error on device hdb, logical block 60948907
attempt to access beyond end of device
hdb: rw=0, want=2614902342, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=1444111036, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=3378575734, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=121897816, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=2614902342, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=1444111036, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=780306490, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=4012112738, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=3981364912, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=1959574150, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=780306490, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=4012112738, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=3981364912, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=1959574150, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=378760226, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=378760226, limit=20480
attempt to access beyond end of device
hdb: rw=0, want=378760226, limit=20480
attempt to access beyond end of device
---------

And after that the last 2 lines repeat ad infinitum.

Comment 1 Sami Liedes 2008-06-06 05:46:00 UTC

Created attachment 16412 [details]
A broken ext2 filesystem image (test case), bzip2 compressed

Comment 2 Sudha Anil Gathala 2010-06-22 19:26:15 UTC

I am able to reproduce this on x86 while mounting as ext2. However, the machine does not crash; the above lines are printed whenever I tried to access the file system.

So, what's the expected behaviour in case of corrupted file systems ?
(mount itself should fail ?)

Comment 3 Sudha Anil Gathala 2010-06-23 02:33:20 UTC

Ok, now I can see the problem; the infinite loop is observed when doing "ls /mnt/doc".....

Comment 4 Sudha Anil Gathala 2010-06-23 17:30:21 UTC

Some observations (arch x86, kernel 2.6.31.13):

$mount -oloop,errors=continue -t ext2 hdb.1548 /mnt
$time ls /mnt/doc

ls: cannot access temp/doc: No such file or directory

real	0m36.560s
user	0m0.004s
sys	0m31.918s

In the dmesg, the followling lines are printed repeatedly for the above duration (36.560 seconds):

attempt to access beyond end of device
loop0: rw=0, want=4298207984, limit=20480
attempt to access beyond end of device
loop0: rw=0, want=4298207984, limit=20480
attempt to access beyond end of device
loop0: rw=0, want=4298207984, limit=20480

Comment 5 Sudha Anil Gathala 2010-06-23 19:10:27 UTC

It is not quite an infinite loop.

The root directory in the corrupted filesystem thinks it has large number of data blocks (i_blocks = 2123063714; dir_pages() = 846572). 

So, when looking up for a file named, "doc", the code (ext2_find_entry()) attempts to read all these data blocks and find the entry for, "doc". Specifically the function, "ext2_find_entry()", returns only after attempting to read all the pages of the directory from the disk. Each such attempt prints the following lines:

attempt to access beyond end of device
loop0: rw=0, want=?????????, limit=20480

A possible fix could be to fail the lookup right after finding that dir is corrupted. However, not sure if that beats the purpose of "errors=continue" flag.

Comment 6 Eric Sandeen 2010-06-23 19:16:16 UTC

Something like we have in ext3, from commit 40b851348fe9bf49c26025b34261d25142269b60, might help:

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 7516957..5304eaf 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -342,6 +342,9 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
 					return 0;
 				}
 			}
+			/* corrupt size?  Maybe no more blocks to read */
+			if (filp->f_pos > inode->i_blocks << 9)
+				break;
 			filp->f_pos += ext2_rec_len_from_disk(de->rec_len);
 		}
 		ext2_put_page(page);


-Eric

Comment 7 Sudha Anil Gathala 2010-06-23 19:45:03 UTC

Created attachment 26916 [details]
ext2_find_entry() returns immediately upon finding an error in reading-directory-blocks

Please review the attached patch.

Comment 8 Eric Sandeen 2010-06-23 19:52:56 UTC

When we fixed something similar on ext3, the hope was that even if there was an IO error on one block of a dir, we could still hope to read later parts of the directory.  Bailing out in this way wouldn't allow that ...  It's an option, but it's different behavior compared to ext3.

-Eric

Comment 9 Eric Sandeen 2010-06-23 19:54:15 UTC

Also, if you do it this way, dir_has_error has no purpose, it'll never be tested after it's set.

Comment 10 Sudha Anil Gathala 2010-06-23 19:59:03 UTC

I agree. 

However, in that case, there is no need of a fix here. The code is already doing the right thing - reading (or attempting to read) all the data blocks of the directory till the end before announcing a failure of inode lookup. 

Just that in this particular case (example above), we have a large number of data blocks for the corrupted directory and hence we see so many prints in the dmesg.

Comment 11 Eric Sandeen 2010-06-23 20:27:32 UTC

Ok, the patch in comment #6 won't help because in this case both i_blocks and i_size are fubar:

Inode: 2   Type: directory    Mode:  0466   Flags: 0xce84283   Generation: 2542945240
User: 59974   Group: 60919   Size: -827409154
File ACL: 1540230428    Directory ACL: -182146329
Links: 3453   Blockcount: 2123063714

ISTR considering testing for a sparse directory inode when we fixed ext3, but for some reason we didn't go that way; in this case it seems like a reasonable test, if size & block count are both corrupt, it's unlikely that they'd "match" to create a non-sparse file.

Maybe on the first error encountered, we should compare size to blocks and permanently bail if they don't match, since that would indicate metadata corruption rather than just an IO error.

There are so many things wrong with the root dir, I think you could choose just about any sanity test to disqualify it ;)

Comment 12 Sami Liedes 2010-06-23 22:14:23 UTC

From a user/tester POV, it seems you could determine metadata corruption when the directory extends past the end of the filesystem (or in this case, I guess the code could figure out that when reading a block has failed because it's past the end of device, no read further past the EOD are going to succeed). In any case I don't think thousands of error messages saying essentially the same thing are going to be helpful to the admin.

But I agree that this seems a rather minor problem. The reason I use errors=continue is that I thought it's more likely to cause problems ;) and that would presumably help find more bugs or potential bugs in the fs code.

Comment 13 Sudha Anil Gathala 2010-06-23 23:08:59 UTC

Checking for metadata corruption sounds like a good plan.

I did not follow how the above test would correctly handle a sparsefile ?
In the case of a sparse file too, i_size and i_blocks do not match right ?

Alternatively, we should test for the corruption of dir's inode like the e2fsck does - checking different attributes of inode....

Comment 14 Sudha Anil Gathala 2010-06-23 23:11:22 UTC

Sami:

would you mind sharing the code you used to corrupt the filesystem image ?

Thanks,
Anil

Comment 15 Sudha Anil Gathala 2010-06-23 23:19:12 UTC

Eric: 

Is it the case that dir is never a sparsefile ?
In that case i_size and i_block mismatch would probably suffice.

Thanks,
Anil

Comment 16 Sami Liedes 2010-06-24 15:56:14 UTC

Here's a script I use inside qemu. hda is the root fs, hdc is a 10 MiB filesystem (here ext2), and hdb is a 10 MiB block device where the corrupted filesystem will be written and which is subsequently mounted.

zzuf is an input fuzzer (e.g. Debian has it, package zzuf). The parameter -s $s tells zzuf to use seed $s (for repeatability), and -r 0:0.03 tells it to generate a random probability p between 0 and 0.03, and to flip bits in input with that probability (so in some runs around 3 % of the bits get flipped, while in some runs it's close to none).

hda contains a minimal Debian system, and I just run it in qemu with something like

qemu -kernel bzImage -append 'root=/dev/hda console=ttyS0,115200n8' -hda hda -hdb hdb -hdc hdc -nographic -serial pty

------------------------------------------------------------
#!/bin/sh

if [ "`hostname`" != "fstest" ]; then
   echo "This is a dangerous script."
   echo "Set your hostname to \`fstest\' if you want to use it."
   exit 1
fi

umount /dev/hdb
umount /dev/hdc
/etc/init.d/sysklogd stop
/etc/init.d/klogd stop
/etc/init.d/cron stop
mount /dev/hda / -t ext3 -o remount,ro || exit 1

#ulimit -t 20

for ((s=$1; s<1000000000; s++)); do
  umount /mnt
  echo '***** zzuffing *****' seed $s
  zzuf -r 0:0.03 -s $s </dev/hdc >/dev/hdb || exit
  mount /dev/hdb /mnt -t ext2 -o errors=continue || continue
  cd /mnt || continue
  timeout 30 cp -r doc doc2 >&/dev/null
  timeout 30 find -xdev >&/dev/null
  timeout 30 find -xdev -print0 2>/dev/null |xargs -0 touch -- 2>/dev/null
  timeout 30 mkdir tmp >&/dev/null
  timeout 30 echo whoah >tmp/filu 2>/dev/null
  timeout 30 rm -rf /mnt/* >&/dev/null
  cd /
done
------------------------------------------------------------

Comment 17 Eric Sandeen 2010-06-24 16:09:00 UTC

Yes, I think that a sparse dir should be invalid.  This is probably something we want to test for in the error paths, though, to decide if we want to bail out.   Doing these tests on every readdir might be expensive, but we could use them to decide if we need to stop trying after an error ...

Unfortunately for a very large filesystem we won't hit the error we see here until we have tried traversing many, many blocks ... in this case the fs is small enough that the errors come early.

-Eric

Comment 18 Sudha Anil Gathala 2010-06-24 18:11:48 UTC

Eric:

Do you think the following test is good enough to check the mismatch ?

i_blocks != (i_size >> 9)

(or should I be adding a one to the right-hand-side)

Thanks,
Anil

Comment 19 Sudha Anil Gathala 2010-06-24 18:24:12 UTC

This seems to be doing the job right:


--- fs/ext2/dir.c	2010-06-24 11:18:20.549140505 -0700
+++ fs/ext2/dir.c.orig	2010-06-23 12:39:08.897558337 -0700
@@ -399,11 +399,8 @@
 				de = ext2_next_entry(de);
 			}
 			ext2_put_page(page);
-		} else {
+		} else
 			dir_has_error = 1;
-			if (unlikely((dir->i_size>>9) != dir->i_blocks))
-				goto out;
-		}
 
 		if (++n >= npages)
 			n = 0;

Comment 20 Sudha Anil Gathala 2010-06-24 18:25:19 UTC

(pls ignore comment #19)
This seems to be doing the job:

--- fs/ext2/dir.c.orig	2010-06-23 12:39:08.897558337 -0700
+++ fs/ext2/dir.c	2010-06-24 11:18:20.549140505 -0700
@@ -399,8 +399,11 @@
 				de = ext2_next_entry(de);
 			}
 			ext2_put_page(page);
-		} else
+		} else {
 			dir_has_error = 1;
+			if (unlikely((dir->i_size>>9) != dir->i_blocks))
+				goto out;
+		}
 
 		if (++n >= npages)
 			n = 0;

Comment 21 Jan Kara 2010-08-03 15:52:31 UTC

This actually won't work because indirect blocks are also accounted in i_blocks. Thus for directories which have indirect block this check will fail...