Bug 20992 - Data corruption triggers ext4 oops
Summary: Data corruption triggers ext4 oops
Status: RESOLVED OBSOLETE
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-10-23 14:39 UTC by Bart Van Assche
Modified: 2013-12-10 22:22 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.35.7
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Kernel oops (4.35 KB, text/plain)
2010-10-23 14:58 UTC, Bart Van Assche
Details

Description Bart Van Assche 2010-10-23 14:39:00 UTC
While running I/O performance tests I accidentally overwrote an ext4 filesystem. The next access of that filesystem triggered a kernel oops. I don't think that should happen ?
Comment 1 Theodore Tso 2010-10-23 14:56:03 UTC
Can you send the oops message, complete with the stack trace?

Thanks!!
Comment 2 Bart Van Assche 2010-10-23 14:58:48 UTC
Created attachment 34542 [details]
Kernel oops

Attached backtrace.
Comment 3 Theodore Tso 2010-10-23 15:20:43 UTC
Yep, looks like a bug alright. 

From what I can tell, you were in the middle of async I/O, at the time when the disk was corrupted.  The problem seemed to come after the I/O was completed, and  ext4_convert_unwritten_extents() was trying to set the initialized bit on the extent tree.  At that point the extent tree must have gotten corrupted on disk, and this seriously confused the extent conversion code, which ended up passing 0 to ext4_ext_put_in_cache() as the length of the extent, and that tripped the BUG_ON in ext4_ext_put_in_cache().

How did you corrupt the file system while it was mounted?   Was it via some dd to the disk device directly?

We do have code that checks to make sure the extent tree is sane, but we skip it if the data was already in the buffer cache, to save CPU costs.  But if you wrote to the disk device directly, it would have gone through the buffer cache, since the extent tree was already cached, we would have skipped the validation step, and that could be the explanation for how the bug got triggered.

If so, I'm loathe to turn on the validation check unconditionally, since that would kill performance.  I can probably change the BUG_ON in ext4_put_in_cache() to rather set the cache state to "invalid", which would at least prevent the BUG_ON.  The filesystem was probably well and truly trashed, though, so sooner or later the ext4 fs code would have hit something to cause it to be very unhappy.  Hoepfully it would be an ext4_error() call to mark the file system as corrupted, as opposed to another BUG_ON.
Comment 4 Bart Van Assche 2010-10-23 15:27:06 UTC
(In reply to comment #3)
> How did you corrupt the file system while it was mounted?   Was it via some
> dd
> to the disk device directly?

Indeed - the filesystem was corrupted while mounted by overwriting the entire contents with dd (dd if=/dev/zero of=/dev/sd... oflag=direct).

Note You need to log in before you can comment on or make changes to this bug.