Bug 14602
Summary: | JBD2 journal abort / checkpoint creation racy? | ||
---|---|---|---|
Product: | File System | Reporter: | Andi Kleen (andi-bz) |
Component: | ext4 | Assignee: | fs_ext4 (fs_ext4) |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | alan, tytso |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.32-rc6 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Andi Kleen
2009-11-14 12:05:13 UTC
If the journal is aborted, then any attempts to destroy the journal, checkpoint the journal, create a new transaction, etc., will result in -EIO being returned. In the code that you quoted, the EIO is being returned because is_journal_aborted(journal) returns true. The normal reason why the journal gets aborted is because ext4_error() has called ext4_handle_error(), which then calls jbd2_journal_abort() and the errors behavior is remount-read-only. The basic idea is that if some kind of file system error or corruption has been detected, we want to stop the file system from any further modifications, which might cause more damage and/or user data loss. We don't have a good errno value to use, so we just use EIO, and the error handling after an aborted journal is admittedly not great. It triggers a lot of scary, and to someone who isn't an ext3/ext4 veteran, misleading, error messages, and unfortunately it can cause the key initial failure to scroll off of a VT console. So it's not a race condition; the system is functioning as designed, although at some point we may put in better error handling and some earlier tests for an aborted journal in some of the upper layers of various ext4 functions. Hmm, it still seems strange because sometimes the error happens and more often not, even when triggering the same underlying inode IO error. I think something is at least inconsistent. Well, an I/O error won't cause an ext4_error() --- unless the garbage returned is corrupted enough that it causes the ext4 file system code to decide to throw an ext4_error. So the fact that you sometimes get an aborted journal (caused by an ext4_error) isn't entirely surprising. The ext4_error is going to depend on whether or not the ext4 fs code things the file system is corrupted, which is going to be data dependent, and that might be variable after an I/O error. The real problem may be that we need to be doing a better job of doing error checking.... |