Bug 207729
Summary: | Mounting EXT4 with data_err=abort does not abort journal on data block write failure | ||
---|---|---|---|
Product: | File System | Reporter: | Anthony Rebello (rebello.anthony) |
Component: | ext4 | Assignee: | fs_ext4 (fs_ext4) |
Status: | ASSIGNED --- | ||
Severity: | normal | CC: | jack |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.3.2 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
TGZ containing shell script and c code to reproduce the bug
TGZ containing code to reproduce bug for APPEND workload |
Description
Anthony Rebello
2020-05-13 08:15:42 UTC
Thanks for the report! So it is actually a description of data_err=abort mount option that is somewhat misleading. Let me explain a bit more: Ext4 in data=ordered mode guarantees to write contents of newly allocated data blocks during transaction commit, after which changes that make these blocks visible will become durable. In practice, whenever a new blocks are allocated for a file, we write out the range of a file that covers all the newly allocated blocks but that's just an implementation detail. And data_err=abort controls the behavior when this particular writeout of data blocks fail. In your test there are no newly allocated blocks in the transaction so the data_err=abort does not apply. To explain some rationale, such data writeback errors are indeed more serious because if we just committed the transaction despite these errors, the newly allocated blocks could expose stale, potentially security sensitive, data from other files. So that's why the option was introduced. But I agree that the documentation is misleading and the semantics of the option is somewhat peculiar. I'll talk to other ext4 developers how we could possibly improve the situation. Created attachment 290633 [details]
TGZ containing code to reproduce bug for APPEND workload
The previous attachment modified a file in place, which did not allocate any new blocks.
This attachment contains a workload that allocates new blocks by appending to the file.
Hi Jan, Thank you for your reply. I've changed the workload to perform appends rather than in-place updates. Appending sufficient data will cause a new block to be allocated. On fsync, this newly allocated data block will be written out and in ordered mode, the journal entry will contain the block bitmap (with the bit set for the newly allocated block) and the inode table that maps the offset to this block. In this particular case, if `data_err=abort` is enabled, the journal should abort when it fails to write the newly allocated data block. However, that doesn't seem to be the case. I've attached an updated TGZ containing the append workload. The bitmap has the block set, the inode table still points to the block, but the block has not been overwritten with the intended data. It still contains it's old contents. Thanks for the reproducer! Good spotting! This is indeed broken. The problem is that the write to the second file block happens, data is written to page cache. Then fsync(2) happens. It starts writeback of the second file block - allocates block, extends file size, submits write of the second file block, and waits for this write to complete. Because the write fails with EIO, waiting for the write to complete returns EIO which then bubbles up to userspace. But this also "consumes" the IO error and so the journalling layer which commits transaction later does not know there was IO error before and so it happily commits the transaction. As I've verified, this scenario indeed leads to stale data exposure that data_err=abort mount option is meant to prevent. I have to think how to fix this properly... |