Bug 200753

Summary: write I/O error for inode structure leads to operation failure without any warning or error
Product: File System Reporter: Shehbaz (shehbazjaffer007)
Component: ext4Assignee: fs_ext4 (fs_ext4)
Status: NEW ---    
Severity: normal CC: Martin, sandeen, tytso
Priority: P1    
Hardware: All   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=200739
Kernel Version: 4.17.0-rc3+ Subsystem:
Regression: No Bisected commit-id:
Attachments: printAllBlocks.sh script that generates block trace of all blocks accessed
injectError.sh script used to selectively inject an I/O error for a specific requested block.
cmd2000 file, the workload file that is called by printAllBlocks.sh script
device mapper code
Makefile to build device mapper code and utimes c file
utimes code to update access and modification times of file
strace output
source programs ftruncate, fchmod, fchown creat

Description Shehbaz 2018-08-07 00:47:15 UTC
Created attachment 277725 [details]
printAllBlocks.sh script that generates block trace of all blocks accessed

Hello!

Background
==========

I am performing block level Write I/O error injection on ext4 file system using  device mapper. I try the following commands: truncate, chmod, chown, utimes, symlink, creat. Each of these commands update the inode structure by writing to the inode data structure on one of the inode blocks.

Observation
===========
If the underlying device returns an EIO error, the operation does not give any warning or error either in the user space or in the logs, however the operation fails to undergo completion.

Steps to Reproduce
==================

Create 2 identical QEMU disks, each of 1.5GB. in my case these are /dev/sdc (backup disk) and /dev/sdb (primary disk). I create the image on backup disk and record block trace there. I then copy the image from backup to primary (using pv command) and replay the command by injecting block error on a specific inode block.

1. please create 2 qemu images /dev/sdb (primary image) and /dev/sdc (backup image)
2. please install pv command, compile dm-io.c with the Makefile attached. please download all attached files in 1 folder, as some commands are helper commands.
3. please run ./printAllBlocks.sh <cmd> where cmd is truncate, chmod, chown, utimes, symlink, creat. this would print a list of blocks with READ/WRITE operation. choose the inode block WRITE operation that is either the 3rd or 4th last block to be written to disk. the block type can be verified using dumpe2fs or debugfs command. In my disk image
4. now, with the target inode block number, run ./injectError.sh <cmd> <errBlkNo> where cmd is one of the 6 commands mentioned above, and errBlkNo is the inode block obtained in step 3.

5. analyse results:

truncate - fails to truncate the file size to 40960 bytes. the original size of the file continues to remain the same, no error is reported in either syslog, fsck or user space.

chmod - fails to correctly change the mode of the file to 777. the original mode of the file continues (644 in my case), no error is reported in either syslog, fsck or userspace.

chown - fails to correctly change the ownership of the file to shehbaz:shehbaz. the original ownership of the file continues (root:root in my case), no error is reported in either syslog, fsck or userspace.

utimes - fails to update last access and modification times to 1979.

symlink - while corrupting the target inode, no error is reported (this is okay for the source path but here corrupting target inode leads to an unreported error). a subsiquent ls on the created target says file system structure needs cleaning.

creat -  fails when parent inode block update encounters a write I/O error. no error is reported in userspace or kernel space.

Please find all programs attached for your reference.
Please let me know if any other details regarding setup is required to reproduce the issue.

Thanks!
Comment 1 Shehbaz 2018-08-07 00:47:43 UTC
Created attachment 277727 [details]
injectError.sh script used to selectively inject an I/O error for a specific requested block.
Comment 2 Shehbaz 2018-08-07 00:48:05 UTC
Created attachment 277729 [details]
cmd2000 file, the workload file that is called by printAllBlocks.sh script
Comment 3 Shehbaz 2018-08-07 00:48:38 UTC
Created attachment 277731 [details]
device mapper code
Comment 4 Shehbaz 2018-08-07 00:49:06 UTC
Created attachment 277733 [details]
Makefile to build device mapper code and utimes c file
Comment 5 Shehbaz 2018-08-07 00:49:41 UTC
Created attachment 277735 [details]
utimes code to update access and modification times of file
Comment 6 Eric Sandeen 2018-08-07 00:52:29 UTC
A couple things up front; an IO error should not in general lead to a fsck error, and it doesn't require a kernel message.  So "no error reported in fsck or kernelspace" isn't a bug.

However, failed syscalls should return errors to the user.  Are you saying that they do not do so?

As this appears to be a complex test to set up, can you perhaps include the strace of one or more of these failing operations on your prepared device?
Comment 7 Theodore Tso 2018-08-07 01:48:10 UTC
Shehbaz,

Before you file more bugs, perhaps you should describe what your model is for how a file system should behave.  You seem to have some very confused ideas for how a file system should behave in the face of I/O errors (e.g., thinking that a symlink to an inode which is in a broken inode table block MUST be a file system error).

Why don't you write up a document that describes how you think the world works, because it seems like a waste of your time to create these very careful bug reports with reproductions when you seem to be operated on a fundamental flawed model of how a file system behaves, and how it should behave in the face of errors.
Comment 8 Shehbaz 2018-08-07 02:44:15 UTC
Hello Eric and Theodore,

I apologize for filing bug reports before describing what I am trying to do. The purpose of my study is to determine how block-level read and write I/O errors are handled by different file systems. In order to simulate this, I created a device mapper target that intercepts all blocks that are read and written by a file system. I use a combination of dumpe2fs, code documentation and debugfs to determine different data structures on the file system.

For each run of a command, multiple blocks are accessed from the device. I systematically return an I/O error from the device mapper instead of the actual block, one block at a time and check how file system either detects or recovers from this error. Specifically, for inode read operations (eg. access, chdir), an I/O error during the read of an inode leads to an error message in user space and an error message in kernel space. Also, the read operation fails. For error during a write of an inode (which mostly happens when I unmount the device, which is when all blocks are flushed to disk), I observe that if I return an EIO error, this is not acknowledged by the file system in any way. I do not see a log message that the EIO was detected by the file system while flushing blocks to disk. e2fsck also does not report that a certain inode write operation did not successfully get flushed to disk. The updated values in the inode do not get reflected once I remount the file system and I see the older file system image.

I would like to ask if such failed operations, due to the failure of the underlying device should be detected and reported at the file system level? I think since the file system would be responsible for persisting writes to disk, if the disk returns back an I/O error during a write operation, this should be reported to either the user or atleast logged as a warning or error message notifying the user that the operation did not succeed as the disk never recorded the updated write.

Once again, I apologize for submitting the bug reports without validating them. I would like to understand your view of this problem and if this form of error should be handled at the file system layer.

Thank you,
Comment 9 Theodore Tso 2018-08-07 03:33:48 UTC
For writes, userspace must call fsync(2) and check the error returns from fsync(2), write(2), and close(2), if it wants to be sure of catching the error report.   (For some remote file systems, such as AFS, the error reporting a quota overflow happens on write, since it's only on the close that the file is sent to the server.)

If you don't call fsync(2), the data blocks (or changes to the inode) may not be attempted to be written to disk before the userspace program exits so there is no guarantee that would be any opportunity for the system even *notice* that there is a problem.

Also, as far as metadata blocks (such as inode table blocks), what's generally important is whether they are successfully written to the journal.   That's because in real life there are two cases where we have errors the *vast* majority of time.  (a) The device has disappeared on us, because it's been unplugged from the computer or the last fibre channel connection between the computer and the disk has been lost, etc.   (b)  There is a media error.

For (a) so long as the writes have made it to the journal, that's what is important.  If the disk has disappeared, then when it comes back, we will replay the journal, and the inode table updates will be written.

For (b), in general with modern storage devices, there is a bad block replacement pool, and writes will use a newly allocated block from the bad block sparing pool if there is a problem with the recording error, and this is transparent to the host software.

How you are modelling errors by using a device-mapper target to force-fail certain blocks permanently might reflect how disks behaved in the early 1980's on PC's (e.g., pre-IDE and pre-SATA), but doesn't reflect how storage devices behave today.

One could argue that ext4 should do the right thing even when using hardware which is 35+ years old.  The problem is, for example, if we forced the disk to actually try to persist writes after each inode update in fsck, we would destroy performance.   You can try simulate this by hacking e2fsck to force the use of O_DIRECT reads and writes (which eliminate buffering, so each read and write call results in a synchronous I/O request to the device).   You will find that the results are not pretty.  Hence, trading off a performance disaster to make some academic who is writing a paper about whether or not file systems handle artificial I/O error injections that do not comport with reality is really not something I'm particularly interested in.....
Comment 10 Eric Sandeen 2018-08-07 03:36:25 UTC
(In reply to Shehbaz from comment #8)

<ted beat me to the reply but I'll send this anyway>

> For error during a write of an inode (which mostly happens
> when I unmount the device, which is when all blocks are flushed to disk), I
> observe that if I return an EIO error, this is not acknowledged by the file
> system in any way. I do not see a log message that the EIO was detected by
> the file system while flushing blocks to disk. e2fsck also does not report
> that a certain inode write operation did not successfully get flushed to
> disk. The updated values in the inode do not get reflected once I remount
> the file system and I see the older file system image.

So, a few  things.  It might be worth reading https://lwn.net/Articles/457667/ for data persistence guarantees and best practices.  For some operations, failures will simply not be seen by userspace unless data persistence calls are made.  For example a buffered write failure won't be seen by the application unless it i.e. calls fsync on the write.  I think some of your "failures to report to userspace" may fall into this class.

And again, e2fsck is not expected to see any error.  The whole point of a journaling filesystem like ext4 is to ensure that the metadata stays consistent even in the face of a failure.  e2fsck not seeing any corruption after EIO is a /good/ thing.

Now, I probably would expect to see some errors in dmesg if for example inode flushing fails at unmount time, though.  It's not strictly a bug to not log an error at the filesystem level, but it's probably desirable.  Ted can probably speak to this better than I can.

> I would like to ask if such failed operations, due to the failure of the
> underlying device should be detected and reported at the file system level?
> I think since the file system would be responsible for persisting writes to
> disk, if the disk returns back an I/O error during a write operation, this
> should be reported to either the user

The filesystem is responsible for maintaining metadata consistency (because it is a journaling filesystem), and for honoring syscalls it receives.  But to be pedantic about it, the filesystem is only responsible for persisting writes if you explicitly tell it to do so via syscalls such as fsync.  For example write(2):

> A successful return from write() does not make any guarantee that data has
> been committed to disk.

It looks like none of your operations make any data persistence calls, so you're not /asking/ for a return of success or failure of persistence.

-Eric
Comment 11 Theodore Tso 2018-08-07 13:44:58 UTC
On Tue, Aug 07, 2018 at 03:36:25AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> Now, I probably would expect to see some errors in dmesg if for example inode
> flushing fails at unmount time, though.  It's not strictly a bug to not log
> an
> error at the filesystem level, but it's probably desirable.  Ted can probably
> speak to this better than I can.

In practice we do, but it's coming from lower levels of the storage
stack.  We'll of course log errors writing into the journal, but once
they metadata updates are logged to the journal, they get written back
by the buffer cache writeback functions.  Immediately before the
unmount they will be flushed out by the jbd2 layer in
fs/jbd2/checkpoint.c, using fs/buffer.c's write_dirty_buffer() with a
REQ_SYNC flag, and I/O errors get logged via buffer_io_error().

In practice, the media write errors are logged by the device drivers,
so the users have some idea that Bad Stuff is happening --- assuming
they even look at the dmesg layer at all, of course.

One could imagine an enhancement to the file system to teach it to not
use the generic buffer cache writeback functions, and instead submit
I/O requests with custom callback functions for metadata I/O so in
case of an error, there would be an ext4-level error that would
explain that writeback to inode table block XXX, affecting inodes
YYY-ZZZ failed.  And if someone submitted such a patch, I'd consider
it for inclusion, assuming the patch was clean, correct, and didn't
introduce long-term maintenance burdens.

However, at $WORK, we have a custom set of changes so that all file
system errors as well as media errors from the SCSI/SATA layer get
sent to a userspace daemon via netlink, and as necessary, the disk
will be automatically sent to a repair workflow.  The repair workflow
would then tell the cluster file system to stop using that disk, and
then confirm that the bad block rediction pool was able to kick in
correctly, or flagging the drive to be sent to a hardware operations
team to replace the disk drive, etc.

(The main reason why we haven't sent it upstream is that the patch as
it stands today is a bit of an ugly kludge, and would have to be
rewritten to be a better structured error kernel->userspace reporting
mechanism --- either for the storage stack in general, or for the
whole kernel.  Alas, no one has had the time or energy to deal with
the almost certain bike-shedding party that would ensue after
proposing such a new feature.  :-)

So I don't have much motivation to fix up something to log explanatory
error messages from the file system level, when the device driver
errors in dmesg are in practice quite sufficient for most users.  Even
without a custom netlink patch, you can just scrape dmesg.  An example
of such a userspace approach can be found here:

	https://github.com/kubernetes/node-problem-detector

In practice, most practical systems don't need to know exactly what
file system metadata I/O block had problems.  We just need to know
when the disk drive has started developing errors, and when it has,
whether we can restore the disk drive to be a 100% functional storage
device, so it can be safely and sanely used by the file system layer
--- or whether it's time to replace it with a working drive.

I do agree with Eric that it would be a nice-to-have if ext4 were to
log messages if an inode table block or a bitmap allocation block ran
into errors when being flushed out, at unmount or by the kernel's
writeback threads.  But it's really only a nice-to-have.

Patches gratefully accepted....
Comment 12 Martin Steigerwald 2018-08-07 14:52:42 UTC
Interesting. That may explain why on doing a "blkdiscard" on the partition of an Ext4 filesystem mounted as / on an SSD does not seem to show anything in "dmesg -w" started before the "blkdiscard" command. I always was quite a bit puzzled about this. This happened with a running Plasma desktop which attempts to write things quite regularly. When accessing a command not yet in cache I do receive "error in format of binary file" (roughly translated from German language). But still nothing in kernel log ring buffer. This is on some Fujitsu workstation with Intel CPU + chipset. I assume the kernel uses libata AHCI for accessing it. No access to such a workstation at the moment.

I made a test with creating an 1 exibyte large XFS filesystem on a large sparse file on a way smaller XFS filesystem and on writing to the 1 exibyte large filesystem beyond the free space left in the underlying XFS gave error messages like "lost buffer write" in kernel log. I never tried the "blkdiscard" thing with XFS so far tough.
Comment 13 Eric Sandeen 2018-08-07 15:02:10 UTC
To some of Ted's point, I agree and was going to make a similar observation - a real disk error would have led to /some/ error message from somewhere lower in teh stack, whether from the scsi layer, lvm, or whatever.  If the custom dm target here silently returns an error, that's different behavior than you'd typically see upstream.

It leave the open question of "should ext4 printk something as well" but in the real world it'd be much more surprising to see a completely silent EIO from disk.
Comment 14 Shehbaz 2018-08-07 18:15:30 UTC
Created attachment 277743 [details]
strace output
Comment 15 Shehbaz 2018-08-07 18:16:19 UTC
Thanks Theodore and Eric,

> If you don't call fsync(2), the data blocks (or changes to the inode) may not
> be attempted to be written to disk before the userspace program exits so
> there is no guarantee that would be any opportunity for the system even
> *notice* that there is a problem.

I have changed my programs to call ftruncate(2), chmod(2), chown(2), creat(2) and open(2),fsync(2), close(2). For each of the programs, I still notice fsync(2) and close(2) return success (strace o/p attached), even after returning I/O error to the file system.
Since symlink(2) and utimes(2) do not work on fds but paths, I did not experiment with these.

> Also, as far as metadata blocks (such as inode table blocks), what's
> generally important is whether they are successfully written to the journal.

I agree. I observe the following: Once the writes are done to the journal, there is a follow-up write then to the in-place location of the inode. When this inode write is failed, no error is reported. I assume that the journal also does not bother since the write has already been made in-place on the file system. This is why after remount even though write was done on journal, the updated inode image is not seen on remount.

Thank you for explaining other scenarios and real world use-cases, I would take these into consideration for extending this work.

> For example a buffered write failure won't be seen by the application unless
> it i.e. calls fsync on the write.

I thought unmount of file system would automatically ensure any left over blocks in buffer cache would be flushed to disk. However, after your feedback I make sure fsync(2), close(2) and open(2) are done at right locations from the userspace program.

> Now, I probably would expect to see some errors in dmesg if for example inode
> flushing fails at unmount time, though.

I agree.

> It looks like none of your operations make any data persistence calls, so
> you're not /asking/ for a return of success or failure of persistence.

In my current truncate.c, creat.c, chmod.c and chown.c, I ask for data persistence. I still do not see any observable user/kernel space warnings or errors.

> We'll of course log errors writing into the journal

I have some different observations regarding this, I observe that an EIO during journal write I/O also do not log any errors in kernel space. This does not cause the operation to fail (unlike failure of inode blocks) because I observe that after the write has been done to journal, the writes are also done to in-place location of the inode. So, if I were to fail a journal block write for truncate operation and remount the file system, I see the truncated file, as the inplace write happened successfully. I am sorry I am unable to determine which data structure - journal super, journal descriptor, journal commit or journal data - lies on the failed block, but I can get back to you regarding this in a separate report.

Also, thank you for providing background of real disk behaviour. My only concern is if disk vendors only reply with EIO and depend on upper layers to make sense of it, and file systems rely on disk vendors to log errors, we might end up having silent corruptions. 2 log messages would be ok. no log messages would be bad.

Please bear with me while I file my findings and I will follow up with patch recommendations.

Thank you,
Comment 16 Shehbaz 2018-08-07 18:17:14 UTC
Created attachment 277745 [details]
source programs ftruncate, fchmod, fchown creat
Comment 17 Theodore Tso 2018-08-07 20:42:18 UTC
I see your problem.   Your dm-io.c is broken:

			printk(KERN_INFO "WRITE block = %lu ",bio->bi_iter.bi_sector / 8);
			bno = bio->bi_iter.bi_sector / 8;	
			if( mdt->error && bno == mdt->errorBlockNumber && mdt->mode != 'R') {
				printk("WRITE ERROR!!\n");
				printk(KERN_INFO "%s():ERROR ON block = %lu\n",__func__,(bio->bi_iter.bi_sector / 8) - 1);
				bio_endio(bio);
				// bio_io_error(bio);
				return DM_MAPIO_SUBMITTED;

It wasn't actually return an error to file system!

Once I fixed your dm-io, I confirmed that errors to the journal do in fact cause a jbd2 error to be logged in dmesg:

[18995.919032] WRITE ERROR!!
[18995.919039] io_map():ERROR ON block = 196609
[18995.919251] Aborting journal on device dm-18-8.
[18995.919278] WRITE block = 196608 
[18998.747065] READ block = 204 
[18998.747258] WRITE block = 0 
[18998.748982] EXT4-fs error (device dm-18): ext4_journal_check_start:61: Detected aborted journal
[18998.749199] EXT4-fs (dm-18): Remounting filesystem read-only

Given your assertion that write errors to the journal don't cause the kernel to complain, and I have just demonstrated that errors to the journal *do* cause the kernel to complain, the reliability of your other observations are seriously called into question.   Combined with your fundamental misunderstandings of how file systems work (including how buffered writes work, and how symlinks work), I'm going to gently suggest that you spend some time gathering a much deeper understanding of how kernel storage works before you waste more kernel developers time....