Bug 200739 - I/O error on read-ahead inode blocks does not get detected or reported
Summary: I/O error on read-ahead inode blocks does not get detected or reported
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-05 22:31 UTC by Shehbaz
Modified: 2018-08-07 03:27 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.17.0-rc3+
Subsystem:
Regression: No
Bisected commit-id:


Attachments
device mapper code (5.86 KB, text/x-csrc)
2018-08-05 22:31 UTC, Shehbaz
Details
Makefile to build device mapper code (331 bytes, text/plain)
2018-08-05 22:31 UTC, Shehbaz
Details
printAllBlocks.sh script that generates block trace of all blocks accessed (1.45 KB, application/x-shellscript)
2018-08-05 22:32 UTC, Shehbaz
Details
cmd2000 file, the workload file that is called by printAllBlocks.sh script (416.62 KB, text/plain)
2018-08-05 22:32 UTC, Shehbaz
Details
injectError.sh script used to selectively inject an I/O error for a specific requested block. (742 bytes, application/x-shellscript)
2018-08-05 22:33 UTC, Shehbaz
Details
sample output for runError.sh that invokes stat command for one of the files and injects error on a block (3.82 KB, text/plain)
2018-08-05 22:34 UTC, Shehbaz
Details

Description Shehbaz 2018-08-05 22:31:02 UTC
Created attachment 277693 [details]
device mapper code

Hello!

Background
==========

I am checking how ext4 behaves in response to block level I/O injection errors. I observe that ext4 file system reads multiple other inodes during an inode block access. these inodes are read-ahead inodes and their count can be modulated using the inode_readahead_blks parameter. For my experiments, I set this parameter to 2, so 2 additional inodes were read for each primary inode block access.

Issue
=====
When I inject I/O error on the first inode being accessed, I am able to see an I/O error message in both the user space - for example:

stat: cannot stat '/mnt/DZAORDJFHRPGQGRPG7VBLAWNXR61S/BFIE2JB9Q353PQBK/OT8YSP69Q9NI93P8J6UK1WK0DFWSNQ599FHC5DIA0BWW7YIDC6W/U/FULLFILE': Input/output error

and the kernel space, for example:

EXT4-fs error (device dm-1): __ext4_get_inode_loc:4626: inode #568: block 156: comm stat: unable to read itable block

However, if the I/O error takes place on all the subsequent read-ahead inodes, I observe no such error.

Steps to Reproduce
==================

1. create a secondary ext2 file system with about 1.5GB space, preferably on QEMU.
2. compile and run dm-io.c using Makefile attached. this is a device mapper target that prints and optionally injects I/O error for a particular block.
3. run printAllBlocks.sh that would generate 2000 nested files and directories. It then mounts ext4 as a dm-io target that runs a "stat" command on one of the files. This also prints all blocks that were accessed in dmsg. We look for 3-4 consecutive blocks being read.

For example, on my setup, I get the following blocks:

140 READ
142 READ
141 READ
....
144 READ
146 READ
145 READ
....
157 READ
158 READ
156 READ

here, blocks 141,144,156 are primary inodes, and all other adjacent succeeding blocks are read-ahead inodes.

4. run injectError.sh [errBlkNo] script that would rerun the stat command, but this time it would inject I/O error when the erroBlkNo block number is returned back to the file system. I set errBlkNo to any of the inode blocks read. For each primary inode, I get an error. for all read ahead blocks, I do not get any error, either in user space or kernel space, which should be reported. Hence, I believe this is a bug.

Please find all related scripts attached for reference.

Thanks!
Comment 1 Shehbaz 2018-08-05 22:31:20 UTC
Created attachment 277695 [details]
Makefile to build device mapper code
Comment 2 Shehbaz 2018-08-05 22:32:16 UTC
Created attachment 277697 [details]
printAllBlocks.sh script that generates block trace of all blocks accessed
Comment 3 Shehbaz 2018-08-05 22:32:59 UTC
Created attachment 277699 [details]
cmd2000 file, the workload file that is called by printAllBlocks.sh script
Comment 4 Shehbaz 2018-08-05 22:33:57 UTC
Created attachment 277701 [details]
injectError.sh script used to selectively inject an I/O error for a specific requested block.
Comment 5 Shehbaz 2018-08-05 22:34:33 UTC
Created attachment 277703 [details]
sample output for runError.sh that invokes stat command for one of the files and injects error on a block
Comment 6 Theodore Tso 2018-08-07 01:42:44 UTC
Does this actually cause an user-visible problem?   If we do readahead for an inode table block never gets used by the user, and that block is never used (perhaps because no inodes have been written using that indoe table block), why should we mark the file system as corrupted?   

Especially given that with modern block devices, when we *do* write to the inode table block, it will probably use redirect the failed sector to a spare block replacement pool automatically, at which point subsequent reads to that inode table block will be *fine*.

So prematurely deciding that just because an speculative, readahead access to a sector returns a media error, is grounds to declare the file system corrupted (which could force a reboot if errors=panic is set), seems to be a massive overreaction.   

Why do you think we should signal an error in this case?
Comment 7 Shehbaz 2018-08-07 03:27:06 UTC
Hello Theodore,

Thank you for your reply to this bug and other bugs.

>Does this actually cause an user-visible problem?   If we do readahead for an
>inode table block never gets used by the user, and that block is never used
>(perhaps because no inodes have been written using that inode table block),
>why should we mark the file system as corrupted?
I agree this does not cause a user-visible problem. I think we should atleast warn the user about disk corruption because we did not receive the block that we requested from the disk. the purpose of read-ahead was to read all blocks ahead of the read block upto certain limit (2 blocks in my experiment). If the blocks do not exist, then returning with 1 or 0 blocks is correct. If the blocks exist but could not be read because of a media error, I believe this should be reported to the user.

Especially given that with modern block devices, when we *do* write to the inode table block, it will probably use redirect the failed sector to a spare block replacement pool automatically, at which point subsequent reads to that inode table block will be *fine*.
> I agree in case of writes, the newly written block would get redirected to a
> healthy sector. However, if it is a read-only workload, a proactive detection
> of a read I/O error should be handled imminently. For btrfs on HDDs, I see
> btrfs-scrub daemon being invoked immediately as soon as any form of
> corruption or I/O error is detected during the read operation. this replaces
> older metadata block with a duplicate copy. For ext4, I do not see any
> warning.

So prematurely deciding that just because an speculative, readahead access to a sector returns a media error, is grounds to declare the file system corrupted (which could force a reboot if errors=panic is set), seems to be a massive overreaction.
Why do you think we should signal an error in this case?
> I am unsure if we should reboot due to read-ahead failure, since the current
> operation did not get affected due to failed read ahead block. However,
> either a warning message or e2fsck run recommendation should be provided (eg.
> structure needs cleaning) so that the user knows the media is not working
> correctly as file system could not read the data it intended to read (2
> read-ahead blocks in this case)

Note You need to log in before you can comment on or make changes to this bug.