Bug 200871 - F2FS experiences data loss (entry is completely lost) when an I/O failure occurs.
Summary: F2FS experiences data loss (entry is completely lost) when an I/O failure occ...
Status: ASSIGNED
Alias: None
Product: File System
Classification: Unclassified
Component: f2fs (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Default virtual assignee for f2fs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-20 21:49 UTC by Stathis Maneas
Modified: 2018-10-16 01:48 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.18.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Device Mapper Module (Injector) (5.35 KB, text/x-csrc)
2018-09-05 21:22 UTC, Stathis Maneas
Details
Device Mapper Module (Injector) - Header File (917 bytes, text/plain)
2018-09-05 21:23 UTC, Stathis Maneas
Details
Device Mapper Module - Makefile (351 bytes, text/plain)
2018-09-05 21:23 UTC, Stathis Maneas
Details
Simple C file that updates the permissions of a file. (877 bytes, text/x-csrc)
2018-09-05 21:42 UTC, Stathis Maneas
Details

Description Stathis Maneas 2018-08-20 21:49:18 UTC
Dear all,

I would like to describe a particular workflow that eventually leads to data loss. Specifically, by starting with a valid F2FS image, we end up losing an entry when an I/O failure occurs.

To begin with, the F2FS file system is deployed on the following device:
$ fdisk -l
Disk /dev/vdb: 1073 MB, 1073741824 bytes
16 heads, 63 sectors/track, 2080 cylinders, total 2097152 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

The C source file that contains the executed workload is the following:
$ cat run_chmod.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/stat.h>

#define FILE_MODE_ALL    S_IRWXU | S_IRWXG | S_IRWXO

static int __exec_fsync(char *entry_name) {
	int fd, ret = 0;

	fd = open(entry_name, O_RDONLY);
	if (fd < 0) {
		perror("open");
		return fd;
	}

	ret = fsync(fd);
	if (ret < 0)
		perror("fsync");

	if (close(fd))
		perror("close");

	return ret;
}

int main(int argc, char *argv[]) {
	int ret, errno_value = 0;

	/* Check that the number of arguments is sufficient. */
	if(argc != 2) {
		fprintf(stderr, "Usage: %s <path-to-entry>\n", argv[0]);
		exit(EXIT_FAILURE);
	}

	if ((ret = chmod(argv[0], FILE_MODE_ALL)) < 0) {
		errno_value = errno;
		perror("chmod");
	} else
		__exec_fsync(argv[0]);

	printf("The operation \"chmod\" returned: %d (errno: %d)\n", ret, errno_value);
	exit(EXIT_SUCCESS);
}

The steps to initialize the file system are the following:
$ dd if=/dev/zero of=/dev/vdb bs=4096
$ mkfs.f2fs -l F2FS /dev/vdb -d 10 -a 0 -O extra_attr -O inode_checksum
$ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off
$ cd /mnt/f2fs
$ dd if=/dev/urandom of=inline_file bs=3400 count=1
$ cd; umount /mnt/f2fs

The actual workload is the following:
$ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off
$ cd /mnt/f2fs
$ ./run_chmod inline_file
(Comment: During the execution of this command, we fail the write operation that is related to the newly updated inode (BlockID: 5633). In order to fail the write operation, we make use of the bio_io_error() function to error the corresponding bio and also return EIO to the upper layers.)

The operation "chmod" returned: 0 (errno: 0)
$ cd; umount /mnt/f2fs

As we observe, the command did not return an error and dmesg is clean.

At this point, if we re-mount the file system, the file cannot be accessed since its inode was not persisted on disk:
$ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off
$ cd /mnt/f2fs
$ ls -alR
.:
ls: cannot access inline_file: Invalid argument
total 8
drwxr-xr-x 2 root root 4096 Aug 20 17:04 .
drwxr-xr-x 3 root root 4096 Jun  5 14:50 ..
-????????? ? ?    ?       ?            ? inline_file
$ dmesg
...
[76801.387616] F2FS-fs (vdb): inconsistent node block, nid:4, node_footer[nid:0,ino:0,ofs:0,cpver:0,blkaddr:0]
[76801.387620] F2FS-fs (vdb): bad inode I/O: 4
...
$ cd; umount /mnt/f2fs

After invoking f2fs.fsck and answering 'yes' to all questions, we end up with a consistent file system images, but now, the entry has disappeared:
$ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off
$ cd /mnt/f2fs
$ ls -alR
.:
total 8
drwxr-xr-x 2 root root 4096 Aug 20 17:04 .
drwxr-xr-x 3 root root 4096 Jun  5 14:50 ..
$ cd; umount /mnt/f2fs

Furthermore, f2fs.fsck does not create an entry inside the lost_found directory and thus, the entry is completely lost.

Last but not least, the issue exists even when the fsync_mode=strict flag is provided during the mount operation.

Kind regards,
Stathis Maneas
Comment 1 Chao Yu 2018-08-26 09:09:48 UTC
I think there are two problems:

1. you made IO error, and return the error to upper, but fsync() didn't fail.

To reproduce this:
a) I add below code in f2fs_write_end_io, and use fault_injection to simulate error under block layer:

	if (time_to_inject(F2FS_P_SB(bio_first_page_all(bio)), FAULT_IO)) {
		f2fs_show_injection_info(FAULT_IO);
		bio->bi_status = BLK_STS_IOERR;
	}

b) xfs_io -f /mnt/f2fs/file -c "pwrite 0 4k" -c "fsync"
fsync: Input/output error

This is because, in f2fs_write_end_io(), if error is injected, -EIO will be set into node inode's page mapping as below:
		if (unlikely(bio->bi_status)) {
			mapping_set_error(page->mapping, -EIO);
			if (type == F2FS_WB_CP_DATA)
				f2fs_stop_checkpoint(sbi, true);
		}
And later filemap_check_errors() in f2fs_sync_file() will capture such error, and propagate it to user.

So how you inject error in bio? by fail_make_request?

2. image became inconsistent, so that we can index node block of inlien_file, but could not read detail info of it due to the node block is corrupted.

"inconsistent node block, nid:4, node_footer[nid:0,ino:0,ofs:0,cpver:0,blkaddr:0]"

I think this is related to first problem.
Comment 2 Stathis Maneas 2018-08-26 23:43:23 UTC
1. I am using a device mapper module similar to the existing dm-flakey module:
https://elixir.bootlin.com/linux/v4.18/source/drivers/md/dm-flakey.c

Specifically, an error is introduced to an I/O operation in the same way as implemented by the dm-flakey module (https://elixir.bootlin.com/linux/v4.18/source/drivers/md/dm-flakey.c#L348):
...
else if (test_bit(ERROR_WRITES, &fc->flags)) {
        bio_io_error(bio);
        return DM_MAPIO_SUBMITTED;
}
...

In this case, fsync is not able to capture the error and returns as if no error has taken place.

Here, I would like to mention that using this mechanism to introduce errors when ext4 is used as the underlying file system, results in fsync returning an error.

2. Indeed, the problem here is that the corresponding NAT table entry points to an inode that has never been persisted on disk. However, since fsync does not report an error, someone would hope that the entry would be there.

Moreover, the problem here is worse, because after invoking fsck, the entry completely disappears from the file system and cannot be restored in any way.

Had fsync returned with an error, the user would have known that the operation has failed and thus, they would not have excepted any modifications to have taken place.
Comment 3 Chao Yu 2018-08-31 07:19:24 UTC
Alright, can you please provider the complete testcase including error injection script for me? I'd like to reproduce in my environment.
Comment 4 Stathis Maneas 2018-09-05 21:22:38 UTC
Created attachment 278331 [details]
Device Mapper Module (Injector)
Comment 5 Stathis Maneas 2018-09-05 21:23:04 UTC
Created attachment 278333 [details]
Device Mapper Module (Injector) - Header File
Comment 6 Stathis Maneas 2018-09-05 21:23:21 UTC
Created attachment 278335 [details]
Device Mapper Module - Makefile
Comment 7 Stathis Maneas 2018-09-05 21:40:50 UTC
Please find attached the required files to reproduce the error. Start with the Makefile to compile and install the module into your kernel.

Assuming the underlying device is /dev/vdb, please execute the following commands to initialize the file system:
$ dd if=/dev/zero of=/dev/vdb bs=4096
$ mkfs.f2fs -l F2FS /dev/vdb -d 10 -a 0 -O extra_attr -O inode_checksum
$ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off
$ cd /mnt/f2fs
$ dd if=/dev/urandom of=inline_file bs=3400 count=1
$ cd; umount /mnt/f2fs

In order to reproduce the error, we need two pieces of information:
- The device's number of sectors (invoking fdisk -l should be more than sufficient to extract this information). In my case, this number is equal to 262,144.
- The block number (on disk) that will accommodate the newly updated inode (after its mode has changed). In my case, this block number is equal to 4609.

Then, please execute the following commands to reproduce the error:
$ sudo dmsetup create f2fs_dev --table '0 262144 injector /dev/vdb 0 W4609' --readahead none
$ dmesg -C; dmsetup message f2fs_dev 0 start
$ sudo mount -t f2fs /dev/mapper/f2fs_dev /mnt/f2fs -o background_gc=off
$ cd /mnt/f2fs/
$ /tmp/run_chmod inline_file
The operation "chmod" returned: 0 (errno: 0)

(Comment: During the execution of this command, the write operation that is related to the newly updated inode (BlockID: 4609) is failed.)

$ cd; umount /mnt/f2fs

If everything is executed as expected, then dmesg should contain an entry similar to the following:
Injecting (W) error for sec: 36872 and block: 4609

At this point, if we re-mount the file system, the file cannot be accessed since its inode was never persisted on disk:
$ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off
$ cd /mnt/f2fs
$ ls -alR
.:
ls: cannot access inline_file: Invalid argument
total 8
drwxr-xr-x 2 root root 4096 Aug 20 17:04 .
drwxr-xr-x 3 root root 4096 Jun  5 14:50 ..
-????????? ? ?    ?       ?            ? inline_file
$ dmesg
...
[3247.026533] F2FS-fs (vdb): inconsistent node block, nid:4, node_footer[nid:0,ino:0,ofs:0,cpver:0,blkaddr:0]
...
$ cd; umount /mnt/f2fs

After invoking f2fs.fsck and answering 'yes' to all questions, we end up with a consistent file system image, but now, the entry has disappeared:
$ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off
$ cd /mnt/f2fs
$ ls -alR
.:
total 8
drwxr-xr-x 2 root root 4096 Aug 20 17:04 .
drwxr-xr-x 3 root root 4096 Jun  5 14:50 ..
$ cd; umount /mnt/f2fs

Please let me know if you run into any issues while trying to reproduce the error.
Comment 8 Stathis Maneas 2018-09-05 21:42:50 UTC
Created attachment 278337 [details]
Simple C file that updates the permissions of a file.
Comment 9 Chao Yu 2018-09-21 02:18:33 UTC
Sorry for putting it off, I just struggled to fix quota things in these days, let me try to reproduce this issue.
Comment 10 Chao Yu 2018-09-21 08:50:21 UTC
(In reply to Stathis Maneas from comment #7)
> Please find attached the required files to reproduce the error. Start with
> the Makefile to compile and install the module into your kernel.
> 
> Assuming the underlying device is /dev/vdb, please execute the following
> commands to initialize the file system:
> $ dd if=/dev/zero of=/dev/vdb bs=4096
> $ mkfs.f2fs -l F2FS /dev/vdb -d 10 -a 0 -O extra_attr -O inode_checksum
> $ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off
> $ cd /mnt/f2fs
> $ dd if=/dev/urandom of=inline_file bs=3400 count=1
> $ cd; umount /mnt/f2fs
> 
> In order to reproduce the error, we need two pieces of information:
> - The device's number of sectors (invoking fdisk -l should be more than
> sufficient to extract this information). In my case, this number is equal to
> 262,144.
> - The block number (on disk) that will accommodate the newly updated inode
> (after its mode has changed). In my case, this block number is equal to 4609.
> 
> Then, please execute the following commands to reproduce the error:
> $ sudo dmsetup create f2fs_dev --table '0 262144 injector /dev/vdb 0 W4609'
> --readahead none
> $ dmesg -C; dmsetup message f2fs_dev 0 start
> $ sudo mount -t f2fs /dev/mapper/f2fs_dev /mnt/f2fs -o background_gc=off
> $ cd /mnt/f2fs/
> $ /tmp/run_chmod inline_file
> The operation "chmod" returned: 0 (errno: 0)
> 
> (Comment: During the execution of this command, the write operation that is
> related to the newly updated inode (BlockID: 4609) is failed.)

I tried this testcase, but I can't reproduce this bug.

Look into this, the problem here is, f2fs always triggers out-place-update for meta data including inode block, so any update in inode of inline_file, f2fs should not write to (BlockID: 4609) directly, so how inline_file becomes corrupted?
Comment 11 Stathis Maneas 2018-09-23 04:16:17 UTC
Considering the fact that the operation modifies the entry's inode and that fsync is invoked at the end of the program so that all modifications persist, then a new inode will eventually be written at a new location on disk. The attached files aim to capture this write operation and drop it, also returning an error to the upper layers.

Could you please let me know how should I help you reproduce this issue? I would definitely be willing to provide more files/information if necessary.
Comment 12 Chao Yu 2018-09-26 08:48:45 UTC
(In reply to Stathis Maneas from comment #7)
> - The block number (on disk) that will accommodate the newly updated inode
> (after its mode has changed). In my case, this block number is equal to 4609.

How can you find this "4609" address? using "dump.f2fs -i ino_of_inline_file /dev/vdb"?
Comment 13 Stathis Maneas 2018-09-26 09:13:29 UTC
Indeed, I can tell what is the block address in which the updated inode is going to be stored using both dump.f2fs and the output of the SSA blocks.

Then, the file system gets initialized from scratch before the operation is executed once again. Based on the file system's deterministic behaviour, when the operation is repeated, then the updated inode will eventually be stored into the same block.

Please let me know if my understanding is accurate.
Comment 14 Chao Yu 2018-09-26 09:45:45 UTC
(In reply to Stathis Maneas from comment #13)
> Indeed, I can tell what is the block address in which the updated inode is
> going to be stored using both dump.f2fs and the output of the SSA blocks.

OK, let me catch the address where inode block will update to, and try your case again.

> 
> Then, the file system gets initialized from scratch before the operation is
> executed once again. Based on the file system's deterministic behaviour,
> when the operation is repeated, then the updated inode will eventually be
> stored into the same block.

In your case, once inode udpates, LBA of inode block will change to LBA+1.

We can record finial LBA of inode block locates in first round, then in next round, we can inject error directly in fixed LBA.

> 
> Please let me know if my understanding is accurate.
Comment 15 Chao Yu 2018-09-29 01:58:03 UTC
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 71f5953..1b5d391 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -114,6 +114,11 @@ static void f2fs_write_end_io(struct bio *bio)
 			mapping_set_error(page->mapping, -EIO);
 			if (type == F2FS_WB_CP_DATA)
 				f2fs_stop_checkpoint(sbi, true);
+			printk("f2fs_write_end_io, ino:%lu, idx:%lu, eio:%d, type:%d\n",
+				page->mapping->host->i_ino,
+				page->index,
+				test_bit(AS_EIO, &page->mapping->flags),
+				type);
 		}
 
 		f2fs_bug_on(sbi, page->mapping == NODE_MAPPING(sbi) &&
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 5907775..76ac293 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1912,6 +1912,8 @@ int f2fs_wait_on_node_pages_writeback(struct f2fs_sb_info *sbi,
 	}
 
 	ret2 = filemap_check_errors(NODE_MAPPING(sbi));
+	printk("f2fs_wait_on_node_pages_writeback, eio:%d\n",
+			test_bit(AS_EIO, &NODE_MAPPING(sbi)->flags));
 	if (!ret)
 		ret = ret2;


Could you help to add above code to reproduce this issue in your environment?
Comment 16 Stathis Maneas 2018-10-07 01:50:40 UTC
I am sorry for the delay! I can definitely help by adding the aforementioned statements!

After adding them and observing the output of dmesg, I can observe that only the second statement gets actually invoked:
[ 1026.438986] f2fs_wait_on_node_pages_writeback, eio:0

Therefore, this indicates that the following condition is not true under the specified workload:
if (unlikely(bio->bi_status)) { ... }

I tried moving the first print statement at the beginning of the following for loop,
bio_for_each_segment_all(bvec, bio, i) { ... }

but still, I am not observing any errors for the segment (block) of interest:
[   76.603891] f2fs_write_end_io, ino:1, idx:4, eio:0, type:7
[   76.603893] f2fs_write_end_io, is_dummy:0 bio_status:0

Is there something else we can try?
Comment 17 Chao Yu 2018-10-16 01:48:46 UTC
(In reply to Stathis Maneas from comment #16)
> I am sorry for the delay! I can definitely help by adding the aforementioned
> statements!
> 
> After adding them and observing the output of dmesg, I can observe that only
> the second statement gets actually invoked:
> [ 1026.438986] f2fs_wait_on_node_pages_writeback, eio:0
> 
> Therefore, this indicates that the following condition is not true under the
> specified workload:
> if (unlikely(bio->bi_status)) { ... }
> 
> I tried moving the first print statement at the beginning of the following
> for loop,
> bio_for_each_segment_all(bvec, bio, i) { ... }
> 
> but still, I am not observing any errors for the segment (block) of interest:
> [   76.603891] f2fs_write_end_io, ino:1, idx:4, eio:0, type:7
> [   76.603893] f2fs_write_end_io, is_dummy:0 bio_status:0

Sorry for the delay.

Oh, so I guess that you trigger the error injection but didn't set bio->bi_status to nonzero value, result in fs can not be aware of that.

I think we'd better to double check all your test steps and injection codes.

> 
> Is there something else we can try?

Note You need to log in before you can comment on or make changes to this bug.