Dear all, I would like to describe a particular workflow that eventually leads to data loss. Specifically, by starting with a valid F2FS image, we end up losing an entry when an I/O failure occurs. To begin with, the F2FS file system is deployed on the following device: $ fdisk -l Disk /dev/vdb: 1073 MB, 1073741824 bytes 16 heads, 63 sectors/track, 2080 cylinders, total 2097152 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 The C source file that contains the executed workload is the following: $ cat run_chmod.c #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <fcntl.h> #include <sys/stat.h> #define FILE_MODE_ALL S_IRWXU | S_IRWXG | S_IRWXO static int __exec_fsync(char *entry_name) { int fd, ret = 0; fd = open(entry_name, O_RDONLY); if (fd < 0) { perror("open"); return fd; } ret = fsync(fd); if (ret < 0) perror("fsync"); if (close(fd)) perror("close"); return ret; } int main(int argc, char *argv[]) { int ret, errno_value = 0; /* Check that the number of arguments is sufficient. */ if(argc != 2) { fprintf(stderr, "Usage: %s <path-to-entry>\n", argv[0]); exit(EXIT_FAILURE); } if ((ret = chmod(argv[0], FILE_MODE_ALL)) < 0) { errno_value = errno; perror("chmod"); } else __exec_fsync(argv[0]); printf("The operation \"chmod\" returned: %d (errno: %d)\n", ret, errno_value); exit(EXIT_SUCCESS); } The steps to initialize the file system are the following: $ dd if=/dev/zero of=/dev/vdb bs=4096 $ mkfs.f2fs -l F2FS /dev/vdb -d 10 -a 0 -O extra_attr -O inode_checksum $ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off $ cd /mnt/f2fs $ dd if=/dev/urandom of=inline_file bs=3400 count=1 $ cd; umount /mnt/f2fs The actual workload is the following: $ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off $ cd /mnt/f2fs $ ./run_chmod inline_file (Comment: During the execution of this command, we fail the write operation that is related to the newly updated inode (BlockID: 5633). In order to fail the write operation, we make use of the bio_io_error() function to error the corresponding bio and also return EIO to the upper layers.) The operation "chmod" returned: 0 (errno: 0) $ cd; umount /mnt/f2fs As we observe, the command did not return an error and dmesg is clean. At this point, if we re-mount the file system, the file cannot be accessed since its inode was not persisted on disk: $ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off $ cd /mnt/f2fs $ ls -alR .: ls: cannot access inline_file: Invalid argument total 8 drwxr-xr-x 2 root root 4096 Aug 20 17:04 . drwxr-xr-x 3 root root 4096 Jun 5 14:50 .. -????????? ? ? ? ? ? inline_file $ dmesg ... [76801.387616] F2FS-fs (vdb): inconsistent node block, nid:4, node_footer[nid:0,ino:0,ofs:0,cpver:0,blkaddr:0] [76801.387620] F2FS-fs (vdb): bad inode I/O: 4 ... $ cd; umount /mnt/f2fs After invoking f2fs.fsck and answering 'yes' to all questions, we end up with a consistent file system images, but now, the entry has disappeared: $ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off $ cd /mnt/f2fs $ ls -alR .: total 8 drwxr-xr-x 2 root root 4096 Aug 20 17:04 . drwxr-xr-x 3 root root 4096 Jun 5 14:50 .. $ cd; umount /mnt/f2fs Furthermore, f2fs.fsck does not create an entry inside the lost_found directory and thus, the entry is completely lost. Last but not least, the issue exists even when the fsync_mode=strict flag is provided during the mount operation. Kind regards, Stathis Maneas
I think there are two problems: 1. you made IO error, and return the error to upper, but fsync() didn't fail. To reproduce this: a) I add below code in f2fs_write_end_io, and use fault_injection to simulate error under block layer: if (time_to_inject(F2FS_P_SB(bio_first_page_all(bio)), FAULT_IO)) { f2fs_show_injection_info(FAULT_IO); bio->bi_status = BLK_STS_IOERR; } b) xfs_io -f /mnt/f2fs/file -c "pwrite 0 4k" -c "fsync" fsync: Input/output error This is because, in f2fs_write_end_io(), if error is injected, -EIO will be set into node inode's page mapping as below: if (unlikely(bio->bi_status)) { mapping_set_error(page->mapping, -EIO); if (type == F2FS_WB_CP_DATA) f2fs_stop_checkpoint(sbi, true); } And later filemap_check_errors() in f2fs_sync_file() will capture such error, and propagate it to user. So how you inject error in bio? by fail_make_request? 2. image became inconsistent, so that we can index node block of inlien_file, but could not read detail info of it due to the node block is corrupted. "inconsistent node block, nid:4, node_footer[nid:0,ino:0,ofs:0,cpver:0,blkaddr:0]" I think this is related to first problem.
1. I am using a device mapper module similar to the existing dm-flakey module: https://elixir.bootlin.com/linux/v4.18/source/drivers/md/dm-flakey.c Specifically, an error is introduced to an I/O operation in the same way as implemented by the dm-flakey module (https://elixir.bootlin.com/linux/v4.18/source/drivers/md/dm-flakey.c#L348): ... else if (test_bit(ERROR_WRITES, &fc->flags)) { bio_io_error(bio); return DM_MAPIO_SUBMITTED; } ... In this case, fsync is not able to capture the error and returns as if no error has taken place. Here, I would like to mention that using this mechanism to introduce errors when ext4 is used as the underlying file system, results in fsync returning an error. 2. Indeed, the problem here is that the corresponding NAT table entry points to an inode that has never been persisted on disk. However, since fsync does not report an error, someone would hope that the entry would be there. Moreover, the problem here is worse, because after invoking fsck, the entry completely disappears from the file system and cannot be restored in any way. Had fsync returned with an error, the user would have known that the operation has failed and thus, they would not have excepted any modifications to have taken place.
Alright, can you please provider the complete testcase including error injection script for me? I'd like to reproduce in my environment.
Created attachment 278331 [details] Device Mapper Module (Injector)
Created attachment 278333 [details] Device Mapper Module (Injector) - Header File
Created attachment 278335 [details] Device Mapper Module - Makefile
Please find attached the required files to reproduce the error. Start with the Makefile to compile and install the module into your kernel. Assuming the underlying device is /dev/vdb, please execute the following commands to initialize the file system: $ dd if=/dev/zero of=/dev/vdb bs=4096 $ mkfs.f2fs -l F2FS /dev/vdb -d 10 -a 0 -O extra_attr -O inode_checksum $ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off $ cd /mnt/f2fs $ dd if=/dev/urandom of=inline_file bs=3400 count=1 $ cd; umount /mnt/f2fs In order to reproduce the error, we need two pieces of information: - The device's number of sectors (invoking fdisk -l should be more than sufficient to extract this information). In my case, this number is equal to 262,144. - The block number (on disk) that will accommodate the newly updated inode (after its mode has changed). In my case, this block number is equal to 4609. Then, please execute the following commands to reproduce the error: $ sudo dmsetup create f2fs_dev --table '0 262144 injector /dev/vdb 0 W4609' --readahead none $ dmesg -C; dmsetup message f2fs_dev 0 start $ sudo mount -t f2fs /dev/mapper/f2fs_dev /mnt/f2fs -o background_gc=off $ cd /mnt/f2fs/ $ /tmp/run_chmod inline_file The operation "chmod" returned: 0 (errno: 0) (Comment: During the execution of this command, the write operation that is related to the newly updated inode (BlockID: 4609) is failed.) $ cd; umount /mnt/f2fs If everything is executed as expected, then dmesg should contain an entry similar to the following: Injecting (W) error for sec: 36872 and block: 4609 At this point, if we re-mount the file system, the file cannot be accessed since its inode was never persisted on disk: $ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off $ cd /mnt/f2fs $ ls -alR .: ls: cannot access inline_file: Invalid argument total 8 drwxr-xr-x 2 root root 4096 Aug 20 17:04 . drwxr-xr-x 3 root root 4096 Jun 5 14:50 .. -????????? ? ? ? ? ? inline_file $ dmesg ... [3247.026533] F2FS-fs (vdb): inconsistent node block, nid:4, node_footer[nid:0,ino:0,ofs:0,cpver:0,blkaddr:0] ... $ cd; umount /mnt/f2fs After invoking f2fs.fsck and answering 'yes' to all questions, we end up with a consistent file system image, but now, the entry has disappeared: $ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off $ cd /mnt/f2fs $ ls -alR .: total 8 drwxr-xr-x 2 root root 4096 Aug 20 17:04 . drwxr-xr-x 3 root root 4096 Jun 5 14:50 .. $ cd; umount /mnt/f2fs Please let me know if you run into any issues while trying to reproduce the error.
Created attachment 278337 [details] Simple C file that updates the permissions of a file.
Sorry for putting it off, I just struggled to fix quota things in these days, let me try to reproduce this issue.
(In reply to Stathis Maneas from comment #7) > Please find attached the required files to reproduce the error. Start with > the Makefile to compile and install the module into your kernel. > > Assuming the underlying device is /dev/vdb, please execute the following > commands to initialize the file system: > $ dd if=/dev/zero of=/dev/vdb bs=4096 > $ mkfs.f2fs -l F2FS /dev/vdb -d 10 -a 0 -O extra_attr -O inode_checksum > $ mount -t f2fs /dev/vdb /mnt/f2fs -o background_gc=off > $ cd /mnt/f2fs > $ dd if=/dev/urandom of=inline_file bs=3400 count=1 > $ cd; umount /mnt/f2fs > > In order to reproduce the error, we need two pieces of information: > - The device's number of sectors (invoking fdisk -l should be more than > sufficient to extract this information). In my case, this number is equal to > 262,144. > - The block number (on disk) that will accommodate the newly updated inode > (after its mode has changed). In my case, this block number is equal to 4609. > > Then, please execute the following commands to reproduce the error: > $ sudo dmsetup create f2fs_dev --table '0 262144 injector /dev/vdb 0 W4609' > --readahead none > $ dmesg -C; dmsetup message f2fs_dev 0 start > $ sudo mount -t f2fs /dev/mapper/f2fs_dev /mnt/f2fs -o background_gc=off > $ cd /mnt/f2fs/ > $ /tmp/run_chmod inline_file > The operation "chmod" returned: 0 (errno: 0) > > (Comment: During the execution of this command, the write operation that is > related to the newly updated inode (BlockID: 4609) is failed.) I tried this testcase, but I can't reproduce this bug. Look into this, the problem here is, f2fs always triggers out-place-update for meta data including inode block, so any update in inode of inline_file, f2fs should not write to (BlockID: 4609) directly, so how inline_file becomes corrupted?
Considering the fact that the operation modifies the entry's inode and that fsync is invoked at the end of the program so that all modifications persist, then a new inode will eventually be written at a new location on disk. The attached files aim to capture this write operation and drop it, also returning an error to the upper layers. Could you please let me know how should I help you reproduce this issue? I would definitely be willing to provide more files/information if necessary.
(In reply to Stathis Maneas from comment #7) > - The block number (on disk) that will accommodate the newly updated inode > (after its mode has changed). In my case, this block number is equal to 4609. How can you find this "4609" address? using "dump.f2fs -i ino_of_inline_file /dev/vdb"?
Indeed, I can tell what is the block address in which the updated inode is going to be stored using both dump.f2fs and the output of the SSA blocks. Then, the file system gets initialized from scratch before the operation is executed once again. Based on the file system's deterministic behaviour, when the operation is repeated, then the updated inode will eventually be stored into the same block. Please let me know if my understanding is accurate.
(In reply to Stathis Maneas from comment #13) > Indeed, I can tell what is the block address in which the updated inode is > going to be stored using both dump.f2fs and the output of the SSA blocks. OK, let me catch the address where inode block will update to, and try your case again. > > Then, the file system gets initialized from scratch before the operation is > executed once again. Based on the file system's deterministic behaviour, > when the operation is repeated, then the updated inode will eventually be > stored into the same block. In your case, once inode udpates, LBA of inode block will change to LBA+1. We can record finial LBA of inode block locates in first round, then in next round, we can inject error directly in fixed LBA. > > Please let me know if my understanding is accurate.
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 71f5953..1b5d391 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -114,6 +114,11 @@ static void f2fs_write_end_io(struct bio *bio) mapping_set_error(page->mapping, -EIO); if (type == F2FS_WB_CP_DATA) f2fs_stop_checkpoint(sbi, true); + printk("f2fs_write_end_io, ino:%lu, idx:%lu, eio:%d, type:%d\n", + page->mapping->host->i_ino, + page->index, + test_bit(AS_EIO, &page->mapping->flags), + type); } f2fs_bug_on(sbi, page->mapping == NODE_MAPPING(sbi) && diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index 5907775..76ac293 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -1912,6 +1912,8 @@ int f2fs_wait_on_node_pages_writeback(struct f2fs_sb_info *sbi, } ret2 = filemap_check_errors(NODE_MAPPING(sbi)); + printk("f2fs_wait_on_node_pages_writeback, eio:%d\n", + test_bit(AS_EIO, &NODE_MAPPING(sbi)->flags)); if (!ret) ret = ret2; Could you help to add above code to reproduce this issue in your environment?
I am sorry for the delay! I can definitely help by adding the aforementioned statements! After adding them and observing the output of dmesg, I can observe that only the second statement gets actually invoked: [ 1026.438986] f2fs_wait_on_node_pages_writeback, eio:0 Therefore, this indicates that the following condition is not true under the specified workload: if (unlikely(bio->bi_status)) { ... } I tried moving the first print statement at the beginning of the following for loop, bio_for_each_segment_all(bvec, bio, i) { ... } but still, I am not observing any errors for the segment (block) of interest: [ 76.603891] f2fs_write_end_io, ino:1, idx:4, eio:0, type:7 [ 76.603893] f2fs_write_end_io, is_dummy:0 bio_status:0 Is there something else we can try?
(In reply to Stathis Maneas from comment #16) > I am sorry for the delay! I can definitely help by adding the aforementioned > statements! > > After adding them and observing the output of dmesg, I can observe that only > the second statement gets actually invoked: > [ 1026.438986] f2fs_wait_on_node_pages_writeback, eio:0 > > Therefore, this indicates that the following condition is not true under the > specified workload: > if (unlikely(bio->bi_status)) { ... } > > I tried moving the first print statement at the beginning of the following > for loop, > bio_for_each_segment_all(bvec, bio, i) { ... } > > but still, I am not observing any errors for the segment (block) of interest: > [ 76.603891] f2fs_write_end_io, ino:1, idx:4, eio:0, type:7 > [ 76.603893] f2fs_write_end_io, is_dummy:0 bio_status:0 Sorry for the delay. Oh, so I guess that you trigger the error injection but didn't set bio->bi_status to nonzero value, result in fs can not be aware of that. I think we'd better to double check all your test steps and injection codes. > > Is there something else we can try?