Bug 202485

Summary: chmod'ed permission not persisted upon fsync
Product: File System Reporter: Seulbae Kim (seulbae)
Component: ext4Assignee: fs_ext4 (fs_ext4)
Status: NEW ---    
Severity: normal CC: tytso, vijayc
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.18~Latest Tree: Mainline
Regression: No
Attachments: Proof of Concept

Description Seulbae Kim 2019-02-01 20:10:18 UTC
Created attachment 280919 [details]
Proof of Concept

[Kernel version]
This bug can be reproduced on
kernel 4.18 ~ 4.20.0+(kernel 645ff1e8e704c4f33ab1fcd3c87f95cb9b6d7144)

[Reproduce] * Use a VM, since our PoC simulates a crash by triggering a SysRq!
1. Download base image
$ wget https://gts3.org/~seulbae/fsimg/ext4-00.image

2. Mount image
$ mkdir /tmp/ext4
$ sudo mount -o loop ext4-00.image /tmp/ext4

3. Compile and run PoC
$ gcc poc.c -o poc
$ sudo ./poc /tmp/ext4
(System reboots)

[Check]
1. Re-mount the crashed image
$ mkdir /tmp/ext4
$ sudo mount -o loop ext4-00.image /tmp/ext4

2. Check inconsistency
$ stat /tmp/ext4/foo/bar/fifo
-> Access: (0644/prw-r--r--)

[Description]
In the base image, 2 directories and 7 files exist.

0: 0755 (mount_point)
+--257: 0755 foo
   +--258: 0755 bar
      +--259: 0644 baz (12 bytes, offset: {})
      +--259: 0644 hln (12 bytes, offset: {})
      +--260: 0644 xattr (0 bytes, offset: {})
      +--261: 0644 acl (0 bytes, offset: {})
      +--262: 0644 æøå (4 bytes, offset: {})
      +--263: 0644 fifo
      +--264: 0777 sln -> mnt/foo/bar/baz

foo/bar/fifo is a FIFO file.
The PoC basically
1. changes its permission,
(line 26) syscall(SYS_chmod, "foo/bar/fifo", 0400);
2. opens it,
(line 27) syscall(SYS_chmod, "foo/bar/fifo", 0400);
3. flushes its metadata, and then
(line 28) syscall(SYS_fsync, fd);
4. simulates a crash by rebooting right away without unmounting.
(line 30) system("echo b > /proc/sysrq-trigger");

We expect that the metadata regarding
the new permission is successfully flushed to disk,
and when we remount the crashed image,
we will see that foo/bar/fifo's mode is changed to 0400.

However, the file still has its old mode, 0644.

Reported by Seulbae Kim (seulbae@gatech.edu) from SSLab, Gatech
Comment 1 Theodore Tso 2019-02-02 03:38:30 UTC
This is going to be true of all special files (block and character devices, unix domain sockets, FIFO's etc.), and it's going to be true for all file systems in Linux.

That's because the VFS provides the file_operations structures for these special files, and the underlying file system never gets the notified about the fsync().   All of these special files don't have a fsync function defined, so fsync() on these devices will be a no-op.

Whether or not this is a bug is an interesting philosophical question.   The Single Unix Specification is extremely non-directive on this score, saying that it's all implementation defined.   The Linux man page does have this statement:

        As well as flushing the file data, fsync() also flushes the metadata
        information associated with the file (see inode(7)).

So what has been implemented in the Linux kernel for decades is at odds with this statement, at least as it relates to special Unix files.

Do you have a use case or a program where this is important?  Or is this something merely of academic interest?
Comment 2 Vijay Chidambaram 2019-02-02 04:01:14 UTC
Hi Seulbae,

Adding on to what Ted said, the POSIX standards are not very specific about crash-consistency guarantees. I'd recommend you go through the ext4/btrfs/xfs mailing lists searching for "Vijay Chidambaram", Jayashree Mohan", or "Ashlie Martinez" to find prior discussions and other potential dead-ends. The previous discussions also document the guarantees provided by the widely-used Linux file systems, so that you know what is a bug and what is not. 

For example, a symlink also does not have crash-consistency guarantees: https://www.spinics.net/lists/linux-btrfs/msg76816.html

Our OSDI paper has more details: http://www.cs.utexas.edu/~vijay/papers/osdi18-crashmonkey.pdf

Thanks,
Vijay Chidambaram, UT Austin
Comment 3 Theodore Tso 2019-02-02 14:29:50 UTC
I do believe we should either fix the man page to document existing practice, or change the kernel.   Changing kernel behaviour here going to be a fairly involved, since today the filesystem layer (ext4, btrfs, xfs) is not even informed when a special device file is open, read, written, fsync'ed, etc.   Which makes sense; when you write to /dev/tty or /dev/ttyS1, or /dev/sda3, etc., it has nothing to do with the file system where the where the directory entry to said special device file is found.   And indeed, if one were to open /dev/sda3, write to it, and then fsync it, the vast majority of user space programs care about the *data* written to /dev/sda3, and not to any associated metadata, such as the permission bits or the modtime of /dev/sda3 being persisted.

The history behind fsync(2) was that it was intended for regular files, not special device files, and for database programs in particular.  The fdatasync(2) call was added later as an optimization to avoid unnecessary updates to the inode when it was not necessary to be able to access the data blocks that had been written.   I am not sure how many legacy Unix implementations implement the semantics that you are hoping for, and I very much doubt there are any user space programs who depend on such a behavior.

As such, I suspect it's going to be hard to get most developers to care about implementing this if it's only to be about API semantic purity.   If there was a cash-rich, say, Japanese corporation that was going to throw $$$ at a company asking them to fix it, it might happen, if it didn't complicate the implementation; most developers would be far more interested in keeping the VFS layer bug-free and long-term maintainable compared to API semantic purity.  Or if *you* really care, it might be an interesting starter project.