Bug 93441

Summary: Add AT_EMPTY_PATH ability to unlinkat() and renameat() for direct use of fds
Product: File System Reporter: Niall Douglas (s_bugzilla)
Component: OtherAssignee: fs_other
Status: NEW ---    
Severity: enhancement CC: accounts+kernel, arequipeno, marcandre.lureau, pali, redneb, szg00000, thiago
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: all Subsystem:
Regression: No Bisected commit-id:

Description Niall Douglas 2015-02-18 01:26:17 UTC
Recently I was trying to write code which can unlink a file completely race free with respect to any third party changes to the filing system. We have an open file descriptor to the file, and the best I could come up with was this:

1. Get one of the current paths of the open file descriptor using readlink() on /proc/self/fd/x

2. Open its containing directory using O_PATH.

3. Do a fstatat() on the containing directory for the leafname of the open file descriptor, checking if the device ids and inodes match the ones for our file descriptor.

4. If they match, do an unlinkat() to remove the leafname. NOTE THIS IS RACY as another program could swap our leafname for another between the fstatat and the unlinkat.


My suggested solution is this: One can already create new hard links from a file descriptor using linkat(fd, "", dirh, "name", AT_EMPTY_PATH), so why not allow direct unlinking of a file descriptor by:

unlinkat(fd, "", AT_EMPTY_PATH)

This allows completely race free unlinking. One question is "which hard link should it delete?" to which I think the answer must be whichever the symlink at /proc/self/fd/x points to.

A similar to situation exists regarding renaming race free. I would suggest the same treatment for renameat, so:

renameat2(fd, "", dirh, "new name", AT_EMPTY_PATH)

... would rename the file referenced by the file descriptor to appear in the destination directory and new leaf name. Does AT_EMPTY_PATH collide with the RENAME_xxx macros? Currently no: the AT_xxx macros start from 0x100 upwards, whilst the RENAME_xxx macros start from 0x1.

As an aside, I assume AT_EMPTY_PATH is there to detect when older kernels don't support this feature, nevertheless the flag is superfluous when an empty path is all you actually need.

Niall
Comment 1 Ian Pilcher 2017-04-14 20:21:44 UTC
I don't think that this can ever work in the general case.  Once a file descriptor is created, it refers to an inode, not a directory entry.

Consider the extreme case where:

1. A file is opened, creating a file descriptor.
2. 2 new hard links are created to the file.
3. The path that was originally used to open the file is unlinked.

What should unlinkat(fd, "", AT_EMPTY_PATH) do in this case?

I suggest that what is really needed is a way to atomically unlink/rename a file, while verifying that the directory entry being modified points to the same inode as the file descriptor.

This would require either new system calls that take 2 (unlink) or 3 (rename) file descriptors or a new AT_ flag that stretches the meaning of unlinkat/renameat almost beyond recognition (since pathname would never be interpreted relative to dirfd).

So a hypothetical funlinkat syscall might do something like this (but with added atomic goodness):

int funlinkat(int fd, int dirfd, const char *pathname, int flags)
{
     struct stat st1, st2;

     fstat(fd, &st1);
     lstat(pathname, &st2);

     if (st1.st_dev != st2.st_dev || st1.st_ino != st2.st_ino)
         return -E??????;  /* Not the same file */

     return unlinkat(dirfd, pathname, flags);
}
Comment 2 Niall Douglas 2017-04-15 00:05:17 UTC
(In reply to Ian Pilcher from comment #1)
> I don't think that this can ever work in the general case.  Once a file
> descriptor is created, it refers to an inode, not a directory entry.

This is incorrect for Linux. A file descriptor refers to one specific path to an inode, namely the one it was opened with. Else /proc/self/fd/<num> would not be stable, the link returned by it would be random.

(Which is the case on FreeBSD BTW, when you ask for a fd's path it gives you whatever path turns up first for that inode during its cache search)

Unlinking that entry causes /proc/self/fd/<num> to be linked to the string "(empty)". No other hard links to that inode are involved nor affected.

Hence getting unlinkat() to delete whatever a fd points at race free should be unproblematic on Linux, probably just a few lines of code.

Niall
Comment 3 Thiago Macieira 2017-09-15 23:22:34 UTC
Another option: add a flag so that one can specify the file name and the file descriptor. unlinkat() could then remove the atomically file if and only if the file path matches the open file descriptor.

Rationale: the problem of the race condition is that removing a given path does not guarantee that the file has not been replaced. That is, another process could have replaced the file with another copy, so unlink() and even unlinkat() today could cause this replacement to be removed. This solution would allow ensuring that the file only got removed if it still where we think it is.

Niall's suggestion is that the file can be removed by fd alone. That would allow the file to be removed anywhere where it may still exist (provided the containing directory is writable by the calling process). I am not sure that is a good idea. If it's been renamed, it may have been for a good reason, so the calling process may want to know that it it happened.

Another solution would be to acquire a write lock on the containing directory's file descriptor. This solution may be useful even in further cases where atomicity is required when multiple files are involved. Obviously, this wouldn't work for sticky world-writeable dirs, like /tmp.
Comment 4 Niall Douglas 2017-09-16 23:12:40 UTC
(In reply to Thiago Macieira from comment #3)
> Niall's suggestion is that the file can be removed by fd alone. That would
> allow the file to be removed anywhere where it may still exist (provided the
> containing directory is writable by the calling process). I am not sure that
> is a good idea. If it's been renamed, it may have been for a good reason, so
> the calling process may want to know that it it happened.

On Windows, you can mark a file for deletion purely by open handle, without regard to its current path. Its entry does not disappear until some time after the last open handle to it in the system is closed, and therefore it continues to appear in the filesystem.

Therefore I could see some merit in Thiago's comment. I personally can't think of where it would actually be useful mind you, but that could be a lack of deep thought on the problem.
Comment 5 Pali Rohár 2017-12-19 12:12:51 UTC
(In reply to Niall Douglas from comment #2)
> (In reply to Ian Pilcher from comment #1)
> > I don't think that this can ever work in the general case.  Once a file
> > descriptor is created, it refers to an inode, not a directory entry.
> 
> This is incorrect for Linux. A file descriptor refers to one specific path
> to an inode, namely the one it was opened with. Else /proc/self/fd/<num>
> would not be stable, the link returned by it would be random.

This is not truth also for Linux. If you open fd for "/old/path" and call rename("/old/path", "/new/path") then in /proc/<pid>/fd/ you would see "/new/path" even you opened "/old/path".

Moreover /proc/<pid>/fd/ is not stable also on Linux. If you have open fd for "/old/path" and then do: link("/old/path", "/new/path") + unlink("/old/path") you would get /proc/<pid>/fd/<fd> pointing to "/old/path (deleted)".

> (Which is the case on FreeBSD BTW, when you ask for a fd's path it gives you
> whatever path turns up first for that inode during its cache search)
> 
> Unlinking that entry causes /proc/self/fd/<num> to be linked to the string
> "(empty)". No other hard links to that inode are involved nor affected.

If you unlink("/path"), then /proc/<pid>/fd/<fd> would point to "/path (deleted)".

But if you create file "/new/path (deleted)" and open it, then /proc/<pid>/fd/<fd> would point (as expected) to "/new/path (deleted)".

So you cannot use " (deleted)" suffix to distinguish if file behind /proc/<pid>/fd/<fd> was deleted or not.

> Hence getting unlinkat() to delete whatever a fd points at race free should
> be unproblematic on Linux, probably just a few lines of code.

Above examples shows that it would not work.
Comment 6 Pali Rohár 2017-12-19 12:16:11 UTC
(In reply to Ian Pilcher from comment #1)
> So a hypothetical funlinkat syscall might do something like this (but with
> added atomic goodness):
> 
> int funlinkat(int fd, int dirfd, const char *pathname, int flags)
> {
>      struct stat st1, st2;
> 
>      fstat(fd, &st1);
>      lstat(pathname, &st2);
> 
>      if (st1.st_dev != st2.st_dev || st1.st_ino != st2.st_ino)
>          return -E??????;  /* Not the same file */
> 
>      return unlinkat(dirfd, pathname, flags);
> }

Exactly. This is how API should look like for race-free unlink syscall. File descriptor points to one inode. And inode can be referenced from more directories (hard links). Therefore for unlinking file entry it is always needed to have file name.
Comment 7 Niall Douglas 2018-01-07 17:19:12 UTC
I repeat once again that file descriptors on Linux do not refer solely to an inode. They specifically track the hard link with which they were opened over time.

As this concept seems to be escaping people replying here, here follows a short program empirically proving that Linux file descriptors track changes to the original path they were opened with, not the inode. This makes Linux behave exactly like Windows, and not like (currently) OS X nor BSD which return a random valid hard link to the inode.

--- test-proc-fd-stability.c ---
#include <fcntl.h>
#include <unistd.h>
#include <fnmatch.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <limits.h>

#define LINKS 10
#define TESTFILE "test-proc-fd-stability-testfile"

int fds[LINKS];

void check()
{
  int n;
  char filename[PATH_MAX], path[PATH_MAX];
  for(n=0; n<LINKS; n++) {
    sprintf(filename, "/proc/self/fd/%d", fds[n]);
    if(-1 == readlink(filename, path, sizeof(path))) {
      fprintf(stderr, "FATAL line %d: %s\n", __LINE__, strerror(errno));
      exit(1);
    }
    sprintf(filename, "*/" TESTFILE "-%d*", n);
    if(FNM_NOMATCH == fnmatch(filename, path, 0)) {
      fprintf(stderr, "FAILED: fd index %d reads as %s\n", n, path);
    }
  }
}

int main(void)
{
  char filename[PATH_MAX], filename2[PATH_MAX];
  int mfd, n;
  // Create the file
  mfd = open(TESTFILE, O_CREAT, 0660);
  if(-1 == mfd) {
    fprintf(stderr, "FATAL line %d: %s\n", __LINE__, strerror(errno));
    return 1;
  }
  // Create the hardlinks to same inode
  for(n = 0; n<LINKS; n++) {
    sprintf(filename, TESTFILE "-%d", n);
    unlink(filename);
    if(-1 == link(TESTFILE, filename)) {
      fprintf(stderr, "FATAL line %d: %s\n", __LINE__, strerror(errno));
      return 1;
    }
  }
  // Check that each fd reflects the specific hardlink it was opened with
  check();
  // Permute hardlinks to prove that each fd tracks renames to its specific
  // path and does NOT choose some random path identifying the inode
  for(n = 0; n<LINKS; n++) {
    sprintf(filename, TESTFILE "-%d", n);
    sprintf(filename2, TESTFILE "-%d-%d", n, n);
    rename(filename, filename2);
  }
  // Check that each fd tracked the specific hardlink it was opened with
  check();
  printf("\nIf no failures appeared above, then this Linux kernel indeed "
         "has its file descriptors track changes to the specific hardlink "
         "they were opened with, and it is NOT the case that they track "
         "the inode and just return any old path to it like on other "
         "operating systems.\n");
  return 0;
}
--- test-proc-fd-stability.c ---

Regarding the preceding comment about the choice of the Linux kernel to say that fds referring to unlinked paths "(empty)" which is of course a perfectly valid filename, I entirely agree that that is an unfortunate choice. An empty string, or that readlink("/proc/self/fd/x") simply fails with an Exxx, would have been a far better choice.

All the more reason for unlinkat() to be extended to be able to unlink the hardlink specifically opened by a file descriptor, race free of any changes made to that hard link's path by anyone else since then.
Comment 8 Pali Rohár 2018-10-28 11:29:57 UTC
(In reply to Niall Douglas from comment #7)
> I repeat once again ... track the hard link with which they were opened over
> time.

Seems that you have not caught what I wrote, that it is not truth and kernel does not always track file path of file descriptor correctly...

> ... a short program empirically proving that ...

To prove something which is always truth, you need to show it for all possible inputs. Not just for only one. On the other hand for disproving you need just one counterexample.

So here is counterexample to show that kernel does not track paths to file descriptors in consistent state:

===

#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <unistd.h>
#include <fcntl.h>

int main() {

	int fd1, fd2;
	char procpath[PATH_MAX];
	char filepath[PATH_MAX];

	/* Create and open two temp files */
	fd1 = creat("/tmp/test1", 0644);
	fd2 = creat("/tmp/test2", 0644);

	/* Rename first file via rename syscall */
	if (rename("/tmp/test1", "/tmp/test1_renamed") != 0) {
		perror("rename /tmp/test1 --> /tmp/test1_renamed");
		return 1;
	}

	/* Rename second file via link+unlink syscalls */
	if (link("/tmp/test2", "/tmp/test2_renamed") != 0) {
		perror("link /tmp/test2 --> /tmp/test2_renamed");
		return 1;
	}
	if (unlink("/tmp/test2") != 0) {
		perror("unlink /tmp/test2");
		return 1;
	}

	/* Read kernel's path to first file descriptor */
	snprintf(procpath, sizeof(procpath), "/proc/self/fd/%d", fd1);
	if (readlink(procpath, filepath, sizeof(filepath)) < 0) {
		perror("readlink /proc/self/fd/...");
		return 1;
	}
	if (strcmp("/tmp/test1_renamed", filepath) != 0)
		printf("ERROR: Kernel does not track path to file correctly (expected: `%s', got: `%s'), file was not deleted, only renamed\n", "/tmp/test1_renamed", filepath);

	/* Read kernel's path to second file descriptor */
	snprintf(procpath, sizeof(procpath), "/proc/self/fd/%d", fd2);
	if (readlink(procpath, filepath, sizeof(filepath)) < 0) {
		perror("readlink /proc/self/fd/...");
		return 1;
	}
	if (strcmp("/tmp/test2_renamed", filepath) != 0)
		printf("ERROR: Kernel does not track path to file correctly (expected: `%s', got: `%s'), file was not deleted, only renamed\n", "/tmp/test2_renamed", filepath);

	/* Cleanup */
	close(fd1);
	close(fd2);
	unlink("/tmp/test1_renamed");
	unlink("/tmp/test2_renamed");

	return 0;

}

===

Note that renaming can be done also by external program and example above does not have to have control for it.

So to have stable, predictive and secure way for unlinking files, it is always needed to supply for any unlink* call path to file which is going to be unlinked.

Otherwise caller of unlink* syscall would know which is is going be unlinked and therefore it opens a possible security hole.
Comment 9 Niall Douglas 2018-10-28 23:34:30 UTC
> So here is counterexample to show that kernel does not track paths to file
> descriptors in consistent state:

Firstly thank you for the counterexample. It is always far more productive to base discussion on proofs rather than on claims.

Your counterexample behaves exactly as I would expect it to. Atomic rename has the fd track that rename as there is no ambiguity which new path the fd ought to have.

Non-atomic rename via link + unlink has the fd report the orphaned path as deleted. This is absolutely consistent with what one would expect. There is no way for any piece of code to know what to unambiguously do here (e.g. if I made two hard links, then unlinked the original, which of the two hard links should the system choose? It cannot. Therefore it's best to not go there).

Moreover, Linux behaves here exactly as Microsoft Windows behaves here. If I hard-link-via-HANDLE some inode to a new path, and then delete-by-HANDLE, I delete the original path, and querying that HANDLE for its current path returns that the file has been deleted. Similarly, if I atomic-rename-by-HANDLE, that HANDLE returns the new path. Exactly as Linux does.

I don't think anybody can reasonably do anything else.

> So to have stable, predictive and secure way for unlinking files, it is
> always needed to supply for any unlink* call path to file which is going to
> be unlinked.

Really, **no**. Unlinking by absolute path is always FAR more racy and prone to unintended data loss than unlinking by open fd.

> Otherwise caller of unlink* syscall would know which is is going be unlinked
> and therefore it opens a possible security hole.

Actually, the path may change between the time of inspection and the time of unlink.

But it doesn't matter. If you have write access to a fd, you have better ways of destroying data than unlinking the file entry.


FYI to Linux kernel devs, the paper proposing C++ standardisation of facilities which would make use of the enhancement can be found at http://wg21.link/P1031 (Low level file i/o library). We currently use a highly inefficient emulation on Linux instead, it currently adds a 6% penalty to unlinking files. Microsoft Windows has no penalty, as it implements this facility already.
Comment 10 Pali Rohár 2018-10-29 08:35:35 UTC
(In reply to Niall Douglas from comment #9)
> > So here is counterexample to show that kernel does not track paths to file
> > descriptors in consistent state:
> 
> Firstly thank you for the counterexample. It is always far more productive
> to base discussion on proofs rather than on claims.
> 
> Your counterexample behaves exactly as I would expect it to. Atomic rename
> has the fd track that rename as there is no ambiguity which new path the fd
> ought to have.
> 
> Non-atomic rename via link + unlink has the fd report the orphaned path as
> deleted. This is absolutely consistent with what one would expect. There is
> no way for any piece of code to know what to unambiguously do here (e.g. if
> I made two hard links, then unlinked the original, which of the two hard
> links should the system choose? It cannot. Therefore it's best to not go
> there).

The problem is that rename can be done by external application for which your application (which is going to do that unlink) has no control. That external application can use whatever it wants (possible non-atomic calls, etc.), but you want that your application must be deterministic and stable.

> Moreover, Linux behaves here exactly as Microsoft Windows behaves here. If I
> hard-link-via-HANDLE some inode to a new path, and then delete-by-HANDLE, I
> delete the original path, and querying that HANDLE for its current path
> returns that the file has been deleted. Similarly, if I
> atomic-rename-by-HANDLE, that HANDLE returns the new path. Exactly as Linux
> does.
> 
> I don't think anybody can reasonably do anything else.
> 
> > So to have stable, predictive and secure way for unlinking files, it is
> > always needed to supply for any unlink* call path to file which is going to
> > be unlinked.
> 
> Really, **no**. Unlinking by absolute path is always FAR more racy and prone
> to unintended data loss than unlinking by open fd.

That is why I'm suggesting to use Ian's proposal from comment #1. Syscall which gets both (absolute) path and file descriptor and atomically compares that file descriptor belongs to inode of path.

> > Otherwise caller of unlink* syscall would know which is is going be
> unlinked
> > and therefore it opens a possible security hole.
> 
> Actually, the path may change between the time of inspection and the time of
> unlink.

If this inspection is done by kernel and atomically, plus userspace provides both path and file descriptor, then there is no race and it cannot change.

> But it doesn't matter. If you have write access to a fd, you have better
> ways of destroying data than unlinking the file entry.

Yes. Who has write access can do anything. But imagine that you have just "correctly" written applications which do not want to destroy data. And they want to have stable and predictive behavior.

> FYI to Linux kernel devs, the paper proposing C++ standardisation of
> facilities which would make use of the enhancement can be found at
> http://wg21.link/P1031 (Low level file i/o library). We currently use a
> highly inefficient emulation on Linux instead, it currently adds a 6%
> penalty to unlinking files. Microsoft Windows has no penalty, as it
> implements this facility already.