Bug 93441

Summary:	Add AT_EMPTY_PATH ability to unlinkat() and renameat() for direct use of fds
Product:	File System	Reporter:	Niall Douglas (s_bugzilla)
Component:	Other	Assignee:	fs_other
Status:	NEW ---
Severity:	enhancement	CC:	accounts+kernel, alx, arequipeno, marcandre.lureau, pali, redneb, szg00000, thiago
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	all	Subsystem:
Regression:	No	Bisected commit-id:

Description Niall Douglas 2015-02-18 01:26:17 UTC

Recently I was trying to write code which can unlink a file completely race free with respect to any third party changes to the filing system. We have an open file descriptor to the file, and the best I could come up with was this:

1. Get one of the current paths of the open file descriptor using readlink() on /proc/self/fd/x

2. Open its containing directory using O_PATH.

3. Do a fstatat() on the containing directory for the leafname of the open file descriptor, checking if the device ids and inodes match the ones for our file descriptor.

4. If they match, do an unlinkat() to remove the leafname. NOTE THIS IS RACY as another program could swap our leafname for another between the fstatat and the unlinkat.

My suggested solution is this: One can already create new hard links from a file descriptor using linkat(fd, "", dirh, "name", AT_EMPTY_PATH), so why not allow direct unlinking of a file descriptor by:

unlinkat(fd, "", AT_EMPTY_PATH)

This allows completely race free unlinking. One question is "which hard link should it delete?" to which I think the answer must be whichever the symlink at /proc/self/fd/x points to.

A similar to situation exists regarding renaming race free. I would suggest the same treatment for renameat, so:

renameat2(fd, "", dirh, "new name", AT_EMPTY_PATH)

... would rename the file referenced by the file descriptor to appear in the destination directory and new leaf name. Does AT_EMPTY_PATH collide with the RENAME_xxx macros? Currently no: the AT_xxx macros start from 0x100 upwards, whilst the RENAME_xxx macros start from 0x1.

As an aside, I assume AT_EMPTY_PATH is there to detect when older kernels don't support this feature, nevertheless the flag is superfluous when an empty path is all you actually need.

Niall

Comment 1 Ian Pilcher 2017-04-14 20:21:44 UTC

I don't think that this can ever work in the general case.  Once a file descriptor is created, it refers to an inode, not a directory entry.

Consider the extreme case where:

1. A file is opened, creating a file descriptor.
2. 2 new hard links are created to the file.
3. The path that was originally used to open the file is unlinked.

What should unlinkat(fd, "", AT_EMPTY_PATH) do in this case?

I suggest that what is really needed is a way to atomically unlink/rename a file, while verifying that the directory entry being modified points to the same inode as the file descriptor.

This would require either new system calls that take 2 (unlink) or 3 (rename) file descriptors or a new AT_ flag that stretches the meaning of unlinkat/renameat almost beyond recognition (since pathname would never be interpreted relative to dirfd).

So a hypothetical funlinkat syscall might do something like this (but with added atomic goodness):

int funlinkat(int fd, int dirfd, const char *pathname, int flags)
{
     struct stat st1, st2;

     fstat(fd, &st1);
     lstat(pathname, &st2);

     if (st1.st_dev != st2.st_dev || st1.st_ino != st2.st_ino)
         return -E??????;  /* Not the same file */

     return unlinkat(dirfd, pathname, flags);
}

Comment 2 Niall Douglas 2017-04-15 00:05:17 UTC

(In reply to Ian Pilcher from comment #1)
> I don't think that this can ever work in the general case.  Once a file
> descriptor is created, it refers to an inode, not a directory entry.

This is incorrect for Linux. A file descriptor refers to one specific path to an inode, namely the one it was opened with. Else /proc/self/fd/<num> would not be stable, the link returned by it would be random.

(Which is the case on FreeBSD BTW, when you ask for a fd's path it gives you whatever path turns up first for that inode during its cache search)

Unlinking that entry causes /proc/self/fd/<num> to be linked to the string "(empty)". No other hard links to that inode are involved nor affected.

Hence getting unlinkat() to delete whatever a fd points at race free should be unproblematic on Linux, probably just a few lines of code.

Niall

Comment 3 Thiago Macieira 2017-09-15 23:22:34 UTC

Another option: add a flag so that one can specify the file name and the file descriptor. unlinkat() could then remove the atomically file if and only if the file path matches the open file descriptor.

Rationale: the problem of the race condition is that removing a given path does not guarantee that the file has not been replaced. That is, another process could have replaced the file with another copy, so unlink() and even unlinkat() today could cause this replacement to be removed. This solution would allow ensuring that the file only got removed if it still where we think it is.

Niall's suggestion is that the file can be removed by fd alone. That would allow the file to be removed anywhere where it may still exist (provided the containing directory is writable by the calling process). I am not sure that is a good idea. If it's been renamed, it may have been for a good reason, so the calling process may want to know that it it happened.

Another solution would be to acquire a write lock on the containing directory's file descriptor. This solution may be useful even in further cases where atomicity is required when multiple files are involved. Obviously, this wouldn't work for sticky world-writeable dirs, like /tmp.

Comment 4 Niall Douglas 2017-09-16 23:12:40 UTC

(In reply to Thiago Macieira from comment #3)
> Niall's suggestion is that the file can be removed by fd alone. That would
> allow the file to be removed anywhere where it may still exist (provided the
> containing directory is writable by the calling process). I am not sure that
> is a good idea. If it's been renamed, it may have been for a good reason, so
> the calling process may want to know that it it happened.

On Windows, you can mark a file for deletion purely by open handle, without regard to its current path. Its entry does not disappear until some time after the last open handle to it in the system is closed, and therefore it continues to appear in the filesystem.

Therefore I could see some merit in Thiago's comment. I personally can't think of where it would actually be useful mind you, but that could be a lack of deep thought on the problem.

Comment 5 Pali Rohár 2017-12-19 12:12:51 UTC

(In reply to Niall Douglas from comment #2)
> (In reply to Ian Pilcher from comment #1)
> > I don't think that this can ever work in the general case.  Once a file
> > descriptor is created, it refers to an inode, not a directory entry.
> 
> This is incorrect for Linux. A file descriptor refers to one specific path
> to an inode, namely the one it was opened with. Else /proc/self/fd/<num>
> would not be stable, the link returned by it would be random.

This is not truth also for Linux. If you open fd for "/old/path" and call rename("/old/path", "/new/path") then in /proc/<pid>/fd/ you would see "/new/path" even you opened "/old/path".

Moreover /proc/<pid>/fd/ is not stable also on Linux. If you have open fd for "/old/path" and then do: link("/old/path", "/new/path") + unlink("/old/path") you would get /proc/<pid>/fd/<fd> pointing to "/old/path (deleted)".

> (Which is the case on FreeBSD BTW, when you ask for a fd's path it gives you
> whatever path turns up first for that inode during its cache search)
> 
> Unlinking that entry causes /proc/self/fd/<num> to be linked to the string
> "(empty)". No other hard links to that inode are involved nor affected.

If you unlink("/path"), then /proc/<pid>/fd/<fd> would point to "/path (deleted)".

But if you create file "/new/path (deleted)" and open it, then /proc/<pid>/fd/<fd> would point (as expected) to "/new/path (deleted)".

So you cannot use " (deleted)" suffix to distinguish if file behind /proc/<pid>/fd/<fd> was deleted or not.

> Hence getting unlinkat() to delete whatever a fd points at race free should
> be unproblematic on Linux, probably just a few lines of code.

Above examples shows that it would not work.

Comment 6 Pali Rohár 2017-12-19 12:16:11 UTC

(In reply to Ian Pilcher from comment #1)
> So a hypothetical funlinkat syscall might do something like this (but with
> added atomic goodness):
> 
> int funlinkat(int fd, int dirfd, const char *pathname, int flags)
> {
>      struct stat st1, st2;
> 
>      fstat(fd, &st1);
>      lstat(pathname, &st2);
> 
>      if (st1.st_dev != st2.st_dev || st1.st_ino != st2.st_ino)
>          return -E??????;  /* Not the same file */
> 
>      return unlinkat(dirfd, pathname, flags);
> }

Exactly. This is how API should look like for race-free unlink syscall. File descriptor points to one inode. And inode can be referenced from more directories (hard links). Therefore for unlinking file entry it is always needed to have file name.

Comment 7 Niall Douglas 2018-01-07 17:19:12 UTC

I repeat once again that file descriptors on Linux do not refer solely to an inode. They specifically track the hard link with which they were opened over time.

As this concept seems to be escaping people replying here, here follows a short program empirically proving that Linux file descriptors track changes to the original path they were opened with, not the inode. This makes Linux behave exactly like Windows, and not like (currently) OS X nor BSD which return a random valid hard link to the inode.

--- test-proc-fd-stability.c ---
#include <fcntl.h>
#include <unistd.h>
#include <fnmatch.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <limits.h>

#define LINKS 10
#define TESTFILE "test-proc-fd-stability-testfile"

int fds[LINKS];

void check()
{
  int n;
  char filename[PATH_MAX], path[PATH_MAX];
  for(n=0; n<LINKS; n++) {
    sprintf(filename, "/proc/self/fd/%d", fds[n]);
    if(-1 == readlink(filename, path, sizeof(path))) {
      fprintf(stderr, "FATAL line %d: %s\n", __LINE__, strerror(errno));
      exit(1);
    }
    sprintf(filename, "*/" TESTFILE "-%d*", n);
    if(FNM_NOMATCH == fnmatch(filename, path, 0)) {
      fprintf(stderr, "FAILED: fd index %d reads as %s\n", n, path);
    }
  }
}

int main(void)
{
  char filename[PATH_MAX], filename2[PATH_MAX];
  int mfd, n;
  // Create the file
  mfd = open(TESTFILE, O_CREAT, 0660);
  if(-1 == mfd) {
    fprintf(stderr, "FATAL line %d: %s\n", __LINE__, strerror(errno));
    return 1;
  }
  // Create the hardlinks to same inode
  for(n = 0; n<LINKS; n++) {
    sprintf(filename, TESTFILE "-%d", n);
    unlink(filename);
    if(-1 == link(TESTFILE, filename)) {
      fprintf(stderr, "FATAL line %d: %s\n", __LINE__, strerror(errno));
      return 1;
    }
  }
  // Check that each fd reflects the specific hardlink it was opened with
  check();
  // Permute hardlinks to prove that each fd tracks renames to its specific
  // path and does NOT choose some random path identifying the inode
  for(n = 0; n<LINKS; n++) {
    sprintf(filename, TESTFILE "-%d", n);
    sprintf(filename2, TESTFILE "-%d-%d", n, n);
    rename(filename, filename2);
  }
  // Check that each fd tracked the specific hardlink it was opened with
  check();
  printf("\nIf no failures appeared above, then this Linux kernel indeed "
         "has its file descriptors track changes to the specific hardlink "
         "they were opened with, and it is NOT the case that they track "
         "the inode and just return any old path to it like on other "
         "operating systems.\n");
  return 0;
}
--- test-proc-fd-stability.c ---

Regarding the preceding comment about the choice of the Linux kernel to say that fds referring to unlinked paths "(empty)" which is of course a perfectly valid filename, I entirely agree that that is an unfortunate choice. An empty string, or that readlink("/proc/self/fd/x") simply fails with an Exxx, would have been a far better choice.

All the more reason for unlinkat() to be extended to be able to unlink the hardlink specifically opened by a file descriptor, race free of any changes made to that hard link's path by anyone else since then.

Comment 8 Pali Rohár 2018-10-28 11:29:57 UTC

(In reply to Niall Douglas from comment #7)
> I repeat once again ... track the hard link with which they were opened over
> time.

Seems that you have not caught what I wrote, that it is not truth and kernel does not always track file path of file descriptor correctly...

> ... a short program empirically proving that ...

To prove something which is always truth, you need to show it for all possible inputs. Not just for only one. On the other hand for disproving you need just one counterexample.

So here is counterexample to show that kernel does not track paths to file descriptors in consistent state:

===

#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <unistd.h>
#include <fcntl.h>

int main() {

	int fd1, fd2;
	char procpath[PATH_MAX];
	char filepath[PATH_MAX];

	/* Create and open two temp files */
	fd1 = creat("/tmp/test1", 0644);
	fd2 = creat("/tmp/test2", 0644);

	/* Rename first file via rename syscall */
	if (rename("/tmp/test1", "/tmp/test1_renamed") != 0) {
		perror("rename /tmp/test1 --> /tmp/test1_renamed");
		return 1;
	}

	/* Rename second file via link+unlink syscalls */
	if (link("/tmp/test2", "/tmp/test2_renamed") != 0) {
		perror("link /tmp/test2 --> /tmp/test2_renamed");
		return 1;
	}
	if (unlink("/tmp/test2") != 0) {
		perror("unlink /tmp/test2");
		return 1;
	}

	/* Read kernel's path to first file descriptor */
	snprintf(procpath, sizeof(procpath), "/proc/self/fd/%d", fd1);
	if (readlink(procpath, filepath, sizeof(filepath)) < 0) {
		perror("readlink /proc/self/fd/...");
		return 1;
	}
	if (strcmp("/tmp/test1_renamed", filepath) != 0)
		printf("ERROR: Kernel does not track path to file correctly (expected: `%s', got: `%s'), file was not deleted, only renamed\n", "/tmp/test1_renamed", filepath);

	/* Read kernel's path to second file descriptor */
	snprintf(procpath, sizeof(procpath), "/proc/self/fd/%d", fd2);
	if (readlink(procpath, filepath, sizeof(filepath)) < 0) {
		perror("readlink /proc/self/fd/...");
		return 1;
	}
	if (strcmp("/tmp/test2_renamed", filepath) != 0)
		printf("ERROR: Kernel does not track path to file correctly (expected: `%s', got: `%s'), file was not deleted, only renamed\n", "/tmp/test2_renamed", filepath);

	/* Cleanup */
	close(fd1);
	close(fd2);
	unlink("/tmp/test1_renamed");
	unlink("/tmp/test2_renamed");

	return 0;

}

===

Note that renaming can be done also by external program and example above does not have to have control for it.

So to have stable, predictive and secure way for unlinking files, it is always needed to supply for any unlink* call path to file which is going to be unlinked.

Otherwise caller of unlink* syscall would know which is is going be unlinked and therefore it opens a possible security hole.

Comment 9 Niall Douglas 2018-10-28 23:34:30 UTC

> So here is counterexample to show that kernel does not track paths to file
> descriptors in consistent state:

Firstly thank you for the counterexample. It is always far more productive to base discussion on proofs rather than on claims.

Your counterexample behaves exactly as I would expect it to. Atomic rename has the fd track that rename as there is no ambiguity which new path the fd ought to have.

Non-atomic rename via link + unlink has the fd report the orphaned path as deleted. This is absolutely consistent with what one would expect. There is no way for any piece of code to know what to unambiguously do here (e.g. if I made two hard links, then unlinked the original, which of the two hard links should the system choose? It cannot. Therefore it's best to not go there).

Moreover, Linux behaves here exactly as Microsoft Windows behaves here. If I hard-link-via-HANDLE some inode to a new path, and then delete-by-HANDLE, I delete the original path, and querying that HANDLE for its current path returns that the file has been deleted. Similarly, if I atomic-rename-by-HANDLE, that HANDLE returns the new path. Exactly as Linux does.

I don't think anybody can reasonably do anything else.

> So to have stable, predictive and secure way for unlinking files, it is
> always needed to supply for any unlink* call path to file which is going to
> be unlinked.

Really, **no**. Unlinking by absolute path is always FAR more racy and prone to unintended data loss than unlinking by open fd.

> Otherwise caller of unlink* syscall would know which is is going be unlinked
> and therefore it opens a possible security hole.

Actually, the path may change between the time of inspection and the time of unlink.

But it doesn't matter. If you have write access to a fd, you have better ways of destroying data than unlinking the file entry.


FYI to Linux kernel devs, the paper proposing C++ standardisation of facilities which would make use of the enhancement can be found at http://wg21.link/P1031 (Low level file i/o library). We currently use a highly inefficient emulation on Linux instead, it currently adds a 6% penalty to unlinking files. Microsoft Windows has no penalty, as it implements this facility already.

Comment 10 Pali Rohár 2018-10-29 08:35:35 UTC

(In reply to Niall Douglas from comment #9)
> > So here is counterexample to show that kernel does not track paths to file
> > descriptors in consistent state:
> 
> Firstly thank you for the counterexample. It is always far more productive
> to base discussion on proofs rather than on claims.
> 
> Your counterexample behaves exactly as I would expect it to. Atomic rename
> has the fd track that rename as there is no ambiguity which new path the fd
> ought to have.
> 
> Non-atomic rename via link + unlink has the fd report the orphaned path as
> deleted. This is absolutely consistent with what one would expect. There is
> no way for any piece of code to know what to unambiguously do here (e.g. if
> I made two hard links, then unlinked the original, which of the two hard
> links should the system choose? It cannot. Therefore it's best to not go
> there).

The problem is that rename can be done by external application for which your application (which is going to do that unlink) has no control. That external application can use whatever it wants (possible non-atomic calls, etc.), but you want that your application must be deterministic and stable.

> Moreover, Linux behaves here exactly as Microsoft Windows behaves here. If I
> hard-link-via-HANDLE some inode to a new path, and then delete-by-HANDLE, I
> delete the original path, and querying that HANDLE for its current path
> returns that the file has been deleted. Similarly, if I
> atomic-rename-by-HANDLE, that HANDLE returns the new path. Exactly as Linux
> does.
> 
> I don't think anybody can reasonably do anything else.
> 
> > So to have stable, predictive and secure way for unlinking files, it is
> > always needed to supply for any unlink* call path to file which is going to
> > be unlinked.
> 
> Really, **no**. Unlinking by absolute path is always FAR more racy and prone
> to unintended data loss than unlinking by open fd.

That is why I'm suggesting to use Ian's proposal from comment #1. Syscall which gets both (absolute) path and file descriptor and atomically compares that file descriptor belongs to inode of path.

> > Otherwise caller of unlink* syscall would know which is is going be
> unlinked
> > and therefore it opens a possible security hole.
> 
> Actually, the path may change between the time of inspection and the time of
> unlink.

If this inspection is done by kernel and atomically, plus userspace provides both path and file descriptor, then there is no race and it cannot change.

> But it doesn't matter. If you have write access to a fd, you have better
> ways of destroying data than unlinking the file entry.

Yes. Who has write access can do anything. But imagine that you have just "correctly" written applications which do not want to destroy data. And they want to have stable and predictive behavior.

> FYI to Linux kernel devs, the paper proposing C++ standardisation of
> facilities which would make use of the enhancement can be found at
> http://wg21.link/P1031 (Low level file i/o library). We currently use a
> highly inefficient emulation on Linux instead, it currently adds a 6%
> penalty to unlinking files. Microsoft Windows has no penalty, as it
> implements this facility already.

Comment 11 Alejandro Colomar 2024-07-10 11:18:51 UTC

Maybe it could be implemented by allowing to unlink(procpath).  How about allowing the following program to do the right thing?


$ cat unlink.c 
#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <unistd.h>
#include <fcntl.h>

int
main(void)
{
	int   fd;
	char  procpath[PATH_MAX];

	fd = creat("/tmp/test1", 0644);
	if (fd == -1)
		err(EXIT_FAILURE, "creat(\"%s\")", "/tmp/test1");

	snprintf(procpath, sizeof(procpath), "/proc/self/fd/%d", fd);

	if (unlink(procpath) == -1)
		err(EXIT_FAILURE, "unlink(\"%s\")", procpath);

	if (close(fd) == -1)
		err(EXIT_FAILURE, "close(\"%d\")", fd);

	return 0;

}

$ cc -Wall -Wextra unlink.c 
$ ./a.out 
a.out: unlink("/proc/self/fd/3"): Operation not permitted

Comment 12 Pali Rohár 2024-07-21 22:10:41 UTC

This has still a problem with consistency and can cause race condition. For example when other process rename the file between open/create and unlink calls in your example.

I think the only way how to prevent race condition and to have consistent behavior is a unlink syscall which takes both file descriptor and path (or instead of full path it can be file descriptor of directory and entry name).

Comment 13 Alejandro Colomar 2024-07-22 06:00:21 UTC

(In reply to Pali Rohár from comment #12)
> This has still a problem with consistency and can cause race condition. For
> example when other process rename the file between open/create and unlink
> calls in your example.

The /proc/pid/fd/N entries affect the open file, regardless of renames or unlinks.

I don't see how a race would affect it.

Here's an example program:

```
$ cat unlink.c 
#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <unistd.h>
#include <fcntl.h>

int
main(void)
{
	int    fd;
	char   old[] = "/tmp/test1";
	char   new[] = "/tmp/test2";
	char   procpath[PATH_MAX];
	char   real[PATH_MAX];
	pid_t  pid;

	fd = creat(old, 0644);
	if (fd == -1)
		err(EXIT_FAILURE, "creat(\"%s\")", old);

	snprintf(procpath, sizeof(procpath), "/proc/self/fd/%d", fd);

	if (realpath(procpath, real) == NULL)
		err(EXIT_FAILURE, "realpath(\"%s\")", procpath);
	printf("realpath: %s\n", real);

	pid = fork();
	switch (pid) {
	case -1:
		err(EXIT_FAILURE, "fork");
	case 0:
		if (rename(old, new) == -1)
			err(EXIT_FAILURE, "rename");
		exit(EXIT_SUCCESS);
	default:
		sleep(1);
		break;
	}		

	if (realpath(procpath, real) == NULL)
		err(EXIT_FAILURE, "realpath(\"%s\")", procpath);
	printf("realpath: %s\n", real);

	if (unlink(procpath) == -1)
		err(EXIT_FAILURE, "unlink(\"%s\")", procpath);

	if (close(fd) == -1)
		err(EXIT_FAILURE, "close(%d)", fd);

	return 0;

}
```

```
alx@debian:~/tmp/linux$ cc -Wall -Wextra unlink.c 
alx@debian:~/tmp/linux$ ./a.out 
realpath: /tmp/test1
realpath: /tmp/test2
a.out: unlink("/proc/self/fd/3"): Operation not permitted
```

The kernel is able to track that the file has been moved, and so it could be able to unlink "/tmp/test2" when the program calls unlink("/proc/self/fd/3").

Comment 14 Alejandro Colomar 2024-07-22 06:13:03 UTC

Or maybe, to avoid having to snprintf(3) a string, allow also unlinkat(fd, "", AT_EMPTY_PATH);



```
alx@debian:~/tmp/linux$ cat unlink.c 
#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int
main(void)
{
	int    fd;
	char   old[] = "/tmp/test1";
	char   new[] = "/tmp/test2";
	char   procpath[PATH_MAX];
	char   real[PATH_MAX];
	pid_t  pid;

	fd = creat(old, 0644);
	if (fd == -1)
		err(EXIT_FAILURE, "creat(\"%s\")", old);

	snprintf(procpath, sizeof(procpath), "/proc/self/fd/%d", fd);

	if (realpath(procpath, real) == NULL)
		err(EXIT_FAILURE, "realpath(\"%s\")", procpath);
	printf("realpath: %s\n", real);

	pid = fork();
	switch (pid) {
	case -1:
		err(EXIT_FAILURE, "fork");
	case 0:
		if (rename(old, new) == -1)
			err(EXIT_FAILURE, "rename");
		exit(EXIT_SUCCESS);
	default:
		sleep(1);
		break;
	}		

	if (realpath(procpath, real) == NULL)
		err(EXIT_FAILURE, "realpath(\"%s\")", procpath);
	printf("realpath: %s\n", real);

	if (unlink(procpath) == -1)
		warn("unlink(\"%s\")", procpath);

	if (unlinkat(fd, "", AT_EMPTY_PATH) == -1)
		warn("unlinkat(%d, \"\", AT_EMPTY_PATH)", fd);

	if (close(fd) == -1)
		err(EXIT_FAILURE, "close(%d)", fd);

	return 0;

}
alx@debian:~/tmp/linux$ cc -Wall -Wextra unlink.c 
alx@debian:~/tmp/linux$ ./a.out 
realpath: /tmp/test1
realpath: /tmp/test2
a.out: unlink("/proc/self/fd/3"): Operation not permitted
a.out: unlinkat(3, "", AT_EMPTY_PATH): Invalid argument
```

Comment 15 Pali Rohár 2024-07-25 13:09:10 UTC

(In reply to Alejandro Colomar from comment #13)
> The /proc/pid/fd/N entries affect the open file, regardless of renames or
> unlinks.
> 
> I don't see how a race would affect it.

I already showed an example with that race.


I will try to show another example:
Process A wants to work with file /tmp/file1 and at the end it want to remove/unlink /tmp/file1. Process B is evil and it tries to move every file with pattern /tmp/file* to dir /tmp/subdir/.

If process A would use your approach of:
fd = open("/tmp/file1");
unlink("/proc/self/fd/<fd>");

then it can result of unlinking file /tmp/subdir/something because of race condition with process B.


Another example: /proc/self/fd/ does not track the path which is being renamed by the link+unlink calls. Calling

fd = open("/tmp/file");
link("/tmp/file", "/tmp/file2");
unlink("/tmp/file");

will result with /proc/self/fd/<fd> pointing to broken path, so your proposed unlink("/proc/self/fd/<fd>") would not work in this case.


Using funlinkat() proposal from comment #1 which will take _both_ path and file descriptor will solve both problems.

Comment 16 Alejandro Colomar 2024-07-25 13:38:52 UTC

(In reply to Pali Rohár from comment #15)
> (In reply to Alejandro Colomar from comment #13)
> > The /proc/pid/fd/N entries affect the open file, regardless of renames or
> > unlinks.
> > 
> > I don't see how a race would affect it.
> 
> I already showed an example with that race.
> 
> 
> I will try to show another example:
> Process A wants to work with file /tmp/file1 and at the end it want to
> remove/unlink /tmp/file1. Process B is evil and it tries to move every file
> with pattern /tmp/file* to dir /tmp/subdir/.
> 
> If process A would use your approach of:
> fd = open("/tmp/file1");
> unlink("/proc/self/fd/<fd>");
> 
> then it can result of unlinking file /tmp/subdir/something because of race
> condition with process B.

This expected behavior.  I wouldn't call it a race.

While we're unlinking a different path, it's still the same underlying hardlink.

If I unlink via a file descriptor, I don't want to unlink the original path, but whatever path the fd points to at the time of removal.  That's the contract of *at() functions with AT_EMPTY_PATH.

> Another example: /proc/self/fd/ does not track the path which is being
> renamed by the link+unlink calls. Calling
> 
> fd = open("/tmp/file");
> link("/tmp/file", "/tmp/file2");
> unlink("/tmp/file");
> 
> will result with /proc/self/fd/<fd> pointing to broken path, so your
> proposed unlink("/proc/self/fd/<fd>") would not work in this case.

That's because the program created a new hardlink, and already unlinked the one it was holding via fd.  Since the hardlink pointed to by fd does not exist anymore, it cannot be unlinked.  This is in full control of the process itself, so it should not be a problem.

Consider that those other hardlinks could exist before the process even started, so it's not a problem that the program should care about.

> Using funlinkat() proposal from comment #1 which will take _both_ path and
> file descriptor will solve both problems.

My proposal with unlinkat() has the following semantics:

-  Unlink the path to which this fd points to currently.

Your proposal with a hypothetical funlinkat() has the following semantics:

-  Unlink the specified path, but only if it corresponds to the file referred to by this fd.

Why would you want to unlink a file via a file descriptor only if its path has not been changed?  Why keep it if the path has changed?

Comment 17 Alejandro Colomar 2024-07-25 13:50:14 UTC

Or, if you want to make sure the path is the original one (or whatever you specify), you could justify adding both APIs.

But your API wouldn't allow to unlink a fd regardless of renames, which is bad.

Consider the following example:

A server creates a Unix socket at startup.  It unlink it when it ends.  A sysadmin might have moved the socket, to run another copy of the server.  The server wants to unlink its own socket, wherever it has been moved; it must not unlink the original path.

Comment 18 Pali Rohár 2024-07-25 14:53:41 UTC

> Why would you want to unlink a file via a file descriptor only if its path
> has not been changed?  Why keep it if the path has changed?

I want to unlink path if its file content or file metadata matches some conditions. So this needs open, read/fstat and then unlink which matches the path from open and fd from read/fstat.

> If I unlink via a file descriptor, I don't want to unlink the original path,
> but whatever path the fd points to at the time of removal.  That's the
> contract of *at() functions with AT_EMPTY_PATH.

This would work only in case other processes did that atomic renames. If other processes did non-atomic rename via link+unlink then it would not work. And other processes are free to use either atomic or non-atomic renames.

But ok, now I understand your usecase and I see that this functionality makes sense. We are basically talking about two different use cases which needs some kind of fd-unlink syscall.

> Or, if you want to make sure the path is the original one (or whatever you
> specify), you could justify adding both APIs.

I think that this could be a best choice. Allow to specify caller if wants to unlink path resolved from "/proc/self/fd/<fd>" or some other custom path (which will be atomicaly validated that is same inode as passed fd).

> Consider the following example:
> A server creates a Unix socket at startup.  It unlink it when it ends.  A
> sysadmin might have moved the socket, to run another copy of the server.  The
> server wants to unlink its own socket, wherever it has been moved; it must
> not unlink the original path.

This is a good example. The only problem is that movement done by external process has to be atomic and not by link+unlink.