Consider the following C program, which copies /proc/self/mountinfo to a memfd using sendfile: #define _GNU_SOURCE #include <stdio.h> #include <assert.h> #include <errno.h> #include <fcntl.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <sys/sendfile.h> int main() { int mntinfo_fd = open("/proc/self/mountinfo", O_RDONLY | O_CLOEXEC); assert(!(mntinfo_fd < 0)); int memfd = memfd_create("mountinfo", MFD_CLOEXEC); assert(!(memfd < 0)); ssize_t copied; again: copied = sendfile(memfd, mntinfo_fd, NULL, 0x7ffff000); if (copied < 0) { if (errno == EINTR) goto again; fprintf(stderr, "Failed to copy \"/proc/self/mountinfo\"\n"); return -1; } return 0; } In Linux 5.9, this succeeds (prints nothing), however, in Linux 5.10-rc1, this fails (prints 'Failed to copy "/proc/self/mountinfo"'). This program comes as reduced test case from some LXC code, as can be seen here: https://github.com/lxc/lxc/blob/8ddf34f7a037325565b8cf8ff995cbf573f9932e/src/lxc/conf.c#L2987 . On my system (current Arch Linux, LXC 4.0.5), this code runs when trying to start a container, and the container fails to start (in 5.10-rc1 but not on 5.9) due to this problem: $ lxc-create -t download -n talpine -- -d alpine -r 3.12 -a amd64 [...] $ lxc-start -n talpine -F lxc-start: talpine: conf.c: turn_into_dependent_mounts: 3012 Invalid argument - Failed to copy "/proc/self/mountinfo" [...] lxc-start: talpine: tools/lxc_start.c: main: 308 The container failed to start I bisected this problem back to commit 36e2c7421f02a22f71c9283e55fdb672a9eb58e7 "fs: don't allow splice read/write without explicit ops". As far as I can tell, sendfile relies on splice operations, and since this commit removes the fallback for splice operations for those files that don't implement them, such as /proc/self/mountinfo and some other procfs files, this problem is kind of expected from that commit. Could some kind of fallback or proper implementation for procfs files be re-implemented in the kernel? Or should user space applications not be expected to rely on sendfile in this case and use a read/write loop instead?
Actually, if my understanding is correct, man sendfile(2) implies that this usage by LXC was not supposed to work? The in_fd argument must correspond to a file which supports mmap(2)-like operations (i.e., it cannot be a socket).
Looks like this was independently found and splice for some /proc files was reimplemented in commit 6b2c4d52fd38e676fc9ab5d9241a056de565eb1a, but not /proc/___/mountinfo.
LXC issue: https://github.com/lxc/lxc/issues/3580 They changed the code not to use sendfile: https://github.com/lxc/lxc/commit/a39fc34bd6842ad1adc6144391071d8b1078667e I keep this bugzilla open for now... probably there may be other projects affected by this so I'd like to see if this is worth solving at the kernel level.
5.10.3 is still affected.
Looks like it got fixed in linux-mainline: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=14e3e989f6a5d9646b6cf60690499cc8bdc11f7d I'm building a kernel now to test, but it should work since I tested a patch that looked like this some time ago.
Fixed in Linux 5.11-rc1.
Fixed in Linux 5.10.4.