Bug 199945

Summary: getxattr silently skips user namespace fixup if size % 8 != 4
Product: File System Reporter: Colin Watson (cjwatson)
Component: VFSAssignee: fs_vfs
Status: NEW ---    
Severity: normal CC: christian
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.15.17 Subsystem:
Regression: No Bisected commit-id:
Attachments: getxattr: use correct xattr length

Description Colin Watson 2018-06-06 17:02:57 UTC
(Apologies if this isn't considered VFS; the bug is in fs/ but not in any particular filesystem implementation, so this seemed the best component to pick.)

When running in a container with a user namespace, if you call getxattr with name = "system.posix_acl_access" and size % 8 != 4, then getxattr silently skips the user namespace fixup that it normally does for POSIX ACLs and returns un-fixed-up data.  This means that if you try to copy POSIX ACLs between files using getxattr and setxattr and don't pass a size that meets this restriction, the setxattr will typically fail with EINVAL.  This restriction is undocumented and extremely strange: getxattr should either return correct data or fail with an error code, and in this case I don't think it makes sense for it to fail.

This happens to me in practice because Python >= 3.3 has a getxattr wrapper that calls the getxattr syscall with size==128 the first time round its loop.  As a result, when these Python versions try to copy the extended attributes of a file owned by a userns-mapped uid (which I found was happening inside pip), they get EINVAL from setxattr.  I considered trying to get this fixed in Python, but (1) to fix it there it would have to make assumptions about this undocumented behaviour which might change again in future, (2) I'd have to backport it to several old Python versions to be useful to me, and (3) this really feels like a mistake in the kernel and not something that should be worked around in userspace.

Here's a fairly minimal test case, assuming that you already have a container with user namespaces set up such that root inside the container is mapped to some non-0 uid outside the container (in my case 100000).  I didn't bother with decoding for the purpose of this test case: it's easy enough to decode by eye with reference to include/uapi/linux/posix_acl_xattr.h.

# touch foo
# cat t.c
#include <sys/types.h>
#include <stdlib.h>
#include <stdio.h>
#include <attr/xattr.h>

int main(int argc, void **argv) {
        char *buffer;
        int size;

        if (argc < 1) { fputs("need size\n", stderr); exit(2); }
        size = atoi(argv[1]);
        buffer = malloc(size);
        getxattr("foo", "system.posix_acl_access", buffer, size);
        exit(0);
}
# gcc t.c -o t
# strace -etrace=getxattr ./t 128
getxattr("foo", "system.posix_acl_access", "\x02\x00\x00\x00\x01\x00\x06\x00\xff\xff\xff\xff\x02\x00\x07\x00\xa0\x86\x01\x00\x04\x00\x05\x00\xff\xff\xff\xff\x10\x00\x06\x00\xff\xff\xff\xff \x00\x04\x00\xff\xff\xff\xff", 128) = 44
+++ exited with 0 +++
# strace -etrace=getxattr ./t 132
getxattr("foo", "system.posix_acl_access", "\x02\x00\x00\x00\x01\x00\x06\x00\xff\xff\xff\xff\x02\x00\x07\x00\x00\x00\x00\x00\x04\x00\x05\x00\xff\xff\xff\xff\x10\x00\x06\x00\xff\xff\xff\xff \x00\x04\x00\xff\xff\xff\xff", 132) = 44
+++ exited with 0 +++

Note the e_id field of the second posix_acl_xattr_entry struct, i.e. bytes 16-19.  With size==128, it's (LE) 0x000186a0 == 100000.  With size==132, it's 0.  The latter is correct.

This happens because getxattr calls posix_acl_fix_attr_to_user(kvalue, size), which calls posix_acl_fix_xattr_userns, which checks that posix_acl_xattr_count(size) returns > 0.  But that's checking the maximum buffer size that the caller passed to getxattr, not the actual length of the attribute value that's about to be returned.

I don't have a kernel development setup right now since it isn't my usual field, but I think simply changing "posix_acl_fix_xattr_to_user(kvalue, size)" in getxattr to "posix_acl_fix_xattr_to_user(kvalue, error)" (noting that "error" at this point is in fact the resulting attribute value length) would fix this.

I've seen this behaviour with Ubuntu's 4.15.0-22.24 kernel on amd64, which is based on upstream 4.15.17, but I've also traced through the code on mainline and I'm pretty certain that nothing relevant is different there.  Let me know if you need me to explicitly try mainline, though.
Comment 1 Christian Brauner 2018-06-06 22:47:43 UTC
I think the analysis is correct and I'm testing a patch.
Comment 2 Christian Brauner 2018-06-07 10:38:52 UTC
A full reproducer:

touch foo

setfacl -m user:0:rwx foo


Then compile and run:

```
#define _GNU_SOURCE
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
#include <attr/xattr.h>

/* Run in user namespace with nsuid 0 mapped to uid != 0 on the host. */
int main(int argc, void **argv)
{
	ssize_t ret1, ret2;
	char buf1[128], buf2[132];
	int fret = EXIT_SUCCESS;
	char *file;

	if (argc < 2) {
		fprintf(stderr, "Please specify a file with \"system.posix_acl_access\" permissions set\n");
		_exit(EXIT_FAILURE);
	}
	file = argv[1];

	ret1 = getxattr(file, "system.posix_acl_access", buf1, sizeof(buf1));
	if (ret1 < 0) {
		fprintf(stderr, "%s - Failed to retrieve \"system.posix_acl_access\" from \"%s\"\n", strerror(errno), file);
		_exit(EXIT_FAILURE);
	}

	ret2 = getxattr(file, "system.posix_acl_access", buf2, sizeof(buf2));
	if (ret2 < 0) {
		fprintf(stderr, "%s - Failed to retrieve \"system.posix_acl_access\" from \"%s\"\n", strerror(errno), file);
		_exit(EXIT_FAILURE);
	}

	if (ret1 != ret2) {
		fprintf(stderr, "The value of \"system.posix_acl_access\" for file \"%s\" changed between two successive calls\n", file);
		_exit(EXIT_FAILURE);
	}

	for (ssize_t i = 0; i < ret2; i++) {
		if (buf1[i] == buf2[i])
			continue;

		fprintf(stderr, "Unexpected different in byte %zd: %02x != %02x\n", i, buf1[i], buf2[i]);
		fret = EXIT_FAILURE;
	}

	if (fret == EXIT_SUCCESS)
		fprintf(stderr, "Test passed\n");
	else
		fprintf(stderr, "Test failed\n");

	_exit(fret);
}
```
Comment 3 Christian Brauner 2018-06-07 10:40:07 UTC
On a kernel without my patch applied:

root@c1:/# ./t
Unexpected different in byte 16: ffffffa0 != 00
Unexpected different in byte 17: ffffff86 != 00
Unexpected different in byte 18: 01 != 00


On a kernel with my patch applied:

root@c1:~# ./t
Test passed
Comment 4 Christian Brauner 2018-06-07 10:43:12 UTC
Created attachment 276369 [details]
getxattr: use correct xattr length

When running in a container with a user namespace, if you call getxattr
with name = "system.posix_acl_access" and size % 8 != 4, then getxattr
silently skips the user namespace fixup that it normally does resulting in
un-fixed-up data being returned.
This is caused by posix_acl_fix_xattr_to_user() being passed the total
buffer size and not the actual size of the xattr as returned by
vfs_getxattr().
This commit passes the actual length of the xattr down.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199945
Reported-by: Colin Watson <cjwatson@ubuntu.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Comment 5 Colin Watson 2018-06-07 10:46:57 UTC
Thanks for picking this up!  Could the patch reasonably go to stable too?
Comment 6 Christian Brauner 2018-06-07 11:11:46 UTC
Thanks for reporting this and tracking it down, Colin. I'll send the patch upstream now and we'll see if they agree with the analysis. Once that is settled I'll ask for this to be backported to stable.