Bug 120671 - missing info about userns restrictions
Summary: missing info about userns restrictions
Status: RESOLVED CODE_FIX
Alias: None
Product: Documentation
Classification: Unclassified
Component: man-pages (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: documentation_man-pages@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-20 13:22 UTC by Michał Zegan
Modified: 2016-07-07 12:46 UTC (History)
1 user (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Michał Zegan 2016-06-20 13:22:41 UTC
I have noticed that some information related to user namespaces that could be useful are actually missing from the user_namespaces manpage.
Example: what capabilities do not have effect in user namespace, what filesystems can or cannot be mounted in a mount namespace assigned to the user namespace, other applied restrictions, maybe also current security issues with user namespaces (optional).
Comment 1 Michael Kerrisk 2016-06-20 14:26:56 UTC
(In reply to Michał Zegan from comment #0)
> I have noticed that some information related to user namespaces that could
> be useful are actually missing from the user_namespaces manpage.
> Example: what capabilities do not have effect in user namespace, what
> filesystems can or cannot be mounted in a mount namespace assigned to the
> user namespace, other applied restrictions, maybe also current security
> issues with user namespaces (optional).

Hello Michał,

Unless you can give me something more specific, it's very hard to do much with this bug report. For example, what capabilities do you find/think do not have effect in a user NS? Which filesystems have you encountered problems mounting filesystems.

Thanks,

Michael
Comment 2 Michał Zegan 2016-06-20 14:36:32 UTC
Well... For example, cap_sys_module does not work in user namespace, doesn't it? cap_sys_mknod last i checked did not work in userns, but may be wrong. About mounting filesystems, there is probably a whitelist. If I recall correctly you are unable to mount any block based fs like ext4 inside of the userns, like you have no permissions to mount most of them except tmpfs, proc and such like. There may be other restrictions I am not aware of, but those are some I know, unless I am wrong. It will help to clarify some things that are just not present in that manpage.
Comment 3 Michał Zegan 2016-06-20 14:39:45 UTC
For clarifications those are security restrictions, not possible bugs
Comment 4 Michael Kerrisk 2016-06-20 20:18:59 UTC
(In reply to Michał Zegan from comment #2)
> Well... For example, cap_sys_module does not work in user namespace, doesn't
> it? cap_sys_mknod last i checked did not work in userns, but may be wrong.

Yes, but that is conveyed in a sentence in user_namespaces(7):

       Having a capability inside a user namespace permits  a  process
       to   perform   operations  (that  require  privilege)  only  on
       resources governed by that namespace.

[1] Loading a kernel module or creating a device node are not governed by any of the 7 current namespace types. What I mean here: this is not a question of whether particular capabilities work in a namespace, rather what operations / abilities are associated with various namespaces.

> About mounting filesystems, there is probably a whitelist.  If I recall
> correctly you are unable to mount any block based fs like ext4 inside of the
> userns, like you have no permissions to mount most of them except tmpfs,
> proc and such like. 

[2] This isn't correct as far as I know, but if you can show me an interesting counterexample...

> There may be other restrictions I am not aware of, but
> those are some I know, unless I am wrong. It will help to clarify some
> things that are just not present in that manpage.

So far, I don't see a real problem in the man page(s). (But maybe, the explanations on the first point could be more detailed.)
Comment 5 Michał Zegan 2016-06-20 20:32:34 UTC
yes, what I mean is just to make soe things more detailed in case someone wonders.
About filesystes, you can try to test mounting an ext4 filesystem after doing unshare of both userns and mountns, almost sure you will fail. I mean mounting the fs from inside of the ns. I may test that too when I have time, to be sure, but I am almost certain that is the case, especially that mounting an arbitrary fs could be a security risk because uids are not shifted.
Comment 6 Michael Kerrisk 2016-06-21 08:48:18 UTC
(In reply to Michał Zegan from comment #5)
> yes, what I mean is just to make soe things more detailed in case someone
> wonders.

Fair enough. See the new text below, which I've added to the man page.

> About filesystes, you can try to test mounting an ext4 filesystem after
> doing unshare of both userns and mountns, almost sure you will fail. I mean
> mounting the fs from inside of the ns. I may test that too when I have time,
> to be sure, but I am almost certain that is the case, especially that
> mounting an arbitrary fs could be a security risk because uids are not
> shifted.

When you've tested to see check that there's an issue, please reopen this bug if needed. For now, I consider the problem to be addressed, as per the new text below, so I'll close.

Cheers,

Michael


       Having  a  capability inside a user namespace permits a process
       to  perform  operations  (that  require  privilege)   only   on
       resources governed by that namespace.  In other words, having a
       capability in a user namespace permits  a  process  to  perform
       privileged   operations  on  resources  that  are  governed  by
       (nonuser) namespaces associated with the  user  namespace  (see
       the next subsection).  On the other hand, there are many privi‐
       leged operations that affect resources that are not  associated
       with  any namespace type, for example, changing the system time
       (governed by CAP_SYS_TIME), loading a kernel  module  (governed
       by   CAP_SYS_MODULE),   and  creating  a  device  (governed  by
       CAP_MKNOD).  Only a process with privileges in the initial user
       namespace can perform such operations.
Comment 7 Michał Zegan 2016-06-21 09:12:08 UTC
I will test fs mounting just to be sure.
I will also test cap_sys_mknod to be sure it is correct that it cannot be used, my info may be outdated, unless you have done your own verification. The clarification is good enough, I believe.
Comment 8 Michał Zegan 2016-06-21 09:25:01 UTC
Reopening because I confirmed the fact about filesystems not being mountable, at least ext2. As I do not know kernel well enough to read sources, it would be useful to have a list of filesystems that are mountable but I cannot write it, I only know at least proc, devpts? tmpfs and cgroupv2 at least if cgroup namespaces are enabled. All my words have to be verified to make sure i am not wrong. Also someone should find any other restrictions user namespaces impose if they exist because I do not know any.
To make you confident I tested filesystem mounting properly, I will paste my terminal session after changing to english locale. :)
Logged in as my server's root and making user/mount/pid namespace.
[root@webczatnet ~]# unshare -rUpmf
[root@webczatnet ~]# fallocate -l 1M test
[root@webczatnet ~]# losetup /dev/loop0 test
[root@webczatnet ~]# mke2fs /dev/loop0
mke2fs 1.42.13 (17-May-2015)
Discarding device blocks: done                            
Creating filesystem with 1024 1k blocks and 128 inodes
Allocating group tables: done                            
Writing inode tables: done                            
Writing superblocks and filesystem accounting information: done
[root@webczatnet ~]# mkdir x
[root@webczatnet ~]# mount /dev/loop0 x
mount: permission denied
[root@webczatnet ~]# exit
logout
[root@webczatnet ~]# mount /dev/loop0 x
[root@webczatnet ~]# umount x
[root@webczatnet ~]# rmdir x
[root@webczatnet ~]# losetup -d /dev/loop0
[root@webczatnet ~]# rm test

One comment: not sure why I can losetup from userns, like is it because I have rw on loop0 as root is mapped to new userns root, or does it check CAP_SYS_ADMIN in the new userns, or both?
Comment 9 Michael Kerrisk 2016-06-21 11:54:47 UTC
(In reply to Michał Zegan from comment #8)
> Reopening because I confirmed the fact about filesystems not being
> mountable, at least ext2. As I do not know kernel well enough to read
> sources, it would be useful to have a list of filesystems that are mountable
> but I cannot write it, I only know at least proc, devpts? tmpfs and cgroupv2
> at least if cgroup namespaces are enabled. All my words have to be verified
> to make sure i am not wrong. Also someone should find any other restrictions
> user namespaces impose if they exist because I do not know any.

Ahhh -- now I'm with you. I was a bit confused in my thinking before. Searching for FS_USERNS_MOUNT tells us which filesystems can be mounted with CAP_SYS_ADMIN in a (noninitial) userns. I added the following text to the page:

       Holding  CAP_SYS_ADMIN  within  a  (noninitial)  user namespace
       allows the creation of bind mounts, and mounting of the follow‐
       ing types of filesystems:

           * /proc (since Linux 3.8)
           * /sys (since Linux 3.8)
           * devpts (since Linux 3.9)
           * tmpfs (since Linux 3.9)
           * ramfs (since Linux 3.9)
           * mqueue (since Linux 3.9)
           * bpf (since Linux 4.4)

       Note however, that mounting block-based filesystems can be done
       only by a process that holds CAP_SYS_ADMIN in the initial  user
       namespace.

> One comment: not sure why I can losetup from userns, like is it because I
> have rw on loop0 as root is mapped to new userns root, or does it check
> CAP_SYS_ADMIN in the new userns, or both?

Not sure. But if you work out all the details, let me know.

Thanks,

Michael
Comment 10 Michał Zegan 2016-06-21 14:15:49 UTC
I believe you can mount cgroup2 if cgroup namespaces are enabled and used, you can also verify if it may apply to cgroupv1 fs?
About the text you proposed: can you actually have CAP_SYS_ADMIN in both user namespaces at the same time? I mean in initial and noninitial?
Comment 11 Michael Kerrisk 2016-06-21 19:56:44 UTC
(In reply to Michał Zegan from comment #10)
> I believe you can mount cgroup2 if cgroup namespaces are enabled and used,
> you can also verify if it may apply to cgroupv1 fs?

Could you please test and let me know. (My quick test suggest not.)

> About the text you proposed: can you actually have CAP_SYS_ADMIN in both
> user namespaces at the same time? I mean in initial and noninitial?

Read the user_namespaces(7) man page. I believe it answers this question.
Comment 12 Michał Zegan 2016-06-21 20:02:29 UTC
I have no eligible kernel for testing. It may become possible in a while, though.
Comment 13 Michael Kerrisk 2016-07-05 09:23:56 UTC
(In reply to Michael Kerrisk from comment #11)
> (In reply to Michał Zegan from comment #10)
> > I believe you can mount cgroup2 if cgroup namespaces are enabled and used,
> > you can also verify if it may apply to cgroupv1 fs?
> 
> Could you please test and let me know. (My quick test suggest not.)

So, I see that I missed something during my testing, and I believe you are correct. I've made some changes to the page to note that cgroup filesystems can be mounted in a userns.
Comment 14 Michał Zegan 2016-07-05 13:29:42 UTC
could you confirm what I once tested? cgroup version v1 cannot be mounted in userns. cgroup version 2 cannot be mounted in user namespace unless cgroup namespace is also created, in which case mounting cgroupv2 becomes allowed.
Comment 15 Michael Kerrisk 2016-07-05 14:01:32 UTC
(In reply to Michał Zegan from comment #14)
> could you confirm what I once tested? cgroup version v1 cannot be mounted in
> userns. cgroup version 2 cannot be mounted in user namespace unless cgroup
> namespace is also created, in which case mounting cgroupv2 becomes allowed.

How long ago did you test this? (I'm thinking about what kernel version you may have been running.)
Comment 16 Michał Zegan 2016-07-05 16:02:04 UTC
mine is 4.6.x. so it has introduced cgroup ns. I can of course retest, but well
Comment 17 Michael Kerrisk 2016-07-07 12:33:29 UTC
Ahhh -- I see now that I missed a detail when reading the kernel source code (in kernel/cgroup.c::cgroup_mount()):

        /*
         * We know this subsystem has not yet been bound.  Users in a non-init
         * user namespace may only mount hierarchies with no bound subsystems,
         * i.e. 'none,name=user1'
         */
        if (!opts.none && !capable(CAP_SYS_ADMIN)) {
                ret = -EPERM;
                goto out_unlock;
        }


I've updated this piece of the user_namespaces(7) page to read:

       Holding  CAP_SYS_ADMIN within the user namespace associated with a
       process's cgroup namespace allows (since Linux 4.6)  that  process
       to  the  mount  cgroup  version  2 filesystem and cgroup version 1
       named hierarchies  (i.e.,  cgroup  filesystems  mounted  with  the
       "none,name=" option).

I've tested both cgroup v2 mounts and cgroup v1 'name=' mounts successfully on kernel 4.7-rc2.
Comment 18 Michał Zegan 2016-07-07 12:46:42 UTC
missed that, probably knew about this before. Thank you.

Note You need to log in before you can comment on or make changes to this bug.