Bug 209689 - [kernel config setting] setcap -v in lxc unprivileged container
Summary: [kernel config setting] setcap -v in lxc unprivileged container
Status: RESOLVED CODE_FIX
Alias: None
Product: Tools
Classification: Unclassified
Component: libcap (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Tools/Libcap default virtual assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-14 22:47 UTC by Hervé Guillemet
Modified: 2020-12-05 19:37 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.4.66
Tree: Mainline
Regression: No


Attachments
Kernel .config (86.24 KB, text/plain)
2020-10-16 08:36 UTC, Hervé Guillemet
Details

Description Hervé Guillemet 2020-10-14 22:47:25 UTC
setcap in a unpriviledged lxc container does work, but verification with `-v` does not:

```
# cp /bin/ping /tmp
# su - x -c "/tmp/ping google.com"
/tmp/ping: socket: Address family not supported by protocol
# setcap cap_net_raw+ep /tmp/ping
# su - x -c "/tmp/ping google.com"
PING google.com (142.250.74.206) 56(84) bytes of data.
# setcap -v cap_net_raw+ep /tmp/ping
nsowner[got=1000000, want=0],/tmp/ping differs in []
```

adding `-n 1000000`  makes the verification pass, but I guess this should be transparent.
Comment 1 Andrew G. Morgan 2020-10-15 02:50:54 UTC
What version of libcap are you using?

When I build libcap-2.44 sources as follows

$ make DYNAMIC=no clean all

under a Linux container on my Chromebook, I find:

$ cd progs
$ cat /proc/self/uid_map 
         0    1000000       1000
      1000       1000          1
      1001       1001          1
      1002    1001002     654358
    655360     655360          1
    655361    1655361       9996
    665357     665357          1
    665358    1665358  999334642
$ sudo ./setcap cap_setfcap=i ./setcap
$ ./getcap ./setcap
./setcap cap_setfcap=i
$ ./setcap -v cap_setfcap=i ./setcap
./setcap: OK
Comment 2 Hervé Guillemet 2020-10-15 07:58:39 UTC
I tried 2.44 and 2.43 with same result.

Here are you exact same commands with 2.44:
```
$ make DYNAMIC=no clean all
$ cd progs
$ cat /proc/self/uid_map 
         0    1000000      65536
$ sudo ./setcap cap_setfcap=i ./setcap
$ ./getcap ./setcap
./setcap cap_setfcap=i
$ ./setcap -v cap_setfcap=i ./setcap
nsowner[got=1000000, want=0],./setcap differs in []
```

Are there differences in our kernel config that may explain this ?
Or container config ? `lxc config show` says this :
```
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":65536}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":65536}]'
```
Comment 3 Andrew G. Morgan 2020-10-16 03:08:03 UTC
It could well be some magic on a Chromebook, not sure where to start looking. From the progs directory, I get these mount options for my container:

$ mount|fgrep " $(df .|fgrep /|awk '{print $6}') "
/dev/vdb on / type btrfs (rw,relatime,discard,space_cache,user_subvol_rm_allowed,subvolid=266,subvol=/lxd/storage-pools/default/containers/penguin/rootfs)
Comment 4 Andrew G. Morgan 2020-10-16 04:51:12 UTC
Just to confirm, is there anything interesting about your kernel? You say you are running 5.4.66. Is there anything else interesting to say about it?
Comment 5 Serge Hallyn 2020-10-16 05:46:05 UTC
The kernel should be returning the converted capabilities to libcap, whereas this looks like libcap is getting the capabilities as they look from the initial user namespace.

The contents of /proc/self/status and /proc/self/uid_map would be interesting, as well as your kernel .config.

Also if you could compile the following program and run it against that /tmp/ping both from a host process and a container process, that would be helpful.  On my system, from the host it tells me it found a v3 capability targeted at uid 100000, while from the container it finds a v2 capability.

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/xattr.h>
#include <errno.h>

typedef __uint32_t u32;

#define VFS_CAP_U32_3           2
#define VFS_CAP_U32             VFS_CAP_U32_3
struct vfs_ns_cap_data {
	u32 magic_etc;
	struct {
		u32 permitted;    /* Little endian */
		u32 inheritable;  /* Little endian */
	} data[VFS_CAP_U32];
	u32 rootid;
};

#define VFS_CAP_FLAGS_EFFECTIVE 0x000001

#define VFS_CAP_REVISION_2  0x02000000
#define VFS_CAP_REVISION_3  0x03000000

int sans_flags(int x) {
	return x & ~VFS_CAP_FLAGS_EFFECTIVE;
}

int main(int argc, char **argv) {
	ssize_t sz;
	char v[200];
	struct vfs_ns_cap_data *nscap;

	sz = getxattr(argv[1], "security.capability", v, 200);
	if (sz < 0) {
		printf("Failed reading xattr: %d\n", errno);
		exit(1);
	}
	if (sz > 200) {
		printf("xattr too long");
		exit(1);
	}

	nscap = (struct vfs_ns_cap_data *)v;
	switch (sans_flags(nscap->magic_etc)) {
	case VFS_CAP_REVISION_3:
		printf("v3, rootid is %d\n", nscap->rootid);
		break;
	case VFS_CAP_REVISION_2:
		printf("v2\n");
		break;
	default:
		printf("unknown version %x\n", nscap->magic_etc);
		break;
	}
	if (sz != sizeof(struct vfs_ns_cap_data)) {
		printf("sz was %ld should be %ld\n", sz, sizeof(struct vfs_ns_cap_data));
		exit(1);
	}
}
Comment 6 Hervé Guillemet 2020-10-16 08:29:27 UTC
(In reply to Andrew G. Morgan from comment #3)
> 
> $ mount|fgrep " $(df .|fgrep /|awk '{print $6}') "
> /dev/vdb on / type btrfs
> (rw,relatime,discard,space_cache,user_subvol_rm_allowed,subvolid=266,subvol=/
> lxd/storage-pools/default/containers/penguin/rootfs)

```
# mount|fgrep " $(df .|fgrep /|awk '{print $6}') "
pool/containers/esel on / type zfs (rw,xattr,posixacl)
```

(In reply to Andrew G. Morgan from comment #4)
> Just to confirm, is there anything interesting about your kernel? You say
> you are running 5.4.66. Is there anything else interesting to say about it?

It's a custom-configured kernel. One thing that is maybe not too common is that I built zsf in, without module support.

(In reply to Serge Hallyn from comment #5)
 > The contents of /proc/self/status and /proc/self/uid_map would be
> interesting, as well as your kernel .config.

I attached my `.config` to this report.
`/proc/self/uid` is the one reported above, just 1 line:  0 1000000 65536.

```
# cat /proc/self/status
Name:	cat
Umask:	0022
State:	R (running)
Tgid:	19117
Ngid:	0
Pid:	19117
PPid:	19107
TracerPid:	0
Uid:	0	0	0	0
Gid:	0	0	0	0
FDSize:	256
Groups:	0 1 2 3 4 6 10 11 26 27 
NStgid:	19117
NSpid:	19117
NSpgid:	19117
NSsid:	19106
VmPeak:	    6864 kB
VmSize:	    6864 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	     444 kB
VmRSS:	     444 kB
RssAnon:	      64 kB
RssFile:	     380 kB
RssShmem:	       0 kB
VmData:	     312 kB
VmStk:	     132 kB
VmExe:	      24 kB
VmLib:	    1416 kB
VmPTE:	      48 kB
VmSwap:	       0 kB
CoreDumping:	0
THP_enabled:	1
Threads:	1
SigQ:	0/63688
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000000000000
SigCgt:	0000000000000000
CapInh:	0000000000000000
CapPrm:	0000003fffffffff
CapEff:	0000003fffffffff
CapBnd:	0000003fffffffff
CapAmb:	0000000000000000
NoNewPrivs:	0
Seccomp:	2
Speculation_Store_Bypass:	thread force mitigated
Cpus_allowed:	f
Cpus_allowed_list:	0-3
Mems_allowed:	1
Mems_allowed_list:	0
voluntary_ctxt_switches:	0
nonvoluntary_ctxt_switches:	0
```
> Also if you could compile the following program and run it against that /tmp/ping both from a host process and a container process, that would be helpful.  On my system, from the host it tells me it found a v3 capability targeted at uid 100000, while from the container it finds a v2 capability.

The output is the same for me from both container and host: v3, rootid is 1000000.
Comment 7 Hervé Guillemet 2020-10-16 08:35:39 UTC
(In reply to Hervé Guillemet from comment #6)
> 
> (In reply to Andrew G. Morgan from comment #4)
> > Just to confirm, is there anything interesting about your kernel? You say
> > you are running 5.4.66. Is there anything else interesting to say about it?
> 
> It's a custom-configured kernel. One thing that is maybe not too common is
> that I built zsf in, without module support.

It also has the gentoo patches.
Comment 8 Hervé Guillemet 2020-10-16 08:36:23 UTC
Created attachment 293009 [details]
Kernel .config
Comment 9 Serge Hallyn 2020-10-16 15:59:15 UTC
Hi - do you have a publically accessible repo with the branch that has all the patches you used?

I've seen some wrong attempts at glueing together patchsets at least for various phone specific kernels in the past.
Comment 10 Hervé Guillemet 2020-10-16 18:23:21 UTC
You mean the Gentoo kernel patches ?
You can find them there:
https://dev.gentoo.org/~mpagano/genpatches
Comment 11 Hervé Guillemet 2020-10-16 21:05:51 UTC
Don't waste time reviewing the Gentoo patches: I just reboot with a vanilla
5.4.66, without the Gentoo patches, and the bug is still here.
Comment 12 Serge Hallyn 2020-10-18 04:06:33 UTC
Tried to reproduce this on 5.8.0-21-generic #22-Ubuntu using a zfs filesystem, but still can't:

ubuntu@brauner:/newpool$ sudo setcap -v cap_net_raw+ep ./capsh
./capsh: OK
Comment 13 Hervé Guillemet 2020-10-26 11:32:55 UTC
I now run on 5.8.16 and I still have:
```
# setcap cap_net_raw+ep ./ping
# setcap -v cap_net_raw+ep ./ping
nsowner[got=1000000, want=0],./ping differs in []
```
Comment 14 Serge Hallyn 2020-10-27 02:39:52 UTC
I spun up a gentoo linode:

Linux localhost 5.4.48-gentoo-x86_64 #1 SMP Mon Aug 10 21:25:02 UTC 2020 x86_64 AMD EPYC 7542 32-Core Processor AuthenticAMD GNU/Linux

Was not able to reproduce, still.

To keep things simpler, can you please 'emerge lxc', then as
unpriv user run lxc-usernexec.

As root on the host,

 cp /sbin/capsh /tmp/
 setcap -n 100000 cap_sys_admin+pe
 caplook /tmp/capsh

where 'caplook" is the program from comment #6.

Then, as the unrpvileged user under lxc-usernsexec also do

 caplook /tmp/capsh

Do you get the same results?  You should have a v2 cap as the
unpriv user, and a v3 cap as root.
Comment 15 Hervé Guillemet 2020-10-27 07:23:48 UTC
I get `v3, rootid is 100000` with both host root and under lxc-usernexec
Comment 16 Hervé Guillemet 2020-10-27 10:41:42 UTC
`CONFIG_SECURITY` was not set. When set, I see v2 cap in containers.

That's far from obvious from the Kconfig help text than this setting is necessary, and even that is does something by itself. Maybe some dependencies is lacking.

`caplook` now also says `sz was 20 should be 24`. Does this mean something else is wrong ?
Comment 17 Serge Hallyn 2020-10-27 13:09:31 UTC
> --- Comment #16 from Hervé Guillemet (herve@guillemet.org) ---
> `CONFIG_SECURITY` was not set. When set, I see v2 cap in containers.

Thanks!

> That's far from obvious from the Kconfig help text than this setting is
> necessary, and even that is does something by itself. Maybe some dependencies
> is lacking.

Ok - I think this may be a side effect of how the LSM stacking stuff is
working.  (Cc:d Casey).  Casey, 'git blame' tells me that the #ifdef
CONFIG_SECURITY at bottom of security/commoncap.c came from

commit b1d9e6b0646d0e5ee5d9050bd236b6c65d66faef
Author: Casey Schaufler <casey@schaufler-ca.com>
Date:   Sat May 2 15:11:42 2015 -0700

    LSM: Switch to lists of hooks

But commoncap should not require CONFIG_SECURITY, historically.

> `caplook` now also says `sz was 20 should be 24`. Does this mean something
> else
> is wrong ?

I saw that too, I think it's a bug in my little test program :)
Comment 18 Casey Schaufler 2020-10-27 15:52:59 UTC
On 10/27/2020 6:09 AM, Serge E. Hallyn wrote:
>> --- Comment #16 from Hervé Guillemet (herve@guillemet.org) ---
>> `CONFIG_SECURITY` was not set. When set, I see v2 cap in containers.
> Thanks!
>
>> That's far from obvious from the Kconfig help text than this setting is
>> necessary, and even that is does something by itself. Maybe some
>> dependencies
>> is lacking.
> Ok - I think this may be a side effect of how the LSM stacking stuff is
> working.  (Cc:d Casey).  Casey, 'git blame' tells me that the #ifdef
> CONFIG_SECURITY at bottom of security/commoncap.c came from
>
> commit b1d9e6b0646d0e5ee5d9050bd236b6c65d66faef
> Author: Casey Schaufler <casey@schaufler-ca.com>
> Date:   Sat May 2 15:11:42 2015 -0700
>
>     LSM: Switch to lists of hooks
>
> But commoncap should not require CONFIG_SECURITY, historically.

There's always been a difference between calling security_capable()
and cap_capable() when CONFIG_SECURITY is unset. The #ifdef is required
because the data structures being used to register security modules
aren't defined when LSMs aren't configured.

>
>> `caplook` now also says `sz was 20 should be 24`. Does this mean something
>> else
>> is wrong ?
> I saw that too, I think it's a bug in my little test program :)
Comment 19 Serge Hallyn 2020-10-27 16:36:18 UTC
On Tue, Oct 27, 2020 at 08:52:55AM -0700, Casey Schaufler wrote:
> On 10/27/2020 6:09 AM, Serge E. Hallyn wrote:
> >> --- Comment #16 from Hervé Guillemet (herve@guillemet.org) ---
> >> `CONFIG_SECURITY` was not set. When set, I see v2 cap in containers.
> > Thanks!
> >
> >> That's far from obvious from the Kconfig help text than this setting is
> >> necessary, and even that is does something by itself. Maybe some
> dependencies
> >> is lacking.
> > Ok - I think this may be a side effect of how the LSM stacking stuff is
> > working.  (Cc:d Casey).  Casey, 'git blame' tells me that the #ifdef
> > CONFIG_SECURITY at bottom of security/commoncap.c came from
> >
> > commit b1d9e6b0646d0e5ee5d9050bd236b6c65d66faef
> > Author: Casey Schaufler <casey@schaufler-ca.com>
> > Date:   Sat May 2 15:11:42 2015 -0700
> >
> >     LSM: Switch to lists of hooks
> >
> > But commoncap should not require CONFIG_SECURITY, historically.
> 
> There's always been a difference between calling security_capable()
> and cap_capable() when CONFIG_SECURITY is unset. The #ifdef is required
> because the data structures being used to register security modules
> aren't defined when LSMs aren't configured.

That's understandable, but it's still a regression (even if it came in
years ago), and not matched by change in documentation.

As Hervé says, the security/Kconfig says about CONFIG_SECURITY:

          If this option is not selected, the default Linux security
          model will be used.

So the simplest fix is probably to change that text.
Comment 20 Andrew G. Morgan 2020-10-27 16:54:08 UTC
So, root cause here is that the issue is one of kernel configuration? Should we change the component for this bug to have it tracked correctly?
Comment 21 Casey Schaufler 2020-10-27 18:31:51 UTC
On 10/27/2020 9:36 AM, Serge E. Hallyn wrote:
> On Tue, Oct 27, 2020 at 08:52:55AM -0700, Casey Schaufler wrote:
>> On 10/27/2020 6:09 AM, Serge E. Hallyn wrote:
>>>> --- Comment #16 from Hervé Guillemet (herve@guillemet.org) ---
>>>> `CONFIG_SECURITY` was not set. When set, I see v2 cap in containers.
>>> Thanks!
>>>
>>>> That's far from obvious from the Kconfig help text than this setting is
>>>> necessary, and even that is does something by itself. Maybe some
>>>> dependencies
>>>> is lacking.
>>> Ok - I think this may be a side effect of how the LSM stacking stuff is
>>> working.  (Cc:d Casey).  Casey, 'git blame' tells me that the #ifdef
>>> CONFIG_SECURITY at bottom of security/commoncap.c came from
>>>
>>> commit b1d9e6b0646d0e5ee5d9050bd236b6c65d66faef
>>> Author: Casey Schaufler <casey@schaufler-ca.com>
>>> Date:   Sat May 2 15:11:42 2015 -0700
>>>
>>>     LSM: Switch to lists of hooks
>>>
>>> But commoncap should not require CONFIG_SECURITY, historically.
>> There's always been a difference between calling security_capable()
>> and cap_capable() when CONFIG_SECURITY is unset. The #ifdef is required
>> because the data structures being used to register security modules
>> aren't defined when LSMs aren't configured.
> That's understandable, but it's still a regression (even if it came in
> years ago), and not matched by change in documentation.
>
> As Hervé says, the security/Kconfig says about CONFIG_SECURITY:
>
>           If this option is not selected, the default Linux security
>           model will be used.
>
> So the simplest fix is probably to change that text.

Ah, but to what? I can't say that I've completely understood
the discussion.
Comment 22 Hervé Guillemet 2020-10-27 18:47:55 UTC
From the end user that I am, the issue it that this feature, documented in capabilities(7):

“Correspondingly, when a version 3 security.capability attribute is retrieved (getxattr(2)) by a process that resides inside a user namespace that was created by the root user ID (or a descendant of that user namespace), the returned attribute is (automatically) simplified to appear as a version 2 attribute (i.e., the returned value is the size of a version 2 attribute and does not include the root user ID). These automatic translations mean that no changes are required to user-space tools (e.g., setcap (1) and getcap (1)) in order for those tools to be used to create and retrieve version 3 security.capability attributes.”

does not work if CONFIG_SECURITY is not set.

If this cannot be fixed, my feeling is that if namespaced file capabilities is selected in the kernel config, CONFIG_SECURITY should be automatically selected.
Comment 23 Serge Hallyn 2020-10-30 12:43:35 UTC
I'm fine with that (enabling CONFIG_SECURITY).  Some may object, but I'll float a patch soon.
Comment 24 Serge Hallyn 2020-11-17 15:04:04 UTC
On Tue, Oct 27, 2020 at 11:31:49AM -0700, Casey Schaufler wrote:
> On 10/27/2020 9:36 AM, Serge E. Hallyn wrote:
> > On Tue, Oct 27, 2020 at 08:52:55AM -0700, Casey Schaufler wrote:
> >> On 10/27/2020 6:09 AM, Serge E. Hallyn wrote:
> >>>> --- Comment #16 from Hervé Guillemet (herve@guillemet.org) ---
> >>>> `CONFIG_SECURITY` was not set. When set, I see v2 cap in containers.
> >>> Thanks!
> >>>
> >>>> That's far from obvious from the Kconfig help text than this setting is
> >>>> necessary, and even that is does something by itself. Maybe some
> dependencies
> >>>> is lacking.
> >>> Ok - I think this may be a side effect of how the LSM stacking stuff is
> >>> working.  (Cc:d Casey).  Casey, 'git blame' tells me that the #ifdef
> >>> CONFIG_SECURITY at bottom of security/commoncap.c came from
> >>>
> >>> commit b1d9e6b0646d0e5ee5d9050bd236b6c65d66faef
> >>> Author: Casey Schaufler <casey@schaufler-ca.com>
> >>> Date:   Sat May 2 15:11:42 2015 -0700
> >>>
> >>>     LSM: Switch to lists of hooks
> >>>
> >>> But commoncap should not require CONFIG_SECURITY, historically.
> >> There's always been a difference between calling security_capable()
> >> and cap_capable() when CONFIG_SECURITY is unset. The #ifdef is required
> >> because the data structures being used to register security modules
> >> aren't defined when LSMs aren't configured.
> > That's understandable, but it's still a regression (even if it came in
> > years ago), and not matched by change in documentation.
> >
> > As Hervé says, the security/Kconfig says about CONFIG_SECURITY:
> >
> >           If this option is not selected, the default Linux security
> >           model will be used.
> >
> > So the simplest fix is probably to change that text.
> 
> Ah, but to what? I can't say that I've completely understood
> the discussion.

Nah there's a better fix - apologies, Casey, this was not a regression,
it was broken from the start.  Patch incoming.

Note You need to log in before you can comment on or make changes to this bug.