Bug 215596 - Commit 59ec715 breaks systemd LimitNPROC with PrivateUsers
Summary: Commit 59ec715 breaks systemd LimitNPROC with PrivateUsers
Status: NEW
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-02-12 15:01 UTC by Etienne Dechamps
Modified: 2022-03-04 19:23 UTC (History)
4 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Etienne Dechamps 2022-02-12 15:01:32 UTC
Commit 59ec715 "ucounts: Fix rlimit max values check", first included in Linux 5.15.12, breaks systemd "LimitNPROC" (RLIMIT_NPROC) when combined with "PrivateUsers" (user namespacing).

This can be reproduced with a trivial systemd service file:

[Service]
User=nobody
PrivateUsers=yes
LimitNPROC=4
Type=oneshot
ExecStart=/bin/true

Which, on 59ec715, fails with:

Failed to execute /bin/true: Resource temporarily unavailable
Failed at step EXEC spawning /bin/true: Resource temporarily unavailable
Main process exited, code=exited, status=203/EXEC

(Even though user `nobody` has no running processes besides this one)

A strace on PID 1 reveals the following sequence of calls (excerpt):

clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x40e60150) = 129
[pid   129] prlimit64(0, RLIMIT_NPROC, {rlim_cur=4, rlim_max=4}, NULL) = 0
[pid   129] unshare(CLONE_NEWUSER)      = 0
[pid   129] setresuid(65534, 65534, 65534) = 0
[pid   129] execve("/bin/true", ["/bin/true"], 0x552ad950a0 /* 7 vars */) = -1 EAGAIN (Resource temporarily unavailable)

On the parent commit of 59ec715 the service starts successfully.

This is still reproducible on current master (83e3966).

Relevant patch discussion: https://lore.kernel.org/lkml/87lf0g9xq7.fsf@email.froward.int.ebiederm.org/T/#m0a39edf27bc5aabca58b2c2a3d81704818d2c6fe

This more recent thread also seems highly relevant: https://lore.kernel.org/lkml/20220207121800.5079-1-mkoutny@suse.com/
Comment 1 Etienne Dechamps 2022-02-12 15:10:15 UTC
Cross-linking to the systemd issue I filed for visibility: https://github.com/systemd/systemd/issues/22494
Comment 2 Eric W. Biederman 2022-02-15 19:30:51 UTC
bugzilla-daemon@kernel.org writes:

> https://bugzilla.kernel.org/show_bug.cgi?id=215596
>
>             Bug ID: 215596
>            Summary: Commit 59ec715 breaks systemd LimitNPROC with
>                     PrivateUsers
>            Product: Other
>            Version: 2.5
>           Hardware: All
>                 OS: Linux
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>           Assignee: other_other@kernel-bugs.osdl.org
>           Reporter: etienne@edechamps.fr
>                 CC: ebiederm@xmission.com, mkoutny@suse.com,
>                     solar@openwall.com
>         Regression: Yes
>
> Commit 59ec715 "ucounts: Fix rlimit max values check", first included in
> Linux
> 5.15.12, breaks systemd "LimitNPROC" (RLIMIT_NPROC) when combined with
> "PrivateUsers" (user namespacing).
>
> This can be reproduced with a trivial systemd service file:
>
> [Service]
> User=nobody
> PrivateUsers=yes
> LimitNPROC=4
> Type=oneshot
> ExecStart=/bin/true
>
> Which, on 59ec715, fails with:
>
> Failed to execute /bin/true: Resource temporarily unavailable
> Failed at step EXEC spawning /bin/true: Resource temporarily unavailable
> Main process exited, code=exited, status=203/EXEC
>
> (Even though user `nobody` has no running processes besides this one)
>
> A strace on PID 1 reveals the following sequence of calls (excerpt):
>
> clone(child_stack=NULL,
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x40e60150) = 129
> [pid   129] prlimit64(0, RLIMIT_NPROC, {rlim_cur=4, rlim_max=4}, NULL) = 0
> [pid   129] unshare(CLONE_NEWUSER)      = 0
> [pid   129] setresuid(65534, 65534, 65534) = 0
> [pid   129] execve("/bin/true", ["/bin/true"], 0x552ad950a0 /* 7 vars */) =
> -1
> EAGAIN (Resource temporarily unavailable)

Do you happen to know which user the code was running as when prlimit64
was called?  Really it only matters before the unshare(CLONE_NEWUSER).

> On the parent commit of 59ec715 the service starts successfully.
>
> This is still reproducible on current master (83e3966).
>
> Relevant patch discussion:
>
> https://lore.kernel.org/lkml/87lf0g9xq7.fsf@email.froward.int.ebiederm.org/T/#m0a39edf27bc5aabca58b2c2a3d81704818d2c6fe
>
> This more recent thread also seems highly relevant:
> https://lore.kernel.org/lkml/20220207121800.5079-1-mkoutny@suse.com/

What this looks like is the user that called unshare had more that 4
processes running.

If that user is root is root there is an easy argument for fixing this.
Looking at the behavior from your trace and reading the code I don't
think the code was running as user root.

If the user that called unshare was not root, the question becomes
what are you trying to achieve.

You say it breaks LimitNPROC with PrivateUsers but I don't see how
this could have worked reliably in the past even without the change.

What limit were you expecting to be enforced?

Right now this looks like:
  Set RLIMIT_NPROC to 4.
  Have more than 4 processes.
  The kernel enforces the limit.

There is a lot of weird and goofy history with RLIMIT_NPROC so I am open
to learning something that would let this be a sensible case.  Right now
I unfortunately am not seeing it.

Eric
Comment 3 Etienne Dechamps 2022-02-16 17:22:23 UTC
> Do you happen to know which user the code was running as when prlimit64
was called?

Root. As one would expect for systemd running as PID 1.

> What this looks like is the user that called unshare had more that 4
processes running.

According to the strace, systemd calls unshare(CLONE_NEWUSER) before setresuid(). In other words it calls unshare(CLONE_NEWUSER) as root. As one would expect on a typical running system, root indeed had more than 4 processes running.

> Looking at the behavior from your trace and reading the code I don't
think the code was running as user root.

Uh? I'm pretty sure it was running as root (until the setresuid() call, of course).

> You say it breaks LimitNPROC with PrivateUsers but I don't see how
this could have worked reliably in the past even without the change.

And yet it did. Before 59ec715, the systemd service would consistently and reliably start. On 59ec715 (and master), the same service consistently and reliably fails to start. This is easily reproducible in a fresh VM.

Note that I'm not necessarily saying the sequence of calls systemd is making makes sense. I honestly don't know for sure as I'm not intimately familiar with the details of how user namespaces work. I'm just the messenger here pointing out the surprising change in behaviour introduced 59ec715, with potential breakage for systemd users. If you think that systemd is doing something wrong here, then perhaps it would make sense to come over to https://github.com/systemd/systemd/issues/22494 and sort this out.

If you want me to gather additional information on the failure, feel free to ask. It would probably be quicker for you to reproduce it directly though, as the issue can trivially be reproduced in any VM running systemd.
Comment 4 Michal Koutný 2022-02-18 10:48:02 UTC
AFAICT the commit 59ec71575ab4 ("ucounts: Fix rlimit max values check") is correct.

In my understanding the issue is caused by 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") which takes as the value for comparison the number of tasks of the old UID (root) (and not the new one (nobody) in set*uid()) and compares it with the reduced RLIMIT_NPROC(==4).

I estimate it could have work earlier if there were no more than four such services spawned (assuming no other nobody's tasks).
Comment 5 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-03-04 06:24:41 UTC
What's the status here? I'm tracking this regression and it seems this got stuck -- or did the discussion continue somewhere else?. Can anyone please provide a quick status update?
Comment 7 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-03-04 14:00:20 UTC
(In reply to Michal Koutný from comment #6)
> Eric's patches landed in torvalds/master now. 

Sorry, missed that, many thx!
Comment 8 Michal Koutný 2022-03-04 14:13:31 UTC
BTW, if you have verification to share, it'd be great.

(I tested Eric's patches with a tailored reproducer, not this systemd one but they should be equal AFAICT.)
Comment 9 Etienne Dechamps 2022-03-04 19:23:34 UTC
FWIW, I just gave it a try and can confirm that 0ac983f fixes the issue, and that the issue is still fixed as of latest master (38f80f4). Thanks for the fix!

Note You need to log in before you can comment on or make changes to this bug.