Commit 59ec715 "ucounts: Fix rlimit max values check", first included in Linux 5.15.12, breaks systemd "LimitNPROC" (RLIMIT_NPROC) when combined with "PrivateUsers" (user namespacing). This can be reproduced with a trivial systemd service file: [Service] User=nobody PrivateUsers=yes LimitNPROC=4 Type=oneshot ExecStart=/bin/true Which, on 59ec715, fails with: Failed to execute /bin/true: Resource temporarily unavailable Failed at step EXEC spawning /bin/true: Resource temporarily unavailable Main process exited, code=exited, status=203/EXEC (Even though user `nobody` has no running processes besides this one) A strace on PID 1 reveals the following sequence of calls (excerpt): clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x40e60150) = 129 [pid 129] prlimit64(0, RLIMIT_NPROC, {rlim_cur=4, rlim_max=4}, NULL) = 0 [pid 129] unshare(CLONE_NEWUSER) = 0 [pid 129] setresuid(65534, 65534, 65534) = 0 [pid 129] execve("/bin/true", ["/bin/true"], 0x552ad950a0 /* 7 vars */) = -1 EAGAIN (Resource temporarily unavailable) On the parent commit of 59ec715 the service starts successfully. This is still reproducible on current master (83e3966). Relevant patch discussion: https://lore.kernel.org/lkml/87lf0g9xq7.fsf@email.froward.int.ebiederm.org/T/#m0a39edf27bc5aabca58b2c2a3d81704818d2c6fe This more recent thread also seems highly relevant: https://lore.kernel.org/lkml/20220207121800.5079-1-mkoutny@suse.com/
Cross-linking to the systemd issue I filed for visibility: https://github.com/systemd/systemd/issues/22494
bugzilla-daemon@kernel.org writes: > https://bugzilla.kernel.org/show_bug.cgi?id=215596 > > Bug ID: 215596 > Summary: Commit 59ec715 breaks systemd LimitNPROC with > PrivateUsers > Product: Other > Version: 2.5 > Hardware: All > OS: Linux > Status: NEW > Severity: normal > Priority: P1 > Component: Other > Assignee: other_other@kernel-bugs.osdl.org > Reporter: etienne@edechamps.fr > CC: ebiederm@xmission.com, mkoutny@suse.com, > solar@openwall.com > Regression: Yes > > Commit 59ec715 "ucounts: Fix rlimit max values check", first included in > Linux > 5.15.12, breaks systemd "LimitNPROC" (RLIMIT_NPROC) when combined with > "PrivateUsers" (user namespacing). > > This can be reproduced with a trivial systemd service file: > > [Service] > User=nobody > PrivateUsers=yes > LimitNPROC=4 > Type=oneshot > ExecStart=/bin/true > > Which, on 59ec715, fails with: > > Failed to execute /bin/true: Resource temporarily unavailable > Failed at step EXEC spawning /bin/true: Resource temporarily unavailable > Main process exited, code=exited, status=203/EXEC > > (Even though user `nobody` has no running processes besides this one) > > A strace on PID 1 reveals the following sequence of calls (excerpt): > > clone(child_stack=NULL, > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > child_tidptr=0x40e60150) = 129 > [pid 129] prlimit64(0, RLIMIT_NPROC, {rlim_cur=4, rlim_max=4}, NULL) = 0 > [pid 129] unshare(CLONE_NEWUSER) = 0 > [pid 129] setresuid(65534, 65534, 65534) = 0 > [pid 129] execve("/bin/true", ["/bin/true"], 0x552ad950a0 /* 7 vars */) = > -1 > EAGAIN (Resource temporarily unavailable) Do you happen to know which user the code was running as when prlimit64 was called? Really it only matters before the unshare(CLONE_NEWUSER). > On the parent commit of 59ec715 the service starts successfully. > > This is still reproducible on current master (83e3966). > > Relevant patch discussion: > > https://lore.kernel.org/lkml/87lf0g9xq7.fsf@email.froward.int.ebiederm.org/T/#m0a39edf27bc5aabca58b2c2a3d81704818d2c6fe > > This more recent thread also seems highly relevant: > https://lore.kernel.org/lkml/20220207121800.5079-1-mkoutny@suse.com/ What this looks like is the user that called unshare had more that 4 processes running. If that user is root is root there is an easy argument for fixing this. Looking at the behavior from your trace and reading the code I don't think the code was running as user root. If the user that called unshare was not root, the question becomes what are you trying to achieve. You say it breaks LimitNPROC with PrivateUsers but I don't see how this could have worked reliably in the past even without the change. What limit were you expecting to be enforced? Right now this looks like: Set RLIMIT_NPROC to 4. Have more than 4 processes. The kernel enforces the limit. There is a lot of weird and goofy history with RLIMIT_NPROC so I am open to learning something that would let this be a sensible case. Right now I unfortunately am not seeing it. Eric
> Do you happen to know which user the code was running as when prlimit64 was called? Root. As one would expect for systemd running as PID 1. > What this looks like is the user that called unshare had more that 4 processes running. According to the strace, systemd calls unshare(CLONE_NEWUSER) before setresuid(). In other words it calls unshare(CLONE_NEWUSER) as root. As one would expect on a typical running system, root indeed had more than 4 processes running. > Looking at the behavior from your trace and reading the code I don't think the code was running as user root. Uh? I'm pretty sure it was running as root (until the setresuid() call, of course). > You say it breaks LimitNPROC with PrivateUsers but I don't see how this could have worked reliably in the past even without the change. And yet it did. Before 59ec715, the systemd service would consistently and reliably start. On 59ec715 (and master), the same service consistently and reliably fails to start. This is easily reproducible in a fresh VM. Note that I'm not necessarily saying the sequence of calls systemd is making makes sense. I honestly don't know for sure as I'm not intimately familiar with the details of how user namespaces work. I'm just the messenger here pointing out the surprising change in behaviour introduced 59ec715, with potential breakage for systemd users. If you think that systemd is doing something wrong here, then perhaps it would make sense to come over to https://github.com/systemd/systemd/issues/22494 and sort this out. If you want me to gather additional information on the failure, feel free to ask. It would probably be quicker for you to reproduce it directly though, as the issue can trivially be reproduced in any VM running systemd.
AFAICT the commit 59ec71575ab4 ("ucounts: Fix rlimit max values check") is correct. In my understanding the issue is caused by 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") which takes as the value for comparison the number of tasks of the old UID (root) (and not the new one (nobody) in set*uid()) and compares it with the reduced RLIMIT_NPROC(==4). I estimate it could have work earlier if there were no more than four such services spawned (assuming no other nobody's tasks).
What's the status here? I'm tracking this regression and it seems this got stuck -- or did the discussion continue somewhere else?. Can anyone please provide a quick status update?
Eric's patches landed in torvalds/master now. This particular bug should be fixed with: - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/sys.c?id=c923a8e7edb010da67424077cbf1a6f1396ebd2e - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/user_namespace.c?id=0ac983f512033cb7b5e210c9589768ad25b1e36b
(In reply to Michal Koutný from comment #6) > Eric's patches landed in torvalds/master now. Sorry, missed that, many thx!
BTW, if you have verification to share, it'd be great. (I tested Eric's patches with a tailored reproducer, not this systemd one but they should be equal AFAICT.)
FWIW, I just gave it a try and can confirm that 0ac983f fixes the issue, and that the issue is still fixed as of latest master (38f80f4). Thanks for the fix!