release_task decrements __task_cred(p)->user->processes, but it is called very late, after exit_notify (which sends SIGCLD, I think), or mm_release (which wakes up the TID futex). Some application (such as make, or threading test cases) assume that at this point, it is safe to spawn a new thread or process. On the Fedora build daemons (generally running current 4.6 kernels), we seem to have accidentally created a configuration which widens the race sufficiently. As a result, fork and pthread_create fail sporadically, even though at a pure userspace level, RLIMIT_NPROC is never exceeded because the process which spawns other processes or threads consistently waits for previous processes/threads to exit before spawning new processes. Particularly affected test cases are nptl/tst-eintr1, malloc/tst-malloc-thread-exit in glibc, and test_thread.rb from the Ruby testsuite.
(In reply to Florian Weimer from comment #0) > release_task decrements __task_cred(p)->user->processes, but it is called > very late, after exit_notify (which sends SIGCLD, I think), even later. After its parent does wait() and reaps a zombie. Ignoring the sub-threads. > Some application (such as make, or > threading test cases) assume that at this point, it is safe to spawn a new > thread or process. Well, I'm afraid the user-space should be fixed. > As a result, fork and pthread_create fail sporadically, even > though at a pure userspace level, RLIMIT_NPROC is never exceeded because the > process which spawns other processes or threads consistently waits for > previous processes/threads to exit before spawning new processes. See above. A zombie thread should be accounted until the parent should call wait() and frees the resources (task_struct, pid, etc).
(In reply to Oleg Nesterov from comment #1) > (In reply to Florian Weimer from comment #0) > > release_task decrements __task_cred(p)->user->processes, but it is called > > very late, after exit_notify (which sends SIGCLD, I think), > > even later. After its parent does wait() and reaps a zombie. Ignoring > the sub-threads. > > > Some application (such as make, or > > threading test cases) assume that at this point, it is safe to spawn a new > > thread or process. > > Well, I'm afraid the user-space should be fixed. How many times should user space retry a fork or clone, to make sure that it has hit the actual limit, and not just observed the effect of these race conditions? > > As a result, fork and pthread_create fail sporadically, even > > though at a pure userspace level, RLIMIT_NPROC is never exceeded because > the > > process which spawns other processes or threads consistently waits for > > previous processes/threads to exit before spawning new processes. > > See above. A zombie thread should be accounted until the parent should > call wait() and frees the resources (task_struct, pid, etc). User thread exit is not signaled through the wait* system calls, but with a futex operation. To be clear, we observed EAGAIN *after* the wait* functions have been called (for processes), or the futex has been woken up (for threads). From a user space perspective, the process/thread creation is properly serialized.
On 09/27, bugzilla-daemon@bugzilla.kernel.org wrote: > > > See above. A zombie thread should be accounted until the parent should > > call wait() and frees the resources (task_struct, pid, etc). > > User thread exit is not signaled through the wait* system calls, but with a > futex operation. Ah, I was confused because you mentioned SIGCHLD. Yes, that is why I said "ignoring sub-threads". And as for sub-threads I do not see a simple solution. futex(clear_child_tid, FUTEX_WAKE) is called by mm_release(), we can't decrement user->processes here... Even if !thread_group_leader(), this thread can be traced by another task and in this case it won't autoreap. And to be honest I never understood rlimits and RLIMIT_NPROC in particular. I mean, we count the number of threads per-user, but rlimit is per-process, this doesn't make real sense to me. So I do not really know what we can/should do, and do we care or not...
(In reply to Oleg Nesterov from comment #3) > On 09/27, bugzilla-daemon@bugzilla.kernel.org wrote: > > > > > See above. A zombie thread should be accounted until the parent should > > > call wait() and frees the resources (task_struct, pid, etc). > > > > User thread exit is not signaled through the wait* system calls, but with a > > futex operation. > > Ah, I was confused because you mentioned SIGCHLD. Yes, that is why I said > "ignoring sub-threads". > > And as for sub-threads I do not see a simple solution. > futex(clear_child_tid, FUTEX_WAKE) is called by mm_release(), we can't > decrement user->processes here... Even if !thread_group_leader(), this > thread can be traced by another task and in this case it won't autoreap. Can we keep track of the futex somewhere, and signal it only after we have freed the task (after verifying that the futex still exists at that point)? We do not need any task information to do that, I think, so all can be safely deallocate before that. I'm not sure if that would be a complete fix, though, because we also saw the issue with full processes, not just with threads, I think. But it is difficult to tell with the thread bug also present. > And to be honest I never understood rlimits and RLIMIT_NPROC in particular. > I mean, we count the number of threads per-user, but rlimit is per-process, > this doesn't make real sense to me. It has always been this way. It's not just RLIMIT_NPROC, there is a desire to have other quota limits.
On 10/04, bugzilla-daemon@bugzilla.kernel.org wrote: > > > Ah, I was confused because you mentioned SIGCHLD. Yes, that is why I said > > "ignoring sub-threads". > > > > And as for sub-threads I do not see a simple solution. > > futex(clear_child_tid, FUTEX_WAKE) is called by mm_release(), we can't > > decrement user->processes here... Even if !thread_group_leader(), this > > thread can be traced by another task and in this case it won't autoreap. > > Can we keep track of the futex somewhere, and signal it only after we have > freed the task (after verifying that the futex still exists at that point)? at this point the exited thread/process no longer has ->mm we could write too, and the caller is not necessarily the same task. So this won't be simple and I don't think we can add this user-visible/incompatible change. > > And to be honest I never understood rlimits and RLIMIT_NPROC in particular. > > I mean, we count the number of threads per-user, but rlimit is per-process, > > this doesn't make real sense to me. > > It has always been this way. It's not just RLIMIT_NPROC, there is a desire > to > have other quota limits. RLIMIT_NPROC (and RLIMIT_SIGPENDING) differs in that we compare the per-process counter with per-user counter. So you can't even use RLIMIT_NPROC to protect against the trivial fork-bomb; you can't limit the number of tasks a user can run via RLIMIT_NPROC. To me, rlimits is mostly debugging tool, to catch the simple mistakes. Oleg.
(In reply to Oleg Nesterov from comment #5) > On 10/04, bugzilla-daemon@bugzilla.kernel.org wrote: > > > > > Ah, I was confused because you mentioned SIGCHLD. Yes, that is why I said > > > "ignoring sub-threads". > > > > > > And as for sub-threads I do not see a simple solution. > > > futex(clear_child_tid, FUTEX_WAKE) is called by mm_release(), we can't > > > decrement user->processes here... Even if !thread_group_leader(), this > > > thread can be traced by another task and in this case it won't autoreap. > > > > Can we keep track of the futex somewhere, and signal it only after we have > > freed the task (after verifying that the futex still exists at that point)? > > at this point the exited thread/process no longer has ->mm we could write > too, > and the caller is not necessarily the same task. So this won't be simple and > I don't think we can add this user-visible/incompatible change. Is there a way to prevent the mm object from going away until the exit signaling is performed? If there is a way to receive thread exit notification in a more synchronous fashion?
Maybe we're thinking about this all wrong. What if we kept thread exit futex notifications as is but changed the accounting so that it stopped being debited from the NPROC balance?
On 10/05, bugzilla-daemon@bugzilla.kernel.org wrote: > > Is there a way to prevent the mm object from going away until the exit > signaling is performed? Well, we certainly do not want to do this, no matter what. > If there is a way to receive thread exit notification in a more synchronous > fashion? We can probably decrement user->processes if !leader && !traced, but somehow we need to close the races with ptrace, so this would mean some uglifications. and it is not clear what can we do with the group leader, we simply can't know if this thread will reap itself or not until exit_notify(). Perhaps you should report/discuss this on lkml. Because I am biased and personally I do not think this worth fixing, to me rlimits are useless. But this is only my opinion and of course I can be wrong.
On 10/05, bugzilla-daemon@bugzilla.kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=154011 > > --- Comment #7 from Andy Lutomirski <luto@kernel.org> --- > Maybe we're thinking about this all wrong. What if we kept thread exit futex > notifications as is but changed the accounting so that it stopped being > debited > from the NPROC balance? But how? Except decrement user->processes before sys_futex(clear_child_tid). I personally would not mind if we simply do --- x/kernel/exit.c +++ x/kernel/exit.c @@ -170,12 +170,6 @@ void release_task(struct task_struct *p) struct task_struct *leader; int zap_leader; repeat: - /* don't need to get the RCU readlock here - the process is dead and - * can't be modifying its own credentials. But shut RCU-lockdep up */ - rcu_read_lock(); - atomic_dec(&__task_cred(p)->user->processes); - rcu_read_unlock(); - proc_flush_task(p); write_lock_irq(&tasklist_lock); @@ -464,6 +458,7 @@ static void exit_mm(struct task_struct * struct mm_struct *mm = tsk->mm; struct core_state *core_state; + atomic_dec(¤t_user()->processes); mm_release(tsk, mm); if (!mm) return; so that user->processes won't count zombies at all, but obviously we can't do this...