Bug 154011

Summary: Task exit is signaled before task resource deallocation, leading to bogus EAGAIN errors
Product: Process Management Reporter: Florian Weimer (fweimer)
Component: OtherAssignee: process_other
Status: NEW ---    
Severity: normal CC: dan, luto, oleg, pbrobinson
Priority: P1    
Hardware: All   
OS: Linux   
See Also: https://sourceware.org/bugzilla/show_bug.cgi?id=24537
Kernel Version: 4.6.3 Subsystem:
Regression: No Bisected commit-id:

Description Florian Weimer 2016-08-26 21:14:34 UTC
release_task decrements __task_cred(p)->user->processes, but it is called very late, after exit_notify (which sends SIGCLD, I think), or mm_release (which wakes up the TID futex).  Some application (such as make, or threading test cases) assume that at this point, it is safe to spawn a new thread or process.

On the Fedora build daemons (generally running current 4.6 kernels), we seem to have accidentally created a configuration which widens the race sufficiently.  As a result, fork and pthread_create fail sporadically, even though at a pure userspace level, RLIMIT_NPROC is never exceeded because the process which spawns other processes or threads consistently waits for previous processes/threads to exit before spawning new processes.

Particularly affected test cases are nptl/tst-eintr1, malloc/tst-malloc-thread-exit in glibc, and test_thread.rb from the Ruby testsuite.
Comment 1 Oleg Nesterov 2016-09-27 14:24:23 UTC
(In reply to Florian Weimer from comment #0)
> release_task decrements __task_cred(p)->user->processes, but it is called
> very late, after exit_notify (which sends SIGCLD, I think),

even later. After its parent does wait() and reaps a zombie. Ignoring
the sub-threads.

> Some application (such as make, or
> threading test cases) assume that at this point, it is safe to spawn a new
> thread or process.

Well, I'm afraid the user-space should be fixed.

> As a result, fork and pthread_create fail sporadically, even
> though at a pure userspace level, RLIMIT_NPROC is never exceeded because the
> process which spawns other processes or threads consistently waits for
> previous processes/threads to exit before spawning new processes.

See above. A zombie thread should be accounted until the parent should
call wait() and frees the resources (task_struct, pid, etc).
Comment 2 Florian Weimer 2016-09-27 14:28:20 UTC
(In reply to Oleg Nesterov from comment #1)
> (In reply to Florian Weimer from comment #0)
> > release_task decrements __task_cred(p)->user->processes, but it is called
> > very late, after exit_notify (which sends SIGCLD, I think),
> 
> even later. After its parent does wait() and reaps a zombie. Ignoring
> the sub-threads.
> 
> > Some application (such as make, or
> > threading test cases) assume that at this point, it is safe to spawn a new
> > thread or process.
> 
> Well, I'm afraid the user-space should be fixed.

How many times should user space retry a fork or clone, to make sure that it has hit the actual limit, and not just observed the effect of these race conditions?

> > As a result, fork and pthread_create fail sporadically, even
> > though at a pure userspace level, RLIMIT_NPROC is never exceeded because
> the
> > process which spawns other processes or threads consistently waits for
> > previous processes/threads to exit before spawning new processes.
> 
> See above. A zombie thread should be accounted until the parent should
> call wait() and frees the resources (task_struct, pid, etc).

User thread exit is not signaled through the wait* system calls, but with a futex operation.

To be clear, we observed EAGAIN *after* the wait* functions have been called (for processes), or the futex has been woken up (for threads).  From a user space perspective, the process/thread creation is properly serialized.
Comment 3 Oleg Nesterov 2016-09-27 16:01:56 UTC
On 09/27, bugzilla-daemon@bugzilla.kernel.org wrote:
>
> > See above. A zombie thread should be accounted until the parent should
> > call wait() and frees the resources (task_struct, pid, etc).
>
> User thread exit is not signaled through the wait* system calls, but with a
> futex operation.

Ah, I was confused because you mentioned SIGCHLD. Yes, that is why I said
"ignoring sub-threads".

And as for sub-threads I do not see a simple solution.
futex(clear_child_tid, FUTEX_WAKE) is called by mm_release(), we can't
decrement user->processes here... Even if !thread_group_leader(), this
thread can be traced by another task and in this case it won't autoreap.

And to be honest I never understood rlimits and RLIMIT_NPROC in particular.
I mean, we count the number of threads per-user, but rlimit is per-process,
this doesn't make real sense to me.

So I do not really know what we can/should do, and do we care or not...
Comment 4 Florian Weimer 2016-10-04 10:11:13 UTC
(In reply to Oleg Nesterov from comment #3)
> On 09/27, bugzilla-daemon@bugzilla.kernel.org wrote:
> >
> > > See above. A zombie thread should be accounted until the parent should
> > > call wait() and frees the resources (task_struct, pid, etc).
> >
> > User thread exit is not signaled through the wait* system calls, but with a
> > futex operation.
> 
> Ah, I was confused because you mentioned SIGCHLD. Yes, that is why I said
> "ignoring sub-threads".
> 
> And as for sub-threads I do not see a simple solution.
> futex(clear_child_tid, FUTEX_WAKE) is called by mm_release(), we can't
> decrement user->processes here... Even if !thread_group_leader(), this
> thread can be traced by another task and in this case it won't autoreap.

Can we keep track of the futex somewhere, and signal it only after we have freed the task (after verifying that the futex still exists at that point)?  We do not need any task information to do that, I think, so all can be safely deallocate before that.

I'm not sure if that would be a complete fix, though, because we also saw the issue with full processes, not just with threads, I think.  But it is difficult to tell with the thread bug also present.

> And to be honest I never understood rlimits and RLIMIT_NPROC in particular.
> I mean, we count the number of threads per-user, but rlimit is per-process,
> this doesn't make real sense to me.

It has always been this way.  It's not just RLIMIT_NPROC, there is a desire to have other quota limits.
Comment 5 Oleg Nesterov 2016-10-04 16:45:46 UTC
On 10/04, bugzilla-daemon@bugzilla.kernel.org wrote:
>
> > Ah, I was confused because you mentioned SIGCHLD. Yes, that is why I said
> > "ignoring sub-threads".
> >
> > And as for sub-threads I do not see a simple solution.
> > futex(clear_child_tid, FUTEX_WAKE) is called by mm_release(), we can't
> > decrement user->processes here... Even if !thread_group_leader(), this
> > thread can be traced by another task and in this case it won't autoreap.
>
> Can we keep track of the futex somewhere, and signal it only after we have
> freed the task (after verifying that the futex still exists at that point)?

at this point the exited thread/process no longer has ->mm we could write too,
and the caller is not necessarily the same task. So this won't be simple and
I don't think we can add this user-visible/incompatible change.

> > And to be honest I never understood rlimits and RLIMIT_NPROC in particular.
> > I mean, we count the number of threads per-user, but rlimit is per-process,
> > this doesn't make real sense to me.
>
> It has always been this way.  It's not just RLIMIT_NPROC, there is a desire
> to
> have other quota limits.

RLIMIT_NPROC (and RLIMIT_SIGPENDING) differs in that we compare the per-process
counter with per-user counter. So you can't even use RLIMIT_NPROC to protect
against the trivial fork-bomb; you can't limit the number of tasks a user can
run via RLIMIT_NPROC.

To me, rlimits is mostly debugging tool, to catch the simple mistakes.

Oleg.
Comment 6 Florian Weimer 2016-10-05 09:30:03 UTC
(In reply to Oleg Nesterov from comment #5)
> On 10/04, bugzilla-daemon@bugzilla.kernel.org wrote:
> >
> > > Ah, I was confused because you mentioned SIGCHLD. Yes, that is why I said
> > > "ignoring sub-threads".
> > >
> > > And as for sub-threads I do not see a simple solution.
> > > futex(clear_child_tid, FUTEX_WAKE) is called by mm_release(), we can't
> > > decrement user->processes here... Even if !thread_group_leader(), this
> > > thread can be traced by another task and in this case it won't autoreap.
> >
> > Can we keep track of the futex somewhere, and signal it only after we have
> > freed the task (after verifying that the futex still exists at that point)?
> 
> at this point the exited thread/process no longer has ->mm we could write
> too,
> and the caller is not necessarily the same task. So this won't be simple and
> I don't think we can add this user-visible/incompatible change.

Is there a way to prevent the mm object from going away until the exit signaling is performed?

If there is a way to receive thread exit notification in a more synchronous fashion?
Comment 7 Andy Lutomirski 2016-10-05 15:16:28 UTC
Maybe we're thinking about this all wrong.  What if we kept thread exit futex notifications as is but changed the accounting so that it stopped being debited from the NPROC balance?
Comment 8 Oleg Nesterov 2016-10-07 16:07:42 UTC
On 10/05, bugzilla-daemon@bugzilla.kernel.org wrote:
>
> Is there a way to prevent the mm object from going away until the exit
> signaling is performed?

Well, we certainly do not want to do this, no matter what.

> If there is a way to receive thread exit notification in a more synchronous
> fashion?

We can probably decrement user->processes if !leader && !traced,
but somehow we need to close the races with ptrace, so this would
mean some uglifications.

and it is not clear what can we do with the group leader, we simply
can't know if this thread will reap itself or not until exit_notify().

Perhaps you should report/discuss this on lkml. Because I am biased
and personally I do not think this worth fixing, to me rlimits are
useless. But this is only my opinion and of course I can be wrong.
Comment 9 Oleg Nesterov 2016-10-07 16:28:28 UTC
On 10/05, bugzilla-daemon@bugzilla.kernel.org wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=154011
>
> --- Comment #7 from Andy Lutomirski <luto@kernel.org> ---
> Maybe we're thinking about this all wrong.  What if we kept thread exit futex
> notifications as is but changed the accounting so that it stopped being
> debited
> from the NPROC balance?

But how? Except decrement user->processes before sys_futex(clear_child_tid).

I personally would not mind if we simply do

	--- x/kernel/exit.c
	+++ x/kernel/exit.c
	@@ -170,12 +170,6 @@ void release_task(struct task_struct *p)
		struct task_struct *leader;
		int zap_leader;
	 repeat:
	-	/* don't need to get the RCU readlock here - the process is dead and
	-	 * can't be modifying its own credentials. But shut RCU-lockdep up */
	-	rcu_read_lock();
	-	atomic_dec(&__task_cred(p)->user->processes);
	-	rcu_read_unlock();
	-
		proc_flush_task(p);
	 
		write_lock_irq(&tasklist_lock);
	@@ -464,6 +458,7 @@ static void exit_mm(struct task_struct *
		struct mm_struct *mm = tsk->mm;
		struct core_state *core_state;
	 
	+	atomic_dec(&current_user()->processes);
		mm_release(tsk, mm);
		if (!mm)
			return;

so that user->processes won't count zombies at all, but obviously
we can't do this...