Bug 200447

Summary: infinite loop in fork syscall
Product: Process Management Reporter: Wen Yang (wenyang)
Component: OtherAssignee: process_other
Status: RESOLVED CODE_FIX    
Severity: normal CC: akpm, ebiederm, ma.jiang, oleg, oleg, torvalds, wenyang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: all Subsystem:
Regression: No Bisected commit-id:
Attachments: Trial patch to see if you can see any improvement
Trial patch for the "real" fix

Description Wen Yang 2018-07-08 14:28:41 UTC
when A process using large amount of memory peridically receives signals, it may infinite loop in the fork syscall.
This issue was found with the following test code:

#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <sys/time.h>

const unsigned long int sz = 1024UL *1024 * 1024 * 8;
static unsigned long long int signal_cnt = 0;

void incr_count()
{
	++signal_cnt;
	printf("now cnt = %llu\n", signal_cnt);
}


int main()
{
	int pid = -1;
	unsigned long long fork_time;
	char * x = malloc(sz);
	memset(x, 1, sz);

	struct itimerval period;
	period.it_interval.tv_sec = 0;
	period.it_interval.tv_usec = 100000;

	period.it_value.tv_sec = 3;
	period.it_value.tv_usec = 0;

	struct timeval t1;
	struct timeval t2;

	signal(SIGALRM, incr_count);
	setitimer(ITIMER_REAL, &period, NULL);

	while(1){
		gettimeofday(&t1, NULL);
		pid = fork();
		if(pid == 0)
			exit(0);

		gettimeofday(&t2, NULL);
		wait(NULL);
		fork_time = (t2.tv_sec * 1000ULL * 1000 + t2.tv_usec)
			- (t1.tv_sec * 1000ULL*1000 + t1.tv_usec);
		printf("fork used %llu\n", fork_time);
	}

	return 0;
}

This is caused by commit 4a2c7a7837da ("make fork() atomic wrt pgrp/session
signals"), which returns -ERESTARTNOINTR when it detects a pending signal, 
fork() will be restarted transparently after handling the signals.
when a process using large amount of memory, mmput could could take a long
time before return -ERESTARTNOINTR. This may lead to a DoS attack if the 
process peridically receives signels.
We may move pending_signals_check ahead (before copy_mm) to avoid the problem.
Comment 1 Linus Torvalds 2018-07-08 21:09:03 UTC
That's an odd bugreport for a commit that is 12 years old.

Not really an infinite loop, just possibly a (unlikely) livelock by somebody who already has the rights to kill the process in question.

How did you notice this, and is there actually some real case that cares? I find your test program cute, but not necessarily hugely convincing that this is a serious problem.

I've forwarded the report to the original people involved with the report, but I think this is pretty low-priority unless you can point to a more realistic real issue..

Did you also test it on modern kernels? The locking and fork() code has changed in a lot of other ways, although the -ERESTARTNOINTR does remain.
Comment 2 Wen Yang 2018-07-09 06:18:16 UTC
(In reply to Linus Torvalds from comment #1)
> That's an odd bugreport for a commit that is 12 years old.
> 
> Not really an infinite loop, just possibly a (unlikely) livelock by somebody
> who already has the rights to kill the process in question.
> 
> How did you notice this, and is there actually some real case that cares? I
> find your test program cute, but not necessarily hugely convincing that this
> is a serious problem.
> 
> I've forwarded the report to the original people involved with the report,
> but I think this is pretty low-priority unless you can point to a more
> realistic real issue..
> 
> Did you also test it on modern kernels? The locking and fork() code has
> changed in a lot of other ways, although the -ERESTARTNOINTR does remain.

Than you very much.

We may use setitimer + SIGALRM sighandler to implement simple beartbeat,eg:

https://github.com/sipwise/heartbeat/blob/master/lib/clplumbing/timers.c

int
setmsalarm(long	ms)
{
	long			secs = ms / 1000L;
	long			usecs = (ms % 1000L)*1000L;
	struct itimerval	nexttime =
	{	{0L, 0L}	/* Repeat Interval */
	,	{secs, usecs}	/* Timer Value */
	};

	return setitimer(ITIMER_REAL, &nexttime, NULL);
}

and

https://github.com/sipwise/heartbeat/blob/master/heartbeat/heartbeat.c
...
	/* Make sure we check for death of parent every so often... */
	for (;;) {

		setmsalarm(1000L);
		msg = msgfromstream(fifo);
		setmsalarm(0L);
		hb_check_mcp_alive();
		hb_signal_process_pending();
...

We'll using rhel7.4( kernel:3.10.0-693.5.2.el7.x86_64) and kernel 4.15 to test it later.
Comment 3 majiang 2018-07-09 08:09:04 UTC
(In reply to Linus Torvalds from comment #1)
> That's an odd bugreport for a commit that is 12 years old.
> 
> Not really an infinite loop, just possibly a (unlikely) livelock by somebody
> who already has the rights to kill the process in question.
> 
> How did you notice this, and is there actually some real case that cares? I
> find your test program cute, but not necessarily hugely convincing that this
> is a serious problem.
> 
Hi, We found this problem when using the symbolic virtual machine KLEE(https://github.com/klee/klee).
KLEE could easily consume many memory (more than 2G, usually), because it must keep current program states in memory.
When KLEE want to discover new states, it create a new process with fork (So that the new process can read current program states directly) and call the SMT solver.
At the same time , KLEE use alarm(1) to do a simple timeout check.
So, If fork took too long to finish(close enough to 1 second), this problem get triggered.

As far as I know, KLEE have no better alternatives to fork(as SMT solvers could get stuck often, or eat up too many memory)...
Comment 4 Eric W. Biederman 2018-07-09 21:59:18 UTC
I see two different pieces of userspace that are being affected.

KLEE and the sipwise heartbeat.

For sipwise I suspect the parent death signal will be a better option for you.

For KLEE how close in the code is calling alarm(1) to the call to fork()?
It sounds like they are right next to each other and a fork taking 1 second concerns me.

As this signal is only delivered to a single process I believe we can update the kernel code without too much trouble so it doesn't do this.   However for KLEE I suspect what will then happen is the signal will be delivered after the fork and the new child will be killed.  So I don't know if improving the kernel will be enough to get the KLEE code to work in this situation.
Comment 5 Linus Torvalds 2018-07-09 22:14:05 UTC
Created attachment 277303 [details]
Trial patch to see if you can see any improvement

This is not a fix, but some low-hanging fruit to at least avoid some of the costs when the fork() retry does happen.

The real fix would involve actually tracking whether a signal that actually could affect the full list of processes. I'll post that too for testing.
Comment 6 Linus Torvalds 2018-07-09 22:22:01 UTC
Created attachment 277305 [details]
Trial patch for the "real" fix

This is the actual "fix".

It's a bit hacky (need to serialize the signal sequence reading), but it should work for testing.

It basically separates out the sending of multi-process signals (the ones that target a whole process group, or "all processes"), and adds a sequence count that gets incremented for those.

And only when *that* kind of signal hits the fork() window do we restart the fork.

I'm not hugely proud of that patch, and it will definitely need some further work, but I think it's in testable shape.

It's against current -git, I don't know how well it backports to older kernels, so you may need to test this with a modern kernel.
Comment 7 Oleg Nesterov 2018-07-10 10:30:44 UTC
(In reply to Linus Torvalds from comment #6)
>
> and it will definitely need some further work

Yes... let me comment this patch.

--- a/include/linux/sched/signal.h	
+++ a/include/linux/sched/signal.h	
@@ -17,6 +17,7 @@ struct sighand_struct {
 	atomic_t		count;
 	struct k_sigaction	action[_NSIG];
 	spinlock_t		siglock;
+	unsigned		sequence;

this doesn't really matter, but I'd suggest to move it into signal_struct

+	/*
+	 * FIXME! This should use sighand locking, or possibly
+	 * just memory ordering.
+	 */
+	sigseq = current->sighand->sequence;
+	if (signal_pending(current))
+		return ERR_PTR(-ERESTARTNOINTR);
+

not only we need ->siglock, we need another recalc_sigpending(). Suppose that a
PGDID signal was already sent to the forking process, but signal_pending() == 0
because it was targeted to another thread.

+		/* Return if fatal signal, or if it was a multi-process signal */
+		if (fatal_signal_pending(current) || sigseq != current->sighand->sequence) {
+			retval = -ERESTARTNOINTR;
+			goto bad_fork_cancel_cgroup;
+		}

This is not enough. We need more checks, or perhaps we can add

                else
                        set_tsk_thread_flag(p, TIF_SIGPENDING); 

For example, a new thread should not start without TIF_SIGPENDING if
JOBCTL_STOP_PENDING is set (SIGSTOP/etc). Same for freezing(), probably
something else...


+static int group_send_sig_info_many(int sig, struct siginfo *info, struct task_struct *p)
+{
+	int ret;
+
+	rcu_read_lock();
+	ret = check_kill_permission(sig, info, p);
+	rcu_read_unlock();
+
+	if (!ret && sig) {
+		unsigned long flags;
+
+		if (lock_task_sighand(p, &flags)) {
+			p->sighand->sequence++;

Agreed, but probably we can cleanup this later (and kill this new helper).

We can turn the "bool group" argument of __send_signal() and its callers
into the thread/process/pgid enum and change __send_signal() to increment
the counter if group == pgid. This way it would be simpler to update, say,
tty_signal_session_leader().
Comment 8 Oleg Nesterov 2018-07-10 10:41:53 UTC
(In reply to majiang from comment #3)
>
> When KLEE want to discover new states, it create a new process with fork (So
> that the new process can read current program states directly) and call the
> SMT solver.
> At the same time , KLEE use alarm(1) to do a simple timeout check.
> So, If fork took too long to finish(close enough to 1 second), this problem
> get triggered.

OK, it seems we are going to fix the kernel.

But can't you simply block/unblock SIGALARM around the fork() call?
Comment 9 Wen Yang 2018-07-10 11:21:01 UTC
(In reply to Linus Torvalds from comment #6)
> Created attachment 277305 [details]
> Trial patch for the "real" fix
> 
> This is the actual "fix".
> 
> It's a bit hacky (need to serialize the signal sequence reading), but it
> should work for testing.
> 
> It basically separates out the sending of multi-process signals (the ones
> that target a whole process group, or "all processes"), and adds a sequence
> count that gets incremented for those.
> 
> And only when *that* kind of signal hits the fork() window do we restart the
> fork.
> 
> I'm not hugely proud of that patch, and it will definitely need some further
> work, but I think it's in testable shape.
> 
> It's against current -git, I don't know how well it backports to older
> kernels, so you may need to test this with a modern kernel.

Thank you very much.
We've found those small nits:
1, sighand->sequence is not initialized.
2, sighand->sequence not protected, we've changed sighand->sequence from unsigned int to atomic_t.
3, fatal_signal_pending already contains signal_pending, so we may remove redundant code 

 	if (signal_pending(current)) {
-		retval = -ERESTARTNOINTR;
-		goto bad_fork_cancel_cgroup;
+		/* Return if fatal signal, or if it was a multi-process signal */
+		if (fatal_signal_pending(current) || sigseq != current->sighand->sequence) {
+			retval = -ERESTARTNOINTR;
+			goto bad_fork_cancel_cgroup;
+		}


We will continue to test it.
Thanks again.
Comment 10 Oleg Nesterov 2018-07-10 11:54:41 UTC
On 07/10, bugzilla-daemon@bugzilla.kernel.org wrote:
>
> We've found those small nits:
> 1, sighand->sequence is not initialized.

this doesn't matter. We only need to know if it changes after we read its
value the 1st time at the start of copy_process().

> 2, sighand->sequence not protected, we've changed sighand->sequence from
> unsigned int to atomic_t.

No, it is protected by ->siglock, it is incremented under lock_task_sighand().

> 3, fatal_signal_pending already contains signal_pending, so we may remove
> redundant code

Yes, we can use __fatal_signal_pending() instead.
Comment 11 majiang 2018-07-10 14:38:54 UTC
(In reply to Eric W. Biederman from comment #4)
> I see two different pieces of userspace that are being affected.
> 
> KLEE and the sipwise heartbeat.
> 
> For sipwise I suspect the parent death signal will be a better option for
> you.
> 
> For KLEE how close in the code is calling alarm(1) to the call to fork()?
> It sounds like they are right next to each other and a fork taking 1 second
> concerns me.
> 
> As this signal is only delivered to a single process I believe we can update
> the kernel code without too much trouble so it doesn't do this.   However
> for KLEE I suspect what will then happen is the signal will be delivered
> after the fork and the new child will be killed.  So I don't know if
> improving the kernel will be enough to get the KLEE code to work in this
> situation.

Thanks for the reply. 
I have simplify the situation in KLEE(sorry that seems misleading). In fact, there is two timer. 
In the parent process, there is a periodic timer (using SIGALARM) to do a simple heartbeat counting.
After fork, KLEE use alarm(1) to do a timeout check in the child process(the signal handler will exit the child).

So, after the kernel get fixed, KLEE will run happily.
Comment 12 majiang 2018-07-10 15:18:07 UTC
(In reply to Oleg Nesterov from comment #8)
> (In reply to majiang from comment #3)
> >
> > When KLEE want to discover new states, it create a new process with fork
> (So
> > that the new process can read current program states directly) and call the
> > SMT solver.
> > At the same time , KLEE use alarm(1) to do a simple timeout check.
> > So, If fork took too long to finish(close enough to 1 second), this problem
> > get triggered.
> 
> OK, it seems we are going to fix the kernel.
> 
> But can't you simply block/unblock SIGALARM around the fork() call?

Thanks for the advice.
We have applied this method before we submitted this bug.And yes, KLEE continued to run after then.
A fork that take several hours to finish is obviously ill-formed. So we submitted this bug to see whether the kernel could do some improvements.
Comment 13 Linus Torvalds 2018-07-10 17:20:57 UTC
On Tue, Jul 10, 2018 at 3:30 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:
>
> +       /*
> +        * FIXME! This should use sighand locking, or possibly
> +        * just memory ordering.
> +        */
> +       sigseq = current->sighand->sequence;
> +       if (signal_pending(current))
> +               return ERR_PTR(-ERESTARTNOINTR);
> +
>
> not only we need ->siglock, we need another recalc_sigpending().

I'm not convinced we do.

> Suppose that a
> PGDID signal was already sent to the forking process, but signal_pending() ==
> 0
> because it was targeted to another thread.

The thing is, we don't actually care all that much if the parent
signal gets delivered to the parent or not. The parent signal handling
is fine - whether it's that particular thread that gets it or not.

All we care about is

 (a) did a signal happen after the parent entered the kernel (and has
to handle them manually)

 (b) can we ignore the signal as far as the *child* is concerned.

Now, (b) is somewhat subtle. For example, if a signal is ignored or
blocked in the parent, we can just say "it happened before the fork
started". It doesn't matter if that is strictly true or not - it
allows us to ignore the signal as far as the child is concerned.

And the same is true if the signal was targeted to another thread in
the parent. We don't care. There's no serialization wrt the fork(), so
we might as well say "we consider the signal happened before the fork
even started, and the other thread took it".

See? The *child* is the one that matters, and if the signal happened
before the fork entirely - or we can act as if it did - then the child
can ignore it, because then we have already handled it.

So we don't need to do recalc_sigpending(), because if it's not
pending in the parent (if we were to just complete the fork
immediately and return), then we can consider it not an issue for the
child.

I don't think we need that *existing* recalc_sigpending() either,
really, for the same reason. If the signal is pending in the parent
that the parent should actually react to, TIF_SIGPENDING is already
set.

> +               /* Return if fatal signal, or if it was a multi-process
> signal
> */
> +               if (fatal_signal_pending(current) || sigseq !=
> current->sighand->sequence) {
> +                       retval = -ERESTARTNOINTR;
> +                       goto bad_fork_cancel_cgroup;
> +               }
>
> This is not enough. We need more checks, or perhaps we can add
>
>                 else
>                         set_tsk_thread_flag(p, TIF_SIGPENDING);

Why? Again, if the parent hasn't gotten a new signal, then why should
the child care.



> Agreed, but probably we can cleanup this later (and kill this new helper).
>
> We can turn the "bool group" argument of __send_signal() and its callers
> into the thread/process/pgid enum and change __send_signal() to increment
> the counter if group == pgid. This way it would be simpler to update, say,
> tty_signal_session_leader().

Actually, I looked at that, and started doing it, and I absolutely
hated what it did to the code. That "group" argument is already
confusing.

I much prefer to have explicit functions that do what they describe
they do than magical arguments that are really confusing and change
how the function acts in subtle ways.

             Linus
Comment 14 Oleg Nesterov 2018-07-11 11:16:39 UTC
On 07/10, bugzilla-daemon@bugzilla.kernel.org wrote:
>
> > Suppose that a
> > PGDID signal was already sent to the forking process, but signal_pending()
> ==
> > 0
> > because it was targeted to another thread.
>
> The thing is, we don't actually care all that much if the parent
> signal gets delivered to the parent or not. The parent signal handling
> is fine - whether it's that particular thread that gets it or not.
>
> All we care about is
>
>  (a) did a signal happen after the parent entered the kernel (and has
> to handle them manually)
>
>  (b) can we ignore the signal as far as the *child* is concerned.
>
> Now, (b) is somewhat subtle. For example, if a signal is ignored or
> blocked in the parent, we can just say "it happened before the fork
> started". It doesn't matter if that is strictly true or not - it
> allows us to ignore the signal as far as the child is concerned.
>
> And the same is true if the signal was targeted to another thread in
> the parent. We don't care. There's no serialization wrt the fork(), so
> we might as well say "we consider the signal happened before the fork
> even started, and the other thread took it".

I don't understand...

OK, suppose we have a single process P in pg == 100, it has 2 threads, M and T.
It has a handler for SIGTERM which does some cleanups and calls exit_group().

M calls fork() and this races with SIGTERM.

This signal should terminate the process P and the child. IOW, either P should
handle the signal and exit before M creates a child, or the new child should be
visible to kill_pgrp(100, SIGTERM) and thus it will receive SIGTERM too.

Now:

	- kill_pgrp(100) finds a single process P, delivers the SIGNAL, but
	  targets thread T, so signal_pending(M) is still 0.

	- M calls fork(), enters copy_process(), reads the already incremented
	  counter, checks signal_pending() without recalc_sigpending(), it is 0.

	  fork() suceeds. fatal_signal_pending() == 0, the counter is the same.

	- T dequeues SIGTERM and terminates the process P.

	- The new child runs.

Is it correct?

> I don't think we need that *existing* recalc_sigpending() either,

See above.

But I forgot to mention yesterday that it can be removed with your parch (as
long as we add another one at the start).

> > +               if (fatal_signal_pending(current) || sigseq !=
> > current->sighand->sequence) {
> > +                       retval = -ERESTARTNOINTR;
> > +                       goto bad_fork_cancel_cgroup;
> > +               }
> >
> > This is not enough. We need more checks, or perhaps we can add
> >
> >                 else
> >                         set_tsk_thread_flag(p, TIF_SIGPENDING);
>
> Why? Again, if the parent hasn't gotten a new signal, then why should
> the child care.

Again, suppose that pthread_create() races with do_signal_stop(). Either
copy_process() should fail, or the new thread should have both TIF_SIGPENDING
and JOBCTL_STOP_PENDING|JOBCTL_STOP_CONSUME flags set.

The race with try_to_freeze_tasks() is simpler, in this case we only need
TIF_SIGPENDING.

And I think think there are more problems. That is why I suggested another
approach which "blocks" only the "real" signals sent to thread/process, so
we can rely on the simple signal_pending() check at the end.

Oleg.
Comment 15 Linus Torvalds 2018-07-11 15:57:20 UTC
(In reply to Oleg Nesterov from comment #14)
> 
>       - kill_pgrp(100) finds a single process P, delivers the SIGNAL, but
>         targets thread T, so signal_pending(M) is still 0.
> 
>       - M calls fork(), enters copy_process(), reads the already incremented
>         counter, checks signal_pending() without recalc_sigpending(), it is
> 0.
> 
>         fork() suceeds. fatal_signal_pending() == 0, the counter is the same.
> 
>       - T dequeues SIGTERM and terminates the process P.
> 
>       - The new child runs.
> 
> Is it correct?

I think that's ok, yes.

Why? Because we just say "SIGTERM was done before fork(), it just ran concurrently in another thread"

The fact that you have a signal pending in another thread is simply not serialized with fork(). 

The key word is that

> It has a handler for SIGTERM which does some cleanups and calls exit_group().

See? Every process in the process group did get the SIGTERM. But then one thread decided to do a fork() as another thread did that signal handler.

So I think that we're fine.

Side note: there's also a weasely lawyerese reason for why we don't need to care about the threaded case: technically it's unspecified. fork() is simply not thread-safe. If you have threads and do fork(), the newly copied address space will be some *random* copy that other threads can be in the middle of things of, including holding core locks etc. 

So the newly created address space may have entirely inconsistent locking in core internal libc routines (think "oh, we forked while another thread held the lock for stdin").

So the kernel *allows* fork() in threaded environments, and we try to make it work, but you get what you get. In practice, fork + immediate execve() *should* be safe, but because of the "you copy locks" issue, it might actually be better to do vfork()+execve instead.

Anyway, I think that from a correctness standpoint, we should worry about:

 (a) SIGKILL to everybody, so that we know that at least the sysadmin can always kill everything and you can't escape that

 (b) the "fork() from a single thread" case, because that's the one where standards language will matter.

and then outside of those cases we should strive to have a "good quality" implementation that we can at least _explain_ what we do, even if it might not be perfect. Best-effort, in other words, simply because "perfect" is fundamentally not an option with fork+threads due to the above issue.
Comment 16 Oleg Nesterov 2018-07-11 17:00:20 UTC
On 07/11, bugzilla-daemon@bugzilla.kernel.org wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=200447
>
> --- Comment #15 from Linus Torvalds (torvalds@linux-foundation.org) ---
> (In reply to Oleg Nesterov from comment #14)
> >
> >       - kill_pgrp(100) finds a single process P, delivers the SIGNAL, but
> >         targets thread T, so signal_pending(M) is still 0.
> >
> >       - M calls fork(), enters copy_process(), reads the already
> incremented
> >         counter, checks signal_pending() without recalc_sigpending(), it is
> > 0.
> >
> >         fork() suceeds. fatal_signal_pending() == 0, the counter is the
> same.
> >
> >       - T dequeues SIGTERM and terminates the process P.
> >
> >       - The new child runs.
> >
> > Is it correct?
>
> I think that's ok, yes.

I disagree... At least I thought that the whole purpose of the current
signal_pending() check is the scenarios like above. I mean, ignoring SIGKILL
and other sources of TIF_SIGPENDING. That is exactly why recalc_sigpending()
was added.

> Why? Because we just say "SIGTERM was done before fork(), it just ran
> concurrently in another thread"

And I think this is wrong, this is how I understand the comment

	* Process group and session signals need to be delivered to just the
	* parent before the fork or both the parent and the child after the
	* fork.

above the signal_pending() check.

> The fact that you have a signal pending in another thread is simply not
> serialized with fork().

But it is not pending in another thread. It is pending in this process, in
signal->shared_pending.

Just complete_signal() tries to not disturb the whole thread group, it doesn't
wakes up all threads, it peaks a "random" one.

> The key word is that
>
> > It has a handler for SIGTERM

Yes,

> which does some cleanups and calls exit_group().

I think it doesn't matter what the handler actually does. It could be
SIGWINCH handler which does ioctl(TIOCGWINSZ).

> Anyway, I think that from a correctness standpoint, we should worry about:
>
>  (a) SIGKILL to everybody, so that we know that at least the sysadmin can
> always kill everything and you can't escape that

This is clear, and this is simple.

>  (b) the "fork() from a single thread" case, because that's the one where
> standards language will matter.

I can't agree with "only a single thread case matters".

Oleg.
Comment 17 Linus Torvalds 2018-07-11 17:05:51 UTC
On Wed, Jul 11, 2018 at 10:00 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:
>
> But it is not pending in another thread. It is pending in this process, in
> signal->shared_pending.

So?

What part of the whole "SIGTERM is handled concurrently in another
thread" is it that you're really disagreeing with?

And if it's handled concurrently by another thread, then the fork is
independent.

And if the fork is independent, then you get the *exact* semantics
you'd have gotten if the SIGTERM was sent just _before_ the fork().

So explain why you think the above three statements aren't true.

                Linus
Comment 18 Oleg Nesterov 2018-07-12 10:06:40 UTC
> --- Comment #17 from Linus Torvalds (torvalds@linux-foundation.org) ---
> On Wed, Jul 11, 2018 at 10:00 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:
> >
> > But it is not pending in another thread. It is pending in this process, in
> > signal->shared_pending.
>
> So?
>
> What part of the whole "SIGTERM is handled concurrently in another
> thread" is it that you're really disagreeing with?

I don't understand why do you think we should make things worse than necessary.

Yes, if a multithreaded process has a terminating handler for SIGTERM and it
calls fork(), then SIGTERM sent to its pgroup can miss the new child anyway,
whatever we do.

But recalc_sigpending_tsk() is cheap (we don't need recalc_sigpending), it can
help at least in the case when SIGTERM was pending in process before one of its
threads calls fork(). I never said this is the great improvement, but I see no
reason to remove it.

OK, please forget.


What about the "fatal_signal_pending(current) || sigseq != current->sighand->sequence"
check? Do you agree it is not enough?

Oleg.
Comment 19 Wen Yang 2018-07-12 13:24:08 UTC
> [RFC][PATCH 10/11] signal: Push pid type from signal senders down into
> __send_signal
>       
> ...
> -     pending = group ? &t->signal->shared_pending : &t->pending;
> +     pending = type != PIDTYPE_PID ? &t->signal->shared_pending :
> &t->pending; 

we may add brackets to make it more readable:
+	pending = (type != (=PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
Comment 20 Wen Yang 2018-07-12 13:57:33 UTC
I'm sorry, it should be:
+	pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
Comment 21 Eric W. Biederman 2018-07-12 14:44:19 UTC
(In reply to Wen Yang from comment #20)
> I'm sorry, it should be:
> +     pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending :
> &t->pending;

Agreed.  Silly me trying to hard to stay within 80 column limits.
Comment 22 Eric W. Biederman 2018-07-12 15:43:30 UTC
(In reply to Linus Torvalds from comment #15)
> (In reply to Oleg Nesterov from comment #14)
> > 
> >       - kill_pgrp(100) finds a single process P, delivers the SIGNAL, but
> >         targets thread T, so signal_pending(M) is still 0.
> > 
> >       - M calls fork(), enters copy_process(), reads the already
> incremented
> >         counter, checks signal_pending() without recalc_sigpending(), it is
> > 0.
> > 
> >         fork() suceeds. fatal_signal_pending() == 0, the counter is the
> same.
> > 
> >       - T dequeues SIGTERM and terminates the process P.
> > 
> >       - The new child runs.
> > 
> > Is it correct?
> 
> I think that's ok, yes.


> Anyway, I think that from a correctness standpoint, we should worry about:
> 
>  (a) SIGKILL to everybody, so that we know that at least the sysadmin can
> always kill everything and you can't escape that
> 
>  (b) the "fork() from a single thread" case, because that's the one where
> standards language will matter.
> 
> and then outside of those cases we should strive to have a "good quality"
> implementation that we can at least _explain_ what we do, even if it might
> not be perfect. Best-effort, in other words, simply because "perfect" is
> fundamentally not an option with fork+threads due to the above issue.

I just reread the posix documentation.  It is quite clear that a signal should be delivered to userspace either before or after the fork.  If we leave the signal as merely pending in another thread it will be possible to construct scenarios where it can be proven that the signal was delivered after the fork, and the child was not signaled.

Further there exists pthread_atfork which makes it possible to cleanly handle most of the locking issues.  So fork and pthreads are not a unusable combination.

So I agree with Oleg the code should call recalc_sigpending() at the start of fork, before we rely on the sequence counter.  Taking the siglock is not expensive nor is the recalculation, nor even the syscall restart at that point.

Most importantly for me adding a recalc_sigpending at the start of fork at very little cost removes one more corner case we have to reason about, test and think about.  Making the code simpler and more reliable.
Comment 23 Oleg Nesterov 2018-07-12 16:04:52 UTC
(In reply to Eric W. Biederman from comment #22)
>
> So I agree with Oleg the code should call recalc_sigpending() at the start
> of fork, before we rely on the sequence counter.  Taking the siglock is not
> expensive nor is the recalculation

Just in case, we can use recalc_sigpending_tsk(current) which doesn't need
siglock because tsk==current and it never clears TIF_SIGPENDING. We only need
to check ->shared_pending.

recalc_sigpending() needs siglock, but it can clear the "spurious" SIGPENDING.
This can help to avoid the unnecessary restart, but this is unlikely and as
you have mentioned restart is not expensive at that point.
Comment 24 Wen Yang 2018-07-17 11:17:36 UTC
From	"Eric W. Biederman" <>
Date	Tue, 10 Jul 2018 21:44:53 -0500
Subject	[RFC][PATCH 05/11] pids: Move the pgrp and session pid pointers from task_struct to signal_struct

task->thread_pid equals to task->signal->pids[PIDTYPE_SID], so we may have stored redundant values.


static inline struct pid *task_pid(struct task_struct *task)
 {
...
-	return task->group_leader->pids[PIDTYPE_SID].pid;
+	return task->thread_pid;
 }
 
  void transfer_pid(struct task_struct *old, struct task_struct *new,
                enum pid_type type)
 {
-    new->pids[type].pid = old->pids[type].pid;
-    hlist_replace_rcu(&old->pids[type].node, &new->pids[type].node);
+    if (type == PIDTYPE_PID)
+        new->thread_pid = old->thread_pid;
+    hlist_replace_rcu(&old->pid_links[type], &new->pid_links[type]);
 }
 
  init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
 {
-     task->pids[type].pid = pid;
+    if (type == PIDTYPE_PID)
+        task->thread_pid = pid;
+    else
+        task->signal->pids[type] = pid;
 }
 
 ---> when type == PIDTYPE_PID, task->thread_pid is right, but task->signal->pids[PIDTYPE_PID] is NULL.
Comment 25 Wen Yang 2018-08-24 07:06:58 UTC
This bug has been fixed by the patch:
https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git/commit/?h=siginfo-testing&id=c3ad2c3b02e953ead2b8d52a0c9e70312930c3d0

We've test it several days, it works fine.
Thanks.

Another patch also needs to be applied, can improve the performance of fork syscall:
https://bugzilla.kernel.org/attachment.cgi?id=277303&action=edit
Comment 26 Linus Torvalds 2018-08-24 19:39:42 UTC
On Fri, Aug 24, 2018 at 12:07 AM  Wen Yang (yellowriver2010@hotmail.com) -wrote:
>
> We've test it several days, it works fine.
> Thanks.
>
> Another patch also needs to be applied, can improve the performance of fork
> syscall:
> https://bugzilla.kernel.org/attachment.cgi?id=277303&action=edit

I'm perfectly happy applying that patch, since I wrote it, and it
makes sense, but it would be really good if you actually have numbers
for the improvement on a real load.

I can see it improving things on a trivial benchmark, but it would
matter much more to me if you actually have numbers for this improving
things in real life on your load.

Thanks,

          Linus
Comment 27 Wen Yang 2018-08-27 11:44:53 UTC
(In reply to Linus Torvalds from comment #26)
> 
> I'm perfectly happy applying that patch, since I wrote it, and it
> makes sense, but it would be really good if you actually have numbers
> for the improvement on a real load.
> 
> I can see it improving things on a trivial benchmark, but it would
> matter much more to me if you actually have numbers for this improving
> things in real life on your load.


Hi, Linus, 
It's my pleasure to test your patch.

My test code is:

#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <string.h>
#include <errno.h>
#include <sys/time.h>

static unsigned long long int signal_cnt = 0;

void incr_count()
{
        ++signal_cnt;
}

int main (int argc, char *argv[] )
{
        long long mem_size;
        unsigned long fork_times;

        if(argc != 3) {
                printf("usage: memsize forktimes !\n");
                return -EINVAL;
        }

        mem_size = atoll(argv[1]);
        fork_times = atoi(argv[2]);
        char * x = malloc(mem_size);
        memset(x, 1, mem_size);

        struct itimerval period;
        period.it_interval.tv_sec = 0;
        period.it_interval.tv_usec = 100000;

        period.it_value.tv_sec = 3;
        period.it_value.tv_usec = 0;

        struct timeval t1;
        struct timeval t2;

        signal(SIGALRM, incr_count);
        setitimer(ITIMER_REAL, &period, NULL);

        int i = 0, pid;
        gettimeofday(&t1, NULL);
        while (i < fork_times) {
                pid = fork();
                if(pid == 0)
                        exit(0);

                wait(NULL);
                i++;
        }

        unsigned long long time_used;
        gettimeofday(&t2, NULL);
        time_used = (t2.tv_sec * 1000ULL * 1000 + t2.tv_usec)
                - (t1.tv_sec * 1000ULL*1000 + t1.tv_usec);
        printf("mem_size=%llu && fork_times=%lu, used %llu us\n",
                        mem_size, fork_times, time_used);

        return 0;
}


My test environment is a kvm virtual machine:
  <memory unit='KiB'>20486144</memory>
  <currentMemory unit='KiB'>20485760</currentMemory>
  <vcpu placement='static'>10</vcpu>


My test cases are:

TEST 1:mem_size=8G && fork_times=4096

  a. Without this patch:
  # uname  -a
  Linux localhost.localdomain 4.18.0-next-20180816 #1 SMP Fri Aug 17 09:33:40 CST 2018 x86_64 x86_64 x86_64 GNU/Linux

  # for i in {0..15};  do  echo 3 > /proc/sys/vm/drop_caches;  ./a.out  8589934592  4096 ; echo 3 > /proc/sys/vm/drop_caches; sleep 5; done;
  mem_size=8589934592 && fork_times=4096, used 54079373 us
  mem_size=8589934592 && fork_times=4096, used 50768614 us
  mem_size=8589934592 && fork_times=4096, used 58499333 us
  mem_size=8589934592 && fork_times=4096, used 50798437 us
  mem_size=8589934592 && fork_times=4096, used 53412134 us
  mem_size=8589934592 && fork_times=4096, used 54974291 us
  mem_size=8589934592 && fork_times=4096, used 57183614 us
  mem_size=8589934592 && fork_times=4096, used 55201697 us
  mem_size=8589934592 && fork_times=4096, used 51308532 us
  mem_size=8589934592 && fork_times=4096, used 54910070 us
  mem_size=8589934592 && fork_times=4096, used 54145605 us
  mem_size=8589934592 && fork_times=4096, used 51561948 us
  mem_size=8589934592 && fork_times=4096, used 50645891 us
  mem_size=8589934592 && fork_times=4096, used 52478923 us
  mem_size=8589934592 && fork_times=4096, used 52967142 us
  mem_size=8589934592 && fork_times=4096, used 54354484 us
  
average = (54079373+50768614+58499333+50798437+53412134+54974291+57183614+55201697+51308532+54910070+54145605+51561948+50645891+52478923+52967142+54354484)/16
        = 53580630.5
		
  b. With this patch:
  # uname  -a
  Linux localhost.localdomain 4.18.0fix-next-20180816+ #4 SMP Mon Aug 27 09:47:11 CST 2018 x86_64 x86_64 x86_64 GNU/Linux

  #  for i in {0..15};  do  echo 3 > /proc/sys/vm/drop_caches;  ./a.out  8589934592  4096 ; echo 3 > /proc/sys/vm/drop_caches; sleep 5; done;
  mem_size=8589934592 && fork_times=4096, used 50220661 us
  mem_size=8589934592 && fork_times=4096, used 40172114 us
  mem_size=8589934592 && fork_times=4096, used 46645951 us
  mem_size=8589934592 && fork_times=4096, used 46042630 us
  mem_size=8589934592 && fork_times=4096, used 46228679 us
  mem_size=8589934592 && fork_times=4096, used 45128850 us
  mem_size=8589934592 && fork_times=4096, used 43085861 us
  mem_size=8589934592 && fork_times=4096, used 43290896 us
  mem_size=8589934592 && fork_times=4096, used 46307499 us
  mem_size=8589934592 && fork_times=4096, used 45169518 us
  mem_size=8589934592 && fork_times=4096, used 42832446 us
  mem_size=8589934592 && fork_times=4096, used 44796410 us
  mem_size=8589934592 && fork_times=4096, used 45700605 us
  mem_size=8589934592 && fork_times=4096, used 43767852 us
  mem_size=8589934592 && fork_times=4096, used 44515256 us
  mem_size=8589934592 && fork_times=4096, used 43434547 us

average = (50220661+40172114+46645951+46042630+46228679+45128850+43085861+43290896+46307499+45169518+42832446+44796410+45700605+43767852+44515256+43434547)/16
        = 44833735.9375
  
delta = (53580630.5 - 44833735.9375)/53580630.5
      = 16.3%
	  
=================================================================
  
TEST 2:mem_size=4G && fork_times=4096

  a. Without this patch:
  # uname -a
  Linux localhost.localdomain 4.18.0-next-20180816 #1 SMP Fri Aug 17 09:33:40 CST 2018 x86_64 x86_64 x86_64 GNU/Linux

  # for i in {0..15};  do  echo 3 > /proc/sys/vm/drop_caches;  ./a.out  4294967296  4096 ; echo 3 > /proc/sys/vm/drop_caches; sleep 5; done;
  mem_size=4294967296 && fork_times=4096, used 18256498 us
  mem_size=4294967296 && fork_times=4096, used 18167115 us
  mem_size=4294967296 && fork_times=4096, used 18577742 us
  mem_size=4294967296 && fork_times=4096, used 18529440 us
  mem_size=4294967296 && fork_times=4096, used 15664382 us
  mem_size=4294967296 && fork_times=4096, used 19681474 us
  mem_size=4294967296 && fork_times=4096, used 20842160 us
  mem_size=4294967296 && fork_times=4096, used 17733025 us
  mem_size=4294967296 && fork_times=4096, used 18448357 us
  mem_size=4294967296 && fork_times=4096, used 19143284 us
  mem_size=4294967296 && fork_times=4096, used 18283942 us
  mem_size=4294967296 && fork_times=4096, used 19133080 us
  mem_size=4294967296 && fork_times=4096, used 17829583 us
  mem_size=4294967296 && fork_times=4096, used 19044675 us
  mem_size=4294967296 && fork_times=4096, used 18316882 us
  mem_size=4294967296 && fork_times=4096, used 18687442 us

average = (18256498+18167115+18577742+18529440 +15664382+19681474+20842160+17733025+18448357+19143284+18283942+19133080+17829583+19044675+18316882+18687442)/16
        = 18521192.5625
		
  b. With this patch:
  
  # uname  -a
  Linux localhost.localdomain 4.18.0fix-next-20180816+ #4 SMP Mon Aug 27 09:47:11 CST 2018 x86_64 x86_64 x86_64 GNU/Linux
  
  #  for i in {0..15};  do  echo 3 > /proc/sys/vm/drop_caches;  ./a.out  4294967296  4096 ; echo 3 > /proc/sys/vm/drop_caches; sleep 5; done;
  mem_size=4294967296 && fork_times=4096, used 15173732 us
  mem_size=4294967296 && fork_times=4096, used 15231416 us
  mem_size=4294967296 && fork_times=4096, used 15853240 us
  mem_size=4294967296 && fork_times=4096, used 16448929 us
  mem_size=4294967296 && fork_times=4096, used 15101745 us
  mem_size=4294967296 && fork_times=4096, used 14808725 us
  mem_size=4294967296 && fork_times=4096, used 13834210 us
  mem_size=4294967296 && fork_times=4096, used 15297005 us
  mem_size=4294967296 && fork_times=4096, used 14256755 us
  mem_size=4294967296 && fork_times=4096, used 14701890 us
  mem_size=4294967296 && fork_times=4096, used 16032061 us
  mem_size=4294967296 && fork_times=4096, used 14864268 us
  mem_size=4294967296 && fork_times=4096, used 14040698 us
  mem_size=4294967296 && fork_times=4096, used 13486647 us
  mem_size=4294967296 && fork_times=4096, used 15362027 us
  mem_size=4294967296 && fork_times=4096, used 14086739 us

average = (15173732+15231416+15853240+16448929+15101745+14808725+13834210+15297005+14256755+14701890+16032061+14864268+14040698+13486647+15362027+14086739)/16
       = 14911255.4375
  
delta = (18521192.5625 - 14911255.4375)/18521192.5625
      = 19.5%


Thanks,
Comment 28 Linus Torvalds 2018-08-27 15:15:04 UTC
The COW copy optimization ended up being merged as commit 1b2de5d039c8 ("mm/cow: don't bother write protecting already write-protected pages") because I decided to just add it before the merge window closed. So it will be in 4.19.

But thanks for the numbers. They are still just for a test program rather than any real load, but it's not like that change should possibly hurt anything.