when A process using large amount of memory peridically receives signals, it may infinite loop in the fork syscall. This issue was found with the following test code: #include <stdlib.h> #include <stdio.h> #include <signal.h> #include <sys/time.h> const unsigned long int sz = 1024UL *1024 * 1024 * 8; static unsigned long long int signal_cnt = 0; void incr_count() { ++signal_cnt; printf("now cnt = %llu\n", signal_cnt); } int main() { int pid = -1; unsigned long long fork_time; char * x = malloc(sz); memset(x, 1, sz); struct itimerval period; period.it_interval.tv_sec = 0; period.it_interval.tv_usec = 100000; period.it_value.tv_sec = 3; period.it_value.tv_usec = 0; struct timeval t1; struct timeval t2; signal(SIGALRM, incr_count); setitimer(ITIMER_REAL, &period, NULL); while(1){ gettimeofday(&t1, NULL); pid = fork(); if(pid == 0) exit(0); gettimeofday(&t2, NULL); wait(NULL); fork_time = (t2.tv_sec * 1000ULL * 1000 + t2.tv_usec) - (t1.tv_sec * 1000ULL*1000 + t1.tv_usec); printf("fork used %llu\n", fork_time); } return 0; } This is caused by commit 4a2c7a7837da ("make fork() atomic wrt pgrp/session signals"), which returns -ERESTARTNOINTR when it detects a pending signal, fork() will be restarted transparently after handling the signals. when a process using large amount of memory, mmput could could take a long time before return -ERESTARTNOINTR. This may lead to a DoS attack if the process peridically receives signels. We may move pending_signals_check ahead (before copy_mm) to avoid the problem.
That's an odd bugreport for a commit that is 12 years old. Not really an infinite loop, just possibly a (unlikely) livelock by somebody who already has the rights to kill the process in question. How did you notice this, and is there actually some real case that cares? I find your test program cute, but not necessarily hugely convincing that this is a serious problem. I've forwarded the report to the original people involved with the report, but I think this is pretty low-priority unless you can point to a more realistic real issue.. Did you also test it on modern kernels? The locking and fork() code has changed in a lot of other ways, although the -ERESTARTNOINTR does remain.
(In reply to Linus Torvalds from comment #1) > That's an odd bugreport for a commit that is 12 years old. > > Not really an infinite loop, just possibly a (unlikely) livelock by somebody > who already has the rights to kill the process in question. > > How did you notice this, and is there actually some real case that cares? I > find your test program cute, but not necessarily hugely convincing that this > is a serious problem. > > I've forwarded the report to the original people involved with the report, > but I think this is pretty low-priority unless you can point to a more > realistic real issue.. > > Did you also test it on modern kernels? The locking and fork() code has > changed in a lot of other ways, although the -ERESTARTNOINTR does remain. Than you very much. We may use setitimer + SIGALRM sighandler to implement simple beartbeat,eg: https://github.com/sipwise/heartbeat/blob/master/lib/clplumbing/timers.c int setmsalarm(long ms) { long secs = ms / 1000L; long usecs = (ms % 1000L)*1000L; struct itimerval nexttime = { {0L, 0L} /* Repeat Interval */ , {secs, usecs} /* Timer Value */ }; return setitimer(ITIMER_REAL, &nexttime, NULL); } and https://github.com/sipwise/heartbeat/blob/master/heartbeat/heartbeat.c ... /* Make sure we check for death of parent every so often... */ for (;;) { setmsalarm(1000L); msg = msgfromstream(fifo); setmsalarm(0L); hb_check_mcp_alive(); hb_signal_process_pending(); ... We'll using rhel7.4( kernel:3.10.0-693.5.2.el7.x86_64) and kernel 4.15 to test it later.
(In reply to Linus Torvalds from comment #1) > That's an odd bugreport for a commit that is 12 years old. > > Not really an infinite loop, just possibly a (unlikely) livelock by somebody > who already has the rights to kill the process in question. > > How did you notice this, and is there actually some real case that cares? I > find your test program cute, but not necessarily hugely convincing that this > is a serious problem. > Hi, We found this problem when using the symbolic virtual machine KLEE(https://github.com/klee/klee). KLEE could easily consume many memory (more than 2G, usually), because it must keep current program states in memory. When KLEE want to discover new states, it create a new process with fork (So that the new process can read current program states directly) and call the SMT solver. At the same time , KLEE use alarm(1) to do a simple timeout check. So, If fork took too long to finish(close enough to 1 second), this problem get triggered. As far as I know, KLEE have no better alternatives to fork(as SMT solvers could get stuck often, or eat up too many memory)...
I see two different pieces of userspace that are being affected. KLEE and the sipwise heartbeat. For sipwise I suspect the parent death signal will be a better option for you. For KLEE how close in the code is calling alarm(1) to the call to fork()? It sounds like they are right next to each other and a fork taking 1 second concerns me. As this signal is only delivered to a single process I believe we can update the kernel code without too much trouble so it doesn't do this. However for KLEE I suspect what will then happen is the signal will be delivered after the fork and the new child will be killed. So I don't know if improving the kernel will be enough to get the KLEE code to work in this situation.
Created attachment 277303 [details] Trial patch to see if you can see any improvement This is not a fix, but some low-hanging fruit to at least avoid some of the costs when the fork() retry does happen. The real fix would involve actually tracking whether a signal that actually could affect the full list of processes. I'll post that too for testing.
Created attachment 277305 [details] Trial patch for the "real" fix This is the actual "fix". It's a bit hacky (need to serialize the signal sequence reading), but it should work for testing. It basically separates out the sending of multi-process signals (the ones that target a whole process group, or "all processes"), and adds a sequence count that gets incremented for those. And only when *that* kind of signal hits the fork() window do we restart the fork. I'm not hugely proud of that patch, and it will definitely need some further work, but I think it's in testable shape. It's against current -git, I don't know how well it backports to older kernels, so you may need to test this with a modern kernel.
(In reply to Linus Torvalds from comment #6) > > and it will definitely need some further work Yes... let me comment this patch. --- a/include/linux/sched/signal.h +++ a/include/linux/sched/signal.h @@ -17,6 +17,7 @@ struct sighand_struct { atomic_t count; struct k_sigaction action[_NSIG]; spinlock_t siglock; + unsigned sequence; this doesn't really matter, but I'd suggest to move it into signal_struct + /* + * FIXME! This should use sighand locking, or possibly + * just memory ordering. + */ + sigseq = current->sighand->sequence; + if (signal_pending(current)) + return ERR_PTR(-ERESTARTNOINTR); + not only we need ->siglock, we need another recalc_sigpending(). Suppose that a PGDID signal was already sent to the forking process, but signal_pending() == 0 because it was targeted to another thread. + /* Return if fatal signal, or if it was a multi-process signal */ + if (fatal_signal_pending(current) || sigseq != current->sighand->sequence) { + retval = -ERESTARTNOINTR; + goto bad_fork_cancel_cgroup; + } This is not enough. We need more checks, or perhaps we can add else set_tsk_thread_flag(p, TIF_SIGPENDING); For example, a new thread should not start without TIF_SIGPENDING if JOBCTL_STOP_PENDING is set (SIGSTOP/etc). Same for freezing(), probably something else... +static int group_send_sig_info_many(int sig, struct siginfo *info, struct task_struct *p) +{ + int ret; + + rcu_read_lock(); + ret = check_kill_permission(sig, info, p); + rcu_read_unlock(); + + if (!ret && sig) { + unsigned long flags; + + if (lock_task_sighand(p, &flags)) { + p->sighand->sequence++; Agreed, but probably we can cleanup this later (and kill this new helper). We can turn the "bool group" argument of __send_signal() and its callers into the thread/process/pgid enum and change __send_signal() to increment the counter if group == pgid. This way it would be simpler to update, say, tty_signal_session_leader().
(In reply to majiang from comment #3) > > When KLEE want to discover new states, it create a new process with fork (So > that the new process can read current program states directly) and call the > SMT solver. > At the same time , KLEE use alarm(1) to do a simple timeout check. > So, If fork took too long to finish(close enough to 1 second), this problem > get triggered. OK, it seems we are going to fix the kernel. But can't you simply block/unblock SIGALARM around the fork() call?
(In reply to Linus Torvalds from comment #6) > Created attachment 277305 [details] > Trial patch for the "real" fix > > This is the actual "fix". > > It's a bit hacky (need to serialize the signal sequence reading), but it > should work for testing. > > It basically separates out the sending of multi-process signals (the ones > that target a whole process group, or "all processes"), and adds a sequence > count that gets incremented for those. > > And only when *that* kind of signal hits the fork() window do we restart the > fork. > > I'm not hugely proud of that patch, and it will definitely need some further > work, but I think it's in testable shape. > > It's against current -git, I don't know how well it backports to older > kernels, so you may need to test this with a modern kernel. Thank you very much. We've found those small nits: 1, sighand->sequence is not initialized. 2, sighand->sequence not protected, we've changed sighand->sequence from unsigned int to atomic_t. 3, fatal_signal_pending already contains signal_pending, so we may remove redundant code if (signal_pending(current)) { - retval = -ERESTARTNOINTR; - goto bad_fork_cancel_cgroup; + /* Return if fatal signal, or if it was a multi-process signal */ + if (fatal_signal_pending(current) || sigseq != current->sighand->sequence) { + retval = -ERESTARTNOINTR; + goto bad_fork_cancel_cgroup; + } We will continue to test it. Thanks again.
On 07/10, bugzilla-daemon@bugzilla.kernel.org wrote: > > We've found those small nits: > 1, sighand->sequence is not initialized. this doesn't matter. We only need to know if it changes after we read its value the 1st time at the start of copy_process(). > 2, sighand->sequence not protected, we've changed sighand->sequence from > unsigned int to atomic_t. No, it is protected by ->siglock, it is incremented under lock_task_sighand(). > 3, fatal_signal_pending already contains signal_pending, so we may remove > redundant code Yes, we can use __fatal_signal_pending() instead.
(In reply to Eric W. Biederman from comment #4) > I see two different pieces of userspace that are being affected. > > KLEE and the sipwise heartbeat. > > For sipwise I suspect the parent death signal will be a better option for > you. > > For KLEE how close in the code is calling alarm(1) to the call to fork()? > It sounds like they are right next to each other and a fork taking 1 second > concerns me. > > As this signal is only delivered to a single process I believe we can update > the kernel code without too much trouble so it doesn't do this. However > for KLEE I suspect what will then happen is the signal will be delivered > after the fork and the new child will be killed. So I don't know if > improving the kernel will be enough to get the KLEE code to work in this > situation. Thanks for the reply. I have simplify the situation in KLEE(sorry that seems misleading). In fact, there is two timer. In the parent process, there is a periodic timer (using SIGALARM) to do a simple heartbeat counting. After fork, KLEE use alarm(1) to do a timeout check in the child process(the signal handler will exit the child). So, after the kernel get fixed, KLEE will run happily.
(In reply to Oleg Nesterov from comment #8) > (In reply to majiang from comment #3) > > > > When KLEE want to discover new states, it create a new process with fork > (So > > that the new process can read current program states directly) and call the > > SMT solver. > > At the same time , KLEE use alarm(1) to do a simple timeout check. > > So, If fork took too long to finish(close enough to 1 second), this problem > > get triggered. > > OK, it seems we are going to fix the kernel. > > But can't you simply block/unblock SIGALARM around the fork() call? Thanks for the advice. We have applied this method before we submitted this bug.And yes, KLEE continued to run after then. A fork that take several hours to finish is obviously ill-formed. So we submitted this bug to see whether the kernel could do some improvements.
On Tue, Jul 10, 2018 at 3:30 AM <bugzilla-daemon@bugzilla.kernel.org> wrote: > > + /* > + * FIXME! This should use sighand locking, or possibly > + * just memory ordering. > + */ > + sigseq = current->sighand->sequence; > + if (signal_pending(current)) > + return ERR_PTR(-ERESTARTNOINTR); > + > > not only we need ->siglock, we need another recalc_sigpending(). I'm not convinced we do. > Suppose that a > PGDID signal was already sent to the forking process, but signal_pending() == > 0 > because it was targeted to another thread. The thing is, we don't actually care all that much if the parent signal gets delivered to the parent or not. The parent signal handling is fine - whether it's that particular thread that gets it or not. All we care about is (a) did a signal happen after the parent entered the kernel (and has to handle them manually) (b) can we ignore the signal as far as the *child* is concerned. Now, (b) is somewhat subtle. For example, if a signal is ignored or blocked in the parent, we can just say "it happened before the fork started". It doesn't matter if that is strictly true or not - it allows us to ignore the signal as far as the child is concerned. And the same is true if the signal was targeted to another thread in the parent. We don't care. There's no serialization wrt the fork(), so we might as well say "we consider the signal happened before the fork even started, and the other thread took it". See? The *child* is the one that matters, and if the signal happened before the fork entirely - or we can act as if it did - then the child can ignore it, because then we have already handled it. So we don't need to do recalc_sigpending(), because if it's not pending in the parent (if we were to just complete the fork immediately and return), then we can consider it not an issue for the child. I don't think we need that *existing* recalc_sigpending() either, really, for the same reason. If the signal is pending in the parent that the parent should actually react to, TIF_SIGPENDING is already set. > + /* Return if fatal signal, or if it was a multi-process > signal > */ > + if (fatal_signal_pending(current) || sigseq != > current->sighand->sequence) { > + retval = -ERESTARTNOINTR; > + goto bad_fork_cancel_cgroup; > + } > > This is not enough. We need more checks, or perhaps we can add > > else > set_tsk_thread_flag(p, TIF_SIGPENDING); Why? Again, if the parent hasn't gotten a new signal, then why should the child care. > Agreed, but probably we can cleanup this later (and kill this new helper). > > We can turn the "bool group" argument of __send_signal() and its callers > into the thread/process/pgid enum and change __send_signal() to increment > the counter if group == pgid. This way it would be simpler to update, say, > tty_signal_session_leader(). Actually, I looked at that, and started doing it, and I absolutely hated what it did to the code. That "group" argument is already confusing. I much prefer to have explicit functions that do what they describe they do than magical arguments that are really confusing and change how the function acts in subtle ways. Linus
On 07/10, bugzilla-daemon@bugzilla.kernel.org wrote: > > > Suppose that a > > PGDID signal was already sent to the forking process, but signal_pending() > == > > 0 > > because it was targeted to another thread. > > The thing is, we don't actually care all that much if the parent > signal gets delivered to the parent or not. The parent signal handling > is fine - whether it's that particular thread that gets it or not. > > All we care about is > > (a) did a signal happen after the parent entered the kernel (and has > to handle them manually) > > (b) can we ignore the signal as far as the *child* is concerned. > > Now, (b) is somewhat subtle. For example, if a signal is ignored or > blocked in the parent, we can just say "it happened before the fork > started". It doesn't matter if that is strictly true or not - it > allows us to ignore the signal as far as the child is concerned. > > And the same is true if the signal was targeted to another thread in > the parent. We don't care. There's no serialization wrt the fork(), so > we might as well say "we consider the signal happened before the fork > even started, and the other thread took it". I don't understand... OK, suppose we have a single process P in pg == 100, it has 2 threads, M and T. It has a handler for SIGTERM which does some cleanups and calls exit_group(). M calls fork() and this races with SIGTERM. This signal should terminate the process P and the child. IOW, either P should handle the signal and exit before M creates a child, or the new child should be visible to kill_pgrp(100, SIGTERM) and thus it will receive SIGTERM too. Now: - kill_pgrp(100) finds a single process P, delivers the SIGNAL, but targets thread T, so signal_pending(M) is still 0. - M calls fork(), enters copy_process(), reads the already incremented counter, checks signal_pending() without recalc_sigpending(), it is 0. fork() suceeds. fatal_signal_pending() == 0, the counter is the same. - T dequeues SIGTERM and terminates the process P. - The new child runs. Is it correct? > I don't think we need that *existing* recalc_sigpending() either, See above. But I forgot to mention yesterday that it can be removed with your parch (as long as we add another one at the start). > > + if (fatal_signal_pending(current) || sigseq != > > current->sighand->sequence) { > > + retval = -ERESTARTNOINTR; > > + goto bad_fork_cancel_cgroup; > > + } > > > > This is not enough. We need more checks, or perhaps we can add > > > > else > > set_tsk_thread_flag(p, TIF_SIGPENDING); > > Why? Again, if the parent hasn't gotten a new signal, then why should > the child care. Again, suppose that pthread_create() races with do_signal_stop(). Either copy_process() should fail, or the new thread should have both TIF_SIGPENDING and JOBCTL_STOP_PENDING|JOBCTL_STOP_CONSUME flags set. The race with try_to_freeze_tasks() is simpler, in this case we only need TIF_SIGPENDING. And I think think there are more problems. That is why I suggested another approach which "blocks" only the "real" signals sent to thread/process, so we can rely on the simple signal_pending() check at the end. Oleg.
(In reply to Oleg Nesterov from comment #14) > > - kill_pgrp(100) finds a single process P, delivers the SIGNAL, but > targets thread T, so signal_pending(M) is still 0. > > - M calls fork(), enters copy_process(), reads the already incremented > counter, checks signal_pending() without recalc_sigpending(), it is > 0. > > fork() suceeds. fatal_signal_pending() == 0, the counter is the same. > > - T dequeues SIGTERM and terminates the process P. > > - The new child runs. > > Is it correct? I think that's ok, yes. Why? Because we just say "SIGTERM was done before fork(), it just ran concurrently in another thread" The fact that you have a signal pending in another thread is simply not serialized with fork(). The key word is that > It has a handler for SIGTERM which does some cleanups and calls exit_group(). See? Every process in the process group did get the SIGTERM. But then one thread decided to do a fork() as another thread did that signal handler. So I think that we're fine. Side note: there's also a weasely lawyerese reason for why we don't need to care about the threaded case: technically it's unspecified. fork() is simply not thread-safe. If you have threads and do fork(), the newly copied address space will be some *random* copy that other threads can be in the middle of things of, including holding core locks etc. So the newly created address space may have entirely inconsistent locking in core internal libc routines (think "oh, we forked while another thread held the lock for stdin"). So the kernel *allows* fork() in threaded environments, and we try to make it work, but you get what you get. In practice, fork + immediate execve() *should* be safe, but because of the "you copy locks" issue, it might actually be better to do vfork()+execve instead. Anyway, I think that from a correctness standpoint, we should worry about: (a) SIGKILL to everybody, so that we know that at least the sysadmin can always kill everything and you can't escape that (b) the "fork() from a single thread" case, because that's the one where standards language will matter. and then outside of those cases we should strive to have a "good quality" implementation that we can at least _explain_ what we do, even if it might not be perfect. Best-effort, in other words, simply because "perfect" is fundamentally not an option with fork+threads due to the above issue.
On 07/11, bugzilla-daemon@bugzilla.kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=200447 > > --- Comment #15 from Linus Torvalds (torvalds@linux-foundation.org) --- > (In reply to Oleg Nesterov from comment #14) > > > > - kill_pgrp(100) finds a single process P, delivers the SIGNAL, but > > targets thread T, so signal_pending(M) is still 0. > > > > - M calls fork(), enters copy_process(), reads the already > incremented > > counter, checks signal_pending() without recalc_sigpending(), it is > > 0. > > > > fork() suceeds. fatal_signal_pending() == 0, the counter is the > same. > > > > - T dequeues SIGTERM and terminates the process P. > > > > - The new child runs. > > > > Is it correct? > > I think that's ok, yes. I disagree... At least I thought that the whole purpose of the current signal_pending() check is the scenarios like above. I mean, ignoring SIGKILL and other sources of TIF_SIGPENDING. That is exactly why recalc_sigpending() was added. > Why? Because we just say "SIGTERM was done before fork(), it just ran > concurrently in another thread" And I think this is wrong, this is how I understand the comment * Process group and session signals need to be delivered to just the * parent before the fork or both the parent and the child after the * fork. above the signal_pending() check. > The fact that you have a signal pending in another thread is simply not > serialized with fork(). But it is not pending in another thread. It is pending in this process, in signal->shared_pending. Just complete_signal() tries to not disturb the whole thread group, it doesn't wakes up all threads, it peaks a "random" one. > The key word is that > > > It has a handler for SIGTERM Yes, > which does some cleanups and calls exit_group(). I think it doesn't matter what the handler actually does. It could be SIGWINCH handler which does ioctl(TIOCGWINSZ). > Anyway, I think that from a correctness standpoint, we should worry about: > > (a) SIGKILL to everybody, so that we know that at least the sysadmin can > always kill everything and you can't escape that This is clear, and this is simple. > (b) the "fork() from a single thread" case, because that's the one where > standards language will matter. I can't agree with "only a single thread case matters". Oleg.
On Wed, Jul 11, 2018 at 10:00 AM <bugzilla-daemon@bugzilla.kernel.org> wrote: > > But it is not pending in another thread. It is pending in this process, in > signal->shared_pending. So? What part of the whole "SIGTERM is handled concurrently in another thread" is it that you're really disagreeing with? And if it's handled concurrently by another thread, then the fork is independent. And if the fork is independent, then you get the *exact* semantics you'd have gotten if the SIGTERM was sent just _before_ the fork(). So explain why you think the above three statements aren't true. Linus
> --- Comment #17 from Linus Torvalds (torvalds@linux-foundation.org) --- > On Wed, Jul 11, 2018 at 10:00 AM <bugzilla-daemon@bugzilla.kernel.org> wrote: > > > > But it is not pending in another thread. It is pending in this process, in > > signal->shared_pending. > > So? > > What part of the whole "SIGTERM is handled concurrently in another > thread" is it that you're really disagreeing with? I don't understand why do you think we should make things worse than necessary. Yes, if a multithreaded process has a terminating handler for SIGTERM and it calls fork(), then SIGTERM sent to its pgroup can miss the new child anyway, whatever we do. But recalc_sigpending_tsk() is cheap (we don't need recalc_sigpending), it can help at least in the case when SIGTERM was pending in process before one of its threads calls fork(). I never said this is the great improvement, but I see no reason to remove it. OK, please forget. What about the "fatal_signal_pending(current) || sigseq != current->sighand->sequence" check? Do you agree it is not enough? Oleg.
> [RFC][PATCH 10/11] signal: Push pid type from signal senders down into > __send_signal > > ... > - pending = group ? &t->signal->shared_pending : &t->pending; > + pending = type != PIDTYPE_PID ? &t->signal->shared_pending : > &t->pending; we may add brackets to make it more readable: + pending = (type != (=PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
I'm sorry, it should be: + pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
(In reply to Wen Yang from comment #20) > I'm sorry, it should be: > + pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : > &t->pending; Agreed. Silly me trying to hard to stay within 80 column limits.
(In reply to Linus Torvalds from comment #15) > (In reply to Oleg Nesterov from comment #14) > > > > - kill_pgrp(100) finds a single process P, delivers the SIGNAL, but > > targets thread T, so signal_pending(M) is still 0. > > > > - M calls fork(), enters copy_process(), reads the already > incremented > > counter, checks signal_pending() without recalc_sigpending(), it is > > 0. > > > > fork() suceeds. fatal_signal_pending() == 0, the counter is the > same. > > > > - T dequeues SIGTERM and terminates the process P. > > > > - The new child runs. > > > > Is it correct? > > I think that's ok, yes. > Anyway, I think that from a correctness standpoint, we should worry about: > > (a) SIGKILL to everybody, so that we know that at least the sysadmin can > always kill everything and you can't escape that > > (b) the "fork() from a single thread" case, because that's the one where > standards language will matter. > > and then outside of those cases we should strive to have a "good quality" > implementation that we can at least _explain_ what we do, even if it might > not be perfect. Best-effort, in other words, simply because "perfect" is > fundamentally not an option with fork+threads due to the above issue. I just reread the posix documentation. It is quite clear that a signal should be delivered to userspace either before or after the fork. If we leave the signal as merely pending in another thread it will be possible to construct scenarios where it can be proven that the signal was delivered after the fork, and the child was not signaled. Further there exists pthread_atfork which makes it possible to cleanly handle most of the locking issues. So fork and pthreads are not a unusable combination. So I agree with Oleg the code should call recalc_sigpending() at the start of fork, before we rely on the sequence counter. Taking the siglock is not expensive nor is the recalculation, nor even the syscall restart at that point. Most importantly for me adding a recalc_sigpending at the start of fork at very little cost removes one more corner case we have to reason about, test and think about. Making the code simpler and more reliable.
(In reply to Eric W. Biederman from comment #22) > > So I agree with Oleg the code should call recalc_sigpending() at the start > of fork, before we rely on the sequence counter. Taking the siglock is not > expensive nor is the recalculation Just in case, we can use recalc_sigpending_tsk(current) which doesn't need siglock because tsk==current and it never clears TIF_SIGPENDING. We only need to check ->shared_pending. recalc_sigpending() needs siglock, but it can clear the "spurious" SIGPENDING. This can help to avoid the unnecessary restart, but this is unlikely and as you have mentioned restart is not expensive at that point.
From "Eric W. Biederman" <> Date Tue, 10 Jul 2018 21:44:53 -0500 Subject [RFC][PATCH 05/11] pids: Move the pgrp and session pid pointers from task_struct to signal_struct task->thread_pid equals to task->signal->pids[PIDTYPE_SID], so we may have stored redundant values. static inline struct pid *task_pid(struct task_struct *task) { ... - return task->group_leader->pids[PIDTYPE_SID].pid; + return task->thread_pid; } void transfer_pid(struct task_struct *old, struct task_struct *new, enum pid_type type) { - new->pids[type].pid = old->pids[type].pid; - hlist_replace_rcu(&old->pids[type].node, &new->pids[type].node); + if (type == PIDTYPE_PID) + new->thread_pid = old->thread_pid; + hlist_replace_rcu(&old->pid_links[type], &new->pid_links[type]); } init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid) { - task->pids[type].pid = pid; + if (type == PIDTYPE_PID) + task->thread_pid = pid; + else + task->signal->pids[type] = pid; } ---> when type == PIDTYPE_PID, task->thread_pid is right, but task->signal->pids[PIDTYPE_PID] is NULL.
This bug has been fixed by the patch: https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git/commit/?h=siginfo-testing&id=c3ad2c3b02e953ead2b8d52a0c9e70312930c3d0 We've test it several days, it works fine. Thanks. Another patch also needs to be applied, can improve the performance of fork syscall: https://bugzilla.kernel.org/attachment.cgi?id=277303&action=edit
On Fri, Aug 24, 2018 at 12:07 AM Wen Yang (yellowriver2010@hotmail.com) -wrote: > > We've test it several days, it works fine. > Thanks. > > Another patch also needs to be applied, can improve the performance of fork > syscall: > https://bugzilla.kernel.org/attachment.cgi?id=277303&action=edit I'm perfectly happy applying that patch, since I wrote it, and it makes sense, but it would be really good if you actually have numbers for the improvement on a real load. I can see it improving things on a trivial benchmark, but it would matter much more to me if you actually have numbers for this improving things in real life on your load. Thanks, Linus
(In reply to Linus Torvalds from comment #26) > > I'm perfectly happy applying that patch, since I wrote it, and it > makes sense, but it would be really good if you actually have numbers > for the improvement on a real load. > > I can see it improving things on a trivial benchmark, but it would > matter much more to me if you actually have numbers for this improving > things in real life on your load. Hi, Linus, It's my pleasure to test your patch. My test code is: #include <stdlib.h> #include <stdio.h> #include <signal.h> #include <string.h> #include <errno.h> #include <sys/time.h> static unsigned long long int signal_cnt = 0; void incr_count() { ++signal_cnt; } int main (int argc, char *argv[] ) { long long mem_size; unsigned long fork_times; if(argc != 3) { printf("usage: memsize forktimes !\n"); return -EINVAL; } mem_size = atoll(argv[1]); fork_times = atoi(argv[2]); char * x = malloc(mem_size); memset(x, 1, mem_size); struct itimerval period; period.it_interval.tv_sec = 0; period.it_interval.tv_usec = 100000; period.it_value.tv_sec = 3; period.it_value.tv_usec = 0; struct timeval t1; struct timeval t2; signal(SIGALRM, incr_count); setitimer(ITIMER_REAL, &period, NULL); int i = 0, pid; gettimeofday(&t1, NULL); while (i < fork_times) { pid = fork(); if(pid == 0) exit(0); wait(NULL); i++; } unsigned long long time_used; gettimeofday(&t2, NULL); time_used = (t2.tv_sec * 1000ULL * 1000 + t2.tv_usec) - (t1.tv_sec * 1000ULL*1000 + t1.tv_usec); printf("mem_size=%llu && fork_times=%lu, used %llu us\n", mem_size, fork_times, time_used); return 0; } My test environment is a kvm virtual machine: <memory unit='KiB'>20486144</memory> <currentMemory unit='KiB'>20485760</currentMemory> <vcpu placement='static'>10</vcpu> My test cases are: TEST 1:mem_size=8G && fork_times=4096 a. Without this patch: # uname -a Linux localhost.localdomain 4.18.0-next-20180816 #1 SMP Fri Aug 17 09:33:40 CST 2018 x86_64 x86_64 x86_64 GNU/Linux # for i in {0..15}; do echo 3 > /proc/sys/vm/drop_caches; ./a.out 8589934592 4096 ; echo 3 > /proc/sys/vm/drop_caches; sleep 5; done; mem_size=8589934592 && fork_times=4096, used 54079373 us mem_size=8589934592 && fork_times=4096, used 50768614 us mem_size=8589934592 && fork_times=4096, used 58499333 us mem_size=8589934592 && fork_times=4096, used 50798437 us mem_size=8589934592 && fork_times=4096, used 53412134 us mem_size=8589934592 && fork_times=4096, used 54974291 us mem_size=8589934592 && fork_times=4096, used 57183614 us mem_size=8589934592 && fork_times=4096, used 55201697 us mem_size=8589934592 && fork_times=4096, used 51308532 us mem_size=8589934592 && fork_times=4096, used 54910070 us mem_size=8589934592 && fork_times=4096, used 54145605 us mem_size=8589934592 && fork_times=4096, used 51561948 us mem_size=8589934592 && fork_times=4096, used 50645891 us mem_size=8589934592 && fork_times=4096, used 52478923 us mem_size=8589934592 && fork_times=4096, used 52967142 us mem_size=8589934592 && fork_times=4096, used 54354484 us average = (54079373+50768614+58499333+50798437+53412134+54974291+57183614+55201697+51308532+54910070+54145605+51561948+50645891+52478923+52967142+54354484)/16 = 53580630.5 b. With this patch: # uname -a Linux localhost.localdomain 4.18.0fix-next-20180816+ #4 SMP Mon Aug 27 09:47:11 CST 2018 x86_64 x86_64 x86_64 GNU/Linux # for i in {0..15}; do echo 3 > /proc/sys/vm/drop_caches; ./a.out 8589934592 4096 ; echo 3 > /proc/sys/vm/drop_caches; sleep 5; done; mem_size=8589934592 && fork_times=4096, used 50220661 us mem_size=8589934592 && fork_times=4096, used 40172114 us mem_size=8589934592 && fork_times=4096, used 46645951 us mem_size=8589934592 && fork_times=4096, used 46042630 us mem_size=8589934592 && fork_times=4096, used 46228679 us mem_size=8589934592 && fork_times=4096, used 45128850 us mem_size=8589934592 && fork_times=4096, used 43085861 us mem_size=8589934592 && fork_times=4096, used 43290896 us mem_size=8589934592 && fork_times=4096, used 46307499 us mem_size=8589934592 && fork_times=4096, used 45169518 us mem_size=8589934592 && fork_times=4096, used 42832446 us mem_size=8589934592 && fork_times=4096, used 44796410 us mem_size=8589934592 && fork_times=4096, used 45700605 us mem_size=8589934592 && fork_times=4096, used 43767852 us mem_size=8589934592 && fork_times=4096, used 44515256 us mem_size=8589934592 && fork_times=4096, used 43434547 us average = (50220661+40172114+46645951+46042630+46228679+45128850+43085861+43290896+46307499+45169518+42832446+44796410+45700605+43767852+44515256+43434547)/16 = 44833735.9375 delta = (53580630.5 - 44833735.9375)/53580630.5 = 16.3% ================================================================= TEST 2:mem_size=4G && fork_times=4096 a. Without this patch: # uname -a Linux localhost.localdomain 4.18.0-next-20180816 #1 SMP Fri Aug 17 09:33:40 CST 2018 x86_64 x86_64 x86_64 GNU/Linux # for i in {0..15}; do echo 3 > /proc/sys/vm/drop_caches; ./a.out 4294967296 4096 ; echo 3 > /proc/sys/vm/drop_caches; sleep 5; done; mem_size=4294967296 && fork_times=4096, used 18256498 us mem_size=4294967296 && fork_times=4096, used 18167115 us mem_size=4294967296 && fork_times=4096, used 18577742 us mem_size=4294967296 && fork_times=4096, used 18529440 us mem_size=4294967296 && fork_times=4096, used 15664382 us mem_size=4294967296 && fork_times=4096, used 19681474 us mem_size=4294967296 && fork_times=4096, used 20842160 us mem_size=4294967296 && fork_times=4096, used 17733025 us mem_size=4294967296 && fork_times=4096, used 18448357 us mem_size=4294967296 && fork_times=4096, used 19143284 us mem_size=4294967296 && fork_times=4096, used 18283942 us mem_size=4294967296 && fork_times=4096, used 19133080 us mem_size=4294967296 && fork_times=4096, used 17829583 us mem_size=4294967296 && fork_times=4096, used 19044675 us mem_size=4294967296 && fork_times=4096, used 18316882 us mem_size=4294967296 && fork_times=4096, used 18687442 us average = (18256498+18167115+18577742+18529440 +15664382+19681474+20842160+17733025+18448357+19143284+18283942+19133080+17829583+19044675+18316882+18687442)/16 = 18521192.5625 b. With this patch: # uname -a Linux localhost.localdomain 4.18.0fix-next-20180816+ #4 SMP Mon Aug 27 09:47:11 CST 2018 x86_64 x86_64 x86_64 GNU/Linux # for i in {0..15}; do echo 3 > /proc/sys/vm/drop_caches; ./a.out 4294967296 4096 ; echo 3 > /proc/sys/vm/drop_caches; sleep 5; done; mem_size=4294967296 && fork_times=4096, used 15173732 us mem_size=4294967296 && fork_times=4096, used 15231416 us mem_size=4294967296 && fork_times=4096, used 15853240 us mem_size=4294967296 && fork_times=4096, used 16448929 us mem_size=4294967296 && fork_times=4096, used 15101745 us mem_size=4294967296 && fork_times=4096, used 14808725 us mem_size=4294967296 && fork_times=4096, used 13834210 us mem_size=4294967296 && fork_times=4096, used 15297005 us mem_size=4294967296 && fork_times=4096, used 14256755 us mem_size=4294967296 && fork_times=4096, used 14701890 us mem_size=4294967296 && fork_times=4096, used 16032061 us mem_size=4294967296 && fork_times=4096, used 14864268 us mem_size=4294967296 && fork_times=4096, used 14040698 us mem_size=4294967296 && fork_times=4096, used 13486647 us mem_size=4294967296 && fork_times=4096, used 15362027 us mem_size=4294967296 && fork_times=4096, used 14086739 us average = (15173732+15231416+15853240+16448929+15101745+14808725+13834210+15297005+14256755+14701890+16032061+14864268+14040698+13486647+15362027+14086739)/16 = 14911255.4375 delta = (18521192.5625 - 14911255.4375)/18521192.5625 = 19.5% Thanks,
The COW copy optimization ended up being merged as commit 1b2de5d039c8 ("mm/cow: don't bother write protecting already write-protected pages") because I decided to just add it before the merge window closed. So it will be in 4.19. But thanks for the numbers. They are still just for a test program rather than any real load, but it's not like that change should possibly hurt anything.