Bug 10460

Summary: race involving POSIX timers in exec
Product: Timers Reporter: Austin Clements (amdragon+kernelbugzilla)
Component: OtherAssignee: john stultz (john.stultz)
Status: CLOSED CODE_FIX    
Severity: normal CC: akpm, mingo, oleg, roland, tglx
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.24 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Tickle race condition
discard the pending signal when the timer is destroyed

Description Austin Clements 2008-04-16 15:36:17 UTC
Distribution: Debian sid
Hardware Environment: ThinkPad T42
Problem Description:

According to POSIX, per-process timers (those created by timer_create)
should be "disarmed and deleted" by an exec.  While these timers are
indeed deleted by the time the new process image runs, it's possible
for a signal to arrive _while_ loading the new process image, before
the timers have been deleted.  The timer signal is, in turn, delivered
to the new process, usually killing it.

The exact sequence of events is:

1. The initial process creates a realtime per-process timer, set up to
   deliver a signal when the timer expires.

2. The initial process calls exec().

3. The per-process timer expires, causing the timer signal to be
   queued in the pending signals for the process.
   (kernel/posix-timers.c:posix_timer_event)

4. All of the timers associated with the current process are deleted.
   (fs/exec.c:flush_old_exec -> fs/exec.c:de_thread ->
   kernel/posix-timers.c:exit_itimers)

5. Signal handlers are flushed, setting the action for the timer
   signal to the default, presumably exit.  (fs/exec.c:flush_old_exec
   -> kernel/signal.c:flush_signal_handlers)

6. The kernel finishes loading the new process image.

7. The pending timer signal queued in step 3 gets delivered to the
   unsuspecting exec'd process.

One possible solution would be to remove pending timer-generated
signals after deleting the timers.  These signals are marked with an
si_code of SI_TIMER, so they should be easy to identify.

Unfortunately, simply disabling timers earlier probably isn't
sufficient to completely eliminate the race condition, unless it can
be guaranteed that timer signals will be disabled between the entry to
the system call and the point where the timers are deleted.
Comment 1 Austin Clements 2008-04-16 15:42:20 UTC
Created attachment 15777 [details]
Tickle race condition

This program tickles the described race condition.  It must be compiled with -lrt because it uses timer_create.

Because it's a race condition, it will try up to 10,000 times or until the child process is terminated with SIGVTALRM.  I set the timer to 0.1 ms, which is always sufficient to produce the race on the first try on my 1.5 GHz Pentium M.  Placing the system under some load exacerbates the problem.

I found that running with
  strace -f -e 'trace=process,timer_create' -e 'signal=!vtalrm'
can be informative.
Comment 2 Andrew Morton 2008-04-16 15:49:22 UTC
cc's added.

I'm rather surprised that this hasn't been seen before.
Comment 3 Oleg Nesterov 2008-04-16 17:11:56 UTC
Austin, I think your analysis is excellent and correct.

But to me this all looks like "expected behaviour". I may be wrong.

This test case creates the source (timer) for the fatal signal (SIGVTALRM)
which has a handler, and does exec(). I think it is very "obvious" that this
signal can kill the task as soon as it "loses" the signal handler, and this
of course happens when the task execs.

Suppose we have a O_CLOEXEC+FASYNC file, and we do fcntl(F_SETSIG, SIGVTALRM).
Now the task can be killed during exec() in the same manner, "as expected".

As you correctly pointed out, we can't fix this problem by disabling the
source of SIGVTALRM earlier, we can enter exec() when the signal is already
pending. And exec() must preserve the pending signals.

Perhaps SI_TIMER signals are "special" ? I don't know.

Anybody?

Oleg.
Comment 4 Oleg Nesterov 2008-04-16 17:29:15 UTC
On 04/17, Oleg Nesterov wrote:
>
> Perhaps SI_TIMER signals are "special" ? I don't know.

Forgot to mention...

flush_signal_handlers() can remove all unblocked signals which have
a handler, this should fix the problem. But again, I am not sure this
would be right.

Oleg.
Comment 5 Austin Clements 2008-04-16 22:41:12 UTC
> But to me this all looks like "expected behaviour". I may be wrong.

I grappled with this for a while, too, before deciding that this
really was not expected behavior.  Consider that the _intent_ of a
per-process timer is to stay within the process boundaries imposed by
a fork or exec.  Delivering the timer signal across such a boundary
violates this intent.

> > Perhaps SI_TIMER signals are "special" ? I don't know.
> 
> Forgot to mention...
> 
> flush_signal_handlers() can remove all unblocked signals which have
> a handler, this should fix the problem. But again, I am not sure
> this would be right.

Ooh, I hadn't thought of that.  I think a combination of these is
really what's needed.  Simply removing all unblocked signals which
have a handler isn't POSIX, since pending signals must cross into the
new process image.  Such a signal could have been produced by, say, a
kill from another process, in which case it would be incorrect to drop
such a signal on the floor.  The timer case differs because the
delivery or non-delivery of the offending signal is merely an
implementation artifact of how long it takes exec to get around to
calling exit_itimers.

However, the set of pending signals that are not blocked and have
(si_code == SI_TIMER) should be _precisely_ the timer signals that
were delivered _after_ entry to exec.  Deleting exactly these signals
would be observationally equivalent with deleting the timers at the
instant exec is entered.
Comment 6 Oleg Nesterov 2008-04-17 04:38:08 UTC
On 04/16, bugme-daemon@bugzilla.kernel.org wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=10460
> 
> ------- Comment #5 from amdragon+kernelbugzilla@mit.edu  2008-04-16 22:41
> -------
> > But to me this all looks like "expected behaviour". I may be wrong.
> 
> I grappled with this for a while, too, before deciding that this
> really was not expected behavior.  Consider that the _intent_ of a
> per-process timer is to stay within the process boundaries imposed by
> a fork or exec.  Delivering the timer signal across such a boundary
> violates this intent.

Yes, I see your point... but still I am not sure.

Don't get we wrong, I don't pretend I _know_ what is the right behaviour.
Hopefully Roland knows.

Do you have some authoritative docs?

> > flush_signal_handlers() can remove all unblocked signals which have
> > a handler, this should fix the problem. But again, I am not sure
> > this would be right.
> 
> Ooh, I hadn't thought of that.  I think a combination of these is
> really what's needed.  Simply removing all unblocked signals which
> have a handler isn't POSIX, since pending signals must cross into the
> new process image.

Yes sure. But please note "which have a handler above", we can pretend
that the signal was already dequeued and processed before exec(), it has
a handler.

But yes, I agree, this doesn't look very right to me.

> The timer case differs because the
> delivery or non-delivery of the offending signal is merely an
> implementation artifact of how long it takes exec to get around to
> calling exit_itimers.
>
> However, the set of pending signals that are not blocked and have
> (si_code == SI_TIMER) should be _precisely_ the timer signals that
> were delivered _after_ entry to exec.

No. As I said, we can enter exec() when SI_TIMER (or other) signal is
already pending. Not that it matters though.

And we have the same issues with O_CLOEXEC+FASYNC files. Should we
delete these signals too?

And, to clarify, we can't actually find SI_TIMER signal. Because
we don't have them after exec() does exit_itimers().
release_posix_timer()->sigqueue_free() removes "struct sigqueue"
from list. But SIGVTALRM is still pending.

So. We can

	1. change sigqueue_free() to sigdelset(q->info.si_signo) if
	   the are no other (rt-) signals with the same .si_signo

	   (looks reasonable, but I don't think this problem is timer-
	    specific)

	2. change exec() to flush all unblocked pending signals which
	   have a handler

	3. decide the current behaviour is correct (my personal choice ;)

I can make a patch for 1 or 2, but what do you all think?

Oleg.
Comment 7 Roland McGrath 2008-04-17 13:07:46 UTC
In general, pending signals are supposed to affect the new program.  It's wrong to flush them.  If not blocked, either they should take effect before the exec, or they should be delivered with SIG_DFL action immediately after the exec.

Signals from timer_create timers are a slightly special case.  But that's only because timers are deleted at exec time.  For the pending signal once the timer has fired, POSIX says it's unspecified (for the case of deleting or disarming the timer, and the wording for exec says it's implicitly deleted, which implies the same "unspecified").  So an application is not conforming if it relies on not getting any such signal carried across exec.

The principle of least astonishment suggests #1.  It's a surprise initially even to me that sigqueue_free leaves the signal in the pending set when it removes the queue entry.  Since collect_signal clears the signal's pending bit after dequeuing the last entry with matching si_signo, it would be consistent for sigqueue_free to do the same.
Comment 8 Oleg Nesterov 2008-04-17 13:20:32 UTC
On 04/17, bugme-daemon@bugzilla.kernel.org wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=10460
> 
> ------- Comment #7 from roland@redhat.com  2008-04-17 13:07 -------
> 
> Signals from timer_create timers are a slightly special case.  But that's
> only
> because timers are deleted at exec time.

What about O_CLOEXEC+FASYNC files ?

> The principle of least astonishment suggests #1.  It's a surprise initially
> even to me that sigqueue_free leaves the signal in the pending set when it
> removes the queue entry.

OK. Will do the patch on Sunday.

There is a little complication, looking at "struct sigqueue" we can't figure
out where it is pending, in ->pending or ->shared_pending. Fortunately,
release_posix_timer() can check SIGEV_THREAD_ID.

Oleg.
Comment 9 Roland McGrath 2008-04-17 13:40:01 UTC
SIGIO is not any kind of special case.  Once it's pending, it's pending.
Comment 10 Oleg Nesterov 2008-04-22 09:46:11 UTC
Created attachment 15842 [details]
discard the pending signal when the timer is destroyed

Austin, could you try these

        [PATCH 1/2] posix timers: (BUG 10460) discard the pending signal when the timer is destroyed
        http://marc.info/?l=linux-kernel&m=120888210417700

        [RFC,PATCH 2/2] posix timers: don't discard the signal if the timer was explicitly destroyed
        http://marc.info/?l=linux-kernel&m=120888210417698

patches? (the first one should fix the problem).

Somehow your test program doesn't work for me, it doesn't trigger the race.

Oleg.
Comment 11 Oleg Nesterov 2008-04-27 05:55:21 UTC
> Somehow your test program doesn't work for me, it doesn't trigger the race.

OK, I slightly modified this program and now it does
trigger the race reliably.

I also verified that the first patch fixes the problem.

Austin, do you agree we can close this BUG?
Comment 12 Austin Clements 2008-04-27 10:32:51 UTC
My apologies for not getting back to you about the patches.  I'm running 2.6.24 and the patches were against (I assume) 2.6.25, which I wasn't able to get to boot on my system.  I tried backporting them just as an experiment, but the signals code had changed quite a bit.

> OK, I slightly modified this program and now it does
> trigger the race reliably.

Out of curiosity, what did you have to change?  I wonder if the reason it worked for me and not for you is that I'm running 2.6.24?

> Austin, do you agree we can close this BUG?

Sounds good to me.  Thank you for your efforts.
Comment 13 Oleg Nesterov 2008-04-28 02:43:29 UTC
On 04/27, bugme-daemon@bugzilla.kernel.org wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=10460
> 
> ------- Comment #12 from amdragon+kernelbugzilla@mit.edu  2008-04-27 10:32
> -------
> I'm running 2.6.24
> and the patches were against (I assume) 2.6.25,

Ah! sorry. Even worse, the patch is against -mm.

> > OK, I slightly modified this program and now it does
> > trigger the race reliably.
> 
> Out of curiosity, what did you have to change?

Please see the trivial patch at the end.

> I wonder if the reason it
> worked for me and not for you is that I'm running 2.6.24?

I have tried it on 2.6.19 too, doesn't work. Hmm... Now I spawned
2 instances of the unpatched (except a fix for execl) program on
2.6.19, this works, but not always. Strange...

> > Austin, do you agree we can close this BUG?
> 
> Sounds good to me.

Great, thanks!

Oleg.

--- TST.c~	2008-04-22 15:25:46.000000000 +0400
+++ TST.c	2008-04-27 16:31:33.000000000 +0400
@@ -39,7 +39,7 @@ void setuptimer(void)
   // With 1 ms, I can usually get through 10's of exec's before it
   // fails.  With 0.1 ms, it reliably fails on the first exec for
   // me.  I've seen the race happen with timers as high as 20 ms.
-  it.it_value.tv_nsec = 1000000 / 10;
+  it.it_value.tv_nsec = 1;
   it.it_interval = it.it_value;
   if (timer_settime(timer, 0, &it, NULL) != 0)
     perror("timer_settime");
@@ -49,6 +49,14 @@ int main(int argc, char **argv)
 {
   int i,pid;
 
+for (i = 0; i < 4; ++i)
+	if (!fork())
+		goto dotest;
+
+while (wait(NULL) > 0)
+	;
+
+dotest:
   // Repeat the test until the race occurs.  With a really short
   // timer, this should happen pretty quickly.
   for (i = 0; i < 10000; ++i) {
@@ -60,21 +68,23 @@ int main(int argc, char **argv)
         continue;
       else if (WIFSIGNALED(status) && WTERMSIG(status) == SIGVTALRM) {
         printf("Child received erroneous SIGVTALRM\n");
-        exit(0);
+        kill(0, SIGKILL);
       } else {
         printf("Child exited for unknown reason\n");
-        exit(0);
+        kill(0, SIGKILL);
       }
     } else {
       // Child
-      char buf[10];
-      sprintf(buf, "%d", i);
+//      char buf[10];
+//      sprintf(buf, "%d", i);
       // Create a POSIX per-process timer
       setuptimer();
       // Exec another binary.  If the race occurs, a SIGVTALRM will be
       // queued for this child before the per-process timer is deleted
       // by exec.
-      execl("/bin/echo", "echo", buf);
+      execl("/bin/true", "true", NULL);
+      perror("exec");
+      kill(0, SIGKILL);
     }
   }
   return 0;