Distribution: Debian sid Hardware Environment: ThinkPad T42 Problem Description: According to POSIX, per-process timers (those created by timer_create) should be "disarmed and deleted" by an exec. While these timers are indeed deleted by the time the new process image runs, it's possible for a signal to arrive _while_ loading the new process image, before the timers have been deleted. The timer signal is, in turn, delivered to the new process, usually killing it. The exact sequence of events is: 1. The initial process creates a realtime per-process timer, set up to deliver a signal when the timer expires. 2. The initial process calls exec(). 3. The per-process timer expires, causing the timer signal to be queued in the pending signals for the process. (kernel/posix-timers.c:posix_timer_event) 4. All of the timers associated with the current process are deleted. (fs/exec.c:flush_old_exec -> fs/exec.c:de_thread -> kernel/posix-timers.c:exit_itimers) 5. Signal handlers are flushed, setting the action for the timer signal to the default, presumably exit. (fs/exec.c:flush_old_exec -> kernel/signal.c:flush_signal_handlers) 6. The kernel finishes loading the new process image. 7. The pending timer signal queued in step 3 gets delivered to the unsuspecting exec'd process. One possible solution would be to remove pending timer-generated signals after deleting the timers. These signals are marked with an si_code of SI_TIMER, so they should be easy to identify. Unfortunately, simply disabling timers earlier probably isn't sufficient to completely eliminate the race condition, unless it can be guaranteed that timer signals will be disabled between the entry to the system call and the point where the timers are deleted.
Created attachment 15777 [details] Tickle race condition This program tickles the described race condition. It must be compiled with -lrt because it uses timer_create. Because it's a race condition, it will try up to 10,000 times or until the child process is terminated with SIGVTALRM. I set the timer to 0.1 ms, which is always sufficient to produce the race on the first try on my 1.5 GHz Pentium M. Placing the system under some load exacerbates the problem. I found that running with strace -f -e 'trace=process,timer_create' -e 'signal=!vtalrm' can be informative.
cc's added. I'm rather surprised that this hasn't been seen before.
Austin, I think your analysis is excellent and correct. But to me this all looks like "expected behaviour". I may be wrong. This test case creates the source (timer) for the fatal signal (SIGVTALRM) which has a handler, and does exec(). I think it is very "obvious" that this signal can kill the task as soon as it "loses" the signal handler, and this of course happens when the task execs. Suppose we have a O_CLOEXEC+FASYNC file, and we do fcntl(F_SETSIG, SIGVTALRM). Now the task can be killed during exec() in the same manner, "as expected". As you correctly pointed out, we can't fix this problem by disabling the source of SIGVTALRM earlier, we can enter exec() when the signal is already pending. And exec() must preserve the pending signals. Perhaps SI_TIMER signals are "special" ? I don't know. Anybody? Oleg.
On 04/17, Oleg Nesterov wrote: > > Perhaps SI_TIMER signals are "special" ? I don't know. Forgot to mention... flush_signal_handlers() can remove all unblocked signals which have a handler, this should fix the problem. But again, I am not sure this would be right. Oleg.
> But to me this all looks like "expected behaviour". I may be wrong. I grappled with this for a while, too, before deciding that this really was not expected behavior. Consider that the _intent_ of a per-process timer is to stay within the process boundaries imposed by a fork or exec. Delivering the timer signal across such a boundary violates this intent. > > Perhaps SI_TIMER signals are "special" ? I don't know. > > Forgot to mention... > > flush_signal_handlers() can remove all unblocked signals which have > a handler, this should fix the problem. But again, I am not sure > this would be right. Ooh, I hadn't thought of that. I think a combination of these is really what's needed. Simply removing all unblocked signals which have a handler isn't POSIX, since pending signals must cross into the new process image. Such a signal could have been produced by, say, a kill from another process, in which case it would be incorrect to drop such a signal on the floor. The timer case differs because the delivery or non-delivery of the offending signal is merely an implementation artifact of how long it takes exec to get around to calling exit_itimers. However, the set of pending signals that are not blocked and have (si_code == SI_TIMER) should be _precisely_ the timer signals that were delivered _after_ entry to exec. Deleting exactly these signals would be observationally equivalent with deleting the timers at the instant exec is entered.
On 04/16, bugme-daemon@bugzilla.kernel.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=10460 > > ------- Comment #5 from amdragon+kernelbugzilla@mit.edu 2008-04-16 22:41 > ------- > > But to me this all looks like "expected behaviour". I may be wrong. > > I grappled with this for a while, too, before deciding that this > really was not expected behavior. Consider that the _intent_ of a > per-process timer is to stay within the process boundaries imposed by > a fork or exec. Delivering the timer signal across such a boundary > violates this intent. Yes, I see your point... but still I am not sure. Don't get we wrong, I don't pretend I _know_ what is the right behaviour. Hopefully Roland knows. Do you have some authoritative docs? > > flush_signal_handlers() can remove all unblocked signals which have > > a handler, this should fix the problem. But again, I am not sure > > this would be right. > > Ooh, I hadn't thought of that. I think a combination of these is > really what's needed. Simply removing all unblocked signals which > have a handler isn't POSIX, since pending signals must cross into the > new process image. Yes sure. But please note "which have a handler above", we can pretend that the signal was already dequeued and processed before exec(), it has a handler. But yes, I agree, this doesn't look very right to me. > The timer case differs because the > delivery or non-delivery of the offending signal is merely an > implementation artifact of how long it takes exec to get around to > calling exit_itimers. > > However, the set of pending signals that are not blocked and have > (si_code == SI_TIMER) should be _precisely_ the timer signals that > were delivered _after_ entry to exec. No. As I said, we can enter exec() when SI_TIMER (or other) signal is already pending. Not that it matters though. And we have the same issues with O_CLOEXEC+FASYNC files. Should we delete these signals too? And, to clarify, we can't actually find SI_TIMER signal. Because we don't have them after exec() does exit_itimers(). release_posix_timer()->sigqueue_free() removes "struct sigqueue" from list. But SIGVTALRM is still pending. So. We can 1. change sigqueue_free() to sigdelset(q->info.si_signo) if the are no other (rt-) signals with the same .si_signo (looks reasonable, but I don't think this problem is timer- specific) 2. change exec() to flush all unblocked pending signals which have a handler 3. decide the current behaviour is correct (my personal choice ;) I can make a patch for 1 or 2, but what do you all think? Oleg.
In general, pending signals are supposed to affect the new program. It's wrong to flush them. If not blocked, either they should take effect before the exec, or they should be delivered with SIG_DFL action immediately after the exec. Signals from timer_create timers are a slightly special case. But that's only because timers are deleted at exec time. For the pending signal once the timer has fired, POSIX says it's unspecified (for the case of deleting or disarming the timer, and the wording for exec says it's implicitly deleted, which implies the same "unspecified"). So an application is not conforming if it relies on not getting any such signal carried across exec. The principle of least astonishment suggests #1. It's a surprise initially even to me that sigqueue_free leaves the signal in the pending set when it removes the queue entry. Since collect_signal clears the signal's pending bit after dequeuing the last entry with matching si_signo, it would be consistent for sigqueue_free to do the same.
On 04/17, bugme-daemon@bugzilla.kernel.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=10460 > > ------- Comment #7 from roland@redhat.com 2008-04-17 13:07 ------- > > Signals from timer_create timers are a slightly special case. But that's > only > because timers are deleted at exec time. What about O_CLOEXEC+FASYNC files ? > The principle of least astonishment suggests #1. It's a surprise initially > even to me that sigqueue_free leaves the signal in the pending set when it > removes the queue entry. OK. Will do the patch on Sunday. There is a little complication, looking at "struct sigqueue" we can't figure out where it is pending, in ->pending or ->shared_pending. Fortunately, release_posix_timer() can check SIGEV_THREAD_ID. Oleg.
SIGIO is not any kind of special case. Once it's pending, it's pending.
Created attachment 15842 [details] discard the pending signal when the timer is destroyed Austin, could you try these [PATCH 1/2] posix timers: (BUG 10460) discard the pending signal when the timer is destroyed http://marc.info/?l=linux-kernel&m=120888210417700 [RFC,PATCH 2/2] posix timers: don't discard the signal if the timer was explicitly destroyed http://marc.info/?l=linux-kernel&m=120888210417698 patches? (the first one should fix the problem). Somehow your test program doesn't work for me, it doesn't trigger the race. Oleg.
> Somehow your test program doesn't work for me, it doesn't trigger the race. OK, I slightly modified this program and now it does trigger the race reliably. I also verified that the first patch fixes the problem. Austin, do you agree we can close this BUG?
My apologies for not getting back to you about the patches. I'm running 2.6.24 and the patches were against (I assume) 2.6.25, which I wasn't able to get to boot on my system. I tried backporting them just as an experiment, but the signals code had changed quite a bit. > OK, I slightly modified this program and now it does > trigger the race reliably. Out of curiosity, what did you have to change? I wonder if the reason it worked for me and not for you is that I'm running 2.6.24? > Austin, do you agree we can close this BUG? Sounds good to me. Thank you for your efforts.
On 04/27, bugme-daemon@bugzilla.kernel.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=10460 > > ------- Comment #12 from amdragon+kernelbugzilla@mit.edu 2008-04-27 10:32 > ------- > I'm running 2.6.24 > and the patches were against (I assume) 2.6.25, Ah! sorry. Even worse, the patch is against -mm. > > OK, I slightly modified this program and now it does > > trigger the race reliably. > > Out of curiosity, what did you have to change? Please see the trivial patch at the end. > I wonder if the reason it > worked for me and not for you is that I'm running 2.6.24? I have tried it on 2.6.19 too, doesn't work. Hmm... Now I spawned 2 instances of the unpatched (except a fix for execl) program on 2.6.19, this works, but not always. Strange... > > Austin, do you agree we can close this BUG? > > Sounds good to me. Great, thanks! Oleg. --- TST.c~ 2008-04-22 15:25:46.000000000 +0400 +++ TST.c 2008-04-27 16:31:33.000000000 +0400 @@ -39,7 +39,7 @@ void setuptimer(void) // With 1 ms, I can usually get through 10's of exec's before it // fails. With 0.1 ms, it reliably fails on the first exec for // me. I've seen the race happen with timers as high as 20 ms. - it.it_value.tv_nsec = 1000000 / 10; + it.it_value.tv_nsec = 1; it.it_interval = it.it_value; if (timer_settime(timer, 0, &it, NULL) != 0) perror("timer_settime"); @@ -49,6 +49,14 @@ int main(int argc, char **argv) { int i,pid; +for (i = 0; i < 4; ++i) + if (!fork()) + goto dotest; + +while (wait(NULL) > 0) + ; + +dotest: // Repeat the test until the race occurs. With a really short // timer, this should happen pretty quickly. for (i = 0; i < 10000; ++i) { @@ -60,21 +68,23 @@ int main(int argc, char **argv) continue; else if (WIFSIGNALED(status) && WTERMSIG(status) == SIGVTALRM) { printf("Child received erroneous SIGVTALRM\n"); - exit(0); + kill(0, SIGKILL); } else { printf("Child exited for unknown reason\n"); - exit(0); + kill(0, SIGKILL); } } else { // Child - char buf[10]; - sprintf(buf, "%d", i); +// char buf[10]; +// sprintf(buf, "%d", i); // Create a POSIX per-process timer setuptimer(); // Exec another binary. If the race occurs, a SIGVTALRM will be // queued for this child before the per-process timer is deleted // by exec. - execl("/bin/echo", "echo", buf); + execl("/bin/true", "true", NULL); + perror("exec"); + kill(0, SIGKILL); } } return 0;