Bug 15395 - Synchronous SIGSEGV signal gets the wrong processor context
Summary: Synchronous SIGSEGV signal gets the wrong processor context
Status: RESOLVED OBSOLETE
Alias: None
Product: Process Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: process_other
URL: http://bugs.winehq.org/show_bug.cgi?i...
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-25 11:02 UTC by Alexandre Julliard
Modified: 2013-12-02 11:01 UTC (History)
10 users (show)

See Also:
Kernel Version: 2.6.32
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Test case to demonstrate the problem (4.05 KB, text/plain)
2010-02-25 11:02 UTC, Alexandre Julliard
Details
example application to illustrate desired signal handling (5.16 KB, text/plain)
2010-02-26 02:13 UTC, Michael Builov
Details
2.6.33 seems to fix the issue. (703.31 KB, text/plain)
2010-02-26 14:45 UTC, Kai Wasserbäch
Details
Wine still triggers the bug (Wine 1.1.39, Kernel 2.6.33 and WINDEBUG=+seh) (497.65 KB, text/plain)
2010-02-26 16:38 UTC, Kai Wasserbäch
Details

Description Alexandre Julliard 2010-02-25 11:02:27 UTC
Created attachment 25209 [details]
Test case to demonstrate the problem

A problem has been found with Wine where we sometimes receive a SIGSEGV with an incorrect context. What seems to happen is that if we receive an asynchronous SIGUSR1 at the same time as a SIGSEGV fault is raised, the signal context that is passed to the SIGSEGV handler is not the context at the point of the fault, but the context that has been already modified to call the SIGUSR1 handler. In particular the instruction pointer points to the first instruction of the SIGUSR1 handler instead of to the faulting instruction.

Blocking SIGSEGV in the SIGUSR1 sigaction mask works around the problem, but of course it prevents catching a SIGSEGV that would happen inside the SIGUSR1 handler.

A test case demonstrating the bug has been written by Michael Builov and is attached. More information can be found in the related Wine bug report at http://bugs.winehq.org/show_bug.cgi?id=20380#c78
Comment 1 Michael Builov 2010-02-25 20:57:48 UTC
Here the sequence of calls at kernel side which leads to incorrect process context in sigsegv-handler on user-side:

(linux-2.6.32.6)
arch/x86/mm/fault.c:946         do_page_fault()
arch/x86/mm/fault.c:773         bad_area_access_error(...)
arch/x86/mm/fault.c:733         force_sig_info_fault(SIGSEGV, ...)
...
arch/x86/kernel/signal.c:691    handle_signal(SIGUSR1)
...
arch/x86/kernel/signal.c:691    handle_signal(SIGSEGV)


looks like the problem because at the moment of segfault thread had pending SIGUSR1. New signal SIGSEGV was queued, and thread was switched to process SIGUSR1.
But, at the very first step of sigusr1 handler thread was switched again to process SIGSEGV - like SIGSEGV was generated in sigusr1 handler, which isn't true.

Possible solution - postpone handling of signals which were queued _before_ currently processing signal until exit signal handler, i.e. don't try to handle SIGSEGV until finish with sigusr1 handler.
Of course, if new SIGSEGV generated in sigusr1 handler - handler should be interrupted to process SIGSEGV - signal was generated _after_ SIGUSR1.
Comment 2 Michael Builov 2010-02-26 02:13:22 UTC
Created attachment 25233 [details]
example application to illustrate desired signal handling

the program emulates signal handling in a way there no attempt to process some signal that was not generated in current signal handler:

 xsigusr1_handler<0>
  new signal: xSIGSEGV<0>
  new signal: xSIGUSR2<0>
  xsigsegv_handler<0>, pending: xSIGUSR2<0>
  xsigusr2_handler<0>

program output shows that in xsigusr1_handler() were obtained two signals: xSIGSEGV and xSIGUSR2. Then xsigusr1_handler() was interrupted to process xSIGSEGV, but handling of xSIGUSR2 was delayed until exit from xsigsegv_handler().

This is possible by using slightly modified fifo:

struct sig_fifo_entry {
    struct sig_fifo_entry *next;
};

struct sig_fifo {
    struct sig_fifo_entry *top;
    struct sig_fifo_entry **p_bottom;
};

static void sig_fifo_init(struct sig_fifo *sf)
{
    sf->top = NULL;
    sf->p_bottom = &sf->top;
}

static void sig_fifo_push_back(struct sig_fifo *sf, struct sig_fifo_entry *e)
{
    e->next = *sf->p_bottom;
    *sf->p_bottom = e;
    sf->p_bottom = &e->next;
}

static struct sig_fifo_entry *sig_fifo_pop_front(struct sig_fifo *sf)
{
    struct sig_fifo_entry *e = sf->top;
    if (e == *sf->p_bottom)
        /* we are at the bottom: no more signals available
           for the current signal handler */
        return NULL;
    sf->top = e->next;
    if (sf->p_bottom == &e->next)
        /* removing last entry for current signal handler,
           new entry will be pushed at top */
        sf->p_bottom = &sf->top;
    return e;
}

static struct sig_fifo_entry **sig_fifo_enter(struct sig_fifo *sf)
{
    struct sig_fifo_entry **p_old_bottom = sf->p_bottom;
    sf->p_bottom = &sf->top;
    return p_old_bottom;
}

static void sig_fifo_leave(struct sig_fifo *sf, struct sig_fifo_entry **p_old_bottom)
{
    sf->p_bottom = p_old_bottom;
}

static void do_handle_signal(struct sig_fifo_entry *e)
{
    struct sig_fifo_entry **p_old_bottom = sig_fifo_enter(&current->sigfifo);
    e->handler();
    sig_fifo_leave(&current->sigfifo, p_old_bottom);
}

The main idea is to remember (save of stack) position in fifo of signals queued _before_ the signal we want handle and then make those signals invisible until exit signal handler.
New signals, generated _after_ we entered signal handler will be inserted at that stored position in fifo - they should be processed before exiting signal handler.
And finally, after handler, we unhide pending signals and process them.
Comment 3 Alexandre Julliard 2010-02-26 13:32:49 UTC
Blocking signals while running the handler can be done through the sigaction mask, it's not the problem here. We don't care if SIGUSR1 interrupts the SIGSEGV handler, but it shouldn't be allowed to interrupt between the faulting instruction and the corresponding signal frame setup.
Comment 4 Kai Wasserbäch 2010-02-26 14:45:12 UTC
Created attachment 25245 [details]
2.6.33 seems to fix the issue.

While I was able to reproduce this bug with a 2.6.32.x (up till .9) kernel, see <http://bugs.winehq.org/show_bug.cgi?id=20380#c79>, I'm no longer able to trigger it with 2.6.33 kernel (see attached output). I'll check it with Wine ASAP.
Comment 5 Kai Wasserbäch 2010-02-26 16:38:20 UTC
Created attachment 25246 [details]
Wine still triggers the bug (Wine 1.1.39, Kernel 2.6.33 and WINDEBUG=+seh)

No, the bug isn't fixed with 2.6.33, though the test case in attachment 25209 [details] doesn't hit the "BUG" line anymore, I can still trigger the issue with Wine as described in WineHQ bug #20380.
Comment 6 Gustaw Smolarczyk 2010-02-27 12:06:22 UTC
#4: Your first attachment still has some BUG lines, but just a few (grep is a good tool :) I was testing this earlier with 2.6.33-rcX kernels and was able to reprocude it easily.
Comment 7 Kai Wasserbäch 2010-02-27 12:18:07 UTC
(In reply to comment #6)
I used grep, but apparently on the wrong file or something similar silly. I confirm, that the test case still triggers on 2.6.33; sorry for the noise.
Comment 8 Michael Builov 2010-02-28 12:36:40 UTC
#3: Yes, you are right - the "BUG" appears when SIGUSR1 is sent between bad_area_access_error() and send_signal(SIGSEGV).

(linux-2.6.32.6)
arch/x86/mm/fault.c:946         do_page_fault()
arch/x86/mm/fault.c:773         bad_area_access_error(...)
arch/x86/mm/fault.c:733         force_sig_info_fault(SIGSEGV, ...)
kernel/signal.c:914             send_signal(SIGUSR1)   <------------- "BUG"
kernel/signal.c:914             send_signal(SIGSEGV)
kernel/signal.c:914             send_signal(SIGUSR1)
arch/x86/kernel/signal.c:691    handle_signal(SIGUSR1)
kernel/signal.c:914             send_signal(SIGUSR1)
kernel/signal.c:914             send_signal(SIGUSR1)
arch/x86/kernel/signal.c:691    handle_signal(SIGSEGV)
kernel/signal.c:914             send_signal(SIGUSR1)
kernel/signal.c:914             send_signal(SIGUSR1)
kernel/signal.c:914             send_signal(SIGUSR1)
...
Comment 9 Michael Builov 2010-02-28 14:01:47 UTC
#8: But I'm also right - SIGSEGV sent _before_ handle_signal(SIGUSR1), so handle_signal(SIGSEGV) should be delayed until exit from sigusr1_handler on user-side.
Comment 10 Oleg Nesterov 2010-03-02 12:38:39 UTC
(In reply to comment #0)
> 
> A problem has been found with Wine where we sometimes receive a SIGSEGV with
> an
> incorrect context. What seems to happen is that if we receive an asynchronous
> SIGUSR1 at the same time as a SIGSEGV fault is raised, the signal context
> that
> is passed to the SIGSEGV handler is not the context at the point of the
> fault,
> but the context that has been already modified to call the SIGUSR1 handler.

Yes, this is correct.

Note that SIGUSR1 < SIGSEGV, SIGUSR1 will be dequeued first. Then we
dequeue SIGSEGV but its context will point to the handler for SIGUSR1.

This was already discussed, please see

    http://marc.info/?t=123704955800002
    http://marc.info/?t=123726362700004

I still think it probably makes sense to add _sigfault._insn, though.
Comment 11 Oleg Nesterov 2010-03-02 12:45:05 UTC
(In reply to comment #9)
>
> #8: But I'm also right - SIGSEGV sent _before_ handle_signal(SIGUSR1), so
> handle_signal(SIGSEGV) should be delayed until exit from sigusr1_handler on
> user-side.

No, the task never returns to user-mode with the pending signal.

So. I didn't actually read the test-case, please let me know if
I missed something. But so far I think this is WONTFIX.
Comment 12 Linus Torvalds 2010-03-02 16:37:57 UTC
On Tue, 2 Mar 2010, wylda@volny.cz wrote:
> 
> i'm trying to get some sunshine for this hidden thing... Sorry if i put
> wrong people on CC.
> 
> This thing is already reported and also there is a test case program
> for that.
> 
> If you are interested and receive another "Thank you!", come and see:
> http://bugzilla.kernel.org/show_bug.cgi?id=15395

Hmm. That's clearly a QoI problem, even if _technically_ it doesn't 
necessarily count as a bug (it's not all that well-defined behavior). I do 
agree that synchronous signals should take precedence over asynchronous 
ones, but we don't currently have any logic to do that (except the logic 
to say that SIGKILL overrides any other signal which is different).

Oleg, I think this one falls squarely under your expertise. I think what 
we want to make sure is that thread-local signals sent by the kernel 
(ie force_sig_info()) need to be de-queued before any other signals.

I'm not seeing any really good way to do that with the current structure, 
though. Sure, we can hack it up with a special case for certain signal 
numbers in __dequeue_signal(), but it sure as hell wouldn't be _pretty_.

Here is an _untested_ patch that may or may not fix it. It's ugly as hell, 
but it's coded so that at least on x86 it doesn't really add all that many 
instructions. The whole

	if (x & Y)
		x &= Y;

thing can be compiled into (and gcc does do it) just three instructions:

	movq    %rdx, %rax
        andl    $Y, %eax
        cmovne  %rax, %rdx

so it's not horrible from a code generation standpoint.

It _is_ very hacky, though. And as mentioned, it is entirely and utterly 
untested. Caveat emptor.

Oleg, do you see any cleaner way to do this?

		Linus

---
 kernel/signal.c |   47 ++++++++++++++++++++++++++++++++++-------------
 1 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 934ae5e..40c04f7 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -159,6 +159,11 @@ void recalc_sigpending(void)
 
 /* Given the mask, find the first available signal that should be serviced. */
 
+#define PRIMARY_MASK (sigmask(SIGKILL))
+#define SYNCHRONOUS_MASK \
+	(sigmask(SIGSEGV) | sigmask(SIGBUS) | sigmask(SIGILL) | \
+	 sigmask(SIGTRAP) | sigmask(SIGFPE))
+
 int next_signal(struct sigpending *pending, sigset_t *mask)
 {
 	unsigned long i, *s, *m, x;
@@ -166,26 +171,42 @@ int next_signal(struct sigpending *pending, sigset_t *mask)
 
 	s = pending->signal.sig;
 	m = mask->sig;
+
+	/*
+	 * Handle the first word specially: it contains
+	 * the synchronous signals and SIGKILL, that need
+	 * to be dequeued first.
+	 */
+	x = *s &~ *m;
+	if (x) {
+		if (x & PRIMARY_MASK)
+			x &= PRIMARY_MASK;
+		if (x & SYNCHRONOUS_MASK)
+			x &= SYNCHRONOUS_MASK;
+		sig = ffz(~x) + 1;
+		return sig;
+	}
+
 	switch (_NSIG_WORDS) {
 	default:
-		for (i = 0; i < _NSIG_WORDS; ++i, ++s, ++m)
-			if ((x = *s &~ *m) != 0) {
-				sig = ffz(~x) + i*_NSIG_BPW + 1;
-				break;
-			}
+		for (i = 1; i < _NSIG_WORDS; ++i) {
+			x = *++s &~ *++m;
+			if (!x)
+				continue;
+			sig = ffz(~x) + i*_NSIG_BPW + 1;
+			break;
+		}
 		break;
 
-	case 2: if ((x = s[0] &~ m[0]) != 0)
-			sig = 1;
-		else if ((x = s[1] &~ m[1]) != 0)
-			sig = _NSIG_BPW + 1;
-		else
+	case 2:
+		x = s[1] &~ m[1];
+		if (!x)
 			break;
-		sig += ffz(~x);
+		sig = ffz(~x) + _NSIG_BPW + 1;
 		break;
 
-	case 1: if ((x = *s &~ *m) != 0)
-			sig = ffz(~x) + 1;
+	case 1:
+		/* Nothing to do */
 		break;
 	}
Comment 13 Linus Torvalds 2010-03-02 18:20:32 UTC
On Tue, 2 Mar 2010, Oleg Nesterov wrote:
>
> No. But I have a very crazy idea, which I'd like to discuss anyway ;)
> 
> 1. Perhaps we can add the new member, siginfo_t->_sigfault.ip ?
>    In this case user-space can always find the faulting address
>    in info->si_ip. This shouldn't enlarge the size of siginfo_t.

I don't think that really help.

The problem is that programs that deeply care about the IP generally do so 
for a _reason_: they are going to want to do something about the SIGSEGV. 
And that "do something" usually involves fixing something up.

That, in turn, will almost always eventually lead to looking at other 
thread state too, and then returning to the faulting instruction (where 
"returning" may well be about changing IP _and_ other register state 
first: it migth be a siglongjump, but it might also be a case of "emulate 
the instruction in the signal handler, and then return to the _next_ 
instruction").

So the instruction pointer is just the first thing that needs to be 
correct for a synchronous part. Some users might care about _only_ the IP, 
but I think that is actually just a very uninteresting subset of all real 
cases.

So we really do want to take synchronous faults with the real signal 
context pointing to the instruction that caused the fault (or at least the 
state - some synchronous faults might have executed the instruction and 
then raised the exception afterwards). That's what happens for some debug 
exceptions.

And the thing is, especially the "raise the fault afterwards" behavior is 
something that will (again) only then cover a subset of the cases. Your 
suggestion of just re-doing the instruction (with extra complexity in the 
signal handler) would work for the common faulting cases, but not for the 
general case.

So I think my patch is really hacky, because it depends on the signal 
number in ways that are kind of fundamentally broken. I'm not claiming 
it's pretty. It would be better to have the "this is a synchronous signal" 
state follow the signal information down all the way. But we don't have 
anything like that, although adding a 'siginfo' for forced signals would 
be possible.

At the same time, I also think that if some process then does 
_asynchronous_ SIGSEGV's or SIGBUS signals (by just using "tkill()" 
directly or whatever) can just blame itself. I don't think we really need 
to care. So I think my patch is hacky, but I also think it's probably 
sufficient and workable.

(And untested. Once more: I have not tested it. I'm lazy, and I'm 
convinced all my code is perfect on the first try, so I'm not even going 
to. Somebody else is quite welcome to test it ;)

		Linus
Comment 14 Oleg Nesterov 2010-03-02 19:25:11 UTC
On 03/02, Linus Torvalds wrote:
>
> On Tue, 2 Mar 2010, Oleg Nesterov wrote:
> >
> > No. But I have a very crazy idea, which I'd like to discuss anyway ;)
> >
> > 1. Perhaps we can add the new member, siginfo_t->_sigfault.ip ?
> >    In this case user-space can always find the faulting address
> >    in info->si_ip. This shouldn't enlarge the size of siginfo_t.
>
> I don't think that really help.
>
> The problem is that programs that deeply care about the IP generally do so
> for a _reason_: they are going to want to do something about the SIGSEGV.
> And that "do something" usually involves fixing something up.

Yes, yes, sure. That is why I suggested 2.

> And the thing is, especially the "raise the fault afterwards" behavior is
> something that will (again) only then cover a subset of the cases. Your
> suggestion of just re-doing the instruction (with extra complexity in the
> signal handler) would work for the common faulting cases, but not for the
> general case.

OK.

> So I think my patch is really hacky, because it depends on the signal
> number in ways that are kind of fundamentally broken. I'm not claiming
> it's pretty. It would be better to have the "this is a synchronous signal"
> state follow the signal information down all the way.

OK, another idea, not pretty too. Again, again, just to discuss the
alternatives.

What if we change force_sig_info() to use TS_RESTORE_SIGMASK logic?
Pseudo code:

	--- kernel/signal.c
	+++ kernel/signal.c
	@@ -1057,6 +1057,14 @@ force_sig_info(int sig, struct siginfo *
		}
		if (action->sa.sa_handler == SIG_DFL)
			t->signal->flags &= ~SIGNAL_UNKILLABLE;
	+
	+	if (t == current) {
	+		WARN_ON(thread_info->status & TS_RESTORE_SIGMASK);
	+		set_restore_sigmask();
	+		t->saved_sigmask = t->blocked;
	+		t->blocked |= BLOCK_MORE_SIGNALS;
	+	}
	+
		ret = specific_send_sig_info(sig, info, t);
		spin_unlock_irqrestore(&t->sighand->siglock, flags);
	 

The obvious problem is that this is user-visible change, the signal
handler will run with BLOCK_MORE_SIGNALS blocked. So I don't really
this this is acceptable.

> At the same time, I also think that if some process then does 
> _asynchronous_ SIGSEGV's or SIGBUS signals (by just using "tkill()" 
> directly or whatever) can just blame itself. I don't think we really need 
> to care. So I think my patch is hacky, but I also think it's probably 
> sufficient and workable.

Yes, agreed. Will try to double check your change tomorrow.

Oleg.
Comment 15 Oleg Nesterov 2010-03-02 19:27:09 UTC
On 03/02, Linus Torvalds wrote:
>
> Hmm. That's clearly a QoI problem, even if _technically_ it doesn't
> necessarily count as a bug (it's not all that well-defined behavior).

Yes. We already discussed this problem, and it was decided we should
live with the current behaviour: http://marc.info/?t=123726362700004

But personally I agree, we should change the kernel to help the user
space.

> I do
> agree that synchronous signals should take precedence over asynchronous
> ones, but we don't currently have any logic to do that

Probably this is the best and simplest option (Roland suggested this
too).

> (except the logic
> to say that SIGKILL overrides any other signal which is different).

I don't think we should worry about SIGKILL (PRIMARY_MASK in your patch),
the task can't return to user-space if it has a pending SIGKILL.

I'll try to read this patch carefully tomorrow, although it looks
"obviously correct". But yes, checking the signal number is not what
we actually want and, say, tkill(SIGBUS) still can confuse SIGSEGV.

> Oleg, do you see any cleaner way to do this?

No. But I have a very crazy idea, which I'd like to discuss anyway ;)

1. Perhaps we can add the new member, siginfo_t->_sigfault.ip ?
   In this case user-space can always find the faulting address
   in info->si_ip. This shouldn't enlarge the size of siginfo_t.

However, this can't fix ucontext.

2. Add siginfo_t->_sigfault.ip, but do not expose it to user-space.
   IOW, do not change copy_siginfo_to_user(__SI_FAULT) although this
   doesn't really matter.

   Instead, we change get_signal_to_deliver:

	signr = dequeue_signal(info);

	...

	// probably we need a more careful check ...
	if ((info->si_code & __SI_MASK) == __SI_FAULT) &&
					signr != SIGTRAP) {

		// OK, it was sent by the kernel, it is private,
		// ->si_ip points to the instruction which triggered
		// the fault.

		if (info->si_ip != regs->ip) {
			// This can only happen if we "raced" with
			// another signal SIGXXX.

			// Pretend that SIGXXX was recieved before
			// this task executed the faulted instruction
			// and return.

			// Just return. User space will repeat the
			// same instruction and get another SEGV/ILL/etc
			// after it handles SIGXXX.

			return 0;
		}
	}

What do you all think?

Let me repeat, if you think this is just stupid (even _if_ correct) -
I agree in advance. I am just trying to think about any possible
alternative.

Oleg.
Comment 16 Linus Torvalds 2010-03-02 19:48:05 UTC
On Tue, 2 Mar 2010, Oleg Nesterov wrote:
> 
> OK, another idea, not pretty too. Again, again, just to discuss the
> alternatives.
> 
> What if we change force_sig_info() to use TS_RESTORE_SIGMASK logic?

Ok, I kind of like this.

> The obvious problem is that this is user-visible change, the signal
> handler will run with BLOCK_MORE_SIGNALS blocked. So I don't really
> this this is acceptable.

We've had other user-visible changes wrt forced signals. In particular, we 
did that whole "punch through even blocked and ignored signals" change 
back in commit 9e008c3c in the historical linux tree (ie BK times).

So forced signals have always (well, for a _loong_ time) had somewhat 
special semantics, and changing them in subtle ways is not necessarily 
wrong.

So I think your patch is conceptually nice, but that WARN_ON() makes me 
nervous. I don't think it should ever happen, but..

		Linus
Comment 17 Oleg Nesterov 2010-03-02 20:18:27 UTC
On 03/02, Linus Torvalds wrote:
>
> We've had other user-visible changes wrt forced signals. In particular, we
> did that whole "punch through even blocked and ignored signals" change
> back in commit 9e008c3c in the historical linux tree (ie BK times).
>
> So forced signals have always (well, for a _loong_ time) had somewhat
> special semantics, and changing them in subtle ways is not necessarily
> wrong.
>
> So I think your patch is conceptually nice,

OK, then I'll try to make the patch for further review tomorrow.

I am still not sure it is better than your simple change.

> but that WARN_ON() makes me
> nervous. I don't think it should ever happen, but..

Yes, this WARN_ON() is in fact the "FIXME" marker. Afaics, the only
case when force_sig() can hit TS_RESTORE_SIGMASK is handle_signal()->
setup_rt_frame() path, I'll try to see what we can do.

Oleg.
Comment 18 Linus Torvalds 2010-03-02 20:28:45 UTC
On Tue, 2 Mar 2010, Oleg Nesterov wrote:

> On 03/02, Linus Torvalds wrote:
> >
> > We've had other user-visible changes wrt forced signals. In particular, we
> > did that whole "punch through even blocked and ignored signals" change
> > back in commit 9e008c3c in the historical linux tree (ie BK times).
> >
> > So forced signals have always (well, for a _loong_ time) had somewhat
> > special semantics, and changing them in subtle ways is not necessarily
> > wrong.
> >
> > So I think your patch is conceptually nice,
> 
> OK, then I'll try to make the patch for further review tomorrow.

Actually, I take that back.

It won't work, will it? If you get a SIGSEGV while running the SIGBUS 
handler, the TS_RESTORE_SIGMASK approach will mean that the signal will be 
blocked _during_ the SIGBUS handler, which in turn means that now the 
SIGSEGV will be reset to a SIG_DFL. So no nested faults can ever happen.

I was thinking that it was just blocking other signals until the signal 
handler started running, but that's not how it works, is it?

		Linus
Comment 19 Oleg Nesterov 2010-03-02 21:10:13 UTC
On 03/02, Linus Torvalds wrote:
>
> On Tue, 2 Mar 2010, Oleg Nesterov wrote:
>
> > On 03/02, Linus Torvalds wrote:
> > >
> > > We've had other user-visible changes wrt forced signals. In particular,
> we
> > > did that whole "punch through even blocked and ignored signals" change
> > > back in commit 9e008c3c in the historical linux tree (ie BK times).
> > >
> > > So forced signals have always (well, for a _loong_ time) had somewhat
> > > special semantics, and changing them in subtle ways is not necessarily
> > > wrong.
> > >
> > > So I think your patch is conceptually nice,
> >
> > OK, then I'll try to make the patch for further review tomorrow.
>
> Actually, I take that back.
>
> It won't work, will it? If you get a SIGSEGV while running the SIGBUS
> handler, the TS_RESTORE_SIGMASK approach will mean that the signal will be
> blocked _during_ the SIGBUS handler, which in turn means that now the
> SIGSEGV will be reset to a SIG_DFL. So no nested faults can ever happen.

Yes. I didn't mean we should block other BUS/ILL/etc signals, only those
which don't belong to SYNCHRONOUS_MASK.

But somehow I didn't realize that then tkill(BUS) can race with SIGSEGV
again. This means that this approach has almost zero advantages compared
to your much simpler patch.

So, lets forget about it.

> I was thinking that it was just blocking other signals until the signal
> handler started running, but that's not how it works, is it?

Yes, that is why this patch would add the user-visible change, another
reason to prefer your patch.

Oleg.
Comment 20 Linus Torvalds 2010-03-03 01:53:54 UTC
On Tue, 2 Mar 2010, Roland McGrath wrote:
> 
> Nope.  After set_restore_sigmask(), we'll restore saved_sigmask before
> getting to user mode if there is no handler to run.  But that's not what it
> does when there is a handler.  Instead, it records the saved_sigmask in the
> handler frame to be restored by sigreturn instead of recording the current
> blocked set before setting up the handler--but the handler setup itself
> just adds sa_mask to the existing blocked set, not the saved one.

Yeah.

> TBH, it is not at all clear to me why it works this way.

If I recall correctly, the original reason for the whole 
set_restore_sigmask() was just 'sigsuspend()'. The pselect/ppoll use came 
later.

And the sigsuspend() man-page clearly says that the signal mask is restore 
_after the signal handler returns. 

So I do think our set_restore_mask() semantics are correct, it's just not 
what we would have wanted for this case, and it was just me being confused 
and not remembering the exact semantics that made me go "that's a good 
idea".

> I think the natural expectations are that a forced signal always gets
> delivered, gets delivered with its proper siginfo_t data, and gets
> delivered first (before other non-forced signals).  That seems pretty
> reasonable from a user perspective.
> 
> So how about queuing forced signals specially to meet those expectations?
> 
> For force_sig_info(), use list_add() instead of list_add_tail() so it goes
> straight on the head of the queue.  Mark the entry in sigqueue.flags or
> somewhere in siqueue.info (as Linus mentioned) to indicate it's a forced
> signal.

Since signal delivery starts out with trying to figure out the signal 
number, we'd have to change that whole thing, and that probably means that 
we should just use a different queue entirely for the synchronous signals. 

Then, instead of the whole "next_signal()+collect_signal()" dance, we'd 
just have a a simple test at the very top of signal delivery that checks 
the queue of synchronous signals, and pops things off that queue first.

> We could make this path skip the legacy_queue() check.  Then we don't lose
> siginfo_t in my scenario above.  I don't know if there are any ways to get
> two authentic forced signals with the same signal number on one trip into
> the kernel, but if there are, it would make those stop losing information.

I'd almost prefer to go the other way: just ever allow one single 
outstanding synchronous signal, and make the "synchronous signal queue" 
really just a very trivial flag indeed.

If an application gets a SIGBUS while handling a SIGSEGV (or any other 
such combination), it is going to have to rely on the alternate stack 
handling _anyway_. So I don't think there is ever a situation where we can 
really usefully have multiple pending synchronous signals at the same 
time.

That said, I'd hate to have to write the code for any of the above. 

		Linus
Comment 21 Roland McGrath 2010-03-03 03:02:07 UTC
> I was thinking that it was just blocking other signals until the signal 
> handler started running, but that's not how it works, is it?

Nope.  After set_restore_sigmask(), we'll restore saved_sigmask before
getting to user mode if there is no handler to run.  But that's not what it
does when there is a handler.  Instead, it records the saved_sigmask in the
handler frame to be restored by sigreturn instead of recording the current
blocked set before setting up the handler--but the handler setup itself
just adds sa_mask to the existing blocked set, not the saved one.

TBH, it is not at all clear to me why it works this way.  I would have to
remind myself about the reasons behind the uses of set_restore_sigmask()
and think through all those details.  For our purposes here, it would seem
to make a lot more sense if saved_sigmask were restored before the handler
setup.  That way, it's still the same mask stored to be restored by
sigreturn.  But for the running of the handler itself, we wouldn't keep the
"artifical" blocked set, just the saved_sigmask + sa_mask.  That is what
Linus had in mind.  Like I said, I'm not at all sure why it shouldn't be
that way--but we would need to think through it all quite carefully to be
sure about that.

Not that we shouldn't figure out and elucidate those corners of the
restore_sigmask stuff.  But, this is all a rather roundabout way to
get at one specific problem and that problem is not really directly
about the blocked signal mask at all (it's about the order of signal
delivery).  That makes me think perhaps we should consider a more
direct approach.

Another problem scenario in the same vein comes to mind.  I think this
could be entirely within the realm of "get what you deserve", but I'll
describe it anyway.  Consider that SIGSEGV (or whichever) is already both
blocked and pending, presumably from a tgkill or something like that.
This sounds pathological and contrived at first, but imagine a handler:

	void fatal(int sig)
	{
		cleanup();
		signal(sig, SIG_DFL);
		raise(sig);
	}

This is a traditional thing to do.  You handle the real fault so you can
clean up, or print out something useful in your log, or whatever.  Then you
reset to SIG_DFL and re-raise the signal so that you get a core dump that
has the right signal number in it.  Since "sig" is blocked while its
handler runs, raise() (i.e. tgkill) just makes the re-sent signal pending,
with the vanilla siginfo_t for a tgkill.  (When the handler returns, the
blocked signal set will be restored by sigreturn and so the re-sent signal
should then get delivered immediately.)

Now, say we are at the epilogue of that handler when it ran for some proper
SIGSEGV.  It's after the raise (tgkill), so the new SIGSEGV is pending.
It's before the sigreturn, so SIGSEGV is still blocked.  Now, say we get a
new SIGSEGV via force_sig_info().  For example, cleanup() clobbered the
stack frame so the PC or SP fatal() tries to return to is bogus.

Since SIGSEGV is blocked, force_sig_info() will unblock it and reset it to
SIG_DFL.  But, since SIGSEGV is already pending, the legacy_queue() logic
will punt the new siginfo_t from the actual fault and leave the queued one
from the tgkill.  So when it comes to delivery, we are delivering the
forced signal (i.e. that's why it's not blocked any more) but with the
siginfo_t of a signal that by rights could not have been delivered (because
it was blocked).

Now, perhaps this doesn't matter.  Since it's SIG_DFL (i.e. core dump), you
might be hard-pressed to notice the siginfo_t anyway.  But, a ptracer can
fetch it with PTRACE_GETSIGINFO.  I think we now have tracepoints that
could be tracing the si_code and such in some fashion.  So we do lose
information that somebody could be expecting to see.

Now, that's a particular pathological case and it's a little hard to care
much about it.  There may well be other scenarios such as races when the
signal is not blocked, but I can't even imagine one off hand to contrive.
I can't say any such cases seem very important.  But I mention them to make
the point that there is more than just delivery order that might stray from
the "natural expectations" about what you get for a forced signal.  For
delivery order alone, mask fiddling is an obvious tactic.  But IMHO the
most satisfying solutions will be ones that cover these other corners too.

Incidentally, what I'm calling a "forced signal" is sometimes called a
"synchronous signal", and in particular I mean calls to force_sig_info() or
force_sig().  Signals sent with SEND_SIG_FORCED in place of siginfo_t are
not this kind of "forced signals", and perhaps we should change the name of
that constant not to be so similar to something with different semantics.
It's really SEND_SIG_NOQUEUE--not to be confused with SEND_SIG_NOINFO,
which is really SEND_SIG_DEFAULT.

I think the natural expectations are that a forced signal always gets
delivered, gets delivered with its proper siginfo_t data, and gets
delivered first (before other non-forced signals).  That seems pretty
reasonable from a user perspective.

So how about queuing forced signals specially to meet those expectations?

For force_sig_info(), use list_add() instead of list_add_tail() so it goes
straight on the head of the queue.  Mark the entry in sigqueue.flags or
somewhere in siqueue.info (as Linus mentioned) to indicate it's a forced
signal.  This could be a variant of the current path with e.g. another flag
argument to __send_signal(), or refactor __send_signal() into subroutines
be used there and by force_sig_info(), or else maybe just drive it all off
checking for the magic siginfo_t bits.  Then, in dequeue_signal() instead
of the normal next_signal() method, first check if the head of the queue is
a forced a signal.  If it is, just dequeue that one first no matter what.

We could make this path skip the legacy_queue() check.  Then we don't lose
siginfo_t in my scenario above.  I don't know if there are any ways to get
two authentic forced signals with the same signal number on one trip into
the kernel, but if there are, it would make those stop losing information.


Thanks,
Roland
Comment 22 Roland McGrath 2010-03-03 11:24:25 UTC
> So I do think our set_restore_mask() semantics are correct, it's just not 
> what we would have wanted for this case, and it was just me being confused 
> and not remembering the exact semantics that made me go "that's a good 
> idea".

Yes, you're right about sigsuspend.

> Since signal delivery starts out with trying to figure out the signal 
> number, we'd have to change that whole thing, and that probably means that 
> we should just use a different queue entirely for the synchronous signals. 
> 
> Then, instead of the whole "next_signal()+collect_signal()" dance, we'd 
> just have a a simple test at the very top of signal delivery that checks 
> the queue of synchronous signals, and pops things off that queue first.

Sure.  The only way that differs from what I said is that I suggested using
the front of the existing task_struct.pending queue instead of adding a
separate one.  

Aside from saving another list_head or suchlike somewhere (ultimately
inflating task_struct), the only motivation I had in mind for using the
existing queue is to glom all the existing logic about pending checks and
so forth.  But I think there may really be nothing more to supporting a
separate queue than one simple check in recalc_sigpending_tsk().  

I like the idea of a cleaner separation.  That makes it natural to remove
it entirely from the world of blocked and pending signal sets.  It's not
even a real signal at all, at least not when it's "generated".  We don't
actually need to fiddle with the sigaction and sigmask in force_sig_info.
We can just handle the forced signal specially when we "dequeue" it.

> > We could make this path skip the legacy_queue() check.  Then we don't lose
> > siginfo_t in my scenario above.  I don't know if there are any ways to get
> > two authentic forced signals with the same signal number on one trip into
> > the kernel, but if there are, it would make those stop losing information.
> 
> I'd almost prefer to go the other way: just ever allow one single 
> outstanding synchronous signal, and make the "synchronous signal queue" 
> really just a very trivial flag indeed.

For the case I cited, the one "real" (tgkill) queued signal and one special
synchronous signal would suffice.  The only implementation difference
between just one record and a queue is something like a siginfo_t pointer
vs a list_head pointing to struct sigqueue.  It can be quite simple either
way.

> If an application gets a SIGBUS while handling a SIGSEGV (or any other 
> such combination), it is going to have to rely on the alternate stack 
> handling _anyway_. So I don't think there is ever a situation where we can 
> really usefully have multiple pending synchronous signals at the same 
> time.

I can't quite tell what you have in mind here.  Are you talking about
sigaltstack?  It's true that you need to use sigaltstack if you are going
to run a signal handler for a fault related to the stack pointer in any
circumstances.  But I don't see what that really has to do with anything
we're talking about.  

I think you are talking about a handled synchronous signal and then having
something else happen in user mode after handler setup.  By the time you
get to user mode at the handler's entry point, that signal is not pending
any more--you've just finished delivering it.  If there is another
user-mode fault while running the handler (even its first instruction, or
the handler entry PC itself being a bogus address), then that's a fresh
synchronous signal with nothing else pending.

The case I'm talking about is where force_sig_info() is called twice before
the first time get_signal_to_deliver() is called--by definition there is no
"handling" or user mode in between.  I had not thought before of a way that
could actually happen, just postulated the possibility.  So far the only
actual way I've thought of since is with multiple watchpoints touched by a
single syscall (so two or more forced SIGTRAP inside one syscall).  That
one example is not especially compelling because those siginfo_t's don't
carry the important information anyway (there is dr6 instead).

I suspect we'll find others.  My inclination is to have a queue instead of
just one, and it's easy enough to implement.

> That said, I'd hate to have to write the code for any of the above. 

Oh, I'm confident Oleg can do it. ;-)
Having thought about it more now, I actually think that making it more
cleanly separate from the normal signals logic will make it simpler to
implement than it first might have seemed.

Anyway, I think we should be discussing this on the mailing list.
I really don't think the current behavior qualifies as a bug.
I've always known it worked that way, if you knew how to ask. ;-)

What it's traditionally done is well-enough specified and not
mysterious or unpredictable (not especially so, as these things go).
The things we're talking about now are spiffy new feature ideas for
a new and more gratifying set of guarantees about signals induced
by machine traps.  I think they may well be some lovely new things
in new kernel versions.  But I don't think I'd ever recommend them
for -stable releases, or call them "bug fixes" in any literal sense.


Thanks,
Roland
Comment 23 Oleg Nesterov 2010-03-03 17:04:50 UTC
On 03/02, Roland McGrath wrote:
>
> So how about queuing forced signals specially to meet those expectations?
>
> For force_sig_info(), use list_add() instead of list_add_tail() so it goes
> straight on the head of the queue.  Mark the entry in sigqueue.flags or
> somewhere in siqueue.info (as Linus mentioned) to indicate it's a forced
> signal.

I thought about this too ;) Well, almost, see below.

I think, we don't need other arguments or sigqueue.flags. send_sig() pathes
already have "int group" argument, currently it is in fact bool. We can rename
it to be "int private" first and change the code accordingly, it is still
used as a boolean.

Then we add "#define XXX_SYNCHRONOUS 2", and change force_sig_info() to
call send_signal(private => XXX_SYNCHRONOUS).

Then we add the last change which actually alters the code, __send_signal()
does

	if (private == XXX_SYNCHRONOUS)
		list_add();
	else
		list_tail();

as you suggested.

> We could make this path skip the legacy_queue() check.

I missed this case when I was thinking about this. Probably we can do
this too.

> Then, in dequeue_signal() instead
> of the normal next_signal() method, first check if the head of the queue is
> a forced a signal.  If it is, just dequeue that one first no matter what.

Do we really need to verify that the head of the queue is a forced a signal?
IOW, can't we just do

	static int collect_sigqueue(struct sigpending *pending, struct sigqueue *first,
					siginfo_t *info)
	{
		struct sigqueue *next = first;
		int sig = first->info.si_signo;

		list_for_each_entry_continue(next, &pending->list, list)
			if (next->info.si_signo == sig)
				goto still_pending;

		sigdelset(&pending->signal, sig);
	still_pending:
		copy_siginfo(info, &first->info);
		list_del_init(&first->list);
		__sigqueue_free(first);

		return sig;
	}

	static int __dequeue_signal(struct sigpending *pending, sigset_t *blocked,
					siginfo_t *info)
	{
		struct sigqueue *first;
		int sig;

		if (!PENDING(pending, blocked))
			return 0;

		list_for_each_entry(first, &pending->list, list)
			if (!sigismember(blocked, first->info.si_signo))
				return collect_sigqueue(pending, first, info);

		/* Ok, it wasn't in the queue.  This must be
		   a fast-pathed signal or we must have been
		   out of queue space.  So zero out the info.
		 */
		sig = next_signal(pending, blocked);
		sigdelset(&pending->signal, sig);
		info->si_signo = sig;
		info->si_errno = 0;
		info->si_code = SI_USER;
		info->si_pid = 0;
		info->si_uid = 0;

		return sig;
	}

?

Of course, this is a user-visible change again. Currently dequeue() sorts
the signals by number. Now we always dequeue the first queued signal in
the likely case.

Can we do this? If not, then __send_signal() can mark sigqueue as forced
and dequeue() can check the head, as you suggested.


Note that I ignored the ->notifier() code. It is absolutely wrong anyway
and needs a separate discussion.

From the performance pov, we use sigismember() instead of "==" to find
siginfo_t.


In any case Linus's patch is much, much simpler and workable. But I like
your idea even if it needs more changes.

Roland, Linus, what do you think?

Oleg.
Comment 24 Oleg Nesterov 2010-03-03 17:25:56 UTC
On 03/03, Roland McGrath wrote:
>
> I like the idea of a cleaner separation.  That makes it natural to remove
> it entirely from the world of blocked and pending signal sets.  It's not
> even a real signal at all, at least not when it's "generated".  We don't
> actually need to fiddle with the sigaction and sigmask in force_sig_info.
> We can just handle the forced signal specially when we "dequeue" it.

This will certainly cleanup the notion of synchronous signal. We have to
change recalc_sigpending_tsk(), but probably that is all.

So. I agree with any approach. Just tell me what you prefer ;)

Oleg.
Comment 25 Linus Torvalds 2010-03-03 17:57:35 UTC
On Wed, 3 Mar 2010, Roland McGrath wrote:
> > 
> > I'd almost prefer to go the other way: just ever allow one single 
> > outstanding synchronous signal, and make the "synchronous signal queue" 
> > really just a very trivial flag indeed.
> 
> For the case I cited, the one "real" (tgkill) queued signal and one special
> synchronous signal would suffice.  The only implementation difference
> between just one record and a queue is something like a siginfo_t pointer
> vs a list_head pointing to struct sigqueue.  It can be quite simple either
> way.

If the synchronous signals are handled on a separate queue, then the whole 
"tgkill signal vs synchronous signal" case turns trivial: you can have 
both pending at the same time independently. And the synchronous one will 
automatically always happen before the pending "tgkill" one.

> > If an application gets a SIGBUS while handling a SIGSEGV (or any other 
> > such combination), it is going to have to rely on the alternate stack 
> > handling _anyway_. So I don't think there is ever a situation where we can 
> > really usefully have multiple pending synchronous signals at the same 
> > time.
> 
> I can't quite tell what you have in mind here.  Are you talking about
> sigaltstack?  It's true that you need to use sigaltstack if you are going
> to run a signal handler for a fault related to the stack pointer in any
> circumstances.  But I don't see what that really has to do with anything
> we're talking about.  

So I'm just thinking about the "synchronous fault happens while setting up 
the stack for a previous synchronous fault", ie a nested synchronous 
fault. That would be a reason to allow the synchronous queue to be more 
than one deep.

But we don't actually turn the nested fault into a signal anyway, we turn 
it into a kernel put_user() exception and failure, and switch to 
altstacks. 

My point was that the synchronous fault shouldn't need a deeper queue than 
just one - so we can implement the whole "synchronous signal queue" as a 
single integer field in the per-thread signal struct.

> > That said, I'd hate to have to write the code for any of the above. 
> 
> Oh, I'm confident Oleg can do it. ;-)
> Having thought about it more now, I actually think that making it more
> cleanly separate from the normal signals logic will make it simpler to
> implement than it first might have seemed.

It should be pretty simple, but I'd really like to make the thing _really_ 
simple.  If done right, we could remove all the logic for punching through 
blocked signals etc, because the synchronous signal handling code itself 
would just ignore it.

IOW, something like the appended (TOTALLY UNTESTED)

		Linus
---
 include/linux/sched.h |    1 +
 kernel/signal.c       |   55 ++++++++++++++++++++++++-------------------------
 2 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4b1753f..3a69d62 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1383,6 +1383,7 @@ struct task_struct {
 /* signal handlers */
 	struct signal_struct *signal;
 	struct sighand_struct *sighand;
+	struct siginfo *sync_signal;
 
 	sigset_t blocked, real_blocked;
 	sigset_t saved_sigmask;	/* restored if set_restore_sigmask() was used */
diff --git a/kernel/signal.c b/kernel/signal.c
index 934ae5e..be08c02 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -125,6 +125,7 @@ static inline int has_pending_signals(sigset_t *signal, sigset_t *blocked)
 static int recalc_sigpending_tsk(struct task_struct *t)
 {
 	if (t->signal->group_stop_count > 0 ||
+	    t->sync_signal ||
 	    PENDING(&t->pending, &t->blocked) ||
 	    PENDING(&t->signal->shared_pending, &t->blocked)) {
 		set_tsk_thread_flag(t, TIF_SIGPENDING);
@@ -460,6 +461,17 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
 {
 	int signr;
 
+	if (tsk->sync_signal) {
+		struct siginfo *n = tsk->sync_signal;
+
+		tsk->sync_signal = NULL;
+		copy_siginfo(info, n);
+		kmem_cache_free(sigqueue_cachep, n);
+
+		recalc_sigpending();
+		return info->si_signo;
+	}
+
 	/* We only dequeue private signals from ourselves, we don't let
 	 * signalfd steal them
 	 */
@@ -1027,40 +1039,27 @@ int do_send_sig_info(int sig, struct siginfo *info, struct task_struct *p,
 }
 
 /*
- * Force a signal that the process can't ignore: if necessary
- * we unblock the signal and change any SIG_IGN to SIG_DFL.
- *
- * Note: If we unblock the signal, we always reset it to SIG_DFL,
- * since we do not want to have a signal handler that was blocked
- * be invoked when user space had explicitly blocked it.
- *
- * We don't want to have recursive SIGSEGV's etc, for example,
- * that is why we also clear SIGNAL_UNKILLABLE.
+ * Force a signal that the process can't ignore.
  */
 int
 force_sig_info(int sig, struct siginfo *info, struct task_struct *t)
 {
-	unsigned long int flags;
-	int ret, blocked, ignored;
-	struct k_sigaction *action;
+	struct siginfo *n;
 
-	spin_lock_irqsave(&t->sighand->siglock, flags);
-	action = &t->sighand->action[sig-1];
-	ignored = action->sa.sa_handler == SIG_IGN;
-	blocked = sigismember(&t->blocked, sig);
-	if (blocked || ignored) {
-		action->sa.sa_handler = SIG_DFL;
-		if (blocked) {
-			sigdelset(&t->blocked, sig);
-			recalc_sigpending_and_wake(t);
-		}
-	}
-	if (action->sa.sa_handler == SIG_DFL)
-		t->signal->flags &= ~SIGNAL_UNKILLABLE;
-	ret = specific_send_sig_info(sig, info, t);
-	spin_unlock_irqrestore(&t->sighand->siglock, flags);
+	if (WARN_ONCE(t->sync_signal, "Nested synchronous signals?"))
+		return -EBUSY;
 
-	return ret;
+	n = kmem_cache_alloc(sigqueue_cachep, GFP_KERNEL);
+	if (!n)
+		return -ENOMEM;
+
+	memcpy(n, info, sizeof(*n));
+	n->si_signo = sig;
+
+	t->sync_signal = n;
+	recalc_sigpending_and_wake(t);
+
+	return 0;
 }
 
 /*
Comment 26 Oleg Nesterov 2010-03-03 18:04:48 UTC
On 03/02, Linus Torvalds wrote:
>
> +#define PRIMARY_MASK (sigmask(SIGKILL))
> +#define SYNCHRONOUS_MASK \
> +     (sigmask(SIGSEGV) | sigmask(SIGBUS) | sigmask(SIGILL) | \
> +      sigmask(SIGTRAP) | sigmask(SIGFPE))
> +
>  int next_signal(struct sigpending *pending, sigset_t *mask)
>  {
>       unsigned long i, *s, *m, x;
> @@ -166,26 +171,42 @@ int next_signal(struct sigpending *pending, sigset_t
> *mask)
>
>       s = pending->signal.sig;
>       m = mask->sig;
> +
> +     /*
> +      * Handle the first word specially: it contains
> +      * the synchronous signals and SIGKILL, that need
> +      * to be dequeued first.
> +      */
> +     x = *s &~ *m;
> +     if (x) {
> +             if (x & PRIMARY_MASK)
> +                     x &= PRIMARY_MASK;
> +             if (x & SYNCHRONOUS_MASK)
> +                     x &= SYNCHRONOUS_MASK;
> +             sig = ffz(~x) + 1;
> +             return sig;
> +     }

Just in case...

I believe the code is correct. I don't think we really need PRIMARY_MASK,
we are not going to return to the user-mode if SIGKILL is pending, but it
doesn't hurt.

I am wondering why (before or after the patch) the code does ffz(~x)
instead of __ffs(x), but this is cosmetic anyway.

Oleg.
Comment 27 Linus Torvalds 2010-03-03 18:06:24 UTC
On Wed, 3 Mar 2010, Linus Torvalds wrote:
> 
> IOW, something like the appended (TOTALLY UNTESTED)

Btw, just to clarify: that really was a total throw-away patch. It may 
_look_ like a real diff, and a serious patch, but it was really not meant 
to be anything but a discussion piece. It's already gone from my working 
tree, and I'm not saving it etc.

In many ways, I like my original patch better. I hate how the latter one 
adds new slab allocation sites for that siginfo thing etc. I was literally 
considering adding a whole "struct siginfo" (not a pointer) into the 
"struct task_struct" just to avoid that whole thing.

I dunno.

				Linus
Comment 28 Linus Torvalds 2010-03-03 18:19:49 UTC
On Wed, 3 Mar 2010, Oleg Nesterov wrote:
> 
> Just in case...
> 
> I believe the code is correct. I don't think we really need PRIMARY_MASK,
> we are not going to return to the user-mode if SIGKILL is pending, but it
> doesn't hurt.
> 
> I am wondering why (before or after the patch) the code does ffz(~x)
> instead of __ffs(x), but this is cosmetic anyway.

Yeah, hysterical raisins. I think we had ffz() but not ffs(). 

I'm inclined to perhaps do this patch (with the SIGKILL case removed) for 
2.6.34. I'm not against doing the whole synchronous queue thing, but I'd 
rather not hurry some signal state cleanup for something like this.

Regardless, I'd still like to hear from the original bug reporters whether 
the patches actually fix the problem they see.

		Linus
Comment 29 Oleg Nesterov 2010-03-03 18:45:54 UTC
On 03/03, Linus Torvalds wrote:
>
> @@ -460,6 +461,17 @@ int dequeue_signal(struct task_struct *tsk, sigset_t
> *mask, siginfo_t *info)
>  {
>       int signr;
>
> +     if (tsk->sync_signal) {
> +             struct siginfo *n = tsk->sync_signal;
> +
> +             tsk->sync_signal = NULL;
> +             copy_siginfo(info, n);
> +             kmem_cache_free(sigqueue_cachep, n);
> +
> +             recalc_sigpending();
> +             return info->si_signo;
> +     }
> ...
>  force_sig_info(int sig, struct siginfo *info, struct task_struct *t)
>  {
> ...
> +     if (WARN_ONCE(t->sync_signal, "Nested synchronous signals?"))
> +             return -EBUSY;
>
> +     n = kmem_cache_alloc(sigqueue_cachep, GFP_KERNEL);
> +     if (!n)
> +             return -ENOMEM;
> +
> +     memcpy(n, info, sizeof(*n));
> +     n->si_signo = sig;
> +
> +     t->sync_signal = n;
> +     recalc_sigpending_and_wake(t);
> +
> +     return 0;
>  }

kmem_cache_alloc(sigqueue_cachep) returns sigqueue, not siginfo, but this
doesn't matter.

We still need to clear SIGNAL_UNKILLABLE, or the signal will be ignored.

force_sig_info() assumes that this task will call dequeue_signal() at
least once before exiting, otherwise nobody will free ->sync_signal.
This is true when t == current, and I guess it is always current?

Well, __do_SAK(), zap_pid_ns_processes(), probably something else, do
force_sig(SIGKILL). They should be changed, SIGKILL doesn't need "force"
and with this change we can hit the WARN_ONCE() above or leak ->sync_signal.

Also. If t != current, I am not sure recalc_sigpending_and_wake() can
be used without ->siglock. If it is current, then we only need to set
TIF_SIGPENDING.

Oleg.
Comment 30 Linus Torvalds 2010-03-03 18:57:02 UTC
On Wed, 3 Mar 2010, Oleg Nesterov wrote:
> 
> force_sig_info() assumes that this task will call dequeue_signal() at
> least once before exiting, otherwise nobody will free ->sync_signal.
> This is true when t == current, and I guess it is always current?

Yeah, it leaks. Good call. Again, not doing the dynamic allocation at all 
(and just having one signal per thread statically allocated) would just 
fix that and simplify the thing further.

And I do think that t always should be current, because it smells very 
very wrong to me that anybody else could send a synchronous signal. You do 
point to

> Well, __do_SAK(), zap_pid_ns_processes(), probably something else, do
> force_sig(SIGKILL). They should be changed, SIGKILL doesn't need "force"
> and with this change we can hit the WARN_ONCE() above or leak ->sync_signal.

Correct. The whole point of "force" is pointless with SIGKILL, that has 
it's own private force ruoles.

> Also. If t != current, I am not sure recalc_sigpending_and_wake() can
> be used without ->siglock. If it is current, then we only need to set
> TIF_SIGPENDING.

Yeah. All in all, I think the approach is workable, but I'm not at all 
happy with the patch. As mentioned, it was meant as a discussion piece, 
not a serious patch.

			Linus
Comment 31 Oleg Nesterov 2010-03-03 19:31:44 UTC
On 03/03, Linus Torvalds wrote:
>
> And I do think that t always should be current, because it smells very
> very wrong to me that anybody else could send a synchronous signal.

Agreed. I think it makes sense to audit/change all force_sig()
callers in any case.

zap_pid_ns_processes() does have a reason to use force(SIGKILL),
we need to clear SIGNAL_UNKILLABLE, but this is solvable.

> Yeah. All in all, I think the approach is workable,

Agreed. I'd like to know what Roland thinks, then I'd be happy
to do anything you and Roland suggest.


Looks like, whatever we do, it makes sense to apply your simple
workaround for now, then revert it when it is no longer needed.

Oleg.
Comment 32 Wylda 2010-03-03 21:15:56 UTC
(In reply to comment #28)
> 
> Regardless, I'd still like to hear from the original bug reporters whether 
> the patches actually fix the problem they see.
> 
>         Linus

Well, i was reading all this, but apparantly just an English dictionary will not help me to understand all the mentioned things. So i'm volunteer for testing on a real app, where this causes troubles.

But treat with me like with a troll ;) i.e. say here is a patch at comment #xy pick it and come in two hours with result. Then i will come back and bring you what you asked for, My Lord :-))

(For now, i don't see anything i could feed my git tree...)
Comment 33 Oleg Nesterov 2010-03-03 21:34:35 UTC
On 03/03, wylda@volny.cz wrote:
>
> Well, i was reading all this, but apparantly just an English
> dictionary will not help me to understand all the mentioned things.
> So i'm volunteer for testing on a real app, where this causes troubles.

I think Linus meant the patch in #12 ;)

I'll send you this patch in plain-text privately in a minute...

Oleg.
Comment 34 Linus Torvalds 2010-03-03 21:41:44 UTC
On Wed, 3 Mar 2010, wylda@volny.cz wrote:
> 
> But treat with me like with a troll ;) i.e. say here is a patch
> at comment #xy pick it and come in two hours with result. Then
> i will come back and bring you what you asked for, My Lord :-))

This is what I'd suggest you try for now. We'll discuss the bigger issues 
more later if anybody has the energy left and cares enough ;)

This is the slightly simplified version that does away with making SIGKILL 
special, since we do that elsewhere anyway for other (although slightly 
similar! The original issue was always that SIGKILL needed to be handled 
before any other signal) reasons.

		Linus

---
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 2 Mar 2010 08:36:46 -0800
Subject: [PATCH] Prioritize synchronous signals over 'normal' signals

This makes sure that we pick the synchronous signals caused by a
processor fault over any pending regular asynchronous signals sent to
use by [t]kill().

This is not strictly required semantics, but it makes it _much_ easier
for programs like Wine that expect to find the fault information in the
signal stack.

Without this, if a non-synchronous signal gets picked first, the delayed
asynchronous signal will have its signal context pointing to the new
signal invocation, rather than the instruction that caused the SIGSEGV
or SIGBUS in the first place.

This is not all that pretty, and we're discussing making the synchronous
signals more explicit rather than have these kinds of implicit
preferences of SIGSEGV and friends.  See for example

	http://bugzilla.kernel.org/show_bug.cgi?id=15395

for some of the discussion.  But in the meantime this is a simple and
fairly straightforward work-around, and the whole

	if (x & Y)
		x &= Y;

thing can be compiled into (and gcc does do it) just three instructions:

	movq    %rdx, %rax
	andl    $Y, %eax
	cmovne  %rax, %rdx

so it is at least a simple solution to a subtle issue.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 kernel/signal.c |   43 ++++++++++++++++++++++++++++++-------------
 1 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 934ae5e..5bb9baf 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -159,6 +159,10 @@ void recalc_sigpending(void)
 
 /* Given the mask, find the first available signal that should be serviced. */
 
+#define SYNCHRONOUS_MASK \
+	(sigmask(SIGSEGV) | sigmask(SIGBUS) | sigmask(SIGILL) | \
+	 sigmask(SIGTRAP) | sigmask(SIGFPE))
+
 int next_signal(struct sigpending *pending, sigset_t *mask)
 {
 	unsigned long i, *s, *m, x;
@@ -166,26 +170,39 @@ int next_signal(struct sigpending *pending, sigset_t *mask)
 
 	s = pending->signal.sig;
 	m = mask->sig;
+
+	/*
+	 * Handle the first word specially: it contains the
+	 * synchronous signals that need to be dequeued first.
+	 */
+	x = *s &~ *m;
+	if (x) {
+		if (x & SYNCHRONOUS_MASK)
+			x &= SYNCHRONOUS_MASK;
+		sig = ffz(~x) + 1;
+		return sig;
+	}
+
 	switch (_NSIG_WORDS) {
 	default:
-		for (i = 0; i < _NSIG_WORDS; ++i, ++s, ++m)
-			if ((x = *s &~ *m) != 0) {
-				sig = ffz(~x) + i*_NSIG_BPW + 1;
-				break;
-			}
+		for (i = 1; i < _NSIG_WORDS; ++i) {
+			x = *++s &~ *++m;
+			if (!x)
+				continue;
+			sig = ffz(~x) + i*_NSIG_BPW + 1;
+			break;
+		}
 		break;
 
-	case 2: if ((x = s[0] &~ m[0]) != 0)
-			sig = 1;
-		else if ((x = s[1] &~ m[1]) != 0)
-			sig = _NSIG_BPW + 1;
-		else
+	case 2:
+		x = s[1] &~ m[1];
+		if (!x)
 			break;
-		sig += ffz(~x);
+		sig = ffz(~x) + _NSIG_BPW + 1;
 		break;
 
-	case 1: if ((x = *s &~ *m) != 0)
-			sig = ffz(~x) + 1;
+	case 1:
+		/* Nothing to do */
 		break;
 	}
Comment 35 Oleg Nesterov 2010-03-03 22:08:36 UTC
Forgot to mention, sorry...

On 03/03, Oleg Nesterov wrote:
>
> > Yeah. All in all, I think the approach is workable,
>
> Agreed. I'd like to know what Roland thinks, then I'd be happy
> to do anything you and Roland suggest.

Note that we still need more changes to ensure that k->sync_signal
really can't be ignored, but this is solvable. With this approach
we don't need to unblock the signal, but we have to s/IGN/DFL/ or
just send SIGKILL if it is ignored.

Oleg.
Comment 36 Wylda 2010-03-04 03:16:09 UTC
(In reply to comment #34)
> On Wed, 3 Mar 2010, wylda@volny.cz wrote:
> > 
> > But treat with me like with a troll ;) i.e. say here is a patch
> > at comment #xy pick it and come in two hours with result. Then
> > i will come back and bring you what you asked for, My Lord :-))
> 
> This is what I'd suggest you try for now. We'll discuss the bigger issues 
> more later if anybody has the energy left and cares enough ;)
> 
> This is the slightly simplified version that does away with making SIGKILL 
> special, since we do that elsewhere anyway for other (although slightly 
> similar! The original issue was always that SIGKILL needed to be handled 
> before any other signal) reasons.
> 
>         Linus
> 

Man, this is total success... :) To be sure it's not solved by something
else, i tried 2.6.33 with and without this patch. No more freezing and
crashing of the affected application :-D _after_ applying your's patch.

As i initially promised, i owe you... Thank you!


Pavel

PS: I'll give it more hard try during this day, but it looks good. Anyway
Alexandre / Michael could you also check yours test case, to be sure?
Comment 37 Wylda 2010-03-04 14:41:38 UTC
>
> Man, this is total success... :) To be sure it's not solved by
> something else, i tried 2.6.33 with and without this patch. No
> more freezing and crashing of the affected application :-D _after_
> applying your's patch.
>
> PS: I'll give it more hard try during this day, but it looks
> good...
>


Ok, another day of testing and i can happily declare Linus's patch rock
stable :)

Oleg, i noticed your patch: http://lkml.org/lkml/2010/3/3/387 Do i need
it together with Linus's?

In what term these patches slip into mainline? Could be 2.6.34 or longer?
Comment 38 Oleg Nesterov 2010-03-04 14:52:50 UTC
On 03/04, wylda@volny.cz wrote:
>
> Ok, another day of testing and i can happily declare Linus's patch rock
> stable :)

Great, thanks.

> Oleg, i noticed your patch: http://lkml.org/lkml/2010/3/3/387  Do i need
> it together with Linus's?

No, no. This is the preparation for the further changes.

Oleg.
Comment 39 Linus Torvalds 2010-03-04 16:43:27 UTC
On Thu, 4 Mar 2010, wylda@volny.cz wrote:
> 
> In what term these patches slip into mainline? Could be 2.6.34 or longer?

I committed the "prioritize synchronous signals" one, and just pushed it 
out. So yes, it will be in 2.6.34.

I didn't mark it as 'stable' material, but if this fixes real problems 
with Wine (or other programs), then I think it's certainly possible to 
push it to stable too.

			Linus
Comment 40 Roland McGrath 2010-03-04 20:50:11 UTC
> If the synchronous signals are handled on a separate queue, then the whole 
> "tgkill signal vs synchronous signal" case turns trivial: you can have 
> both pending at the same time independently. And the synchronous one will 
> automatically always happen before the pending "tgkill" one.

Correct.

> So I'm just thinking about the "synchronous fault happens while setting up 
> the stack for a previous synchronous fault", ie a nested synchronous 
> fault. That would be a reason to allow the synchronous queue to be more 
> than one deep.

Well, "nested" is an abstract notion that applies to the situation.  But as
far as the signal code is concerned, there is no multiple queuing going on
in that case.  The original synchronous signal is dequeued and no longer
pending while we set up the handler.  If there is a second fault in handler
setup, it's really no different than a normal user fault while running the
handler code.

> But we don't actually turn the nested fault into a signal anyway, we turn 
> it into a kernel put_user() exception and failure, and switch to 
> altstacks. 

I don't really know what you are referring to here.  If we get a fault
accessing the stack to set up a signal handler, we wind up doing:
	force_sig(SIGSEGV, current);

In x86, this is in arch/x86/kernel/signal.c:signal_fault.
In powerpc, it calls force_sigsegv() which does that too.

> My point was that the synchronous fault shouldn't need a deeper queue than 
> just one - so we can implement the whole "synchronous signal queue" as a 
> single integer field in the per-thread signal struct.

A single pointer to siginfo_t, anyway.  I think I said that before.

> It should be pretty simple, but I'd really like to make the thing _really_ 
> simple.  If done right, we could remove all the logic for punching through 
> blocked signals etc, because the synchronous signal handling code itself 
> would just ignore it.

Yes, that's exactly what I proposed.

> IOW, something like the appended (TOTALLY UNTESTED)

That is approximately what I had in mind.  It's not quite that simple
because we have to do the logic that reverts to SIG_DFL when blocked.
That can't really be in dequeue_signal() as it is now.  But it's not
hard to do it right.

> Btw, just to clarify: that really was a total throw-away patch. It may 
> _look_ like a real diff, and a serious patch, but it was really not meant 
> to be anything but a discussion piece. It's already gone from my working 
> tree, and I'm not saving it etc.

Understood.  I also think we should bat the ideas around for a while longer.
And again, this is something for the mailing list rather than bugzilla.

> In many ways, I like my original patch better. I hate how the latter one 
> adds new slab allocation sites for that siginfo thing etc. I was literally 
> considering adding a whole "struct siginfo" (not a pointer) into the 
> "struct task_struct" just to avoid that whole thing.

I certainly thought of that too.  But looking at the size of siginfo_t
inflating task_struct made me balk.  (And I still like having a queue.
But I'm not really at the point of making a case for that.)


Thanks,
Roland
Comment 41 Roland McGrath 2010-03-04 20:52:36 UTC
> On 03/03, Linus Torvalds wrote:
> >
> > And I do think that t always should be current, because it smells very
> > very wrong to me that anybody else could send a synchronous signal.
> 
> Agreed. I think it makes sense to audit/change all force_sig()
> callers in any case.

I agree.  I think we should replace force_sig() and force_sig_info() with
something (by whatever name, even that one) that doesn't take a task argument.
If there are any cases not just passing current, we need to figure out what
is actually right for those.

> zap_pid_ns_processes() does have a reason to use force(SIGKILL),
> we need to clear SIGNAL_UNKILLABLE, but this is solvable.

I think we might want to make a special-case function just for force_sigkill.


Thanks,
Roland
Comment 42 Roland McGrath 2010-03-04 20:57:05 UTC
I think this is a fine interim change.  I would not have bothered with a
separately-selected mask, I'd have just used SIG_KERNEL_COREDUMP_MASK.
But that doesn't much matter.


Thanks,
Roland
Comment 43 Roland McGrath 2010-03-04 21:01:29 UTC
> I didn't mark it as 'stable' material, but if this fixes real problems 
> with Wine (or other programs), then I think it's certainly possible to 
> push it to stable too.

I wouldn't do that.  This is behavior that has always been this way,
and everybody survived this long.  There has never been any guarantee
about the order of delivery of <SIGRTMIN signals.  But the reliable
fact in Linux has always been that they come in signal number order.
So it's entirely possible that somebody else's application happens to work
correctly today only because of that ordering.

Personally, I would not really bother with the interim change at all.  It's
still not a guarantee you can rely on in all cases.  I'd just leave the
status quo of the last 15+ years until we hash out the thorough solution
with ironclad behavior guarantees for machine fault signals.  I think we
can work that out soon enough.  But I certainly see the motivation for the
quick workaround to the observed problem.


Thanks,
Roland

Note You need to log in before you can comment on or make changes to this bug.