Bug 16061

Summary: single stepping in a signal handler can cause the single step flag to get "stuck"
Product: Other Reporter: Evan Teran (eteran)
Component: OtherAssignee: Oleg Nesterov (oleg)
Status: RESOLVED OBSOLETE    
Severity: normal CC: akpm, alan, b7.10110111, roland
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 2.6.32 (gentoo patchset r7) Subsystem:
Regression: No Bisected commit-id:
Attachments: [patch] the quick hack for illustration

Description Evan Teran 2010-05-27 19:26:14 UTC
I think I have found a bug. It is fairly harmless but non-the-less, not the expected behavior. I have only done tests on x86_64.

First, there is one assumption I am making, so if this is incorrect, then this bug is invalid. From my experience, doing a ptrace with PTRACE_SINGLESTEP sets a flag in rflags which causes the processor to execute a single instruction, **the flag is then cleared**, finally a trap event is delivered. So far it has been my experience, that the flag is supposed to be cleared, so I am running on that assumption.

Given a program which looks like this (any signal handling code will do, but it's easier with an example).

#include <stdio.h>
#include <signal.h>
#include <unistd.h>

void handler(int n) {
    puts("alarm");
    alarm(5);
}

int main() {
    signal(SIGALRM, handler);
    alarm(5);
    while(1) {}
    return 0;
}

If we run this with ptrace, and do a PTRACE_CONT at the start, we will get a stop event inside the while(1) {}. If we then do a PTRACE_SINGLESTEP passing the  signal to the debugee, we end up stopped at the first instruction of the signal handler.

At this point, all is working as expected.

If we then do a PTRACE_SINGLESTEP, the debugee single steps as expected, **but the single step flag remains set in the eflags!**. if we follow this up with a PTRACE_CONT, we still get single step events.

This only happens if you single step the very first instruction of the signal handler. If we set a breakpoint in the middle of the signal handler and step from there, the flag is unset as expected.

I originally encountered this in my personal debugger, but here is a log of a GDB session demonstrating the issue as well:


$ gdb ./test

warning: Can not parse XML syscalls information; XML support was disabled at compile time.
GNU gdb (Gentoo 7.0.1 p1) 7.0.1
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.gentoo.org/>...
Reading symbols from /####/test...(no debugging symbols found)...done.
(gdb) handle SIGALRM stop
Signal        Stop      Print   Pass to program Description
SIGALRM       Yes       Yes     Yes             Alarm clock
(gdb) run
Starting program: /####/test 

Program received signal SIGALRM, Alarm clock.
0x00007ffff7b216e0 in __nanosleep_nocancel () at ../sysdeps/unix/syscall-template.S:82
82      ../sysdeps/unix/syscall-template.S: No such file or directory.
        in ../sysdeps/unix/syscall-template.S
Current language:  auto
The current source language is "auto; currently asm".
(gdb) stepi
0x0000000000400614 in handler ()
(gdb) disassemble 
Dump of assembler code for function handler:
0x0000000000400614 <handler+0>: push   %rbp
0x0000000000400615 <handler+1>: mov    %rsp,%rbp
0x0000000000400618 <handler+4>: sub    $0x10,%rsp
0x000000000040061c <handler+8>: mov    %edi,-0x4(%rbp)
0x000000000040061f <handler+11>:        mov    $0x40074c,%edi
0x0000000000400624 <handler+16>:        callq  0x4004e0 <puts@plt>
0x0000000000400629 <handler+21>:        mov    $0x5,%edi
0x000000000040062e <handler+26>:        callq  0x400520 <alarm@plt>
0x0000000000400633 <handler+31>:        leaveq 
0x0000000000400634 <handler+32>:        retq   
End of assembler dump.
(gdb) info registers 
rax            0x0      0
rbx            0x7fffffffda60   140737488345696
rcx            0xffffffffffffffff       -1
rdx            0x7fffffffd4c0   140737488344256
rsi            0x7fffffffd5f0   140737488344560
rdi            0xe      14
rbp            0xffffffff       0xffffffff
rsp            0x7fffffffd4b8   0x7fffffffd4b8
r8             0x7fffffffd940   140737488345408
r9             0x7ffff7decb20   140737351961376
r10            0x8      8
r11            0x246    582
r12            0x7fffffffd9e0   140737488345568
r13            0x7fffffffdc00   140737488346112
r14            0x0      0
r15            0x0      0
rip            0x400614 0x400614 <handler>
eflags         0x246    [ PF ZF IF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
fctrl          0x37f    895
fstat          0x0      0
ftag           0xffff   65535
fiseg          0x0      0
fioff          0x0      0
foseg          0x0      0
fooff          0x0      0
fop            0x0      0
mxcsr          0x1f80   [ IM DM ZM OM UM PM ]
(gdb) # as you can see TF is not set here
(gdb) stepi
0x0000000000400615 in handler ()
(gdb) info registers 
rax            0x0      0
rbx            0x7fffffffda60   140737488345696
rcx            0xffffffffffffffff       -1
rdx            0x7fffffffd4c0   140737488344256
rsi            0x7fffffffd5f0   140737488344560
rdi            0xe      14
rbp            0xffffffff       0xffffffff
rsp            0x7fffffffd4b0   0x7fffffffd4b0
r8             0x7fffffffd940   140737488345408
r9             0x7ffff7decb20   140737351961376
r10            0x8      8
r11            0x246    582
r12            0x7fffffffd9e0   140737488345568
r13            0x7fffffffdc00   140737488346112
r14            0x0      0
r15            0x0      0
rip            0x400615 0x400615 <handler+1>
eflags         0x346    [ PF ZF TF IF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
fctrl          0x37f    895
fstat          0x0      0
ftag           0xffff   65535
fiseg          0x0      0
fioff          0x0      0
foseg          0x0      0
fooff          0x0      0
fop            0x0      0
mxcsr          0x1f80   [ IM DM ZM OM UM PM ]
(gdb) # suddenly TF is set, breaking continue
(gdb) cont
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000400618 in handler ()
(gdb) # it should run until the next SIGARLM at the same place as the first, but we got a single step event instead
(gdb) quit
Comment 1 Oleg Nesterov 2010-05-28 12:40:11 UTC
Agreed, this looks like a bug to me. Perhaps TIF_FORCED_TF logic.
I'll try to investigate on Monday.

Thanks Evan!
Comment 2 Oleg Nesterov 2010-05-28 13:26:23 UTC
Agreed, this looks like a bug to me. Perhaps TIF_FORCED_TF logic.
I'll try to investigate on Monday.

Thanks Evan!
Comment 3 Oleg Nesterov 2010-05-30 19:54:54 UTC
Created attachment 26587 [details]
[patch] the quick hack for illustration

This patch fixes the problem, but I doubt it is really right.
I'll try to think more.

So:

- the tracee reports a signal SIG

- the tracer does ptrace(SINGLESTEP, SIG), this approves the signal
  and also sets TF

- the tracee dequeues the signal, changes its IP to sig_handler().

  then it notices TIF_SINGLESTEP and notifies the tracer without
  return to user-space _and_ without clearing TF or TIF_SINGLESTEP

- the tracer does ptrace(SINGLESTEP) again, but now enable_single_step()
  looses TIF_FORCED_TF.

- the tracer does ptrace(CONT), but user_disable_single_step() doesn't
  clear TF since TIF_FORCED_TF is not set

- the tracee returns to user-space with X86_EFLAGS_TF in eflags

I am not sure what would be the "right" fix. I read enable_single_step()
billion times, and I still can't say I understand the comments/logic.

But perhaps this patch makes sense anyway?
Comment 4 Oleg Nesterov 2010-05-30 21:23:40 UTC
Damn, I was in a hurry, the description is not accurate...

(In reply to comment #3)
>
>   then it notices TIF_SINGLESTEP and notifies the tracer without
>   return to user-space _and_ without clearing TF or TIF_SINGLESTEP

... _and_ without clearing TIF_SINGLESTEP, but TF is cleared
(right before setup_rt_frame())
 
> - the tracer does ptrace(SINGLESTEP) again, but now enable_single_step()
>   looses TIF_FORCED_TF.

and that is why enable_single_step() doesn't set FORCED.
 
> But perhaps this patch makes sense anyway?

As for x86, we can just change handle_signal to use
test_and_clear(TIF_SINGLESTEP) as an argument to the tracehook,
but then we probably need fixes for other machines.

Roland, what do you think?
Comment 5 Roland McGrath 2010-06-01 20:17:25 UTC
Oleg, please get an appropriate test case for this into the ptrace-tests suite.

That change might be the right one, but we should discuss it more in email, and look at the situation on other machines.
Comment 6 Ruslan 2015-09-18 11:11:06 UTC
Why is this marked OBSOLETE? Has this been fixed in any version? I still reproduce it in 3.16.0-43-generic from Kubuntu 14.04.
Comment 7 Oleg Nesterov 2015-09-18 11:58:13 UTC
> --- Comment #6 from Ruslan <b7.10110111@gmail.com> ---
> Why is this marked OBSOLETE? Has this been fixed in any version? I still
> reproduce it in 3.16.0-43-generic from Kubuntu 14.04.

This was hopefully fixed by
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=fd0f86b66425bd8c6af8985881e82b28c30fd450

Oleg.