Bug 215813

Summary: syscall(SYS_vfork) causes execve() to return 0.
Product: Process Management Reporter: Alejandro Colomar (alx)
Component: OtherAssignee: process_other
Status: NEW ---    
Severity: normal CC: kernelorg, ninjalj, socketpair
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.10 Subsystem:
Regression: No Bisected commit-id:

Description Alejandro Colomar 2022-04-06 19:18:56 UTC
This was found while debugging another different bug:
<https://bugzilla.kernel.org/show_bug.cgi?id=215769>

As happens with fork(2), it is expected that vfork(2) (or equivalently syscall(SYS_vfork)) returns twice: one for the child process, and one for the parent process.  In the case of the child, it should return 0, and in the case of the parent, it should return the PID of the child.

The expected behavior happens if I call vfork() (the glibc wrapper).  However, that's not the case if I call the syscall as `syscall(SYS_vfork)`.  In that case, it returns only once, for the child, with a return value of 0, and then the parent continues execution right after execve(2), which returns 0 for the parent.

I was very careful to write a test program that reproduces this behavior without invoking Undefined Behavior (AFAIK), and I reproduced it in two different systems: Debian stable, and Devuan stable (however, both have a 5.10 kernel).


$ cat print_pid.c 
#include <err.h>
#include <stdlib.h>
#include <unistd.h>

int
main(void)
{
	errx(EXIT_SUCCESS, "%jd exiting.", (long) getpid());
}



$ cat vfork.c 
#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>

static char *const ch_argv[] = {
	"print_pid",
	NULL
};

static char *const ch_envp[] = {
	NULL
};

int
main(void)
{
	pid_t pid;

	printf("%s: PID: %ld\n", program_invocation_short_name, (long) getpid());

	if (signal(SIGCHLD, SIG_IGN) == SIG_ERR)
		err(EXIT_FAILURE, "signal(2)");

	/*
	 * vfork() returns control to the parent in the same vfork() call,
	 * as happens with fork(), or syscall(SYS_fork).
	 * But syscall(SYS_vfork) behaves differently, returning control to
	 * the parent right after execve(), as if it returned successfully.
	 */
	pid = syscall(SYS_vfork);
	//pid = vfork();
	//pid = syscall(SYS_fork);
	//pid = fork();

	switch (pid) {
	case 0:
		/*
		 * This is the child.  If execve(2) fails, it can only call
		 * _exit(2).  If execve(2) succeeds, it's a noreturn, so
		 * control flow should never fall through.
		 *
		 * syscall(SYS_execve, ...) and execve(...) behave the same.
		 */
		if (execve("/home/alx/tmp/print_pid", ch_argv, ch_envp) == -1)
			_exit(EXIT_FAILURE);

		/*
		 * This is again the parent, since the child either succeded,
		 * or it failed and exited.  execve(2) cannot return anything
		 * other than -1, so this is impossible... yet it happens when
		 * we call syscall(SYS_vfork).
		 */
		sleep(1);
		err(EXIT_SUCCESS, "%jd exiting after execve(2)", (long) getpid());
	case -1:
		err(EXIT_FAILURE, "vfork(2)");
	default:
		/*
		 * This is what I'd expect to happen.  vfork() brings us here,
		 * but syscall(SYS_vfork) doesn't!
		 */
		sleep(1);
		errx(EXIT_SUCCESS, "%ld exiting after vfork(2).  Child is %ld\n",
		     (long) getpid(), (long) pid);
	}
}


$ cc -Wall -Wextra -Werror -o print_pid print_pid.c
$ cc -Wall -Wextra -Werror -o vfork     vfork.c
$ ./vfork 
vfork: PID: 5306
print_pid: 5307 exiting.
vfork: 5306 exiting after execve(2): Success



Since the glibc wrappers are written in assembly in <sysdeps/unix/sysv/linux/x86_64/vfork.S> and <sysdeps/unix/sysv/linux/x86_64/syscall.S>, I can't read them enough to understand why there's a difference.  Some guesses are that syscall(2) may be invoking Undefined Behavior in some way that the vfork(2) wrapper doesn't invoke, or maybe the kernel simply has a bug.

What is clear to me is that execve(2) returning 0 is _wrong_ per its own manual page.
Comment 1 Коренберг Марк 2022-04-06 19:31:38 UTC
getpid() is cache in glibc. So "This is again the parent" is WRONG. it's not. Please use syscall.

Also, using stack and/or heap in vfork child is heavily UB. Please refactor. Seems PARENT process after your UB memory changes went to the children code branch.
Comment 2 Alejandro Colomar 2022-04-06 19:32:03 UTC
Hi Florian,

On 4/6/22 21:26, Florian Weimer wrote:
> It's not clear if this is valid.  The syscall function in glibc does not
> protect the on-stack return address against overwriting, so it can't be
> used to call SYS_vfork on x86.
> 
> Can you reproduce this with a true inline syscall, or the glibc vfork
> function (which protects the return address)?

If you tell me how I can call a syscall without the libc wrapper or 
syscall(2), sure, I can try :)

If syscall(2) can't be used for certain syscalls, maybe we should 
document that.

Thanks,

Alex
Comment 3 Коренберг Марк 2022-04-06 19:35:03 UTC
Also, please compile -O3 in order not to use stack in child. child and parent share the stack!
Comment 4 Alejandro Colomar 2022-04-06 19:37:27 UTC
Hi Коренберг Марк,

On 4/6/22 21:31, bugzilla-daemon@kernel.org wrote:
> --- Comment #1 from Коренберг Марк (socketpair@gmail.com) ---
> getpid() is cache in glibc. So "This is again the parent" is WRONG. it's not.
> Please use syscall.
The was fixed recently.  See getpid(2):

    C library/kernel differences
        From glibc version 2.3.4  up  to  and  including  version
        2.24,  the  glibc  wrapper  function  for getpid() cached
        PIDs, with the goal of avoiding additional  system  calls
        when  a process calls getpid() repeatedly.  Normally this
        caching was invisible, but its correct  operation  relied
        on   support   in  the  wrapper  functions  for  fork(2),
        vfork(2), and clone(2): if an  application  bypassed  the
        glibc   wrappers   for   these   system  calls  by  using
        syscall(2), then a call to getpid() in  the  child  would
        return  the  wrong  value (to be precise: it would return
        the PID of the parent process).  In addition, there  were
        cases  where  getpid()  could return the wrong value even
        when invoking clone(2) via the  glibc  wrapper  function.
        (For   a  discussion  of  one  such  case,  see  BUGS  in
        clone(2).)  Furthermore, the complexity  of  the  caching
        code  had been the source of a few bugs within glibc over
        the years.

        Because of the aforementioned problems, since glibc  ver-
        sion  2.25,  the  PID cache is removed: calls to getpid()
        always invoke the actual system call, rather than return-
        ing a cached value.

I'm using glibc 2.31, so I'm safe there.

Cheers,

Alex
Comment 5 Alejandro Colomar 2022-04-06 19:44:48 UTC
On 4/6/22 21:31, bugzilla-daemon@kernel.org wrote:
 > --- Comment #1 from Коренберг Марк (socketpair@gmail.com) ---
 > Also, using stack and/or heap in vfork child is heavily UB. Please 
refactor.
 > Seems PARENT process after your UB memory changes went to the 
children code
 > branch.
 >

I don't see where I'm invoking UB, if I am.  I can imagine that 
syscall(SYS_vfork) may be UB by itself (would like to see that 
documented if so), but other than that, I don't see UB in my own code.

On 4/6/22 21:35, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215813
> 
> --- Comment #3 from Коренберг Марк (socketpair@gmail.com) ---
> Also, please compile -O3 in order not to use stack in child. child and parent
> share the stack!
> 

Wow, this is getting weird!  If I can't compile a program using vfork() 
without -O3, I really want that to be documented in BUGS (or NOTES).  Or 
gcc(1) should produce the right code for my program, even without 
specifying optimization flags, as long as I don't invoke UB myself.


Anyway, I can still reproduce it with -O3:

$ cc -Wall -Wextra -Werror -O3 -o print_pid print_pid.c
$ cc -Wall -Wextra -Werror -O3 -o vfork     vfork.c
$ ./vfork
vfork: PID: 6220
print_pid: PID 6221 exiting.
vfork: 6220 exiting after execve(2): Success


Cheers,

Alex
Comment 6 Коренберг Марк 2022-04-06 19:50:31 UTC
Dump of assembler code for function syscall:
=> 0x00007ffff7ed9dc0 <+0>:	endbr64 
   0x00007ffff7ed9dc4 <+4>:	mov    %rdi,%rax
   0x00007ffff7ed9dc7 <+7>:	mov    %rsi,%rdi
   0x00007ffff7ed9dca <+10>:	mov    %rdx,%rsi
   0x00007ffff7ed9dcd <+13>:	mov    %rcx,%rdx
   0x00007ffff7ed9dd0 <+16>:	mov    %r8,%r10
   0x00007ffff7ed9dd3 <+19>:	mov    %r9,%r8
   0x00007ffff7ed9dd6 <+22>:	mov    0x8(%rsp),%r9
   0x00007ffff7ed9ddb <+27>:	syscall 


Actually. you CALL libc function. using assembler mnemonic call! it meanspushing return value to the stack. stack is shared. TADAAA
Comment 7 Коренберг Марк 2022-04-06 20:20:46 UTC
#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <stdint.h>
#include <string.h>

#define INTERNAL_SYSCALL1(name, a1)                             \
  ({                                                            \
    unsigned long resultvar;                                    \
    long int __arg1 = (long) (a1);                              \
    register long int _a1 __asm ("rdi") = __arg1;               \
    __asm volatile (                                            \
    "syscall\n\t"                                               \
    : "=a" (resultvar)                                          \
    : "0" (name) , "r" (_a1) : "memory", "cc", "r11", "cx");    \
    (long) resultvar; })

#define INTERNAL_SYSCALL3(name, a1, a2, a3)                     \
  ({                                                            \
    unsigned long resultvar;                                    \
    long int __arg1 = (long) (a1);                              \
    long int __arg2 = (long) (a2);                              \
    long int __arg3 = (long) (a3);                              \
    register long int _a1 __asm ("rdi") = __arg1;               \
    register long int _a2 __asm ("rsi") = __arg2;               \
    register long int _a3 __asm ("rdx") = __arg3;               \
    __asm volatile (                                            \
    "syscall\n\t"                                               \
    : "=a" (resultvar)                                          \
    : "0" (name) , "r" (_a1), "r" (_a2), "r" (_a3) : "memory", "cc", "r11", "cx");      \
    (long) resultvar; })

static char *const ch_argv[] = {
    "sh", "-c", "echo \"child: my pid is $$\"",
    NULL
};

static char *const ch_envp[] = {
    NULL
};


int main(void)
{
    pid_t pid;

    printf("%s: PID: %ld\n", program_invocation_short_name, (long) getpid());

    if (signal(SIGCHLD, SIG_IGN) == SIG_ERR)
        err(EXIT_FAILURE, "signal(2)");

    pid = vfork(); //syscall(SYS_vfork);
    switch (pid) {
    case 0:
        INTERNAL_SYSCALL3(SYS_write, 1, "calling execve\n", strlen("calling execve\n"));
        if (INTERNAL_SYSCALL3 (SYS_execve, (uint64_t)"/bin/sh", (uint64_t)ch_argv, (uint64_t)ch_envp) == -1)
            _exit(EXIT_FAILURE);
        INTERNAL_SYSCALL3(SYS_write, 1, "exitting after execve\n", strlen("exitting after execve\n"));
        INTERNAL_SYSCALL1(SYS_exit_group, 0);
        _exit(42); // tell compiler that this is noreturn branch
    case -1:
        err(EXIT_FAILURE, "vfork(2)");
    default:
        sleep(1);
        errx(EXIT_SUCCESS, "%ld exiting after vfork(2).  Child is %ld\n",
             (long) getpid(), (long) pid);
    }
}



please try this. No return actually and no bugs.
Comment 8 Коренберг Марк 2022-04-06 20:39:35 UTC
https://bugs.python.org/issue47245  I scared!
Comment 9 Коренберг Марк 2022-04-06 20:41:43 UTC
https://github.com/sunzc/PLUM_OS/blob/4f19d4c9fd788984356855ecab501f47b9ad8f7e/include/syscall.h

I have got INTERNAL_SYSCALL definitions here. And compiled -O3 --save-temps, looking to generated assembly. It looks like the whole child part must be coded in assembly in order to work with any compiler optimization options.
Comment 10 Alejandro Colomar 2022-04-06 20:42:35 UTC
On 4/6/22 22:20, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215813
> 
> --- Comment #7 from Коренберг Марк (socketpair@gmail.com) ---
[...]
> please try this. No return actually and no bugs.
> 

I tried a similar one that also has no bugs (see below), so the UB seems 
to be calling syscall(SYS_vfork) and only there.  We should document 
that in vfork(2) in NOTES.


On 4/6/22 22:39, bugzilla-daemon@kernel.org wrote:
 > --- Comment #8 from Коренберг Марк (socketpair@gmail.com) ---
 > https://bugs.python.org/issue47245  I scared!
 >

Sorry :)


Thanks,

Alex

---

$ cat vfork_internal_syscall.c
#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <stdint.h>
#include <string.h>

#define INTERNAL_SYSCALL0(name)                                 \
({                                                              \
	unsigned long resultvar;                                \
                                                                 \
	__asm volatile (                                        \
		"syscall\n\t"                                   \
		: "=a" (resultvar)                              \
		: "0" (name) : "memory", "cc", "r11", "cx");    \
                                                                 \
	(long) resultvar;                                       \
})


static char *const ch_argv[] = {
     "sh", "-c", "echo \"child: my pid is $$\"",
     NULL
};

static char *const ch_envp[] = {
     NULL
};


int
main(void)
{
	pid_t pid;

	printf("%s: PID: %ld\n", program_invocation_short_name, (long) getpid());

	if (signal(SIGCHLD, SIG_IGN) == SIG_ERR)
		err(EXIT_FAILURE, "signal(2)");

	pid = INTERNAL_SYSCALL0(SYS_vfork); //syscall(SYS_vfork);

	switch (pid) {
	case 0:

		if (execve("/home/alx/tmp/print_pid", ch_argv, ch_envp) == -1)
			_exit(EXIT_FAILURE);

		sleep(1);
		err(EXIT_SUCCESS, "%jd exiting after execve(2)", (long) getpid());
	case -1:
		err(EXIT_FAILURE, "vfork(2)");
	default:
		sleep(1);
		errx(EXIT_SUCCESS, "%ld exiting after vfork(2).  Child is %ld\n",
		     (long) getpid(), (long) pid);
	}
}


$ cc -Wall -Wextra -Werror vfork_internal_syscall.c
$ ./a.out
a.out: PID: 7087
sh: PID 7088 exiting.
a.out: 7087 exiting after vfork(2).  Child is 7088
Comment 11 Коренберг Марк 2022-04-06 20:57:23 UTC
No, your test is wrong.

if (execve(.....

Actually it touches the stack. It's not allowed. Yes, for some unknown reason it does not reproduced. Possibly some races. For example, calling vfork() as a glibc function means pushing the return address to the stack. You have changed vfork() call in a way that does not touch the stack in parent process, seems this is a cause why the bug is not reproduced.

Please stop. Look to assembly. I would close this thread and this bug in bugzilla as "can not reproduce".
Comment 12 Коренберг Марк 2022-04-06 21:04:02 UTC
https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/spawni.c


See what they do in order to overcome the problem.
Comment 13 Alejandro Colomar 2022-04-06 21:29:21 UTC
On 4/6/22 22:57, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215813
> 
> --- Comment #11 from Коренберг Марк (socketpair@gmail.com) ---
> No, your test is wrong.
> 
> if (execve(.....
> 
> Actually it touches the stack. It's not allowed. Yes, for some unknown reason
> it does not reproduced. Possibly some races. For example, calling vfork() as
> a
> glibc function means pushing the return address to the stack. You have
> changed
> vfork() call in a way that does not touch the stack in parent process, seems
> this is a cause why the bug is not reproduced.
> 
> Please stop. Look to assembly. I would close this thread and this bug in
> bugzilla as "can not reproduce".

can not reproduce?  I can actually reproduce UB in syscall(SYS_vfork) in 
every system I have.

vfork(2) says:

        vfork()  differs  from fork(2) in that the calling thread
        is suspended until the child terminates (either normally,
        by  calling  _exit(2), or abnormally, after delivery of a
        fatal signal), or it makes a call  to  execve(2).   Until
        that  point, the child shares all memory with its parent,
        including the stack.  The child must not return from  the
        current  function  or  call exit(3) (which would have the
        effect of calling exit handlers established by the parent
        process  and flushing the parent's stdio(3) buffers), but
        may call _exit(2).


Sorry, but I don't understand assembly very much.  Glibc code is also 
difficult for me to read.  But yes, I can see that I'm issuing a call to 
execve(2), and that may produce bugs if the vfork(2) and execve(2) 
wrappers are so wrong that they cannot be actually used.  However, I 
expect that someone would have realized before if that was actually the 
case, and would have documented it in the manual page.

Anyway, the manual page is missing information (at least it should note 
that syscall(SYS_vfork) is wrong), so I'm not closing this thread yet, 
so that Florian and Christian (they are CCd in my email) who know their 
code much better than I do, may propose some change to the manual page 
that makes clear how these functions can and cannot be used.

Thanks,

Alex
Comment 14 Luis Javier Merino Morán 2022-04-07 11:06:30 UTC
There are at least two reasons to call vfork() through the libc vfork() wrapper, and not via syscall(SYS_vfork) or even INTERNAL_SYSCALL:


 - The libc wrapper takes care of properly storing and restoring the return address, which could otherwise be clobbered by e.g. a later call to execve() in the child, as shown graphically in  https://www.mail-archive.com/uclinux-dev@uclinux.org/msg01290.html
 - The compiler may have special handling for functions calling vfork() (and setjmp(), and functions marked with attribute returns_twice, ...). E.g. last I checked (https://stackoverflow.com/questions/9347928/return-value-in-vfork-system-call#comment11829359_9347975 which, admittedly, was a decade ago), GCC disallowed tail call optimization,  global common subexpresion elimination and jump bypassing for functions calling vfork() or setjmp().


Re: comment 11:

> > if (execve(.....
> Actually it touches the stack. It's not allowed

Note the phrasing:

> or  calls  any  other  function  before  successfully  calling
>     _exit(2) or one of the exec(3) family of functions.

vfork() is implemented so as to work correctly with a later call to the execve() function.
Comment 15 Коренберг Марк 2022-04-08 10:27:09 UTC
Alejandro Colomar,

Finally I found.

Yes, It's SAFE to call syscall(SYS_VFORK). If you vforked() using this way, it's not allowed to modify the stack.

But if you vfork() using glibc function vfork(), the library do nasty hacks in order to allow you to touch stack - I mean push something. After returning from syscall in parent process, glibc restores stack differently than usual.

So, no bugs neither in kernel, nor in glibc. The bug in documentation. It should be explained:

1. man vfork: Using syscall(SYS_vfork) [or calling this syscall in assembly] is very dangerous. Yes, it's allowed if child do not touch stack at all. For example, in child branch it's not allowed in general to use local variables and also, calling other function is also disallowed. Especially calling glibc's  _exit() of _execve(). Even calling syscall(SYS_exit) is not allowed, since syscall() is a usual glibc function. But you can call syscalls in assembly if you have some reason to do so. In short - please use glibc wrapper instead.

2. man syscall: this could be implemented as a usual function, i.e. not a macro. In some picky contexts (for example, in the child process after direct calling syscall(SYS_vfork) rather than vfork()) it's not allowed to call any function at all, so be warned.