This was found while debugging another different bug: <https://bugzilla.kernel.org/show_bug.cgi?id=215769> As happens with fork(2), it is expected that vfork(2) (or equivalently syscall(SYS_vfork)) returns twice: one for the child process, and one for the parent process. In the case of the child, it should return 0, and in the case of the parent, it should return the PID of the child. The expected behavior happens if I call vfork() (the glibc wrapper). However, that's not the case if I call the syscall as `syscall(SYS_vfork)`. In that case, it returns only once, for the child, with a return value of 0, and then the parent continues execution right after execve(2), which returns 0 for the parent. I was very careful to write a test program that reproduces this behavior without invoking Undefined Behavior (AFAIK), and I reproduced it in two different systems: Debian stable, and Devuan stable (however, both have a 5.10 kernel). $ cat print_pid.c #include <err.h> #include <stdlib.h> #include <unistd.h> int main(void) { errx(EXIT_SUCCESS, "%jd exiting.", (long) getpid()); } $ cat vfork.c #define _GNU_SOURCE #include <err.h> #include <errno.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <sys/syscall.h> #include <unistd.h> static char *const ch_argv[] = { "print_pid", NULL }; static char *const ch_envp[] = { NULL }; int main(void) { pid_t pid; printf("%s: PID: %ld\n", program_invocation_short_name, (long) getpid()); if (signal(SIGCHLD, SIG_IGN) == SIG_ERR) err(EXIT_FAILURE, "signal(2)"); /* * vfork() returns control to the parent in the same vfork() call, * as happens with fork(), or syscall(SYS_fork). * But syscall(SYS_vfork) behaves differently, returning control to * the parent right after execve(), as if it returned successfully. */ pid = syscall(SYS_vfork); //pid = vfork(); //pid = syscall(SYS_fork); //pid = fork(); switch (pid) { case 0: /* * This is the child. If execve(2) fails, it can only call * _exit(2). If execve(2) succeeds, it's a noreturn, so * control flow should never fall through. * * syscall(SYS_execve, ...) and execve(...) behave the same. */ if (execve("/home/alx/tmp/print_pid", ch_argv, ch_envp) == -1) _exit(EXIT_FAILURE); /* * This is again the parent, since the child either succeded, * or it failed and exited. execve(2) cannot return anything * other than -1, so this is impossible... yet it happens when * we call syscall(SYS_vfork). */ sleep(1); err(EXIT_SUCCESS, "%jd exiting after execve(2)", (long) getpid()); case -1: err(EXIT_FAILURE, "vfork(2)"); default: /* * This is what I'd expect to happen. vfork() brings us here, * but syscall(SYS_vfork) doesn't! */ sleep(1); errx(EXIT_SUCCESS, "%ld exiting after vfork(2). Child is %ld\n", (long) getpid(), (long) pid); } } $ cc -Wall -Wextra -Werror -o print_pid print_pid.c $ cc -Wall -Wextra -Werror -o vfork vfork.c $ ./vfork vfork: PID: 5306 print_pid: 5307 exiting. vfork: 5306 exiting after execve(2): Success Since the glibc wrappers are written in assembly in <sysdeps/unix/sysv/linux/x86_64/vfork.S> and <sysdeps/unix/sysv/linux/x86_64/syscall.S>, I can't read them enough to understand why there's a difference. Some guesses are that syscall(2) may be invoking Undefined Behavior in some way that the vfork(2) wrapper doesn't invoke, or maybe the kernel simply has a bug. What is clear to me is that execve(2) returning 0 is _wrong_ per its own manual page.
getpid() is cache in glibc. So "This is again the parent" is WRONG. it's not. Please use syscall. Also, using stack and/or heap in vfork child is heavily UB. Please refactor. Seems PARENT process after your UB memory changes went to the children code branch.
Hi Florian, On 4/6/22 21:26, Florian Weimer wrote: > It's not clear if this is valid. The syscall function in glibc does not > protect the on-stack return address against overwriting, so it can't be > used to call SYS_vfork on x86. > > Can you reproduce this with a true inline syscall, or the glibc vfork > function (which protects the return address)? If you tell me how I can call a syscall without the libc wrapper or syscall(2), sure, I can try :) If syscall(2) can't be used for certain syscalls, maybe we should document that. Thanks, Alex
Also, please compile -O3 in order not to use stack in child. child and parent share the stack!
Hi Коренберг Марк, On 4/6/22 21:31, bugzilla-daemon@kernel.org wrote: > --- Comment #1 from Коренберг Марк (socketpair@gmail.com) --- > getpid() is cache in glibc. So "This is again the parent" is WRONG. it's not. > Please use syscall. The was fixed recently. See getpid(2): C library/kernel differences From glibc version 2.3.4 up to and including version 2.24, the glibc wrapper function for getpid() cached PIDs, with the goal of avoiding additional system calls when a process calls getpid() repeatedly. Normally this caching was invisible, but its correct operation relied on support in the wrapper functions for fork(2), vfork(2), and clone(2): if an application bypassed the glibc wrappers for these system calls by using syscall(2), then a call to getpid() in the child would return the wrong value (to be precise: it would return the PID of the parent process). In addition, there were cases where getpid() could return the wrong value even when invoking clone(2) via the glibc wrapper function. (For a discussion of one such case, see BUGS in clone(2).) Furthermore, the complexity of the caching code had been the source of a few bugs within glibc over the years. Because of the aforementioned problems, since glibc ver- sion 2.25, the PID cache is removed: calls to getpid() always invoke the actual system call, rather than return- ing a cached value. I'm using glibc 2.31, so I'm safe there. Cheers, Alex
On 4/6/22 21:31, bugzilla-daemon@kernel.org wrote: > --- Comment #1 from Коренберг Марк (socketpair@gmail.com) --- > Also, using stack and/or heap in vfork child is heavily UB. Please refactor. > Seems PARENT process after your UB memory changes went to the children code > branch. > I don't see where I'm invoking UB, if I am. I can imagine that syscall(SYS_vfork) may be UB by itself (would like to see that documented if so), but other than that, I don't see UB in my own code. On 4/6/22 21:35, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=215813 > > --- Comment #3 from Коренберг Марк (socketpair@gmail.com) --- > Also, please compile -O3 in order not to use stack in child. child and parent > share the stack! > Wow, this is getting weird! If I can't compile a program using vfork() without -O3, I really want that to be documented in BUGS (or NOTES). Or gcc(1) should produce the right code for my program, even without specifying optimization flags, as long as I don't invoke UB myself. Anyway, I can still reproduce it with -O3: $ cc -Wall -Wextra -Werror -O3 -o print_pid print_pid.c $ cc -Wall -Wextra -Werror -O3 -o vfork vfork.c $ ./vfork vfork: PID: 6220 print_pid: PID 6221 exiting. vfork: 6220 exiting after execve(2): Success Cheers, Alex
Dump of assembler code for function syscall: => 0x00007ffff7ed9dc0 <+0>: endbr64 0x00007ffff7ed9dc4 <+4>: mov %rdi,%rax 0x00007ffff7ed9dc7 <+7>: mov %rsi,%rdi 0x00007ffff7ed9dca <+10>: mov %rdx,%rsi 0x00007ffff7ed9dcd <+13>: mov %rcx,%rdx 0x00007ffff7ed9dd0 <+16>: mov %r8,%r10 0x00007ffff7ed9dd3 <+19>: mov %r9,%r8 0x00007ffff7ed9dd6 <+22>: mov 0x8(%rsp),%r9 0x00007ffff7ed9ddb <+27>: syscall Actually. you CALL libc function. using assembler mnemonic call! it meanspushing return value to the stack. stack is shared. TADAAA
#define _GNU_SOURCE #include <err.h> #include <errno.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <sys/syscall.h> #include <unistd.h> #include <stdint.h> #include <string.h> #define INTERNAL_SYSCALL1(name, a1) \ ({ \ unsigned long resultvar; \ long int __arg1 = (long) (a1); \ register long int _a1 __asm ("rdi") = __arg1; \ __asm volatile ( \ "syscall\n\t" \ : "=a" (resultvar) \ : "0" (name) , "r" (_a1) : "memory", "cc", "r11", "cx"); \ (long) resultvar; }) #define INTERNAL_SYSCALL3(name, a1, a2, a3) \ ({ \ unsigned long resultvar; \ long int __arg1 = (long) (a1); \ long int __arg2 = (long) (a2); \ long int __arg3 = (long) (a3); \ register long int _a1 __asm ("rdi") = __arg1; \ register long int _a2 __asm ("rsi") = __arg2; \ register long int _a3 __asm ("rdx") = __arg3; \ __asm volatile ( \ "syscall\n\t" \ : "=a" (resultvar) \ : "0" (name) , "r" (_a1), "r" (_a2), "r" (_a3) : "memory", "cc", "r11", "cx"); \ (long) resultvar; }) static char *const ch_argv[] = { "sh", "-c", "echo \"child: my pid is $$\"", NULL }; static char *const ch_envp[] = { NULL }; int main(void) { pid_t pid; printf("%s: PID: %ld\n", program_invocation_short_name, (long) getpid()); if (signal(SIGCHLD, SIG_IGN) == SIG_ERR) err(EXIT_FAILURE, "signal(2)"); pid = vfork(); //syscall(SYS_vfork); switch (pid) { case 0: INTERNAL_SYSCALL3(SYS_write, 1, "calling execve\n", strlen("calling execve\n")); if (INTERNAL_SYSCALL3 (SYS_execve, (uint64_t)"/bin/sh", (uint64_t)ch_argv, (uint64_t)ch_envp) == -1) _exit(EXIT_FAILURE); INTERNAL_SYSCALL3(SYS_write, 1, "exitting after execve\n", strlen("exitting after execve\n")); INTERNAL_SYSCALL1(SYS_exit_group, 0); _exit(42); // tell compiler that this is noreturn branch case -1: err(EXIT_FAILURE, "vfork(2)"); default: sleep(1); errx(EXIT_SUCCESS, "%ld exiting after vfork(2). Child is %ld\n", (long) getpid(), (long) pid); } } please try this. No return actually and no bugs.
https://bugs.python.org/issue47245 I scared!
https://github.com/sunzc/PLUM_OS/blob/4f19d4c9fd788984356855ecab501f47b9ad8f7e/include/syscall.h I have got INTERNAL_SYSCALL definitions here. And compiled -O3 --save-temps, looking to generated assembly. It looks like the whole child part must be coded in assembly in order to work with any compiler optimization options.
On 4/6/22 22:20, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=215813 > > --- Comment #7 from Коренберг Марк (socketpair@gmail.com) --- [...] > please try this. No return actually and no bugs. > I tried a similar one that also has no bugs (see below), so the UB seems to be calling syscall(SYS_vfork) and only there. We should document that in vfork(2) in NOTES. On 4/6/22 22:39, bugzilla-daemon@kernel.org wrote: > --- Comment #8 from Коренберг Марк (socketpair@gmail.com) --- > https://bugs.python.org/issue47245 I scared! > Sorry :) Thanks, Alex --- $ cat vfork_internal_syscall.c #define _GNU_SOURCE #include <err.h> #include <errno.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <sys/syscall.h> #include <unistd.h> #include <stdint.h> #include <string.h> #define INTERNAL_SYSCALL0(name) \ ({ \ unsigned long resultvar; \ \ __asm volatile ( \ "syscall\n\t" \ : "=a" (resultvar) \ : "0" (name) : "memory", "cc", "r11", "cx"); \ \ (long) resultvar; \ }) static char *const ch_argv[] = { "sh", "-c", "echo \"child: my pid is $$\"", NULL }; static char *const ch_envp[] = { NULL }; int main(void) { pid_t pid; printf("%s: PID: %ld\n", program_invocation_short_name, (long) getpid()); if (signal(SIGCHLD, SIG_IGN) == SIG_ERR) err(EXIT_FAILURE, "signal(2)"); pid = INTERNAL_SYSCALL0(SYS_vfork); //syscall(SYS_vfork); switch (pid) { case 0: if (execve("/home/alx/tmp/print_pid", ch_argv, ch_envp) == -1) _exit(EXIT_FAILURE); sleep(1); err(EXIT_SUCCESS, "%jd exiting after execve(2)", (long) getpid()); case -1: err(EXIT_FAILURE, "vfork(2)"); default: sleep(1); errx(EXIT_SUCCESS, "%ld exiting after vfork(2). Child is %ld\n", (long) getpid(), (long) pid); } } $ cc -Wall -Wextra -Werror vfork_internal_syscall.c $ ./a.out a.out: PID: 7087 sh: PID 7088 exiting. a.out: 7087 exiting after vfork(2). Child is 7088
No, your test is wrong. if (execve(..... Actually it touches the stack. It's not allowed. Yes, for some unknown reason it does not reproduced. Possibly some races. For example, calling vfork() as a glibc function means pushing the return address to the stack. You have changed vfork() call in a way that does not touch the stack in parent process, seems this is a cause why the bug is not reproduced. Please stop. Look to assembly. I would close this thread and this bug in bugzilla as "can not reproduce".
https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/spawni.c See what they do in order to overcome the problem.
On 4/6/22 22:57, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=215813 > > --- Comment #11 from Коренберг Марк (socketpair@gmail.com) --- > No, your test is wrong. > > if (execve(..... > > Actually it touches the stack. It's not allowed. Yes, for some unknown reason > it does not reproduced. Possibly some races. For example, calling vfork() as > a > glibc function means pushing the return address to the stack. You have > changed > vfork() call in a way that does not touch the stack in parent process, seems > this is a cause why the bug is not reproduced. > > Please stop. Look to assembly. I would close this thread and this bug in > bugzilla as "can not reproduce". can not reproduce? I can actually reproduce UB in syscall(SYS_vfork) in every system I have. vfork(2) says: vfork() differs from fork(2) in that the calling thread is suspended until the child terminates (either normally, by calling _exit(2), or abnormally, after delivery of a fatal signal), or it makes a call to execve(2). Until that point, the child shares all memory with its parent, including the stack. The child must not return from the current function or call exit(3) (which would have the effect of calling exit handlers established by the parent process and flushing the parent's stdio(3) buffers), but may call _exit(2). Sorry, but I don't understand assembly very much. Glibc code is also difficult for me to read. But yes, I can see that I'm issuing a call to execve(2), and that may produce bugs if the vfork(2) and execve(2) wrappers are so wrong that they cannot be actually used. However, I expect that someone would have realized before if that was actually the case, and would have documented it in the manual page. Anyway, the manual page is missing information (at least it should note that syscall(SYS_vfork) is wrong), so I'm not closing this thread yet, so that Florian and Christian (they are CCd in my email) who know their code much better than I do, may propose some change to the manual page that makes clear how these functions can and cannot be used. Thanks, Alex
There are at least two reasons to call vfork() through the libc vfork() wrapper, and not via syscall(SYS_vfork) or even INTERNAL_SYSCALL: - The libc wrapper takes care of properly storing and restoring the return address, which could otherwise be clobbered by e.g. a later call to execve() in the child, as shown graphically in https://www.mail-archive.com/uclinux-dev@uclinux.org/msg01290.html - The compiler may have special handling for functions calling vfork() (and setjmp(), and functions marked with attribute returns_twice, ...). E.g. last I checked (https://stackoverflow.com/questions/9347928/return-value-in-vfork-system-call#comment11829359_9347975 which, admittedly, was a decade ago), GCC disallowed tail call optimization, global common subexpresion elimination and jump bypassing for functions calling vfork() or setjmp(). Re: comment 11: > > if (execve(..... > Actually it touches the stack. It's not allowed Note the phrasing: > or calls any other function before successfully calling > _exit(2) or one of the exec(3) family of functions. vfork() is implemented so as to work correctly with a later call to the execve() function.
Alejandro Colomar, Finally I found. Yes, It's SAFE to call syscall(SYS_VFORK). If you vforked() using this way, it's not allowed to modify the stack. But if you vfork() using glibc function vfork(), the library do nasty hacks in order to allow you to touch stack - I mean push something. After returning from syscall in parent process, glibc restores stack differently than usual. So, no bugs neither in kernel, nor in glibc. The bug in documentation. It should be explained: 1. man vfork: Using syscall(SYS_vfork) [or calling this syscall in assembly] is very dangerous. Yes, it's allowed if child do not touch stack at all. For example, in child branch it's not allowed in general to use local variables and also, calling other function is also disallowed. Especially calling glibc's _exit() of _execve(). Even calling syscall(SYS_exit) is not allowed, since syscall() is a usual glibc function. But you can call syscalls in assembly if you have some reason to do so. In short - please use glibc wrapper instead. 2. man syscall: this could be implemented as a usual function, i.e. not a macro. In some picky contexts (for example, in the child process after direct calling syscall(SYS_vfork) rather than vfork()) it's not allowed to call any function at all, so be warned.