Bug 215769
Summary: | vfork() returns EINVAL after unshare(CLONE_NEWTIME) | ||
---|---|---|---|
Product: | Other | Reporter: | Коренберг Марк (socketpair) |
Component: | Other | Assignee: | documentation_man-pages (documentation_man-pages) |
Status: | NEW --- | ||
Severity: | normal | CC: | alx, brauner, fweimer |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: |
Description
Коренберг Марк
2022-03-29 11:02:44 UTC
Hello Коренберг Марк, On 3/29/22 13:02, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=215769 > > Bug ID: 215769 > Summary: man 2 vfork() does not document corner case when PID > == 1 > Product: Documentation > Version: unspecified > Hardware: All > OS: Linux > Status: NEW > Severity: normal > Priority: P1 > Component: man-pages > Assignee: documentation_man-pages@kernel-bugs.osdl.org > Reporter: socketpair@gmail.com > Regression: No > > If a process has PID=1 (for example in pid namespace), calling vfork() always > returns EINVAL. (https://bugs.python.org/issue47151). > > Please add this informtion in "RETURN VALUE" section or just in somewhere > else > in the manpage. > > Actually, it may be a bug in Linux kernel, I don't know. Possibly because the > init process must not be suspended ? Sorry, but I couldn't reproduce it. Could you please run the following test program in the same system that you're experiencing the bug? I run it on Debian Sid with kernel 5.16 and glibc 2.33: $ uname -a Linux ADY-debian-11 5.16.0-5-amd64 #1 SMP PREEMPT Debian 5.16.14-1 (2022-03-15) x86_64 GNU/Linux $ /lib/x86_64-linux-gnu/libc.so.6 | head -n1 GNU C Library (Debian GLIBC 2.33-7) release release version 2.33. $ cat vfork.c #define _GNU_SOURCE #include <err.h> #include <sched.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> int main(void) { pid_t pid; if (unshare(CLONE_NEWPID | CLONE_NEWNS) == -1) err(EXIT_FAILURE, "unshare(2)"); if (signal(SIGCHLD, SIG_IGN) == SIG_ERR) err(EXIT_FAILURE, "sigaction(2)"); pid = fork(); switch (pid) { case 0: break; case -1: err(EXIT_FAILURE, "fork(2)"); default: errx(EXIT_SUCCESS, "Parent exiting normally."); } if (getpid() != 1) errx(EXIT_FAILURE, "Child is not PID 1."); /* I'm not sure if I need to ignore it again, but just in case. */ if (signal(SIGCHLD, SIG_IGN) == SIG_ERR) err(EXIT_FAILURE, "sigaction(2)"); pid = vfork(); switch (pid) { case 0: errx(EXIT_SUCCESS, "Grandchild exiting normally."); case -1: /* If we got here, the report is confirmed. */ err(EXIT_FAILURE, "vfork(2)"); default: errx(EXIT_SUCCESS, "Child exiting normally."); } } $ cc -Wall -Wextra vfork.c $ sudo ./a.out a.out: Parent exiting normally. a.out: Grandchild exiting normally. a.out: Child exiting normally. $ If you can confirm the bug with this program, please send your system details (most importantly, kernel and libc versions). Thanks, Alex On 3/30/22 02:48, Alejandro Colomar (man-pages) wrote: > $ uname -a > Linux ADY-debian-11 5.16.0-5-amd64 #1 SMP PREEMPT Debian 5.16.14-1 > (2022-03-15) x86_64 GNU/Linux > $ /lib/x86_64-linux-gnu/libc.so.6 | head -n1 > GNU C Library (Debian GLIBC 2.33-7) release release version 2.33. On (almost?) any system, you should be able to run the following program to get the libc version: $ which ld | xargs ldd | sed -n '/libc\b/s/.* \(\/[^ ]* \).*/\1/p' | xargs sh -c | head -n1 GNU C Library (Debian GLIBC 2.33-7) release release version 2.33. [...] > If you can confirm the bug with this program, please send your system > details (most importantly, kernel and libc versions). > Hi, I appreciate depth of information validation. Actually, you are right. vfork() DOES work with pid=1 processes. I figured out the cause in my case. In order to reproduce -- add unshare(CLONE_NEWTIME) just before vfork(). Now, I don't know if it's a bug in vfork() or in fork(). Yes, both are clone() actually. In any case, they should either both give EINVAL or both don't fail. But it's definitely bug in the kernel around CLONE_NEWTIME. #define _GNU_SOURCE 1 #include <stdio.h> #include <sched.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/wait.h> #include <err.h> #ifndef CLONE_NEWTIME #define CLONE_NEWTIME 0x00000080 #endif int main (void) { if (unshare (CLONE_NEWTIME)) err (EXIT_FAILURE, "UNSHARE_NEWTIME"); pid_t pid; switch (pid=vfork ()) { case 0: _exit(0); case -1: err(EXIT_FAILURE, "vfork BUG"); default: waitpid(pid, NULL, 0); } return 0; } [Added some kernel CCs that may know what's going on] Hi, On 3/31/22 09:53, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=215769 > > --- Comment #3 from Коренберг Марк (socketpair@gmail.com) --- > Hi, > I appreciate depth of information validation. Actually, you are right. > vfork() > DOES work with pid=1 processes. I figured out the cause in my case. In order > to > reproduce -- add unshare(CLONE_NEWTIME) just before vfork(). Now, I don't > know > if it's a bug in vfork() or in fork(). Yes, both are clone() actually. > > In any case, they should either both give EINVAL or both don't fail. But it's > definitely bug in the kernel around CLONE_NEWTIME. > On 3/31/22 10:12, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=215769 > > --- Comment #4 from Коренберг Марк (socketpair@gmail.com) --- > #define _GNU_SOURCE 1 > #include <stdio.h> > #include <sched.h> > #include <stdlib.h> > #include <unistd.h> > #include <sys/types.h> > #include <sys/wait.h> > #include <err.h> > > #ifndef CLONE_NEWTIME > #define CLONE_NEWTIME 0x00000080 > #endif > > int main (void) > { > if (unshare (CLONE_NEWTIME)) err (EXIT_FAILURE, "UNSHARE_NEWTIME"); > > pid_t pid; > switch (pid=vfork ()) > { > case 0: > _exit(0); > case -1: > err(EXIT_FAILURE, "vfork BUG"); > default: > waitpid(pid, NULL, 0); > } > return 0; > } > I could reproduce it with the following code. I tried syscall(SYS_vfork) to make sure it's not a problem in the libc wrapper, and to make sure I do call vfork(2). If I replace vfork(2) with fork(2), I don't get the error. $ cat vfork.c #define _GNU_SOURCE #include <err.h> #include <linux/sched.h> #include <sched.h> #include <signal.h> #include <stdlib.h> #include <sys/syscall.h> #include <unistd.h> int main(void) { pid_t pid; if (unshare(CLONE_NEWTIME) == -1) err(EXIT_FAILURE, "unshare(2)"); if (signal(SIGCHLD, SIG_IGN) == SIG_ERR) err(EXIT_FAILURE, "sigaction(2)"); pid = syscall(SYS_vfork); switch (pid) { case 0: errx(EXIT_SUCCESS, "Grandchild exiting normally."); case -1: /* If we got here, the report is confirmed. */ err(EXIT_FAILURE, "vfork(2)"); default: errx(EXIT_SUCCESS, "Child exiting normally."); } } $ cc -Wall -Wextra -Werror vfork.c $ sudo ./a.out a.out: vfork(2): Invalid argument $ grep_syscall_def vfork kernel/fork.c:2711: SYSCALL_DEFINE0(vfork) { struct kernel_clone_args args = { .flags = CLONE_VFORK | CLONE_VM, .exit_signal = SIGCHLD, }; return kernel_clone(&args); } Maybe someone in the kernel can send some patch for the clone(2) and/or vfork(2) manual pages that explains the reason (if it's intended). Thanks, Alex On Sat, Apr 02, 2022 at 11:15:52PM +0200, Alejandro Colomar (man-pages) wrote:
> [Added some kernel CCs that may know what's going on]
>
> Hi,
>
> On 3/31/22 09:53, bugzilla-daemon@kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=215769
> >
> > --- Comment #3 from Коренберг Марк (socketpair@gmail.com) ---
> > Hi,
> > I appreciate depth of information validation. Actually, you are right.
> vfork()
> > DOES work with pid=1 processes. I figured out the cause in my case. In
> order to
> > reproduce -- add unshare(CLONE_NEWTIME) just before vfork(). Now, I don't
> know
> > if it's a bug in vfork() or in fork(). Yes, both are clone() actually.
> >
> > In any case, they should either both give EINVAL or both don't fail. But
> it's
> > definitely bug in the kernel around CLONE_NEWTIME.
> >
>
> On 3/31/22 10:12, bugzilla-daemon@kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=215769
> >
> > --- Comment #4 from Коренберг Марк (socketpair@gmail.com) ---
> > #define _GNU_SOURCE 1
> > #include <stdio.h>
> > #include <sched.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> > #include <sys/types.h>
> > #include <sys/wait.h>
> > #include <err.h>
> >
> > #ifndef CLONE_NEWTIME
> > #define CLONE_NEWTIME 0x00000080
> > #endif
> >
> > int main (void)
> > {
> > if (unshare (CLONE_NEWTIME)) err (EXIT_FAILURE, "UNSHARE_NEWTIME");
> >
> > pid_t pid;
> > switch (pid=vfork ())
> > {
> > case 0:
> > _exit(0);
> > case -1:
> > err(EXIT_FAILURE, "vfork BUG");
> > default:
> > waitpid(pid, NULL, 0);
> > }
> > return 0;
> > }
> >
>
> I could reproduce it with the following code. I tried
> syscall(SYS_vfork) to make sure it's not a problem in the libc wrapper,
> and to make sure I do call vfork(2). If I replace vfork(2) with
> fork(2), I don't get the error.
>
>
> $ cat vfork.c
> #define _GNU_SOURCE
> #include <err.h>
> #include <linux/sched.h>
> #include <sched.h>
> #include <signal.h>
> #include <stdlib.h>
> #include <sys/syscall.h>
> #include <unistd.h>
>
> int main(void)
> {
> pid_t pid;
>
> if (unshare(CLONE_NEWTIME) == -1)
> err(EXIT_FAILURE, "unshare(2)");
> if (signal(SIGCHLD, SIG_IGN) == SIG_ERR)
> err(EXIT_FAILURE, "sigaction(2)");
> pid = syscall(SYS_vfork);
> switch (pid) {
> case 0:
> errx(EXIT_SUCCESS, "Grandchild exiting normally.");
> case -1:
> /* If we got here, the report is confirmed. */
> err(EXIT_FAILURE, "vfork(2)");
> default:
> errx(EXIT_SUCCESS, "Child exiting normally.");
> }
> }
>
> $ cc -Wall -Wextra -Werror vfork.c
> $ sudo ./a.out
> a.out: vfork(2): Invalid argument
>
>
>
> $ grep_syscall_def vfork
> kernel/fork.c:2711:
> SYSCALL_DEFINE0(vfork)
> {
> struct kernel_clone_args args = {
> .flags = CLONE_VFORK | CLONE_VM,
> .exit_signal = SIGCHLD,
> };
>
> return kernel_clone(&args);
> }
>
>
> Maybe someone in the kernel can send some patch for the clone(2) and/or
> vfork(2) manual pages that explains the reason (if it's intended).
Hey Alejandro,
I won't be able to send a patch very soon but I can at least explain why
you see EINVAL. :)
This is intended.
vfork() suspends the parent process and the child process will share the
same vm as the parent process. If the child process is in a new time
namespace different from its parent process it is not allowed to be in
the same threadgroup or share virtual memory with the parent process.
That's why you see EINVAL.
Note, the unshare(CLONE_NEWTIME) call will _not_ cause the calling
process to be moved into a different time namespace. Only the newly
created child process will be after a subsequent
fork()/vfork()/clone()/clone3()...
The semantics are equivalent to that of CLONE_NEWPID in this regard. You
can see this via /proc/<pid>/ns/ where you see two entries for pid
namespaces and also two entries for time namespaces:
* CLONE_NEWTIME
* /proc/<pid>/ns/time // current time namespace
* /proc/<pid>/ns/time_for_children // time namespace for the new child process
If during fork:
parent_process->time != parent_process->time_for_children
and either CLONE_VM or CLONE_THREAD is set you see EINVAL.
You can thus replicate the same error via:
unshare(CLONE_NEWTIME)
and a
clone() or clone3() call with CLONE_VM or CLONE_THREAD.
Christian
Christian, If you are right, why other CLONE_* do not have such a problem ? only CLONE_NEWTIME, I don't think it is special. Possibly, I'm not into the subject. Yes, I saw the kernel sources, and I found the exact place from where EINVAL returned. For threads - Yes, it's intended. But for fork()/execve() I think it should work. In my experiments, vfork() works OK with all present CLONE_* namespaces except CLONE_NEWTIME. Strange, isn't it ? Hey, Christian! On 4/4/22 10:05, Christian Brauner wrote: > On Sat, Apr 02, 2022 at 11:15:52PM +0200, Alejandro Colomar (man-pages) > wrote: >> [Added some kernel CCs that may know what's going on] [...] >> Maybe someone in the kernel can send some patch for the clone(2) and/or >> vfork(2) manual pages that explains the reason (if it's intended). > > Hey Alejandro, > > I won't be able to send a patch very soon but I can at least explain why > you see EINVAL. :) Don't hurry, we're not planning to release any soon :) > > This is intended. > > vfork() suspends the parent process and the child process will share the > same vm as the parent process. If the child process is in a new time > namespace different from its parent process it is not allowed to be in > the same threadgroup or share virtual memory with the parent process. > That's why you see EINVAL. That makes a lot of sense to me. > > Note, the unshare(CLONE_NEWTIME) call will _not_ cause the calling > process to be moved into a different time namespace. Only the newly > created child process will be after a subsequent > fork()/vfork()/clone()/clone3()... > > The semantics are equivalent to that of CLONE_NEWPID in this regard. You > can see this via /proc/<pid>/ns/ where you see two entries for pid > namespaces and also two entries for time namespaces: > > * CLONE_NEWTIME > * /proc/<pid>/ns/time // current time namespace > * /proc/<pid>/ns/time_for_children // time namespace for the new child > process Also makes sense. Michael taught me that a few weeks ago :) This also triggers some doubt: will the same problem happen with CLONE_NEWPID since it also moves the child into a new ns (in this case a PID one)? See test program below. > > If during fork: > > parent_process->time != parent_process->time_for_children > > and either CLONE_VM or CLONE_THREAD is set you see EINVAL. > > You can thus replicate the same error via: > > unshare(CLONE_NEWTIME) > > and a > > clone() or clone3() call with CLONE_VM or CLONE_THREAD. So, to test my doubts, I wrote this similar program (and also similar programs where only the CLONE_NEW* flag was changed, one with CLONE_NEWTIME, and one with CLONE_NEWNS)): $ cat vfork_newpid.c #define _GNU_SOURCE #include <err.h> #include <errno.h> #include <linux/sched.h> #include <sched.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <sys/syscall.h> #include <unistd.h> static char *const child_argv[] = { "print_pid", NULL }; static char *const child_envp[] = { NULL }; int main(void) { pid_t pid; printf("%s: PID: %ld\n", program_invocation_short_name, (long) getpid()); if (unshare(CLONE_NEWPID) == -1) err(EXIT_FAILURE, "unshare(2)"); if (signal(SIGCHLD, SIG_IGN) == SIG_ERR) err(EXIT_FAILURE, "signal(2)"); pid = syscall(SYS_vfork); //pid = vfork(); // This behaves differently. switch (pid) { case 0: execve("/home/alx/tmp/print_pid", child_argv, child_envp); err(EXIT_SUCCESS, "PID %jd exiting after execve(2)", (long) getpid()); case -1: err(EXIT_FAILURE, "vfork(2)"); default: errx(EXIT_SUCCESS, "Parent exiting after vfork(2)."); } } $ cat print_pid.c #include <err.h> #include <stdlib.h> #include <unistd.h> int main(void) { errx(EXIT_SUCCESS, "PID %jd exiting.", (long) getpid()); } $ cc -Wall -Wextra -Werror -o print_pid print_pid.c $ cc -Wall -Wextra -Werror -o vfork_newpid vfork_newpid.c $ $ $ sudo ./vfork_newpid vfork_newpid: PID: 8479 vfork_newpid: PID 8479 exiting after execve(2): Success print_pid: PID 1 exiting. $ $ $ sudo ./vfork_newtime vfork_newtime: PID: 8484 vfork_newtime: vfork(2): Invalid argument $ $ $ sudo ./vfork_newns vfork_newns: PID: 8486 vfork_newns: PID 8486 exiting after execve(2): Success print_pid: PID 8487 exiting. The first thing I noted is that usage of vfork(2) differs considerably from fork(2), and that's something that's not clear by reading the manual page. It sais that the parent process is suspended until the child calls execve(2), but I expected it to mean that vfork(2) doesn't return to the parent until that happened, but was otherwise transparent. I was wrong and my tests showed me that. I was going to propose an example program for the manual page, when I decided to try a slightly different thing: call vfork() instead of syscall(SYS_vfork); that changed the behavior to the same one as with fork(2) (i.e., the parent resumes after vfork(2) returns the PID of the child. Is that also intended? I couldn't find the glibc wrapper source code, so I don't know what is glibc doing here, but I straced the processes, and they're all calling vfork(), so the behavior should be consistent; it's quite weird. I'm very confused at this point. I'm also wondering why it's okay to have processes in different PID ns share the same vm, but I guess that's implementation details that I don't need to care that much. Thanks for the details! Cheers, Alex GLIBC: sysdeps/unix/sysv/linux/spawni.c It uses clone(CLONE_VM | CLONE_VFORK | SIGCHLD) instead of direct vfork() syscall. Also, vfork() function implementation in glibc depends on architecture. On x86it should be 1:1 i.e. it should call vfork() syscall directly. On some other architectures, it calls clone() with flags. On Tue, Apr 05, 2022 at 09:28:12PM +0200, Alejandro Colomar wrote: > Hey, Christian! > > On 4/4/22 10:05, Christian Brauner wrote: > > On Sat, Apr 02, 2022 at 11:15:52PM +0200, Alejandro Colomar (man-pages) > wrote: > > > [Added some kernel CCs that may know what's going on] > [...] > > > Maybe someone in the kernel can send some patch for the clone(2) and/or > > > vfork(2) manual pages that explains the reason (if it's intended). > > > > Hey Alejandro, > > > > I won't be able to send a patch very soon but I can at least explain why > > you see EINVAL. :) > > Don't hurry, we're not planning to release any soon :) > > > > > This is intended. > > > > vfork() suspends the parent process and the child process will share the > > same vm as the parent process. If the child process is in a new time > > namespace different from its parent process it is not allowed to be in > > the same threadgroup or share virtual memory with the parent process. > > That's why you see EINVAL. > > That makes a lot of sense to me. > > > > > Note, the unshare(CLONE_NEWTIME) call will _not_ cause the calling > > process to be moved into a different time namespace. Only the newly > > created child process will be after a subsequent > > fork()/vfork()/clone()/clone3()... > > > > The semantics are equivalent to that of CLONE_NEWPID in this regard. You > > can see this via /proc/<pid>/ns/ where you see two entries for pid > > namespaces and also two entries for time namespaces: > > > > * CLONE_NEWTIME > > * /proc/<pid>/ns/time // current time namespace > > * /proc/<pid>/ns/time_for_children // time namespace for the new > child process > > Also makes sense. Michael taught me that a few weeks ago :) > > This also triggers some doubt: will the same problem happen with > CLONE_NEWPID since it also moves the child into a new ns (in this case a PID > one)? See test program below. No, it won't. A pid namespace places no relevant constraints on vm usage whereas a time namespace does. If a task joins a new time namespace it'll clean the VVAR page tables and refault them with the new layout after the timens change. That affects all tasks which use the same task->mm. Since CLONE_THREAD implies CLONE_VM this would affect the whole thread-group behind their back. All threads would suddenly change timens. No such issues exist for pid namespaces; they don't need to alter task->mm. > > > > > If during fork: > > > > parent_process->time != parent_process->time_for_children > > > > and either CLONE_VM or CLONE_THREAD is set you see EINVAL. > > > > You can thus replicate the same error via: > > > > unshare(CLONE_NEWTIME) > > > > and a > > > > clone() or clone3() call with CLONE_VM or CLONE_THREAD. > > So, to test my doubts, I wrote this similar program (and also similar > programs where only the CLONE_NEW* flag was changed, one with CLONE_NEWTIME, > and one with CLONE_NEWNS)): > > $ cat vfork_newpid.c > #define _GNU_SOURCE > #include <err.h> > #include <errno.h> > #include <linux/sched.h> > #include <sched.h> > #include <signal.h> > #include <stdio.h> > #include <stdlib.h> > #include <sys/syscall.h> > #include <unistd.h> > > static char *const child_argv[] = { > "print_pid", > NULL > }; > > static char *const child_envp[] = { > NULL > }; > > int > main(void) > { > pid_t pid; > > printf("%s: PID: %ld\n", program_invocation_short_name, (long) > getpid()); > > if (unshare(CLONE_NEWPID) == -1) > err(EXIT_FAILURE, "unshare(2)"); > if (signal(SIGCHLD, SIG_IGN) == SIG_ERR) > err(EXIT_FAILURE, "signal(2)"); > > pid = syscall(SYS_vfork); > //pid = vfork(); // This behaves differently. > switch (pid) { > case 0: > execve("/home/alx/tmp/print_pid", child_argv, child_envp); > err(EXIT_SUCCESS, "PID %jd exiting after execve(2)", > (long) getpid()); > case -1: > err(EXIT_FAILURE, "vfork(2)"); > default: > errx(EXIT_SUCCESS, "Parent exiting after vfork(2)."); > } > } > > $ cat print_pid.c > #include <err.h> > #include <stdlib.h> > #include <unistd.h> > > int > main(void) > { > errx(EXIT_SUCCESS, "PID %jd exiting.", (long) getpid()); > } > > $ cc -Wall -Wextra -Werror -o print_pid print_pid.c > $ cc -Wall -Wextra -Werror -o vfork_newpid vfork_newpid.c > $ > $ > $ sudo ./vfork_newpid > vfork_newpid: PID: 8479 > vfork_newpid: PID 8479 exiting after execve(2): Success > print_pid: PID 1 exiting. > $ > $ > $ sudo ./vfork_newtime > vfork_newtime: PID: 8484 > vfork_newtime: vfork(2): Invalid argument > $ > $ > $ sudo ./vfork_newns > vfork_newns: PID: 8486 > vfork_newns: PID 8486 exiting after execve(2): Success > print_pid: PID 8487 exiting. > > > The first thing I noted is that usage of vfork(2) differs considerably from > fork(2), and that's something that's not clear by reading the manual page. > It sais that the parent process is suspended until the child calls > execve(2), but I expected it to mean that vfork(2) doesn't return to the > parent until that happened, but was otherwise transparent. I was wrong and > my tests showed me that. > > I was going to propose an example program for the manual page, when I > decided to try a slightly different thing: call vfork() instead of > syscall(SYS_vfork); that changed the behavior to the same one as with > fork(2) (i.e., the parent resumes after vfork(2) returns the PID of the > child. > > Is that also intended? I couldn't find the glibc wrapper source code, so I > don't know what is glibc doing here, but I straced the processes, and > they're all calling vfork(), so the behavior should be consistent; it's > quite weird. I'm very confused at this point. glibc does vfork() via inline assembly massaging. There's probably atfork handlers and a bunch of other stuff involved so it's difficult to do a remote diagnosis. (And note that calling anything other than execve() or _exit() after vfork() is basically undefined behavior.) > > > I'm also wondering why it's okay to have processes in different PID ns share > the same vm, but I guess that's implementation details that I don't need to > care that much. See earlier in the thread. (In reply to brauner from comment #10) > glibc does vfork() via inline assembly massaging. There's probably > atfork handlers and a bunch of other stuff involved so it's difficult to > do a remote diagnosis. glibc does not run fork handlers for vfork. > (And note that calling anything other than execve() or _exit() after > vfork() is basically undefined behavior.) Historically, glibc supports calling malloc after vfork, so that applications can implement their own form of close_range by reading /proc/self/fd. I wonder if we need to handle CLONE_NEWTIME in posix_spawn in some way. Surely that clone(CLONE_VM | CLONE_VFORK | SIGCHLD) call should fail as well if vfork is blocked after CLONE_NEWTIME, so CLONE_NEWTIME probably breaks posix_spawn. > $ sudo ./vfork_newpid > vfork_newpid: PID: 8479 > vfork_newpid: PID 8479 exiting after execve(2): Success > print_pid: PID 1 exiting. I definitely think this is a kernel (or glibc) bug. execve(2) is supposed to _never_ return 0 (and errno 0). I submitted a new bug to discuss it. Please see <https://bugzilla.kernel.org/show_bug.cgi?id=215813> Please help me to change Bugzilla fields in the issue. This is a kernel bug as I think. Nothing should be changed in documentation. And also, please stop discussing how to correctly use vfork() (regarding modifying stack, glibc and so on). This issue is actually about vfork() + CLONE_NEWTIME. Regarding CLONE_VM. We have no problems for, say, PID namespace. Suppose we have parent process with two threads. Let's second thread calls vfork() and is stopped as expected. So children process is running and the first thread of parent process too. They share VM, but getpid() will give different values in them, right ? If YES, I don't see any stoppers for doing the same for CLONE_NEWTIME. If NO, this is a bug. Actually, yes: #define _GNU_SOURCE 1 #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <syscall.h> #include <sys/types.h> #include <sys/wait.h> #include <err.h> #include <pthread.h> static void *showpid (void *restrict arg) { (void) arg; for (int i = 0; i < 25; i++) { printf ("thr pid=%lu\n", (unsigned long) syscall (SYS_getpid)); struct timespec ts = { 0, 100000000 };<---->// 0.1 sec nanosleep (&ts, NULL); } return NULL; } int main (void) { pthread_t thr; if (pthread_create (&thr, NULL, showpid, NULL)) abort (); if (unshare (CLONE_NEWPID) == -1) err (EXIT_FAILURE, "unshare(newpid)"); sleep (1);<--><------><------>// allow the thread to work for 1 second pid_t p = vfork (); if (!p) { static char qwe[100]; static struct timespec ts = { 1, 0 }; syscall (SYS_write, 1, qwe, sprintf (qwe, "child: %lu\n", (unsigned long) syscall (SYS_getpid))); syscall (SYS_clock_nanosleep, CLOCK_MONOTONIC, 0, &ts, NULL); syscall (SYS_write, 1, qwe, sprintf (qwe, "child sleep complete\n")); _exit (0); } if (p == -1) err (EXIT_FAILURE, "vfork"); printf("Waiting for child\n"); waitpid (p, NULL, 0); pthread_join (thr, NULL); } $ sudo ./a.out thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 child: 1 thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 child sleep complete Waiting for child thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 thr pid=46371 I think the fundamental issue is that a time namespace needs a new data segment for the vDSO, to store the changed office. So fully shared virtual memory just is not possible with CLONE_NEWTIME, which is why vfork will not work in this situation. Maybe the effect of CLONE_NEWTIME should have been deferred to the next execve? I missed a bunch of messages here. Apparently they weren't sent out as mails. I only saw the replies from Alejandro. Please change "Assignee" please. It's not about documentation. What assignee or component do you want to change it to? It's a bug or missing feature in Linux namespaces subsystem. I don't know to whom it should be assigned. It's some core kernel functionality. It's not about documentation. |