The current POSIX-wide implementation of
system()in glibc for the parent process:1.a sets the process-wide signal handlers for SIGINT and SIGQUIT to ignore
1.b and blocks SIGCHLD.
The current Linux-specific implementation of
posix_spawn()in glibc blocks all signals on the parent process.
What are the reasons for these signal handling manipulations?
system()has a rather specific use-case:Think of
:read !<cmd>in vim, orshell <cmd>in gdb.From POSIX 2018:
The last paragraph refers to the mechanism for waiting on processes:
waitpid()or similarIt is possible to do both or neither; but the child process will (unless configured) become a zombie process until the parent calls
waitpid(). I would therefore consider the SIGCHLD as an extra option, e.g. useful for single-threaded processes which don't want to block for child processes (consider shell background tasks).Since the parent might have a signal handler installed for SIGCHLD, and that signal handler might call
waitpid()or similar beforesystem()internally executeswaitpid(), thesystem()implementation might lose the information about the child exit. Only the firstwaitpid()receives that information.Note that we assume the parent process is single-threaded, therefore blocking SIGCHLD on the parent thread is sufficient.
You can also set options on any signal handler you install for SIGCHLD that influence this mess, which is something I have not looked into.
This is because
posix_spawn()uses something likevfork(). From glibc'sspawni.c:CLONE_VFORKis meant to provide one aspect of the venerablevfork()function.vforkcreates a child process but both the parent and the child process share the same memory:Spawning a new process on Unix systems was originally based on the model of
fork()+exec(): First, the parent process callsfork()which literally creates a "fork" in the execution by spawning off a child process which runs on a copy of the memory of the parent process and starts execution by returning fromfork(). Both threads, the spawner in the parent and the new child thread return fromfork()but operate on different copies of the same memory contents. The child process would then do some minor things like setting up the stdin/stdout/stderr file descriptors and finally callexec(). Thatexec()syscall replaces the child's memory with an image of the executable to be executed.Since
fork()needs to provide a memory copy, it was originally slow and expensive (there were no copy-on-write hardware features). Therefore,vfork()was provided: here, the parent and child process operate on the same memory. That is, modifications to memory done by the child affect the parent and vice versa. Because it's dangerous for two threads to operate on the same stack,vfork()blocks the spawner thread (in the parent process) until the child has either executedexec()orexit(). Theexec()syscall stops memory sharing, in other words, the child process will have its own memory afterexec().In Linux, there's neither a
fork()nor avfork()system call. Instead, both features are provided via flags to theclone()system call:CLONE_VFORKandCLONE_VM.CLONE_VFORKimplements the blocking aspect ofvfork(), that is, it blocks the spawner thread until the child callsexec()orexit().CLONE_VMcontrols the sharing of memory: if set, both processes operate on the same memory (untilexec()); if unset, the child will have a (copy-on-write) clone of the memory of the parent.glibc according to the comment is mostly concerned about the child clobbering the stack of the parent. To avoid this, it allocates a new stack for the child and uses
clone()withCLONE_VM | CLONE_VFORK. Theclone()syscall provides an additional parameter where you can set the stack of the child. Forvfork()proper, that would be the stack of the parent; glibc here places its dedicated allocation.Any signal handler executed in the child also operate on the memory of the parent, but the state of the child process is a bit strange:
The threads of the parent process are not part of the same thread group (process), yet they access the same data and locks.
The thread ID and process ID of the parent do not fit to the thread-local storage of the child (glibc shares the TLS between parent and child by not setting
CLONE_SETTLS).The child shares the memory with the parent, but other resources like file descriptors are not shared. File descriptors specifically are only inherited - meaning that closing a file descriptor only affects the child, for example.
If the parent (thread) receives a signal while waiting for the child in
clone(CLONE_VFORK), it will only process that signal after the childexit()s orexec()s. (This is documented in the man page forvfork()but not in the one forclone().)Signals can either be sent to the child directly (for whatever reason), occur within the child (e.g. SIGSEGV or SIGTTIN) or affect an entire process group (e.g. SIGINT). Some of these aren't entirely under the control of the code between
clone()andexec()(like SIGINT and SIGTTIN), which explains why the signals should not be handled in the child using the existing signal handlers of the parent.posix_spawn()therefore restores all signal handlers to their default disposition in the child.This still doesn't quite explains why signals are blocked in the parent, though. I assume glibc wants to avoid a race condition where a signal can arrive in the child in between the
clone()and the set-up of the signal handlers in the child. Glibc does restore the signal mask after resetting the signal handlers in the child (and after performing theposix_spawn()file actions). Recently, the kernel added another flagCLONE_CLEAR_SIGHANDforclone(), which at least takes care of setting the signal disposition / uninstalling any custom signal handlers. I wonder if this means you could also get rid of the signal blocking (in the parent).Since we have allocated a stack for the child, we also need to free it. And it has been allocated in the memory space of the parent; the child cannot free it (easily) because it runs on that stack up until
exec()which might even return on failure. So the parent needs to wait until the child is done (exec()successful orexit()called) before releasing the stack allocation. I'm not sure whywaitpid()was not used as the synchronization mechanism, the comment in glibc merely states that this synchronization is the reason for usingCLONE_VFORK.