Traditional unix systems had a vfork
function, which works like
fork
, but without creating a new virtual address space; the parent
and child run in the same address space. Unlike with pthread_create
,
where the new thread runs on its own stack, vfork
behaves like
fork
and “returns twice”, once in the child and once in the parent.
This seems impossible, since the parent and child would clobber one
another’s stacks, but a clever trick saves the day: the parent process
is suspended until the child performs exec
or _exit
, breaking the
shared-memory-space relation between the two processes.
vfork
was omitted from POSIX and modern standards because it’s
difficult to use; the original specification for the function left it
undefined to do basically anything except exec
or _exit
after
vfork
in the child. However, many systems (including Linux) still
provide a similar or identical interface at the kernel level, and new
interest in its use has arisen again due to the fact that huge
processes cannot fork
on systems with strict commit charge
accounting due to lack of memory, and the fact that copying the
virtual memory layout of a process can be expensive if the process has
a huge number of maps created by mmap
. As such, both musl and glibc
use vfork
to implement posix_spawn
, the modern interface for
executing external programs as a new process.
While working on vfork
usage in musl’s posix_spawn
implementation,
I realized that using it is a lot trickier (and more dangerous!) than
I’d realized before. Here are some of the issues.
vfork
Since the vfork
child runs in the same address space as the parent,
care needs to be taken to ensure that it does not modify the parent’s
memory in unwanted/unsafe ways. This seems easy enough, until you
realize that the calling program might have installed signal handlers,
and these signal handlers could get invoked in the child. The most
likely way this could happen is when signals are sent to entire
process groups, for example, as a result of events like pressing the
interrupt/quit key on the controlling terminal, or resizing the
terminal. However, signals can also be sent explicitly to a process
group as well.
If a signal handler runs in the child after vfork
, there are several
different ways it could corrupt the parent’s state:
As such, if vfork
is to be used in code where the caller might have
signal handlers which could be broken by the above issues, it’s
necessary to ensure that the parent’s signal handlers don’t get
invoked in the child. This amounts to:
vfork
.SIG_IGN
to SIG_DFL
, and then restore the old signal mask.vfork
returns in the parent, restore the signal mask.Unfortunately, step 2 was illegal (undefined behavior) in the original
specification of vfork
, which basically meant vfork
was impossible
to use. Fortunately, on systems where vfork
is supported, more
specific semantics are provided/guaranteed, and step 2 works.
vfork
Formally, vfork
was pretty much history by the time people started
caring about threads. But in real-world implementations (Linux), we
can observe that vfork
doesn’t suspend the whole parent process
(which would be really difficult to do, anyway), but instead just
suspends the calling thread until the child calls exec
/exit
.
This means that concurrency issues exist, and the vfork
child is
actually sharing memory with other running code, not just a
suspended parent.
This leads us to...
setuid
and vfork
Now we get to the worst of it. Threads and vfork
allow you to get in
a situation where two processes are both sharing memory space and
running at the same time. Now, what happens if another thread in the
parent calls setuid
(or any other privilege-affecting function)? You
end up with two processes with different privilege levels running in a
shared address space. And this is A Bad Thing.
Consider for example a multi-threaded server daemon, running initially
as root, that’s using posix_spawn
, implemented naively with vfork
,
to run an external command. It doesn’t care if this command runs as
root or with low privileges, since it’s a fixed command line with
fixed environment and can’t do anything harmful. (As a stupid example,
let’s say it’s running date
as an external command because the
programmer couldn’t figure out how to use strftime
.)
Since it doesn’t care, it calls setuid
in another thread without any
synchronization against running the external program, with the intent
to drop down to a normal user and execute user-provided code (perhaps
a script or dlopen
-obtained module) as that user. Unfortunately, it
just gave that user permission to mmap
new code over top of the
running posix_spawn
code, or to change the strings posix_spawn
is
passing to exec
in the child. Whoops.
The easy out would be just giving up on vfork
. But in musl, a major
target is systems where robustness is required (no overcommit) and
memory might be constrained; therefore, fork
is not a good option.
Also, using vfork
to implement posix_spawn
might eventually allow
us to support no-MMU targets.
In musl, there’s already a global lock that controls calls to the
setuid
family of functions; it was needed because Linux requires a
userspace process to synchronize all its threads to make the setuid
,
etc. syscalls rather than doing the synchronization in kernelspace.
Thus, it was easy to just reuse this lock to prevent uid/gid changes
while posix_spawn
is running. I believe glibc could do the same,
since it has an equivalent locking mechanism in NPTL. musl may have a
slight advantage here at present, in that the lock we’re using is a
reader-writer lock, and callers of posix_spawn
only count as
readers, not writers.
I’ve filed and reopened several glibc bug reports related to the above issues:
These are security-relevant, but the rarity of multi-threaded programs
that change their uid/gid after going threaded, and the rarity of
real-world programs using posix_spawn
, makes the impact extremely
low at present.
Linux provides a CLONE_VFORK
flag to clone
which provides similar
semantics to the traditional behavior of vfork
, but allowing the new
process to run on a separate stack with its own entry point, instead
of utilizing the returns-twice idiom of fork
. However, this does not
solve any of the above problems; the signal handler and setuid issues
still exist, and code in the child still has to tiptoe around anything
that might upset the parent’s state. As such, I don’t see it as being
any more useful than “traditional” vfork
for implementing things
like posix_spawn
.