EWONTFIX

vfork considered dangerous

21 Oct 2012 21:20:22 GMT

Traditional unix systems had a vfork function, which works like fork, but without creating a new virtual address space; the parent and child run in the same address space. Unlike with pthread_create, where the new thread runs on its own stack, vfork behaves like fork and “returns twice”, once in the child and once in the parent. This seems impossible, since the parent and child would clobber one another’s stacks, but a clever trick saves the day: the parent process is suspended until the child performs exec or _exit, breaking the shared-memory-space relation between the two processes.

vfork was omitted from POSIX and modern standards because it’s difficult to use; the original specification for the function left it undefined to do basically anything except exec or _exit after vfork in the child. However, many systems (including Linux) still provide a similar or identical interface at the kernel level, and new interest in its use has arisen again due to the fact that huge processes cannot fork on systems with strict commit charge accounting due to lack of memory, and the fact that copying the virtual memory layout of a process can be expensive if the process has a huge number of maps created by mmap. As such, both musl and glibc use vfork to implement posix_spawn, the modern interface for executing external programs as a new process.

While working on vfork usage in musl’s posix_spawn implementation, I realized that using it is a lot trickier (and more dangerous!) than I’d realized before. Here are some of the issues.

Signal handlers and `vfork`

Since the vfork child runs in the same address space as the parent, care needs to be taken to ensure that it does not modify the parent’s memory in unwanted/unsafe ways. This seems easy enough, until you realize that the calling program might have installed signal handlers, and these signal handlers could get invoked in the child. The most likely way this could happen is when signals are sent to entire process groups, for example, as a result of events like pressing the interrupt/quit key on the controlling terminal, or resizing the terminal. However, signals can also be sent explicitly to a process group as well.

If a signal handler runs in the child after vfork, there are several different ways it could corrupt the parent’s state:

Modifications to memory will be seen, but modifications to the state of open file descriptors, signal dispositions, and other process state, will not be seen in the parent. This could leave the parent in an inconsistent state.
The same signal might be seen twice “by the parent” (per what’s recorded in the parent’s memory) even though it should have been seen only once.
The signal handler in the child process may store properties of the process (e.g. its pid) somewhere in the shared memory, where the parent’s values of these properties, not the child’s, should be stored.

As such, if vfork is to be used in code where the caller might have signal handlers which could be broken by the above issues, it’s necessary to ensure that the parent’s signal handlers don’t get invoked in the child. This amounts to:

Block all signals before calling vfork.
In the child, reset all signal dispositions that aren’t SIG_IGN to SIG_DFL, and then restore the old signal mask.
After vfork returns in the parent, restore the signal mask.

Unfortunately, step 2 was illegal (undefined behavior) in the original specification of vfork, which basically meant vfork was impossible to use. Fortunately, on systems where vfork is supported, more specific semantics are provided/guaranteed, and step 2 works.

Threads and `vfork`

Formally, vfork was pretty much history by the time people started caring about threads. But in real-world implementations (Linux), we can observe that vfork doesn’t suspend the whole parent process (which would be really difficult to do, anyway), but instead just suspends the calling thread until the child calls exec/exit. This means that concurrency issues exist, and the vfork child is actually sharing memory with other running code, not just a suspended parent.

This leads us to...

`setuid` and `vfork`

Now we get to the worst of it. Threads and vfork allow you to get in a situation where two processes are both sharing memory space and running at the same time. Now, what happens if another thread in the parent calls setuid (or any other privilege-affecting function)? You end up with two processes with different privilege levels running in a shared address space. And this is A Bad Thing.

Consider for example a multi-threaded server daemon, running initially as root, that’s using posix_spawn, implemented naively with vfork, to run an external command. It doesn’t care if this command runs as root or with low privileges, since it’s a fixed command line with fixed environment and can’t do anything harmful. (As a stupid example, let’s say it’s running date as an external command because the programmer couldn’t figure out how to use strftime.)

Since it doesn’t care, it calls setuid in another thread without any synchronization against running the external program, with the intent to drop down to a normal user and execute user-provided code (perhaps a script or dlopen-obtained module) as that user. Unfortunately, it just gave that user permission to mmap new code over top of the running posix_spawn code, or to change the strings posix_spawn is passing to exec in the child. Whoops.

Working around the issues

The easy out would be just giving up on vfork. But in musl, a major target is systems where robustness is required (no overcommit) and memory might be constrained; therefore, fork is not a good option. Also, using vfork to implement posix_spawn might eventually allow us to support no-MMU targets.

In musl, there’s already a global lock that controls calls to the setuid family of functions; it was needed because Linux requires a userspace process to synchronize all its threads to make the setuid, etc. syscalls rather than doing the synchronization in kernelspace. Thus, it was easy to just reuse this lock to prevent uid/gid changes while posix_spawn is running. I believe glibc could do the same, since it has an equivalent locking mechanism in NPTL. musl may have a slight advantage here at present, in that the lock we’re using is a reader-writer lock, and callers of posix_spawn only count as readers, not writers.

Bug reports

I’ve filed and reopened several glibc bug reports related to the above issues:

These are security-relevant, but the rarity of multi-threaded programs that change their uid/gid after going threaded, and the rarity of real-world programs using posix_spawn, makes the impact extremely low at present.

Some final thoughts

Linux provides a CLONE_VFORK flag to clone which provides similar semantics to the traditional behavior of vfork, but allowing the new process to run on a separate stack with its own entry point, instead of utilizing the returns-twice idiom of fork. However, this does not solve any of the above problems; the signal handler and setuid issues still exist, and code in the child still has to tiptoe around anything that might upset the parent’s state. As such, I don’t see it as being any more useful than “traditional” vfork for implementing things like posix_spawn.