EWONTFIX

Multi-threaded setxid on Linux

15 Jan 2015 16:12:00 GMT

Background

Linux has a legacy of treating threads like processes that share memory. The situation was a lot worse about 15 years ago, but it's still far from perfect. Despite lots of fixes to the way signals, process termination and replacement via execve, etc. are handled to make threads behave like threads, plenty of ugly remnants of the idea that "threads are just processes sharing memory" remain; the big areas are:

Today I want to focus on the last area, permissions, namely real, effective, and saved user and group ids.

Per POSIX, the real, effective, and saved user and group ids are a property of the process. This is sometimes a good thing and sometimes a bad thing -- it makes the traditional idiom of temporarily changing effective uid to access a file "as if by another user" impossible or at least impractical in multi-threaded programs -- but as we'll see, the dangers of doing it any other way are so severe that the POSIX way is the only right way. (Besides, Linux offers an extension, fsuid/fsgid, which can be used to reproduce that old idiom, and which have been kept as thread-local by glibc.)

On the other hand, Linux treats these ids as a property local to each thread, not a property of the process (or, in Linux kernel terminology, the "thread group").

Resolving the Linux/POSIX discrepency

Since any remotely reasonable libc on Linux is going to want to provide POSIX semantics for setuid and related functions, userspace emulation of a single process-wide set of ids is necessary. The approach pioneered by glibc, which musl roughly followed, and which I've reported bugs on in the past, is to use signals to asynchronously capture control of all threads in the process, and direct them all to perform the desired id-setting syscall at the same time.

Potential pitfalls

An exploitable scenario

For this example, suppose we have a program that performs some privileged operations like authenticating with a host key or binding to a privileged port, then drops root and executes user-provided code -- either C code obtained by dlopen, or code in some sort of interpreted language that can perform low-level operations. Further, suppose that before dropping root, there has been, at least momentarily, more than one thread, and at least one thread is "exiting" (past the point of no return) at the time of the setuid call.

Now, setuid begins signaling and collecting threads to change their ids.

If you're glibc, you explicitly ignore the exiting thread:

static void
internal_function
setxid_mark_thread (struct xid_command *cmdp, struct pthread *t)
{
  int ch;

  /* Wait until this thread is cloned.  */
  if (t->setxid_futex == -1
      && ! atomic_compare_and_exchange_bool_acq (&t->setxid_futex, -2, -1))
    do
      lll_futex_wait (&t->setxid_futex, -2, LLL_PRIVATE);
    while (t->setxid_futex == -2);

  /* Don't let the thread exit before the setxid handler runs.  */
  t->setxid_futex = 0;

  do
    {
      ch = t->cancelhandling;

      /* If the thread is exiting right now, ignore it.  */
      if ((ch & EXITING_BITMASK) != 0)
        {
          /* Release the futex if there is no other setxid in
             progress.  */
          if ((ch & SETXID_BITMASK) == 0)
            {
              t->setxid_futex = 1;
              lll_futex_wake (&t->setxid_futex, 1, LLL_PRIVATE);
            }
          return;
        }
    }
  while (atomic_compare_and_exchange_bool_acq (&t->cancelhandling,
                                               ch | SETXID_BITMASK, ch));
}

Versions of musl up through 1.1.6 don't explicitly ignore exiting threads, but instead simply lack a means of seeing them; they're not included in the current thread count, and not signalable since they've already blocked all signals in preparation for exiting (so that the application can't observe a signal handler running in the thread after it formally exited, and in the case of detached threads, so the thread can unmap its own stack before exiting).

Now we get to the fun (or scary) part: setuid returns success and the untrusted code begins to run, but there's a dying thread still waiting to get scheduled and perform its last few instructions before making the syscall to terminate itself. Since our untrusted thread code is evil, it calls mmap with MAP_FIXED to map over the entire memory space with a NOP slide ending in shellcode. Now, when the exiting thread that still has its privileges gets scheduled, it runs the shellcode as root.

Oops.

Fixing the problem

There are various solutions to the problem. The one I've opted for in musl is not to trust its own view of the number or identity of live threads, since there could be exiting threads we don't see. Instead, we're switching to using /proc/self/task to identify all threads in the kernel's view of the thread-group (process), and waiting for all of them either to respond to signals, or exit. There are lots of ugly obstacles to making this work, but they've all proven solvable.

The main alternative I see from a userspace side is having a supervisor thread to monitor the lifetimes of all threads, including detached ones with caller-provided stacks, using the CLONE_CHILD_CLEARTID futex to ensure that any threads not caught by signals have fully exited. I opted not to take this approach in musl because it imposes costly architectural constraints on the threads implementation, as well as heavy additional synchronization and syscall overhead at thread creation/exit time, for the sake of supporting one ugly corner case. But other implementors may see it differently.

In any case, the right solution is to fix the kernel and eliminate the need for the costly and error-prone emulation in userspace. Linux should introduce new syscalls which atomically set the user or group ids for all threads in the process (thread-group), and deprecate per-thread ids (except fsuid/fsgid). Depending on how it's done (especially with a need to keep the old syscalls too), this may have significant implementation cost on the kernel side, but given the security risks that keep popping up from having this wrong, it's unreasonable for the kernel not to fix it.