Linux has a legacy of treating threads like processes that share
memory. The situation was a lot worse about 15 years ago, but it's
still far from perfect. Despite lots of fixes to the way signals,
process termination and replacement via execve
, etc. are handled to
make threads behave like threads, plenty of ugly remnants of the idea
that "threads are just processes sharing memory" remain; the big areas
are:
Today I want to focus on the last area, permissions, namely real, effective, and saved user and group ids.
Per POSIX, the real, effective, and saved user and group ids are a property of the process. This is sometimes a good thing and sometimes a bad thing -- it makes the traditional idiom of temporarily changing effective uid to access a file "as if by another user" impossible or at least impractical in multi-threaded programs -- but as we'll see, the dangers of doing it any other way are so severe that the POSIX way is the only right way. (Besides, Linux offers an extension, fsuid/fsgid, which can be used to reproduce that old idiom, and which have been kept as thread-local by glibc.)
On the other hand, Linux treats these ids as a property local to each thread, not a property of the process (or, in Linux kernel terminology, the "thread group").
Since any remotely reasonable libc on Linux is going to want to
provide POSIX semantics for setuid
and related functions, userspace
emulation of a single process-wide set of ids is necessary. The
approach pioneered by glibc, which musl roughly followed, and which
I've reported bugs on in the past, is to use signals to asynchronously
capture control of all threads in the process, and direct them all to
perform the desired id-setting syscall at the same time.
Atomicity of changes: It needs to be impossible for the application to observe mid-call states where some threads have one set of ids and other threads have a different set of ids.
Atomicity of failure: If some threads succeed changing their ids
while others fail, the process is left in an inconsistent and
dangerous state with no way to back out. What happens then? musl
versions up through 1.1.5 attempted to avoid the main cause of
failure in old kernels (pre-3.1) via increasing RLIMIT_NPROCS
to
RLIM_INFINITY
during the operation, but this had
issues
and still failed to cover failures due to other causes (mainly,
Linux bugs due to id-change operations requiring kernel memory
allocation). The only safe answer is to raise SIGKILL
when this
happens.
Failure to report failure: Does the application even know if some threads failed to change their ids? This was glibc bug 13347.
Applications that ignore the return value: There's not much that can be done for them, but the complexities of emulating correct multi-threaded id-change operations increases the chance of an application getting a failure on an operation that, conceptually, should not be able to fail (e.g. dropping root).
Async-signal-safety: POSIX requires setuid
and setgid
to be
async-signal-safe. This is a difficult requirement to reconcile
efficiently with any mechanisms that involves locking. glibc simply
ignores the AS-safety requirement, and musl is only addressing it
now.
Finality of privilege-dropping: Is there any possibility of
(potentially untrusted) code in the application that runs after
dropping privileges with setuid
re-gaining the dropped privileges?
Unfortunately, the answer is yes, and I'll construct a scenario
demonstrating the issue below.
For this example, suppose we have a program that performs some
privileged operations like authenticating with a host key or binding
to a privileged port, then drops root and executes user-provided code
-- either C code obtained by dlopen
, or code in some sort of
interpreted language that can perform low-level operations. Further,
suppose that before dropping root, there has been, at least
momentarily, more than one thread, and at least one thread is
"exiting" (past the point of no return) at the time of the setuid
call.
Now, setuid
begins signaling and collecting threads to change their
ids.
If you're glibc, you explicitly ignore the exiting thread:
static void
internal_function
setxid_mark_thread (struct xid_command *cmdp, struct pthread *t)
{
int ch;
/* Wait until this thread is cloned. */
if (t->setxid_futex == -1
&& ! atomic_compare_and_exchange_bool_acq (&t->setxid_futex, -2, -1))
do
lll_futex_wait (&t->setxid_futex, -2, LLL_PRIVATE);
while (t->setxid_futex == -2);
/* Don't let the thread exit before the setxid handler runs. */
t->setxid_futex = 0;
do
{
ch = t->cancelhandling;
/* If the thread is exiting right now, ignore it. */
if ((ch & EXITING_BITMASK) != 0)
{
/* Release the futex if there is no other setxid in
progress. */
if ((ch & SETXID_BITMASK) == 0)
{
t->setxid_futex = 1;
lll_futex_wake (&t->setxid_futex, 1, LLL_PRIVATE);
}
return;
}
}
while (atomic_compare_and_exchange_bool_acq (&t->cancelhandling,
ch | SETXID_BITMASK, ch));
}
Versions of musl up through 1.1.6 don't explicitly ignore exiting threads, but instead simply lack a means of seeing them; they're not included in the current thread count, and not signalable since they've already blocked all signals in preparation for exiting (so that the application can't observe a signal handler running in the thread after it formally exited, and in the case of detached threads, so the thread can unmap its own stack before exiting).
Now we get to the fun (or scary) part: setuid
returns success and
the untrusted code begins to run, but there's a dying thread still
waiting to get scheduled and perform its last few instructions before
making the syscall to terminate itself. Since our untrusted thread
code is evil, it calls mmap
with MAP_FIXED
to map over the entire
memory space with a NOP slide ending in shellcode. Now, when the
exiting thread that still has its privileges gets scheduled, it runs
the shellcode as root.
Oops.
There are various solutions to the problem. The one I've opted for in
musl is not to trust its own view of the number or identity of live
threads, since there could be exiting threads we don't see. Instead,
we're switching to using /proc/self/task
to identify all threads in
the kernel's view of the thread-group (process), and waiting for all
of them either to respond to signals, or exit. There are lots of ugly
obstacles to making this work, but they've all proven solvable.
The main alternative I see from a userspace side is having a
supervisor thread to monitor the lifetimes of all threads, including
detached ones with caller-provided stacks, using the
CLONE_CHILD_CLEARTID
futex to ensure that any threads not caught by
signals have fully exited. I opted not to take this approach in musl
because it imposes costly architectural constraints on the threads
implementation, as well as heavy additional synchronization and
syscall overhead at thread creation/exit time, for the sake of
supporting one ugly corner case. But other implementors may see it
differently.
In any case, the right solution is to fix the kernel and eliminate the need for the costly and error-prone emulation in userspace. Linux should introduce new syscalls which atomically set the user or group ids for all threads in the process (thread-group), and deprecate per-thread ids (except fsuid/fsgid). Depending on how it's done (especially with a need to keep the old syscalls too), this may have significant implementation cost on the kernel side, but given the security risks that keep popping up from having this wrong, it's unreasonable for the kernel not to fix it.