32-bit x86 Position Independent Code - It's that bad

15 Apr 2015 03:23:06 GMT

Let's start by looking at a simple C function to be compiled as position-independent code (i.e. -fPIC, for use in a shared library):

void bar(void);

void foo(void)

And now, what GCC compiles it to (listing 2):

    pushl   %ebx
    subl    $24, %esp
    call    __x86.get_pc_thunk.bx
    addl    $_GLOBAL_OFFSET_TABLE_, %ebx
    movl    32(%esp), %eax
    movl    %eax, (%esp)
    call    bar@PLT
    addl    $24, %esp
    popl    %ebx
    movl    (%esp), %ebx

Wow. What went wrong?

Obviously what we'd like to see is:

    jmp bar

And that's what we'd get with non-PIC code generation, or with PIC and applying hidden visibility to bar. That ideal form is not attainable with PIC, because the PC-relative address of the callee (bar) may not be fixed at link-time; it may reside in another shared library or the main program.

Readers familiar with the principles of position-independent code and GOTs/PLTs might wonder why we can't achieve this (listing 4):

    jmp bar@PLT

Here, the symbol@PLT notation in the assembly tells the assembler to generate a special type of relocation, which the linker will use to resolve the relative address in the call instruction to a “procedure linkage table” thunk it generates in output. This thunk exists at a fixed location in the shared library (and thus a fixed relative address from the caller, no matter what address the library is loaded at) and is responsible for loading the actual address of the function and jumping to it.

Here's where things start to go wrong.

In order to be able to jump to the actual function bar, the PLT thunk needs to be able to access global data: in particular, a pointer in the “global offset table” (GOT) which it will use as its jump destination. The PLT thunk code looks like this (listing 5):

    jmp *bar@GOT(%ebx)
    push $0
    jmp <PLT slot -1>

The second and third instructions are related to lazy binding (more on that later) but the first one is the one we're interested in right now. Since 32-bit x86 provides no direct way to do memory loads/stores relative to the current instruction pointer, the SysV ABI provides that, when code generated as PIC calls into a PLT thunk, it must pass a pointer to the GOT as a hidden argument in %ebx.

So why does that preclude the code in listing 4 above?

Well, per the ABI, the only call-clobbered registers on x86 are %eax, %ecx, and %edx. The register used for the hidden GOT argument, %ebx, is call-saved. That means foo is responsible for backing up and restoring %ebx if it modifies it. So we have a horrible cascade of inefficiency:

  1. foo has to load %ebx as an argument to bar@PLT.
  2. foo has to save %ebx on the stack and restore it before returning because it's a call-saved register.
  3. The call to bar can't be a tail-call because foo has work to do after bar returns: it has to restore %ebx.

Thus the monstrosity in listing 2.

What can be done to fix it?

Well, a first thought might be to change the register used for the hidden argument, but that can't be done without breaking the ABI contract between the compiler and linker, so it's out.

What about getting rid of the requirement for the hidden argument to bar@PLT? Well, that would also be an ABI change, but at least not an incompatible one. In any case it's not really practical. There's no simple way to make the PLT thunk load the GOT address without clobbering registers, and the only three registers which are call-clobbered are also used for argument-passing in certain non-default but supported "regparm" calling conventions. The choice of %ebx was almost certainly intentional, to avoid clashing with registers that could potentially be used as arguments.

So what's left?

How about getting rid of the PLT thunk entirely? Instead of aiming to generate the code in listing 4, let's aim for this (listing 6):

    call    __x86.get_pc_thunk.cx
    jmp *bar@GOT+_GLOBAL_OFFSET_TABLE_(%ecx)
    movl    (%esp), %ecx

Not only does this eliminate nearly all of the overhead/bloat in foo; it also eliminates the one-instruction visit to an extra cache line where the PLT thunk resides. Seems like a big win.

Why wasn't this done from the beginning?

Unfortunately, there's a reason, and it's a really bad one.

The original purpose for having a PLT was for the main program, compiled for a fixed load address (think pre-PIE era) and not using position-independent code, to be able to call shared library functions. The type of PLT that appears in the main program does not need the hidden GOT argument in %ebx, because, being at a fixed address, it's free to use absolute addresses for accessing its own GOT. The main program does need the PLT, though, because it's incorporating “legacy” (non-PIC) object files that don't know the functions they're calling could be loaded at varying locations at run-time. (It's the linker that's responsible, when producing a dynamic-linked program from such object files, for generating the appropriate glue in the form of PLT thunks.)

Position-independent code does not need a PLT. As in my example in listing 6, such code can load the target address from the GOT itself and make the indirect call/jump. Rather, the use of a PLT in position-independent shared library code was instituted to exploit another property of having a PLT: lazy binding.

Basics of lazy binding

When lazy binding is used, the dynamic linker gets to skip looking up callee symbol names at load time; the lookup is deferred to the first time the function is called. In theory, this trades runtime overhead, in the form of complexity and major latency at the time of the first call to a function, for a moderate savings in startup-time.

At least, that was the theory a few decades ago when all this machinery was designed.

Nowdays, lazy binding is a huge liability for security, and its performance benefits have also come under question. The biggest problem is that, for lazy binding to work, the GOT must be writable at runtime, and that makes it an attack vector for arbitrary code execution. Modern hardened systems use relro, which protects part or all of the GOT read-only after loading, but GOT slots subject to lazy-binding in the PLT are excluded from this protection. To get significant benefit from the relro link feature, lazy binding must also be disabled, with the following link options:

-Wl,-z,relro -Wl,-z,now

So basically, lazy binding is, or should be considered, deprecated.

Incidentally, musl libc does not support lazy binding at all for these and other reasons.

Lazy binding and the PLT

Remember lines 2 and 3 of the sample PLT thunk in listing 5? Well, the way they work is that bar@GOT(%ebx) initially (prior to lazy binding) contains a pointer to line 2, setup by the dynamic linker. The constant 0 pushed in line 2 is the PLT/GOT slot number, and the code jumped to is a thunk that invokes the code to resolve the lazy binding, using the slot number that was pushed into the stack as its argument.

There's no easy way to achieve the same thing with the code in listing 6; attempting to do so would slow down the caller and require some code duplication at each call site.

So, the reason we don't have efficient x86 PIC function calls is to support an obsolete misfeature.

Fortunately, it's fixable!

If we can (optionally) give up lazy binding, that is.

Alexander Monakov has prepared this simple patch for GCC, which lets you disable PIC calls via PLT, and which probably has a chance of making it upstream:

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 3263656..cd5f246 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -5451,7 +5451,8 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
       && !TARGET_64BIT
       && flag_pic
-      && (!decl || !targetm.binds_local_p (decl)))
+      && flag_plt
+      && (decl && !targetm.binds_local_p (decl)))
     return false;

   /* If we need to align the outgoing stack, then sibcalling would
@@ -25577,15 +25578,23 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
       /* Static functions and indirect calls don't need the pic register.  */
       if (flag_pic
          && (!TARGET_64BIT
+             || !flag_plt
              || (ix86_cmodel == CM_LARGE_PIC
                  && DEFAULT_ABI != MS_ABI))
          && GET_CODE (XEXP (fnaddr, 0)) == SYMBOL_REF
          && ! SYMBOL_REF_LOCAL_P (XEXP (fnaddr, 0)))
-         use_reg (&use, gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM));
-         if (ix86_use_pseudo_pic_reg ())
-           emit_move_insn (gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM),
-                           pic_offset_table_rtx);
+         if (flag_plt)
+           {
+             use_reg (&use, gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM));
+             if (ix86_use_pseudo_pic_reg ())
+               emit_move_insn (gen_rtx_REG (Pmode,
+                                            REAL_PIC_OFFSET_TABLE_REGNUM),
+                               pic_offset_table_rtx);
+           }
+         else
+           fnaddr = gen_rtx_MEM (QImode,
+                                 legitimize_pic_address (XEXP (fnaddr, 0), 0));

diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 301430c..aacc668 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -572,6 +572,10 @@ mprefer-avx128
 Target Report Mask(PREFER_AVX128) SAVE
 Use 128-bit AVX instructions instead of 256-bit AVX instructions in the auto-vectorizer.

+Target Report Var(flag_plt) Init(0)
+Use PLT for PIC calls (-mno-plt: load the address from GOT at call site)
 ;; ISA support


I've been playing around with some similar changes to my local gcc 4.7.3 tree and was able to achieve the following output:

    call    __x86.get_pc_thunk.cx
    addl    $_GLOBAL_OFFSET_TABLE_, %ecx
    movl    bar@GOT(%ecx), %eax
    jmp *%eax
    movl    (%esp), %ecx

It's still a ways off from the ideal 2 functions, but much better than the original output.

Compared to listing 6, there are two differences. Loading bar@GOT(%ecx) into %eax to make the indirect call is utterly useless and just bad codegen that's hopefully fixed on newer GCC versions. Failure to combine the constants bar@GOT and _GLOBAL_OFFSET_TABLE_ (which actually resolves to _GLOBAL_OFFSET_TABLE_-.) into a single constant is a more fundamental problem, though. It would take a new relocation type to resolve: a relocation that resolves not to the fixed GOT-base-relative offset of the GOT slot for bar, but rather to the fixed instruction-pointer-relative offset of the GOT slot for bar. Having this new relocation type would make all GOT accesses mildly cheaper.

Post or read comments...

Multi-threaded setxid on Linux

15 Jan 2015 16:12:00 GMT


Linux has a legacy of treating threads like processes that share memory. The situation was a lot worse about 15 years ago, but it's still far from perfect. Despite lots of fixes to the way signals, process termination and replacement via execve, etc. are handled to make threads behave like threads, plenty of ugly remnants of the idea that "threads are just processes sharing memory" remain; the big areas are:

Today I want to focus on the last area, permissions, namely real, effective, and saved user and group ids.

Per POSIX, the real, effective, and saved user and group ids are a property of the process. This is sometimes a good thing and sometimes a bad thing -- it makes the traditional idiom of temporarily changing effective uid to access a file "as if by another user" impossible or at least impractical in multi-threaded programs -- but as we'll see, the dangers of doing it any other way are so severe that the POSIX way is the only right way. (Besides, Linux offers an extension, fsuid/fsgid, which can be used to reproduce that old idiom, and which have been kept as thread-local by glibc.)

On the other hand, Linux treats these ids as a property local to each thread, not a property of the process (or, in Linux kernel terminology, the "thread group").

Resolving the Linux/POSIX discrepency

Since any remotely reasonable libc on Linux is going to want to provide POSIX semantics for setuid and related functions, userspace emulation of a single process-wide set of ids is necessary. The approach pioneered by glibc, which musl roughly followed, and which I've reported bugs on in the past, is to use signals to asynchronously capture control of all threads in the process, and direct them all to perform the desired id-setting syscall at the same time.

Potential pitfalls

An exploitable scenario

For this example, suppose we have a program that performs some privileged operations like authenticating with a host key or binding to a privileged port, then drops root and executes user-provided code -- either C code obtained by dlopen, or code in some sort of interpreted language that can perform low-level operations. Further, suppose that before dropping root, there has been, at least momentarily, more than one thread, and at least one thread is "exiting" (past the point of no return) at the time of the setuid call.

Now, setuid begins signaling and collecting threads to change their ids.

If you're glibc, you explicitly ignore the exiting thread:

static void
setxid_mark_thread (struct xid_command *cmdp, struct pthread *t)
  int ch;

  /* Wait until this thread is cloned.  */
  if (t->setxid_futex == -1
      && ! atomic_compare_and_exchange_bool_acq (&t->setxid_futex, -2, -1))
      lll_futex_wait (&t->setxid_futex, -2, LLL_PRIVATE);
    while (t->setxid_futex == -2);

  /* Don't let the thread exit before the setxid handler runs.  */
  t->setxid_futex = 0;

      ch = t->cancelhandling;

      /* If the thread is exiting right now, ignore it.  */
      if ((ch & EXITING_BITMASK) != 0)
          /* Release the futex if there is no other setxid in
             progress.  */
          if ((ch & SETXID_BITMASK) == 0)
              t->setxid_futex = 1;
              lll_futex_wake (&t->setxid_futex, 1, LLL_PRIVATE);
  while (atomic_compare_and_exchange_bool_acq (&t->cancelhandling,
                                               ch | SETXID_BITMASK, ch));

Versions of musl up through 1.1.6 don't explicitly ignore exiting threads, but instead simply lack a means of seeing them; they're not included in the current thread count, and not signalable since they've already blocked all signals in preparation for exiting (so that the application can't observe a signal handler running in the thread after it formally exited, and in the case of detached threads, so the thread can unmap its own stack before exiting).

Now we get to the fun (or scary) part: setuid returns success and the untrusted code begins to run, but there's a dying thread still waiting to get scheduled and perform its last few instructions before making the syscall to terminate itself. Since our untrusted thread code is evil, it calls mmap with MAP_FIXED to map over the entire memory space with a NOP slide ending in shellcode. Now, when the exiting thread that still has its privileges gets scheduled, it runs the shellcode as root.


Fixing the problem

There are various solutions to the problem. The one I've opted for in musl is not to trust its own view of the number or identity of live threads, since there could be exiting threads we don't see. Instead, we're switching to using /proc/self/task to identify all threads in the kernel's view of the thread-group (process), and waiting for all of them either to respond to signals, or exit. There are lots of ugly obstacles to making this work, but they've all proven solvable.

The main alternative I see from a userspace side is having a supervisor thread to monitor the lifetimes of all threads, including detached ones with caller-provided stacks, using the CLONE_CHILD_CLEARTID futex to ensure that any threads not caught by signals have fully exited. I opted not to take this approach in musl because it imposes costly architectural constraints on the threads implementation, as well as heavy additional synchronization and syscall overhead at thread creation/exit time, for the sake of supporting one ugly corner case. But other implementors may see it differently.

In any case, the right solution is to fix the kernel and eliminate the need for the costly and error-prone emulation in userspace. Linux should introduce new syscalls which atomically set the user or group ids for all threads in the process (thread-group), and deprecate per-thread ids (except fsuid/fsgid). Depending on how it's done (especially with a need to keep the old syscalls too), this may have significant implementation cost on the kernel side, but given the security risks that keep popping up from having this wrong, it's unreasonable for the kernel not to fix it.

Post or read comments...

Open race-condition bugs in glibc

06 Mar 2014 19:10:09 GMT

As part of the freeze announcement for musl 1.0, we mentioned longstanding open race condition bugs in glibc. Shortly after the announcement went out on Phoronix, I got a request for details on which bugs we were referring to, so I put together a list.

The most critical (in my opinion) open race bugs are the ones I described on this blog back in 2012; they are reported in glibc issue 12683:

They make it virtually impossible to use thread cancellation safely; at the very least you would have to block cancellation around all cancellation points which allocate or free resources.

Since the original post on EWONTFIX was written, attempts to follow up with the Austin Group seem to have clarified (see issue 614, and issue 529 which was referenced in the response to it) that the requirements on side effects in the event of cancellation are as I interpreted them. A related glibc issue (symptom of the same design problem) is:

A few other race-related issues present in glibc (and which musl's implementation of pthreads avoids) are:

And one which musl shares (musl's POSIX aio is very immature and due for an overhaul):

The above are all bugs which I reported; in addition to these, there are at least a few other race-related bugs open against glibc:

In many ways, the above bugs are what inspired the creation of EWONTFIX, and while I haven't gotten a chance to write a detailed analysis of them all, it's nice to have them at least catalogued. I hope this list can serve as a catalyst for interest in getting them fixed.

Post or read comments...

Systemd has 6 service startup notification types, and they're all wrong

27 Feb 2014 04:14:26 GMT

In my last post, Broken by design: systemd, I covered technical aspects of systemd outside its domain of specialization that make it a poor choice for the future of the Linux userspace's init system. Since then, it's come to my attention as a result of a thread on the glibc development list that systemd can't even get things right in its own problem domain: service supervision.

Per the manual, systemd has the following 6 "types" that can be used in a service file to control how systemd will supervise the service (daemon):

The whole idea of systemd's service supervision and activation system is built on being able to start services asynchronously as soon as their dependencies are met (and no sooner). However, none of the above choices actually make it possible to do this with a daemon that was not written specifically to interact with systemd!

In the case of simple, there is no way for systemd to determine when the daemon is actually active and providing the service that subsequent services may depend on. If using "socket activation" (a feature by which systemd allocates the sockets a daemon will listen on and passes them to the daemon to use), this may not matter. However, most daemons not written for systemd are not able to accept preexisting sockets, and even if they can, this might preclude some of their functionality.

In the case of forking, systemd assumes that, after the original process exits, the forked daemon is already initialized and ready to provide its service. Not only is this unlikely to be true; attempting to make it true is likely to lead to buggy daemon code. If you're going to fork in a daemon, doing so needs to be one of the first things your program does; otherwise, if anything you do (e.g. calling third-party library code) creates additional threads, a subsequent fork puts the child in an async-signal context and the child basically cannot do anything but execve or _exit without invoking undefined behavior. So it's almost certainly wrong to write a daemon that forks at the last step after setting itself up successfully. You could instead fork right away but use a synchronization primitive to prevent the parent from exiting before the child signals it to do so; however, I have not seen this done in practice. And no matter what you do, if your daemon forks, you're subject to all the race issues of using pid files.

The remaining nontrivial options are dbus and notify; both of these depend on daemons being written as part of the Freedesktop.org/systemd library framework. There is no documented, stable way for a daemon to use either of these options without linking to D-Bus's library and/or systemd's library (and thereby, for binary packages, pulling in a dependency on these packages even if the user is not using them). Furthermore, there are issues of accessing the notification channel. If the daemon has to sandbox itself (e.g. chroot, namespace/container, dropping root, etc.) before it finishes initializing, it may not even have a means to access to notification channel to inform systemd of its success, or any means to prove its identity even if it could access the channel.

So in short, the only way to make systemd's asynchronous service activation reliable is to add systemd-specific (or D-Bus specific) code into the daemon, and even these may not work reliably for all usage cases.

There are several ways this could have been avoided:

Option 1: A simple notification mechanism

Rather than requiring library code to notify systemd that the daemon is ready, use some existing trivial method. The simplest would be asking daemons to add an option to write (anything; the contents don't matter) to and close a particular file descriptor once they're ready. Then systemd could detect success as a non-empty pipe, and the default case (closing the pipe or exiting without writing anything) would be interpreted as failure.

Option 2: Polling

Despite it being against the "spirit" of systemd, this is perhaps the cleanest and most reliable: have systemd poll whatever service the daemon is supposed to provide. For example, if the service is starting sshd on port 22, systemd could repeatedly try connecting to port 22, with exponential backoff, until it succeeds. This approach requires no modification to existing daemons, and if implemented correctly, would have minimal cost (only at daemon start time) in cpu load and startup latency.

Thankfully, this approach is already possible, albeit in a very convoluted way, without modifying systemd: you can wrap daemons with a wrapper utility that performs the polling and reports back to systemd using the sd_notify API.

The current situation

As it stands, my view is that systemd has failed to solve the problem everybody thinks it's solved: making dependency-based service startup work robustly without the traditional hacks (like sleep 1) all over the place in ugly init scripts. What it has instead done is setup a situation where major daemons are going to come under pressure to link to systemd's library and/or integrate themselves with D-Bus in order to make systemd's promises into a reality. And this of course leads to more entangled cross-dependency and more platform-specific behavior working its way into cross-platform software.

Post or read comments...

Broken by design: systemd

09 Feb 2014 19:56:09 GMT

Recently the topic of systemd has come up quite a bit in various communities in which I'm involved, including the musl IRC channel and on the Busybox mailing list.

While the attitude towards systemd in these communities is largely negative, much of what I've seen has been either dismissable by folks in different circles as mere conservatism, or tempered by an idea that despite its flaws, "the design is sound". This latter view comes with the notion that systemd's flaws are fixable without scrapping it or otherwise incurring major costs, and therefore not a major obstacle to adopting systemd.

My view is that this idea is wrong: systemd is broken by design, and despite offering highly enticing improvements over legacy init systems, it also brings major regressions in terms of many of the areas Linux is expected to excel: security, stability, and not having to reboot to upgrade your system.

The first big problem: PID 1

On unix systems, PID 1 is special. Orphaned processes (including a special case: daemons which orphan themselves) get reparented to PID 1. There are also some special signal semantics with respect to PID 1, and perhaps most importantly, if PID 1 crashes or exits, the whole system goes down (kernel panic).

Among the reasons systemd wants/needs to run as PID 1 is getting parenthood of badly-behaved daemons that orphan themselves, preventing their immediate parent from knowing their PID to signal or wait on them.

Unfortunately, it also gets the other properties, including bringing down the whole system when it crashes. This matters because systemd is complex. A lot more complex than traditional init systems. When I say complex, I don't mean in a lines-of-code sense. I mean in terms of the possible inputs and code paths that may be activated at runtime. While legacy init systems basically deal with no inputs except SIGCHLD from orphaned processes exiting and manual runlevel changes performed by the administrator, systemd deals with all sorts of inputs, including device insertion and removal, changes to mount points and watched points in the filesystem, and even a public DBus-based API. These in turn entail resource allocation, file parsing, message parsing, string handling, and so on. This brings us to:

The second big problem: Attack Surface

On a hardened system without systemd, you have at most one root-privileged process with any exposed surface: sshd. Everything else is either running as unprivileged users or does not have any channel for providing it input except local input from root. Using systemd then more than doubles the attack surface.

This increased and unreasonable risk is not inherent to systemd's goal of fixing legacy init. However it is inherent to the systemd design philosophy of putting everything into the init process.

The third big problem: Reboot to Upgrade

Windows Update rebooting

Fundamentally, upgrading should never require rebooting unless the component being upgraded is the kernel. Even then, for security updates, it's ideal to have a "hot-patch" that can be applied as a loadable kernel module to mitigate the security issue until rebooting with the new kernel is appropriate.

Unfortunately, by moving large amounts of functionality that's likely to need to be upgraded into PID 1, systemd makes it impossible to upgrade without rebooting. This leads to "Linux" becoming the laughing stock of Windows fans, as happened with Ubuntu a long time ago.

Possible counter-arguments

With regards to security, one could ask why can't desktop systems use systemd, and leave server systems to find something else. But I think this line of reasoning is flawed in at least three ways:

  1. Many of the selling-point features of systemd are server-oriented. State-of-the-art transaction-style handling of daemon starting and stopping is not a feature that's useful on desktop systems. The intended audience for that sort of thing is clearly servers.

  2. The desktop is quickly becoming irrelevant. The future platform is going to be mobile and is going to be dealing with the reality of running untrusted applications. While the desktop made the unix distinction of local user accounts largely irrelevant, the coming of mobile app ecosystems full of potentially-malicious apps makes "local security" more important than ever.

  3. The crowd pushing systemd, possibly including its author, is not content to have systemd be one choice among many. By providing public APIs intended to be used by other applications, systemd has set itself up to be difficult not to use once it achieves a certain adoption threshold.

With regards to upgrades, systemd's systemctl has a daemon-reexec command to make systemd serialize its state, re-exec itself, and continue uninterrupted. This could perhaps be used to switch to a new version without rebooting. Various programs already use this technique, such as the IRC client irssi which lets you /upgrade without dropping any connections. Unfortunately, this brings us back to the issue of PID 1 being special. For normal applications, if re-execing fails, the worst that happens is the process dies and gets restarted (either manually or by some monitoring process) if necessary. However for PID 1, if re-execing itself fails, the whole system goes down (kernel panic).

For common reasons it might fail, the execve syscall returns failure in the original process image, allowing the program to handle the error. However, failure of execve is not entirely atomic:

In addition, systemd might fail to restore its serialized state due to resource allocation failures, or if the old and new versions have diverged sufficiently that the old state is not usable by the new version.

So if not systemd, what? Debian's discussion of whether to adopt systemd or not basically devolved into a false dichotomy between systemd and upstart. And except among grumpy old luddites, keeping legacy sysvinit is not an attractive option. So despite all its flaws, is systemd still the best option?


None of the things systemd "does right" are at all revolutionary. They've been done many times before. DJB's daemontools, runit, and Supervisor, among others, have solved the "legacy init is broken" problem over and over again (though each with some of their own flaws). Their failure to displace legacy sysvinit in major distributions had nothing to do with whether they solved the problem, and everything to do with marketing. Said differently, there's nothing great and revolutionary about systemd. Its popularity is purely the result of an aggressive, dictatorial marketing strategy including elements such as:

So how should init be done right?

The Unix way: with simple self-contained programs that do one thing and do it well.

First, get everything out of PID 1:

The systemd way: Take advantage of special properties of pid 1 to the maximum extent possible. This leads to ever-expanding scope creep and exacerbates all of the problems described above (and probably many more yet to be discovered).

The right way: Do away with everything special about pid 1 by making pid 1 do nothing but start the real init script and then just reap zombies:

#define _XOPEN_SOURCE 700
#include <signal.h>
#include <unistd.h>

int main()
    sigset_t set;
    int status;

    if (getpid() != 1) return 1;

    sigprocmask(SIG_BLOCK, &set, 0);

    if (fork()) for (;;) wait(&status);

    sigprocmask(SIG_UNBLOCK, &set, 0);

    setpgid(0, 0);
    return execve("/etc/rc", (char *[]){ "rc", 0 }, (char *[]){ 0 });

Yes, that's really all that belongs in PID 1. Then there's no way it can fail at runtime, and no need to upgrade it once it's successfully running.

Next, from the init script, run a process supervision system outside of PID 1 to manage daemons as immediate child processes (no backgrounding). As mentioned above are several existing choices here. It's not clear to me that any of them are sufficiently polished or robust to satisfy major distributions at this time. But neither is systemd; its backers are just better at sweeping that under the rug.

What the existing choices do have, though, is better design, mainly in the way of having clean, well-defined scope rather than Katamari Damacy.

If none of them are ready for prime time, then the folks eager to replace legacy init in their favorite distributions need to step up and either polish one of the existing solutions or write a better implementation based on the same principles. Either of these options would be a lot less work than fixing what's wrong with systemd.

Whatever system is chosen, the most important criterion is that it be transparent to applications. For 30+ years, the choice of init system used has been completely irrelevant to everybody but system integrators and administrators. User applications have had no reason to know or care whether you use sysvinit with runlevels, upstart, my minimal init with a hard-coded rc script or a more elaborate process-supervision system, or even /bin/sh. Ironically, this sort of modularity and interchangibility is what made systemd possible; if we were starting from the kind of monolithic, API-lock-in-oriented product systemd aims to be, swapping out the init system for something new and innovative would not even be an option.

Update: license on code

Added December 21, 2014.

There has been some interest in having a proper free software license on the trivial init code included above. I originally considered it too trivial to even care about copyright or need a license on it, but I don't want this to keep anyone from using or reusing it, so I'm explicitly licensing it under the following terms (standard MIT license):

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.


Post or read comments...

Incorrect configure checks for availability of functions

14 Aug 2013 00:35:06 GMT

The short version: using functions without prototypes is dangerous, and bad configure script recipes directly encourage this practice.

One of the most basic and important checks configure scripts perform is checking for the availability of library functions (either in the standard library or third-party libraries) that are optional, present only in certain versions, wrongly missing on some systems, and so on. In a sense, this is the main purpose of having a configure script, and one would think this kind of check would be hard to get wrong. Unfortunately, these checks are not only possible to get wrong, they're usually wrong, especially in GNU software or other software using gnulib.

The basic problem is that most configure scripts check only for the existence of a symbol whose name matches that of the interface they want to use. The basic formula for this test is to make a C program which does nothing but call the desired function. There are several general strategies that are used:

  1. Including the headers needed to use the function, then calling the function from main.
  2. Omitting all headers, prototyping the function correctly, then calling it from main.
  3. Omitting all headers, declaring the function with a bogus (prototype or non-prototype) declaration, then calling it from main.
  4. Omitting all headers and calling the function from main with no declaration whatsoever.

All of the above are wrong, with varying degrees of wrongness, but the basic common problems they all share are that:

In addition, the various approaches are likely to give false positives and/or false negatives depending on factors like the version of C the compiler is providing, and user-provided CFLAGS such as -Werror=implicit-function-declaration.

Here is an excerpt of the latest configure check breakage we encountered building GNU coreutils on musl 0.9.12:

#undef strtod_l

/* Override any GCC internal prototype to avoid an error.
   Use char because int might match the return type of a GCC
   builtin and then its argument prototype would still apply.  */
#ifdef __cplusplus
extern "C"
char strtod_l ();
/* The GNU C library defines this for functions which it implements
    to always fail with ENOSYS.  Some functions are actually named
    something starting with __ and the normal name is an alias.  */
#if defined __stub_strtod_l |    defined __stub___strtod_l
choke me

main ()
return strtod_l ();
  return 0;

Note the usage of an incorrect declaration; this could actually fail to link under an advanced linker performing LTO, resulting in a false-negative. And, as mentioned above, it can result in a false-positive if there is no proper declaration of strtod_l for the application to use. This issue affected musl 0.9.12 because we added strtod_l purely as a compatibility symbol for loading binary libraries which use it, but did not intend for it to be part of the API for linking new programs. GNU coreutils happily attempted to use strtod_l with no prototype, resulting in treatment of the (junk) contents of the integer return-value register as the result of the conversion.

One legitimate question to ask would be: Is this musl's fault? In other words, is it a bug to lack a public declaration for a symbol which exists and which is a public interface on some systems? I believe the answer is no, for multiple reasons:

What it comes down to is that the only public interfaces are those which are documented as public, and short of human-readable documentation indicating otherwise, the documentation of public interfaces is the headers associated with a library. Assuming the existence of a public interface based on the existence of a symbol with a particular name is not reasonable, especially when the whole point of having a configure script is to detect rather than assume.

So what would a correct check for strtod_l look like? A good start would be:

#include <stdlib.h>
#include <locale.h>
int main()
    char *p;
    return (strtod_l)("", &p, (locale_t)0);

Enclosing the function name in parentheses serves two purposes: it suppresses any function-like macro, and it forces the compiler to produce an error when there is no declaration for the function.

Unfortunately, the argument check logic above seems difficult to handle in an automated check generator. So a better approach might be something like:

#include <stdlib.h>
#include <locale.h>
int main()
    double (*p)(const char *, char **, locale_t) = strtod_l;

This version has even stronger protection against implicit function declarations, since strtod_l is never called; instead, its address is taken. The pointer assignment should then yield an error if the types are not compatible, giving us a clean formulatic check for types.

Unfortunately, these types of configure bugs are endemic, and for musl, the solution is going to be providing public prototypes (under the appropriate feature tests). The alternative, taking explicit action to prevent linking against the symbols, involves hacks that are even worse and more fragile than what configure is doing.

Post or read comments...


04 Jul 2013 16:46:23 GMT

A little-known part of GCC's build process is a script called "fixincludes", or fixinc.sh. Purportedly, the purpose of this script is to fix "non-ANSI system header files" which GCC "cannot compile". This description seems to correspond roughly to the original intended purpose of fixincludes, but the scope of what it does has since ballooned into all sorts of unrelated changes. Let's look at the first few rules in fixincludes' inclhack.def:

(Source: fixincludes/inclhack.def)

Of the first 8 hacks (using GCC's terminology here) cited above, only one deals with fixing pre-ANSI-C headers. One more is fixing serious C++ breakage that would probably make it impossible to use C++ at all with the system headers, but the rest seem to be fixing, or attempting to fix, unrelated bugs that have nothing to do with making the compiler or compilation environment usable. And at least one has introduced major header breakage that might or might not be worse than what was in the vendor's original header.

In other words, what fixincludes evolved into is the GCC developers forcibly applying their own, often arguably incorrect or buggy, bug fix patches to system headers (and in some cases, non-system headers that happen to be in /usr/include) with no outside oversight or input from the maintainers of the software they're patching.

So how does fixincludes work? Basically, it iterates over each header file it finds under /usr/include (or whatever the configured include directory is), applies a set of heuristics based on the filename, machine tuple, and a regular expression matched against the file contents, and if these rules match, it applies a sequence of sed commands to the file. As of this writing, there are 228 such "hacks" that may be applied. The output is then stored in GCC's private include-fixed directory (roughly /usr/lib/gcc/$MACH/$VER/include-fixed), which GCC searches before the system include directory, thus "replacing" the original header when the new GCC is used.

In case it's not already obvious what a bad idea this whole concept is, here are a few serious flaws:

  1. Fixincludes prevents library upgrades from working. Suppose for example you have libfoo version 1.0 installed at the time GCC is built and installed. The fixincludes script decides to patch foo.h, and puts its patched version in GCC's include-fixed directory. Now suppose you install libfoo version 2.0, which comes with a new foo.h and which is incompatible with the definitions in the old version of foo.h. Due to GCC's include path order, the new version of the header will be silently ignored and GCC will keep using the old header from the version of libfoo that was present when GCC was installed. Moreover, since fixincludes does not take any precautions to avoid applying its changes to files other than the original broken file they were intended to fix, library authors who want to avoid the danger of having their users get stuck with old headers must take on the burden of ensuring that their header files don't match any of the patterns in fixincludes.

  2. Fixincludes can lead to unintended copyright infringement or leakage of private data. Unless you are fully aware of fixincludes, when building GCC, you would not expect an unbounded amount of local header files, some of which may be part of proprietary programs or site-local private headers, to end up in the GCC directory. Now, if you package up the GCC directory (think of people building cross compiler binaries and such), you could inadvertently ship copies of these headers in a public release.

  3. Many of the fixes are actually incorrect or fail to achieve what they're trying to achieve. For example, the VxWorks stdint.h "fix" creates a badly broken stdint.h. Another example, which came up in our development of musl, is the fix for va_list exposure in the prototypes in stdio.h and wchar.h. Per ANSI/ISO C, va_list is not defined in these headers (POSIX, on the other hand, requires it to be defined), so GCC uses bad heuristic regex matches to find such exposure and change it to __gnuc_va_list. Somehow (we never determined the reason), the resulting headers were interfering with the definition of mbstate_t and preventing libstdc++ from compiling successfully. In addition, we found that, while attempting to remedy an extremely minor "namespace pollution" issue in these headers, fixincludes was making a new namespace violation: for its double-inclusion guard macro, it used FIXINC_WRAP_STDIO_H_STDIO_STDARG_H, a name that happens to be in the namespace reserved for the application, not the implementation.

  4. The rules for whether and how to apply the "hacks" are poor heuristics, and no effort is made to avoid false positives. The README for fixincludes even states (line 118) their policy of applying hacks even when they might be a false positive, with no consideration for how incorrectly applying them (after all, they are hackish sed replacements, not anything robust or sophisticated) might break proper, working headers.

How could this situation be fixed? The GCC developers claim fixincludes is still needed (see also here), and while I'm fairly skeptical of this claim, I don't think it's a matter where they'll be convinced otherwise in the near future, so I'd like to look for other more constructive approaches. Here are the steps I think would be needed to fix fixincludes:

  1. Remove all outdated hacks, i.e. hacks for systems which GCC no longer supports. While not strictly necessary, cleaning up the list of hacks in this manner should make the next steps more practical.

  2. Remove all hacks for files that are none-of-GCC's-business. That means anything that doesn't absolutely need to be fixed to successfully compile GCC or provide a working (not necessarily fully conforming, if the underlying system was non-conforming, but "working") build environment after installation.

  3. Eliminate false positives and buggy sed replacements by adding to the hack definitions in inclhack.def a list of hashes for known-bad files the hack is meant to be applied to. If necessary, include a configure option, off by default, that would ignore the hashes.

  4. Where some of the "fixes" made by fixincludes themselves have bugs like namespace violations or macros that do not meet the requirements for being usable in the preprocessor, they should be changed to output something more correct. There is no justification for replacing one broken header with another, potentially worse, broken header.

  5. Add a --disable-fixincludes option to configure so that fixincludes can be completely turned off. This would be ideal for system integrators, packagers, and basically anyone installing GCC from source on a modern system. It's especially important for the case where the user is installing GCC on a system that already has many third-party library headers in /usr/include, some of which may be "broken" in the eyes of fixincludes, where "fixing" them would have the dangerous consequence of preventing future library upgrades from working properly.

Finally, I suppose one might wonder why something that seems so broken, as I've described fixincludes, might go undetected for so long. The explanation is simple: distros. Most users of GCC use binary packages prepared for a particular OS distribution, where the packager has already cleaned up most of the mess, either by building GCC in a sterile environment where it can't find any headers to pick up and hack up, or by pruning the resulting include-fixed directory. Thus, the only people who have to deal with fixincludes are people who build GCC from the source packages, or who are setting up build scripts for their own deployment/distribution.

For the curious, here are some links to the tricks distros do to overcome fixincludes:

It's unclear to me exactly what Debian does, but as their installed include-fixed directory is very minimal, they must also be doing something. I have not inspected the other major binary distributions with complex build and package systems, but casual experience suggests they are taking some measures to contain the breakage of fixincludes.

Post or read comments...

NULL considered harmful

04 Jul 2013 03:25:02 GMT

The C and C++ languages define the macro NULL, widely taught as the correct way to write a literal null pointer. The motivations for using NULL are well-meaning; largely they come down to fact that it documents the intent, much like how a macro named FALSE might better document boolean intent than a literal 0 would do. Unfortunately, use of the NULL macro without fully understanding it can lead to subtle bugs and portability issues, some of which are difficult for compilers and static analysis tools to diagnose.

Despite it being superceded by the 2011 standard, I'm going to quote C99 because it's what I'm most familar with, and I suspect most readers are in the same situation. 7.17 specifies the NULL macro as:


which expands to an implementation-defined null pointer constant; Pointers in turn defines null pointer constant as:

An integer constant expression with the value 0, or such an expression cast to type void *

And finally, for completeness, 6.6 defines integer constant expression as:

Constant expressions shall not contain assignment, increment, decrement, function-call, or comma operators, except when they are contained within a subexpression that is not evaluated.

An integer constant expression shall have integer type and shall only have operands that are integer constants, enumeration constants, character constants, sizeof expressions whose results are integer constants, and floating constants that are the immediate operands of casts. Cast operators in an integer constant expression shall only convert arithmetic types to integer types, except as part of an operand to the sizeof operator.

What's important to realize is that this allows a great deal of freedom with regard to how an implementation defines NULL. Some examples:

#define NULL 0

#define NULL ((void *)0)

#define NULL 0L

typedef void *__voidp;
#define NULL ((__voidp)0)

#define NULL (1ULL-1ULL)

#define NULL ('9'-'0'-9)

#define NULL (sizeof(void *)/(sizeof(char)-2))

In particular, the set of possible types NULL could have includes all integer types and void *. Fortunately, since these are all null pointer constants, in most contexts using NULL will get you The Right Thing no matter which type NULL might have. Pointer comparisons, assignments to pointers, the ?: operator, and so forth all treat null pointer constants in special ways such that the type of the null pointer constant mostly doesn't matter.

The main place the type of NULL does become an issue is with variadic (or non-prototyped) functions. Consider these examples:

printf("%p", NULL); // WRONG!
execl("/bin/true", "true", NULL); // WRONG!

The first is wrong, in general, because the %p specifier for printf requires an argument of type void *. The second example is always wrong because the execl function requires arguments of type char *, and NULL can never have type char *.

One might object that, despite being formally incorrect, the above should work, especially if NULL is defined with the typical C definition of ((void *)0). Assuming also that pointers of different types are treated identically by the calling convention, having type void * instead of char * would not matter when calling execl. (Arguably it may not matter anyway, since C specified that char * and void * have the same representation, but it's not entirely clear how this applies to arguments to variadic functions.)

Things get a lot uglier with C++. In C++, NULL was traditionally defined as plain 0, on account of the fact that C++ has no implicit conversion from void * to other pointer types. (Note that the current C++ standard also allows NULL to be defined as nullptr, and some non-conforming implmentations define it in other ways, such as __null.) So in many historical C++ implementations, if you pass NULL to a variadic function, bad things might happen.

And bad things did happen. We were encountering several hard-to-track-down crash bugs in certain applications (including evince and audacity) built against musl, which at the time defined NULL as 0 for C++ programs, and it turned out the source of the problem was that the C++ code was making calls to variadic C functions using NULL as a sentinel. The original report and further details are available in the mailing list archives for musl.

The crashes that were happening were on 64-bit archs (at the time, for us that meant just x86_64) and were, if I recall, "heisenbugs": inserting additional code to inspect what was going on made some or all of them go away. As it turns out, while x86_64 (and most 64-bit archs) uses entire registers or 64-bit stack slots for all arguments, even ones smaller than 64-bit, there's no guarantee in the ABI that the unused upper bits will be cleared by the caller. So, when the caller passed 0 as an int to a variadic function, the upper 32 bits of the stack slot contained whatever junk happened to be there before, and the callee's va_arg, attempting to read an argument of type void *, pulled the junk in as the upper bits of the pointer. Thus, manifestation of the bug was highly dependent both on the code executed just prior to the variadic call, and the compiler's choice of how to setup the arguments (e.g. push versus mov).

As far as what we would do about the issue, I've been on the fence for a long time about the relative merits of trying to work around incorrect, non-portable applications versus actively breaking them and pressuring their maintainers to fix them. However, the widespread incorrect use of NULL seemed overwhelming, and developers seemed very reluctant to fix the issue. Fortunately, we came up with what I believe is a really good solution on musl's side: replacing this:

#ifdef __cplusplus
#define NULL 0
#define NULL ((void *)0)

with this:

#define NULL 0L

(Note that musl assumes a type model where long and pointers have the same size; this is also assumed by Linux and is fundamental to the way system calls work, so it's not an unreasonable assumption for us.)

By making this change, all programs that assume they can pass NULL as a pointer argument to variadic functions (sentinel role or otherwise) continue to work as if NULL were defined as ((void *)0). However, GCC does not consider 0L a valid "sentinel" value (since it has the wrong type) for functions declared with the sentinel attribute. So what we have achieved is making incorrect programs work, but produce warnings where the code is incorrect.

One initial concern I had with this approach is that the matter of whether NULL has pointer or integer type might be observable to programs in such a way that the change from ((void *)0) to 0L could break some incorrect C programs. However, after some of the smart folks over at Stack Overflow tried to find ways a program might detect whether NULL is defined with integer or pointer type and failed, I deemed the change safe, and our developers and users have not run into any problems since.

It's been some time now (about 6 months) since the issue was raised and addressed in musl, but I just recently reported the sentinel issue in Busybox and submitted a patch to fix it. While getting the patch accepted was painless, I mention this anecdote as a possible explanation for why the NULL sentinel bug is so prevalent: after the patch was accepted, I received several follow-up emails chipping in with misconceptions about NULL. The impression I was left with is that everybody seems to think NULL is a much simpler topic than it really is.

So what would I recommend programmers do with NULL? After all this trouble with variadic function arguments, I've just basically stopped using NULL completely. A simple 0 is shorter and usually clear when initializing or returning a pointer, and for null pointer checks, I usually use the ! operator rather than ==NULL anyway. However, I really don't have any opinion on what others should do; from a correctness standpoint, what matters is simply not making assumptions about the type of NULL. Anywhere the type of the expression matters, you really need to use a cast, which you could apply to NULL, or simply to 0.

As a final remark, I don't think this article would be complete without mentioning a similar article, albeit much shorter and less detailed, written by Jens Gustedt over two years ago.

Post or read comments...

Non-invasive printf debugging

12 Dec 2012 16:09:05 GMT

This post is not about any particular bug or bad programming practice, just a new “printf debugging” technique I came up with.

Often when tracking down a bug, it’s useful to add extra output to track the state of the program in the moments leading up to the crash or incorrect behavior, aka “printf debugging”. However, this technique is “invasive” in the sense that it interleaves unwanted data into the program output. Using stderr instead of stdout can alleviate the problem to some extent, but when you’re inserting the debugging code into a widely-used library (in my case, libc.so) or a shell, even having unwanted output on stderr can be a problem.

A previous approach I had used was sending the output to an arbitrary high file descriptor instead of stdout or stderr. For example, dprintf(666, "message here"); This avoids breaking anything but a program using a huge number of file descriptors, and allows reading the output via adding 666>&2 to the command line.

However, let’s take that a bit further. I don’t really need the program itself to print any output. I just need to be able to see the message. Enter my other favorite debugging tool, strace. Using strace, I can see the above output on a high file descriptor even if the descriptor isn’t actually open; the syscall still happens and returns with EBADF. But there’s no reason it needed to be a valid number that could be a file descriptor in the first place. I can instead just use dprintf(-42, "message here\n"); and there’s no risk of ever producing any output, but the strace log (with the appropriate options to increase string length limits) will show the writes.

The result is a form of “printf debugging” that can safely be added in library code and other code that shouldn’t touch any file descriptors.

Post or read comments...

Stubborn and ignorant use of int where size_t is needed

25 Oct 2012 23:48:02 GMT

What’s wrong with this C function?

char *my_strchr(char *s, int c)
    int i;
    for (i=0; s[i]!=c; i++)
        if (!s[i]) return 0;
    return &s[i];

Unless its interface contract requires that the caller pass a string no longer than INT_MAX, it can invoke undefined behavior due to integer overflow, most likely resulting in a crash. Even if you change the type to unsigned instead of int to avoid the signed overflow issue, the function will still be buggy; now, it just goes into an infinite loop if the input string is longer than UINT_MAX.

The solution, simple enough, is to use the right type for i: size_t. Strictly speaking, this could still fail in the same way, since the C language places no requirement on implementations that they not support objects too large for size_t; however, given that the standard library functions all use size_t, even on such an implementation one would have to go out of the way to produce such an over-sized object (for example, using calloc or a non-standard allocation function). And any reasonable implementation of C will have size_t sufficiently large to store any object size.

Unfortunately, I regularly encounter a surprising degree of resistence to the idea that this type of code must use size_t to be correct. The most infamous example is the 64-bit vulnerability in qmail, which D. J. Bernstein refused to acknowledge and award bounty, claiming that it’s not a bug, but even to this day I encounter quite a few programmers who claim they should be allowed to use int instead of size_t for loop counters involving array element counts.

The argument for using int basically comes down to a claim that inputs with more than 2 billion elements are (1) pathological, (2) malicious, and (3) should have already been rejected at some earlier stage. In most cases, I mostly agree with the first two points. Unless you’re writing software that’s intended to deal with gigantic volumes of data (in which case, I’ll presume you know what you’re doing), gigantic inputs are probably bad inputs. Okay, fair enough.

The problem in the argument for using int instead of size_t arises when you get to the third point — the idea that large inputs should already have been rejected. This is basically a cheap way of pushing the blame/responsibility onto somebody else, and while valid, if you write your library functions this way, it will make them extremely inefficient to interoperate with code that’s not making the same assumptions.

As an example, suppose the above my_strchr is being called from code that can’t assume its inputs were already length-limited below INT_MAX. The only way to safely pass this input to my_strchr is to call strlen (or strnlen, with n==INT_MAX) on it first to validate that it’s not too long. Even if the typical runtime of my_strchr is low (because you expect to find the character in the first hundred bytes or so), you now have to scan the whole string (or at least up to some large limit) first — not because you’ll actually be using the data, but just to check that the string meets the awkward interface contract of my_strchr. You’ve replaced the good typical-case runtime with a guaranteed worst-case runtime, just for the sake of gluing together interfaces that don’t play nice with each other.

One could argue here that I’ve attacked a straw man — certainly the my_strchr function could be improved to bail out early if i passes some arbitrary limit. This actually does solve the problem, and used consistently, this sort of approach works just as well as using size_t as long as you never intend to support objects larger than INT_MAX. Of course, the same bounded functions using size_t in place of int would work just as well, and they would be more general and flexible, in that they could be reused in applications requiring support for huge data.

Post or read comments...

Earlier posts