EWONTFIX - Broken by design: systemd

Broken by design: systemd

09 Feb 2014 19:56:09 GMT

Recently the topic of systemd has come up quite a bit in various communities in which I'm involved, including the musl IRC channel and on the Busybox mailing list.

While the attitude towards systemd in these communities is largely negative, much of what I've seen has been either dismissable by folks in different circles as mere conservatism, or tempered by an idea that despite its flaws, "the design is sound". This latter view comes with the notion that systemd's flaws are fixable without scrapping it or otherwise incurring major costs, and therefore not a major obstacle to adopting systemd.

My view is that this idea is wrong: systemd is broken by design, and despite offering highly enticing improvements over legacy init systems, it also brings major regressions in terms of many of the areas Linux is expected to excel: security, stability, and not having to reboot to upgrade your system.

The first big problem: PID 1

On unix systems, PID 1 is special. Orphaned processes (including a special case: daemons which orphan themselves) get reparented to PID 1. There are also some special signal semantics with respect to PID 1, and perhaps most importantly, if PID 1 crashes or exits, the whole system goes down (kernel panic).

Among the reasons systemd wants/needs to run as PID 1 is getting parenthood of badly-behaved daemons that orphan themselves, preventing their immediate parent from knowing their PID to signal or wait on them.

Unfortunately, it also gets the other properties, including bringing down the whole system when it crashes. This matters because systemd is complex. A lot more complex than traditional init systems. When I say complex, I don't mean in a lines-of-code sense. I mean in terms of the possible inputs and code paths that may be activated at runtime. While legacy init systems basically deal with no inputs except SIGCHLD from orphaned processes exiting and manual runlevel changes performed by the administrator, systemd deals with all sorts of inputs, including device insertion and removal, changes to mount points and watched points in the filesystem, and even a public DBus-based API. These in turn entail resource allocation, file parsing, message parsing, string handling, and so on. This brings us to:

The second big problem: Attack Surface

On a hardened system without systemd, you have at most one root-privileged process with any exposed surface: sshd. Everything else is either running as unprivileged users or does not have any channel for providing it input except local input from root. Using systemd then more than doubles the attack surface.

This increased and unreasonable risk is not inherent to systemd's goal of fixing legacy init. However it is inherent to the systemd design philosophy of putting everything into the init process.

The third big problem: Reboot to Upgrade

Windows Update rebooting

Fundamentally, upgrading should never require rebooting unless the component being upgraded is the kernel. Even then, for security updates, it's ideal to have a "hot-patch" that can be applied as a loadable kernel module to mitigate the security issue until rebooting with the new kernel is appropriate.

Unfortunately, by moving large amounts of functionality that's likely to need to be upgraded into PID 1, systemd makes it impossible to upgrade without rebooting. This leads to "Linux" becoming the laughing stock of Windows fans, as happened with Ubuntu a long time ago.

Possible counter-arguments

With regards to security, one could ask why can't desktop systems use systemd, and leave server systems to find something else. But I think this line of reasoning is flawed in at least three ways:

Many of the selling-point features of systemd are server-oriented. State-of-the-art transaction-style handling of daemon starting and stopping is not a feature that's useful on desktop systems. The intended audience for that sort of thing is clearly servers.
The desktop is quickly becoming irrelevant. The future platform is going to be mobile and is going to be dealing with the reality of running untrusted applications. While the desktop made the unix distinction of local user accounts largely irrelevant, the coming of mobile app ecosystems full of potentially-malicious apps makes "local security" more important than ever.
The crowd pushing systemd, possibly including its author, is not content to have systemd be one choice among many. By providing public APIs intended to be used by other applications, systemd has set itself up to be difficult not to use once it achieves a certain adoption threshold.

With regards to upgrades, systemd's systemctl has a daemon-reexec command to make systemd serialize its state, re-exec itself, and continue uninterrupted. This could perhaps be used to switch to a new version without rebooting. Various programs already use this technique, such as the IRC client irssi which lets you /upgrade without dropping any connections. Unfortunately, this brings us back to the issue of PID 1 being special. For normal applications, if re-execing fails, the worst that happens is the process dies and gets restarted (either manually or by some monitoring process) if necessary. However for PID 1, if re-execing itself fails, the whole system goes down (kernel panic).

For common reasons it might fail, the execve syscall returns failure in the original process image, allowing the program to handle the error. However, failure of execve is not entirely atomic:

The kernel may fail setting up the VM for the new process image after the original VM has already been destroyed; the main situation under which this would happen is resource exhaustion.
Even after the kernel successfully sets up the new VM and transfers execution to the new process image, it's possible to have failures prior to the transfer of control to the actual application program. This could happen in the dynamic linker (resource exhaustion or other transient failures mapping required libraries or loading configuration files) or libc startup code. Using musl libc with static linking or even dynamic linking with no additional libraries eliminates these failure cases, but systemd is intended to be used with glibc.

In addition, systemd might fail to restore its serialized state due to resource allocation failures, or if the old and new versions have diverged sufficiently that the old state is not usable by the new version.

So if not systemd, what? Debian's discussion of whether to adopt systemd or not basically devolved into a false dichotomy between systemd and upstart. And except among grumpy old luddites, keeping legacy sysvinit is not an attractive option. So despite all its flaws, is systemd still the best option?

No.

None of the things systemd "does right" are at all revolutionary. They've been done many times before. DJB's daemontools, runit, and Supervisor, among others, have solved the "legacy init is broken" problem over and over again (though each with some of their own flaws). Their failure to displace legacy sysvinit in major distributions had nothing to do with whether they solved the problem, and everything to do with marketing. Said differently, there's nothing great and revolutionary about systemd. Its popularity is purely the result of an aggressive, dictatorial marketing strategy including elements such as:

Engulfing other "essential" system components like udev and making them difficult or impossible to use without systemd (but see eudev).
Setting up for API lock-in (having the DBus interfaces provided by systemd become a necessary API that user-level programs depend on).
Dictating policy rather than being scoped such that the user, administrator, or systems integrator (distribution) has to provide glue. This eliminates bikesheds and thereby fast-tracks adoption at the expense of flexibility and diversity.

So how should init be done right?

The Unix way: with simple self-contained programs that do one thing and do it well.

First, get everything out of PID 1:

The systemd way: Take advantage of special properties of pid 1 to the maximum extent possible. This leads to ever-expanding scope creep and exacerbates all of the problems described above (and probably many more yet to be discovered).

The right way: Do away with everything special about pid 1 by making pid 1 do nothing but start the real init script and then just reap zombies:

#define _XOPEN_SOURCE 700
#include <signal.h>
#include <unistd.h>

int main()
{
    sigset_t set;
    int status;

    if (getpid() != 1) return 1;

    sigfillset(&set);
    sigprocmask(SIG_BLOCK, &set, 0);

    if (fork()) for (;;) wait(&status);

    sigprocmask(SIG_UNBLOCK, &set, 0);

    setsid();
    setpgid(0, 0);
    return execve("/etc/rc", (char *[]){ "rc", 0 }, (char *[]){ 0 });
}

Yes, that's really all that belongs in PID 1. Then there's no way it can fail at runtime, and no need to upgrade it once it's successfully running.

Next, from the init script, run a process supervision system outside of PID 1 to manage daemons as immediate child processes (no backgrounding). As mentioned above are several existing choices here. It's not clear to me that any of them are sufficiently polished or robust to satisfy major distributions at this time. But neither is systemd; its backers are just better at sweeping that under the rug.

What the existing choices do have, though, is better design, mainly in the way of having clean, well-defined scope rather than Katamari Damacy.

If none of them are ready for prime time, then the folks eager to replace legacy init in their favorite distributions need to step up and either polish one of the existing solutions or write a better implementation based on the same principles. Either of these options would be a lot less work than fixing what's wrong with systemd.

Whatever system is chosen, the most important criterion is that it be transparent to applications. For 30+ years, the choice of init system used has been completely irrelevant to everybody but system integrators and administrators. User applications have had no reason to know or care whether you use sysvinit with runlevels, upstart, my minimal init with a hard-coded rc script or a more elaborate process-supervision system, or even /bin/sh. Ironically, this sort of modularity and interchangibility is what made systemd possible; if we were starting from the kind of monolithic, API-lock-in-oriented product systemd aims to be, swapping out the init system for something new and innovative would not even be an option.

Update: license on code

Added December 21, 2014.

There has been some interest in having a proper free software license on the trivial init code included above. I originally considered it too trivial to even care about copyright or need a license on it, but I don't want this to keep anyone from using or reusing it, so I'm explicitly licensing it under the following terms (standard MIT license):

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.