EWONTFIX - Systemd has 6 service startup notification types, and they're all wrong

Systemd has 6 service startup notification types, and they're all wrong

27 Feb 2014 04:14:26 GMT

In my last post, Broken by design: systemd, I covered technical aspects of systemd outside its domain of specialization that make it a poor choice for the future of the Linux userspace's init system. Since then, it's come to my attention as a result of a thread on the glibc development list that systemd can't even get things right in its own problem domain: service supervision.

Per the manual, systemd has the following 6 "types" that can be used in a service file to control how systemd will supervise the service (daemon):

simple - manages the lifetime of the daemon via the pid, and depends on the daemon not forking so that it's a direct child process. This roughly corresponds to the correct supervision practices of other systems like runit, s6, etc. The service is considered activated immediately.
forking - assumes the original process invoked to start the daemon will exit once the daemon is successfully initialized, and not earlier. Requires a pid file and is subject to all the traditional flaws of pid files (but systemd can mitigate them somewhat by being the process that inherits orphans).
oneshot - used for non-daemon "services" that run without forking and exit when finished; systemd waits for them to exit before considering them activated.
dbus - like simple, but systemd does not consider the service activated until it acquires a name on D-Bus.
notify - like simple, but systemd does not consider the service activated until it makes a call to the C function sd_notify, part of the systemd library.
idle - like simple but defers running the service until other jobs have finished; a hack to avoid interleaved spam on the console.

The whole idea of systemd's service supervision and activation system is built on being able to start services asynchronously as soon as their dependencies are met (and no sooner). However, none of the above choices actually make it possible to do this with a daemon that was not written specifically to interact with systemd!

In the case of simple, there is no way for systemd to determine when the daemon is actually active and providing the service that subsequent services may depend on. If using "socket activation" (a feature by which systemd allocates the sockets a daemon will listen on and passes them to the daemon to use), this may not matter. However, most daemons not written for systemd are not able to accept preexisting sockets, and even if they can, this might preclude some of their functionality.

In the case of forking, systemd assumes that, after the original process exits, the forked daemon is already initialized and ready to provide its service. Not only is this unlikely to be true; attempting to make it true is likely to lead to buggy daemon code. If you're going to fork in a daemon, doing so needs to be one of the first things your program does; otherwise, if anything you do (e.g. calling third-party library code) creates additional threads, a subsequent fork puts the child in an async-signal context and the child basically cannot do anything but execve or _exit without invoking undefined behavior. So it's almost certainly wrong to write a daemon that forks at the last step after setting itself up successfully. You could instead fork right away but use a synchronization primitive to prevent the parent from exiting before the child signals it to do so; however, I have not seen this done in practice. And no matter what you do, if your daemon forks, you're subject to all the race issues of using pid files.

The remaining nontrivial options are dbus and notify; both of these depend on daemons being written as part of the Freedesktop.org/systemd library framework. There is no documented, stable way for a daemon to use either of these options without linking to D-Bus's library and/or systemd's library (and thereby, for binary packages, pulling in a dependency on these packages even if the user is not using them). Furthermore, there are issues of accessing the notification channel. If the daemon has to sandbox itself (e.g. chroot, namespace/container, dropping root, etc.) before it finishes initializing, it may not even have a means to access to notification channel to inform systemd of its success, or any means to prove its identity even if it could access the channel.

So in short, the only way to make systemd's asynchronous service activation reliable is to add systemd-specific (or D-Bus specific) code into the daemon, and even these may not work reliably for all usage cases.

There are several ways this could have been avoided:

Option 1: A simple notification mechanism

Rather than requiring library code to notify systemd that the daemon is ready, use some existing trivial method. The simplest would be asking daemons to add an option to write (anything; the contents don't matter) to and close a particular file descriptor once they're ready. Then systemd could detect success as a non-empty pipe, and the default case (closing the pipe or exiting without writing anything) would be interpreted as failure.

Option 2: Polling

Despite it being against the "spirit" of systemd, this is perhaps the cleanest and most reliable: have systemd poll whatever service the daemon is supposed to provide. For example, if the service is starting sshd on port 22, systemd could repeatedly try connecting to port 22, with exponential backoff, until it succeeds. This approach requires no modification to existing daemons, and if implemented correctly, would have minimal cost (only at daemon start time) in cpu load and startup latency.

Thankfully, this approach is already possible, albeit in a very convoluted way, without modifying systemd: you can wrap daemons with a wrapper utility that performs the polling and reports back to systemd using the sd_notify API.

The current situation

As it stands, my view is that systemd has failed to solve the problem everybody thinks it's solved: making dependency-based service startup work robustly without the traditional hacks (like sleep 1) all over the place in ugly init scripts. What it has instead done is setup a situation where major daemons are going to come under pressure to link to systemd's library and/or integrate themselves with D-Bus in order to make systemd's promises into a reality. And this of course leads to more entangled cross-dependency and more platform-specific behavior working its way into cross-platform software.