In preparing to release musl 1.2.0, I worked with distro maintainers from Adélie Linux and Yoe to find serious application compatibility problems users would hit when upgrading, so that we could have patches ready and reduce user frustration with the upgrade. Here are some of the findings.
By far the most dangerous type of app compatibility issue we found was
in Berkeley DB 5.x, which defines its own wrong version of the
timespec
struct to pass to clock_gettime
:
typedef struct {
time_t tv_sec; /* seconds */
#ifdef HAVE_MIXED_SIZE_ADDRESSING
int32_t tv_nsec;
#else
long tv_nsec; /* nanoseconds */
#endif
} db_timespec;
The type of tv_sec
in their redefinition was right (time_t
), using
the libc type for time_t
. However, the layout of the rest of the
struct failed to match. The libc timespec
struct is (prettified and
removing conditionals):
struct timespec {
time_t tv_sec;
long tv_nsec;
long :32;
};
with the padding and tv_nsec
swapped for big-endian archs, so as to
line up with the low bits of a 64-bit field. However the padding
matters even on little-endian archs, since otherwise the structure may
be only 12 bytes long instead of 16 bytes (on archs that don't impose
8-byte alignment). This includes i386, sh, or1k, m68k, and others. So,
when Berkeley DB declared its own version without the padding, it
passed a 12-byte object to a function expecting a 16-byte one,
resulting in stack smashing. Thankfully, stack-protector caught it;
otherwise it might have yielded branch to a location specified by
current time in nanoseconds mod 1 second.
Note that even without the size/layout mismatch, defining a custom
type to pass to clock_gettime
was undefined behavior and should
never have been done.
(Reportedly there are later versions of Berkeley DB, which may or may not have already fixed the bug, but after taking over maintenance of it Oracle went and relicensed the code to be license-incompatible with basically every piece of software ever to use it, so in practice everyone uses 5.x.)
A similarly dangerous variant of this kind of issue is bypassing libc
declarations to declare time-related functions manually. This is
forbidden by C/POSIX except for functions that can be declared without
the use of any types defined by the standard library, and in most
cases should produce compile-time errors if the declarations mismatch,
since making the types visible usually also makes the libc
declarations visible. However, some Qt 2.x code copy-and-pasted into
Doxygen
managed to do this without an error by using the preprocessor to
suppress the declaration of gettimeofday
, then subsequently
declaring the function itself:
#define gettimeofday __hide_gettimeofday
#include "qdatetime.h"
#include "qdatastream.h"
#include <stdio.h>
#include <time.h>
#if defined(_OS_WIN32_)
#if defined(_CC_BOOL_DEF_)
#undef bool
#include <windows.h>
#define bool int
#else
#include <windows.h>
#endif
#elif defined(_OS_MSDOS_)
#include <dos.h>
#elif defined(_OS_OS2_)
#include <os2.h>
#elif defined(_OS_UNIX_) || defined(_OS_MAC_)
#include <sys/time.h>
#include <unistd.h>
#undef gettimeofday
extern "C" int gettimeofday( struct timeval *, struct timezone * );
#endif
Purportedly the reason for this hack was to deal with ambiguity over
whether the secont argument is struct timespec *
or void *
, since
C++ can't deal with implicit conversions from void *
. However that's
irrelevant since a null pointer implicitly converts to either and they
pass a null pointer (because a null pointer is the only valid
argument value).
Fortunately, this error manifests in the opposite direction: causing
the old time32 function to get called, resulting in a smaller write to
the correct-sized timeval
type rather than an oversized write to an
incorrect timeval
type. But it still produced a crash, since the Qt
code passed the resulting junk time_t
value to localtime
, then
dereferenced the reult without checking for error (and localtime
returned null because the bogus time_t
value overflowed tm_year
when converting to struct tm
).
Other issues we found were a lot less spectacular, but may merit a second (or third) post later.
Let's start by looking at a simple C function to be compiled as
position-independent code (i.e. -fPIC
, for use in a shared library):
void bar(void);
void foo(void)
{
bar();
}
And now, what GCC compiles it to (listing 2):
foo:
pushl %ebx
subl $24, %esp
call __x86.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_, %ebx
movl 32(%esp), %eax
movl %eax, (%esp)
call bar@PLT
addl $24, %esp
popl %ebx
ret
__x86.get_pc_thunk.bx:
movl (%esp), %ebx
ret
Obviously what we'd like to see is:
foo:
jmp bar
And that's what we'd get with non-PIC code generation, or with PIC and
applying hidden visibility to bar
. That ideal form is not attainable
with PIC, because the PC-relative address of the callee (bar
) may
not be fixed at link-time; it may reside in another shared library or
the main program.
Readers familiar with the principles of position-independent code and GOTs/PLTs might wonder why we can't achieve this (listing 4):
foo:
jmp bar@PLT
Here, the symbol
@PLT
notation in the assembly tells the
assembler to generate a special type of relocation, which the linker
will use to resolve the relative address in the call instruction to a
“procedure linkage table” thunk it generates in output. This thunk
exists at a fixed location in the shared library (and thus a fixed
relative address from the caller, no matter what address the library
is loaded at) and is responsible for loading the actual address of the
function and jumping to it.
In order to be able to jump to the actual function bar
, the PLT
thunk needs to be able to access global data: in particular, a pointer
in the “global offset table” (GOT) which it will use as its jump
destination. The PLT thunk code looks like this (listing 5):
bar@PLT:
jmp *bar@GOT(%ebx)
push $0
jmp <PLT slot -1>
The second and third instructions are related to lazy binding (more on
that later) but the first one is the one we're interested in right
now. Since 32-bit x86 provides no direct way to do memory loads/stores
relative to the current instruction pointer, the SysV ABI provides
that, when code generated as PIC calls into a PLT thunk, it must pass
a pointer to the GOT as a hidden argument in %ebx
.
So why does that preclude the code in listing 4 above?
Well, per the ABI, the only call-clobbered registers on x86 are
%eax
, %ecx
, and %edx
. The register used for the hidden GOT
argument, %ebx
, is call-saved. That means foo
is responsible for
backing up and restoring %ebx
if it modifies it. So we have a
horrible cascade of inefficiency:
foo
has to load %ebx
as an argument to bar@PLT
.foo
has to save %ebx
on the stack and restore it before
returning because it's a call-saved register.bar
can't be a tail-call because foo
has work to do
after bar
returns: it has to restore %ebx
.Thus the monstrosity in listing 2.
Well, a first thought might be to change the register used for the hidden argument, but that can't be done without breaking the ABI contract between the compiler and linker, so it's out.
What about getting rid of the requirement for the hidden argument to
bar@PLT
? Well, that would also be an ABI change, but at least not an
incompatible one. In any case it's not really practical. There's no
simple way to make the PLT thunk load the GOT address without
clobbering registers, and the only three registers which are
call-clobbered are also used for argument-passing in certain
non-default but supported "regparm" calling conventions. The choice of
%ebx
was almost certainly intentional, to avoid clashing with
registers that could potentially be used as arguments.
So what's left?
How about getting rid of the PLT thunk entirely? Instead of aiming to generate the code in listing 4, let's aim for this (listing 6):
foo:
call __x86.get_pc_thunk.cx
jmp *bar@GOT+_GLOBAL_OFFSET_TABLE_(%ecx)
__x86.get_pc_thunk.cx:
movl (%esp), %ecx
ret
Not only does this eliminate nearly all of the overhead/bloat in
foo
; it also eliminates the one-instruction visit to an extra cache
line where the PLT thunk resides. Seems like a big win.
Unfortunately, there's a reason, and it's a really bad one.
The original purpose for having a PLT was for the main program,
compiled for a fixed load address (think pre-PIE era) and not using
position-independent code, to be able to call shared library
functions. The type of PLT that appears in the main program does not
need the hidden GOT argument in %ebx
, because, being at a fixed
address, it's free to use absolute addresses for accessing its own
GOT. The main program does need the PLT, though, because it's
incorporating “legacy” (non-PIC) object files that don't know the
functions they're calling could be loaded at varying locations at
run-time. (It's the linker that's responsible, when producing a
dynamic-linked program from such object files, for generating the
appropriate glue in the form of PLT thunks.)
Position-independent code does not need a PLT. As in my example in listing 6, such code can load the target address from the GOT itself and make the indirect call/jump. Rather, the use of a PLT in position-independent shared library code was instituted to exploit another property of having a PLT: lazy binding.
When lazy binding is used, the dynamic linker gets to skip looking up callee symbol names at load time; the lookup is deferred to the first time the function is called. In theory, this trades runtime overhead, in the form of complexity and major latency at the time of the first call to a function, for a moderate savings in startup-time.
At least, that was the theory a few decades ago when all this machinery was designed.
Nowdays, lazy binding is a huge liability for security, and its
performance benefits have also come under question. The biggest
problem is that, for lazy binding to work, the GOT must be writable at
runtime, and that makes it an attack vector for arbitrary code
execution. Modern hardened systems use relro
, which protects part or
all of the GOT read-only after loading, but GOT slots subject to
lazy-binding in the PLT are excluded from this protection. To get
significant benefit from the relro
link feature, lazy binding must
also be disabled, with the following link options:
-Wl,-z,relro -Wl,-z,now
So basically, lazy binding is, or should be considered, deprecated.
Incidentally, musl libc does not support lazy binding at all for these and other reasons.
Remember lines 2 and 3 of the sample PLT thunk in listing 5? Well, the
way they work is that bar@GOT(%ebx)
initially (prior to lazy
binding) contains a pointer to line 2, setup by the dynamic linker.
The constant 0 pushed in line 2 is the PLT/GOT slot number, and the
code jumped to is a thunk that invokes the code to resolve the lazy
binding, using the slot number that was pushed into the stack as its
argument.
There's no easy way to achieve the same thing with the code in listing 6; attempting to do so would slow down the caller and require some code duplication at each call site.
So, the reason we don't have efficient x86 PIC function calls is to support an obsolete misfeature.
If we can (optionally) give up lazy binding, that is.
Alexander Monakov has prepared this simple patch for GCC, which lets you disable PIC calls via PLT, and which probably has a chance of making it upstream:
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 3263656..cd5f246 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -5451,7 +5451,8 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
if (!TARGET_MACHO
&& !TARGET_64BIT
&& flag_pic
- && (!decl || !targetm.binds_local_p (decl)))
+ && flag_plt
+ && (decl && !targetm.binds_local_p (decl)))
return false;
/* If we need to align the outgoing stack, then sibcalling would
@@ -25577,15 +25578,23 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
/* Static functions and indirect calls don't need the pic register. */
if (flag_pic
&& (!TARGET_64BIT
+ || !flag_plt
|| (ix86_cmodel == CM_LARGE_PIC
&& DEFAULT_ABI != MS_ABI))
&& GET_CODE (XEXP (fnaddr, 0)) == SYMBOL_REF
&& ! SYMBOL_REF_LOCAL_P (XEXP (fnaddr, 0)))
{
- use_reg (&use, gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM));
- if (ix86_use_pseudo_pic_reg ())
- emit_move_insn (gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM),
- pic_offset_table_rtx);
+ if (flag_plt)
+ {
+ use_reg (&use, gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM));
+ if (ix86_use_pseudo_pic_reg ())
+ emit_move_insn (gen_rtx_REG (Pmode,
+ REAL_PIC_OFFSET_TABLE_REGNUM),
+ pic_offset_table_rtx);
+ }
+ else
+ fnaddr = gen_rtx_MEM (QImode,
+ legitimize_pic_address (XEXP (fnaddr, 0), 0));
}
}
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 301430c..aacc668 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -572,6 +572,10 @@ mprefer-avx128
Target Report Mask(PREFER_AVX128) SAVE
Use 128-bit AVX instructions instead of 256-bit AVX instructions in the auto-vectorizer.
+mplt
+Target Report Var(flag_plt) Init(0)
+Use PLT for PIC calls (-mno-plt: load the address from GOT at call site)
+
;; ISA support
m32
I've been playing around with some similar changes to my local gcc 4.7.3 tree and was able to achieve the following output:
foo:
call __x86.get_pc_thunk.cx
addl $_GLOBAL_OFFSET_TABLE_, %ecx
movl bar@GOT(%ecx), %eax
jmp *%eax
__x86.get_pc_thunk.cx:
movl (%esp), %ecx
ret
It's still a ways off from the ideal 2 functions, but much better than the original output.
Compared to listing 6, there are two differences. Loading
bar@GOT(%ecx)
into %eax
to make the indirect call is utterly
useless and just bad codegen that's hopefully fixed on newer GCC
versions. Failure to combine the constants bar@GOT
and
_GLOBAL_OFFSET_TABLE_
(which actually resolves to
_GLOBAL_OFFSET_TABLE_-.
) into a single constant is a more
fundamental problem, though. It would take a new relocation type to
resolve: a relocation that resolves not to the fixed GOT-base-relative
offset of the GOT slot for bar
, but rather to the fixed
instruction-pointer-relative offset of the GOT slot for bar
. Having
this new relocation type would make all GOT accesses mildly cheaper.
Linux has a legacy of treating threads like processes that share
memory. The situation was a lot worse about 15 years ago, but it's
still far from perfect. Despite lots of fixes to the way signals,
process termination and replacement via execve
, etc. are handled to
make threads behave like threads, plenty of ugly remnants of the idea
that "threads are just processes sharing memory" remain; the big areas
are:
Today I want to focus on the last area, permissions, namely real, effective, and saved user and group ids.
Per POSIX, the real, effective, and saved user and group ids are a property of the process. This is sometimes a good thing and sometimes a bad thing -- it makes the traditional idiom of temporarily changing effective uid to access a file "as if by another user" impossible or at least impractical in multi-threaded programs -- but as we'll see, the dangers of doing it any other way are so severe that the POSIX way is the only right way. (Besides, Linux offers an extension, fsuid/fsgid, which can be used to reproduce that old idiom, and which have been kept as thread-local by glibc.)
On the other hand, Linux treats these ids as a property local to each thread, not a property of the process (or, in Linux kernel terminology, the "thread group").
Since any remotely reasonable libc on Linux is going to want to
provide POSIX semantics for setuid
and related functions, userspace
emulation of a single process-wide set of ids is necessary. The
approach pioneered by glibc, which musl roughly followed, and which
I've reported bugs on in the past, is to use signals to asynchronously
capture control of all threads in the process, and direct them all to
perform the desired id-setting syscall at the same time.
Atomicity of changes: It needs to be impossible for the application to observe mid-call states where some threads have one set of ids and other threads have a different set of ids.
Atomicity of failure: If some threads succeed changing their ids
while others fail, the process is left in an inconsistent and
dangerous state with no way to back out. What happens then? musl
versions up through 1.1.5 attempted to avoid the main cause of
failure in old kernels (pre-3.1) via increasing RLIMIT_NPROCS
to
RLIM_INFINITY
during the operation, but this had
issues
and still failed to cover failures due to other causes (mainly,
Linux bugs due to id-change operations requiring kernel memory
allocation). The only safe answer is to raise SIGKILL
when this
happens.
Failure to report failure: Does the application even know if some threads failed to change their ids? This was glibc bug 13347.
Applications that ignore the return value: There's not much that can be done for them, but the complexities of emulating correct multi-threaded id-change operations increases the chance of an application getting a failure on an operation that, conceptually, should not be able to fail (e.g. dropping root).
Async-signal-safety: POSIX requires setuid
and setgid
to be
async-signal-safe. This is a difficult requirement to reconcile
efficiently with any mechanisms that involves locking. glibc simply
ignores the AS-safety requirement, and musl is only addressing it
now.
Finality of privilege-dropping: Is there any possibility of
(potentially untrusted) code in the application that runs after
dropping privileges with setuid
re-gaining the dropped privileges?
Unfortunately, the answer is yes, and I'll construct a scenario
demonstrating the issue below.
For this example, suppose we have a program that performs some
privileged operations like authenticating with a host key or binding
to a privileged port, then drops root and executes user-provided code
-- either C code obtained by dlopen
, or code in some sort of
interpreted language that can perform low-level operations. Further,
suppose that before dropping root, there has been, at least
momentarily, more than one thread, and at least one thread is
"exiting" (past the point of no return) at the time of the setuid
call.
Now, setuid
begins signaling and collecting threads to change their
ids.
If you're glibc, you explicitly ignore the exiting thread:
static void
internal_function
setxid_mark_thread (struct xid_command *cmdp, struct pthread *t)
{
int ch;
/* Wait until this thread is cloned. */
if (t->setxid_futex == -1
&& ! atomic_compare_and_exchange_bool_acq (&t->setxid_futex, -2, -1))
do
lll_futex_wait (&t->setxid_futex, -2, LLL_PRIVATE);
while (t->setxid_futex == -2);
/* Don't let the thread exit before the setxid handler runs. */
t->setxid_futex = 0;
do
{
ch = t->cancelhandling;
/* If the thread is exiting right now, ignore it. */
if ((ch & EXITING_BITMASK) != 0)
{
/* Release the futex if there is no other setxid in
progress. */
if ((ch & SETXID_BITMASK) == 0)
{
t->setxid_futex = 1;
lll_futex_wake (&t->setxid_futex, 1, LLL_PRIVATE);
}
return;
}
}
while (atomic_compare_and_exchange_bool_acq (&t->cancelhandling,
ch | SETXID_BITMASK, ch));
}
Versions of musl up through 1.1.6 don't explicitly ignore exiting threads, but instead simply lack a means of seeing them; they're not included in the current thread count, and not signalable since they've already blocked all signals in preparation for exiting (so that the application can't observe a signal handler running in the thread after it formally exited, and in the case of detached threads, so the thread can unmap its own stack before exiting).
Now we get to the fun (or scary) part: setuid
returns success and
the untrusted code begins to run, but there's a dying thread still
waiting to get scheduled and perform its last few instructions before
making the syscall to terminate itself. Since our untrusted thread
code is evil, it calls mmap
with MAP_FIXED
to map over the entire
memory space with a NOP slide ending in shellcode. Now, when the
exiting thread that still has its privileges gets scheduled, it runs
the shellcode as root.
Oops.
There are various solutions to the problem. The one I've opted for in
musl is not to trust its own view of the number or identity of live
threads, since there could be exiting threads we don't see. Instead,
we're switching to using /proc/self/task
to identify all threads in
the kernel's view of the thread-group (process), and waiting for all
of them either to respond to signals, or exit. There are lots of ugly
obstacles to making this work, but they've all proven solvable.
The main alternative I see from a userspace side is having a
supervisor thread to monitor the lifetimes of all threads, including
detached ones with caller-provided stacks, using the
CLONE_CHILD_CLEARTID
futex to ensure that any threads not caught by
signals have fully exited. I opted not to take this approach in musl
because it imposes costly architectural constraints on the threads
implementation, as well as heavy additional synchronization and
syscall overhead at thread creation/exit time, for the sake of
supporting one ugly corner case. But other implementors may see it
differently.
In any case, the right solution is to fix the kernel and eliminate the need for the costly and error-prone emulation in userspace. Linux should introduce new syscalls which atomically set the user or group ids for all threads in the process (thread-group), and deprecate per-thread ids (except fsuid/fsgid). Depending on how it's done (especially with a need to keep the old syscalls too), this may have significant implementation cost on the kernel side, but given the security risks that keep popping up from having this wrong, it's unreasonable for the kernel not to fix it.
As part of the freeze announcement for musl 1.0, we mentioned longstanding open race condition bugs in glibc. Shortly after the announcement went out on Phoronix, I got a request for details on which bugs we were referring to, so I put together a list.
The most critical (in my opinion) open race bugs are the ones I described on this blog back in 2012; they are reported in glibc issue 12683:
They make it virtually impossible to use thread cancellation safely; at the very least you would have to block cancellation around all cancellation points which allocate or free resources.
Since the original post on EWONTFIX was written, attempts to follow up with the Austin Group seem to have clarified (see issue 614, and issue 529 which was referenced in the response to it) that the requirements on side effects in the event of cancellation are as I interpreted them. A related glibc issue (symptom of the same design problem) is:
A few other race-related issues present in glibc (and which musl's implementation of pthreads avoids) are:
pthread_kill
vfork
in
posix_spawn
raise()
is not
async-signal-safeAnd one which musl shares (musl's POSIX aio is very immature and due for an overhaul):
The above are all bugs which I reported; in addition to these, there are at least a few other race-related bugs open against glibc:
sem_post
/sem_wait
race causing sem_post
to return
EINVAL
pthread_cond_wait()
can consume a signal that was
sent before it started
waitingIn many ways, the above bugs are what inspired the creation of EWONTFIX, and while I haven't gotten a chance to write a detailed analysis of them all, it's nice to have them at least catalogued. I hope this list can serve as a catalyst for interest in getting them fixed.
In my last post, Broken by design: systemd, I covered technical aspects of systemd outside its domain of specialization that make it a poor choice for the future of the Linux userspace's init system. Since then, it's come to my attention as a result of a thread on the glibc development list that systemd can't even get things right in its own problem domain: service supervision.
Per the manual, systemd has the following 6 "types" that can be used in a service file to control how systemd will supervise the service (daemon):
simple
- manages the lifetime of the daemon via the pid, and
depends on the daemon not forking so that it's a direct child
process. This roughly corresponds to the correct supervision
practices of other systems like runit, s6, etc. The service is
considered activated immediately.
forking
- assumes the original process invoked to start the daemon
will exit once the daemon is successfully initialized, and not
earlier. Requires a pid file and is subject to all the traditional
flaws of pid files (but systemd can mitigate them somewhat by being
the process that inherits orphans).
oneshot
- used for non-daemon "services" that run without forking
and exit when finished; systemd waits for them to exit before
considering them activated.
dbus
- like simple
, but systemd does not consider the service
activated until it acquires a name on D-Bus.
notify
- like simple
, but systemd does not consider the service
activated until it makes a call to the C function
sd_notify
,
part of the systemd library.
idle
- like simple
but defers running the service until other
jobs have finished; a hack to avoid interleaved spam on the console.
The whole idea of systemd's service supervision and activation system is built on being able to start services asynchronously as soon as their dependencies are met (and no sooner). However, none of the above choices actually make it possible to do this with a daemon that was not written specifically to interact with systemd!
In the case of simple
, there is no way for systemd to determine when
the daemon is actually active and providing the service that
subsequent services may depend on. If using "socket activation" (a
feature by which systemd allocates the sockets a daemon will listen on
and passes them to the daemon to use), this may not matter. However,
most daemons not written for systemd are not able to accept
preexisting sockets, and even if they can, this might preclude some of
their functionality.
In the case of forking
, systemd assumes that, after the original
process exits, the forked daemon is already initialized and ready to
provide its service. Not only is this unlikely to be true; attempting
to make it true is likely to lead to buggy daemon code. If you're
going to fork
in a daemon, doing so needs to be one of the first
things your program does; otherwise, if anything you do (e.g. calling
third-party library code) creates additional threads, a subsequent
fork
puts the child in an async-signal context and the child
basically cannot do anything but execve
or _exit
without invoking
undefined behavior. So it's almost certainly wrong to write a daemon
that forks at the last step after setting itself up successfully. You
could instead fork right away but use a synchronization primitive to
prevent the parent from exiting before the child signals it to do so;
however, I have not seen this done in practice. And no matter what you
do, if your daemon forks, you're subject to all the race issues of
using pid files.
The remaining nontrivial options are dbus
and notify
; both of
these depend on daemons being written as part of the
Freedesktop.org/systemd library framework. There is no documented,
stable way for a daemon to use either of these options without linking
to D-Bus's library and/or systemd's library (and thereby, for binary
packages, pulling in a dependency on these packages even if the user
is not using them). Furthermore, there are issues of accessing the
notification channel. If the daemon has to sandbox itself (e.g.
chroot, namespace/container, dropping root, etc.) before it finishes
initializing, it may not even have a means to access to notification
channel to inform systemd of its success, or any means to prove its
identity even if it could access the channel.
So in short, the only way to make systemd's asynchronous service activation reliable is to add systemd-specific (or D-Bus specific) code into the daemon, and even these may not work reliably for all usage cases.
There are several ways this could have been avoided:
Rather than requiring library code to notify systemd that the daemon is ready, use some existing trivial method. The simplest would be asking daemons to add an option to write (anything; the contents don't matter) to and close a particular file descriptor once they're ready. Then systemd could detect success as a non-empty pipe, and the default case (closing the pipe or exiting without writing anything) would be interpreted as failure.
Despite it being against the "spirit" of systemd, this is perhaps the cleanest and most reliable: have systemd poll whatever service the daemon is supposed to provide. For example, if the service is starting sshd on port 22, systemd could repeatedly try connecting to port 22, with exponential backoff, until it succeeds. This approach requires no modification to existing daemons, and if implemented correctly, would have minimal cost (only at daemon start time) in cpu load and startup latency.
Thankfully, this approach is already possible, albeit in a very convoluted way, without modifying systemd: you can wrap daemons with a wrapper utility that performs the polling and reports back to systemd using the sd_notify API.
As it stands, my view is that systemd has failed to solve the problem
everybody thinks it's solved: making dependency-based service startup
work robustly without the traditional hacks (like sleep 1
) all over
the place in ugly init scripts. What it has instead done is setup a
situation where major daemons are going to come under pressure to link
to systemd's library and/or integrate themselves with D-Bus in order
to make systemd's promises into a reality. And this of course leads to
more entangled cross-dependency and more platform-specific behavior
working its way into cross-platform software.
Recently the topic of systemd has come up quite a bit in various communities in which I'm involved, including the musl IRC channel and on the Busybox mailing list.
While the attitude towards systemd in these communities is largely negative, much of what I've seen has been either dismissable by folks in different circles as mere conservatism, or tempered by an idea that despite its flaws, "the design is sound". This latter view comes with the notion that systemd's flaws are fixable without scrapping it or otherwise incurring major costs, and therefore not a major obstacle to adopting systemd.
My view is that this idea is wrong: systemd is broken by design, and despite offering highly enticing improvements over legacy init systems, it also brings major regressions in terms of many of the areas Linux is expected to excel: security, stability, and not having to reboot to upgrade your system.
On unix systems, PID 1 is special. Orphaned processes (including a special case: daemons which orphan themselves) get reparented to PID 1. There are also some special signal semantics with respect to PID 1, and perhaps most importantly, if PID 1 crashes or exits, the whole system goes down (kernel panic).
Among the reasons systemd wants/needs to run as PID 1 is getting parenthood of badly-behaved daemons that orphan themselves, preventing their immediate parent from knowing their PID to signal or wait on them.
Unfortunately, it also gets the other properties, including bringing
down the whole system when it crashes. This matters because systemd is
complex. A lot more complex than traditional init systems. When I say
complex, I don't mean in a lines-of-code sense. I mean in terms of the
possible inputs and code paths that may be activated at runtime. While
legacy init systems basically deal with no inputs except SIGCHLD
from orphaned processes exiting and manual runlevel changes performed
by the administrator, systemd deals with all sorts of inputs,
including device insertion and removal, changes to mount points and
watched points in the filesystem, and even a public DBus-based API.
These in turn entail resource allocation, file parsing, message
parsing, string handling, and so on. This brings us to:
On a hardened system without systemd, you have at most one root-privileged process with any exposed surface: sshd. Everything else is either running as unprivileged users or does not have any channel for providing it input except local input from root. Using systemd then more than doubles the attack surface.
This increased and unreasonable risk is not inherent to systemd's goal of fixing legacy init. However it is inherent to the systemd design philosophy of putting everything into the init process.
Fundamentally, upgrading should never require rebooting unless the component being upgraded is the kernel. Even then, for security updates, it's ideal to have a "hot-patch" that can be applied as a loadable kernel module to mitigate the security issue until rebooting with the new kernel is appropriate.
Unfortunately, by moving large amounts of functionality that's likely to need to be upgraded into PID 1, systemd makes it impossible to upgrade without rebooting. This leads to "Linux" becoming the laughing stock of Windows fans, as happened with Ubuntu a long time ago.
With regards to security, one could ask why can't desktop systems use systemd, and leave server systems to find something else. But I think this line of reasoning is flawed in at least three ways:
Many of the selling-point features of systemd are server-oriented. State-of-the-art transaction-style handling of daemon starting and stopping is not a feature that's useful on desktop systems. The intended audience for that sort of thing is clearly servers.
The desktop is quickly becoming irrelevant. The future platform is going to be mobile and is going to be dealing with the reality of running untrusted applications. While the desktop made the unix distinction of local user accounts largely irrelevant, the coming of mobile app ecosystems full of potentially-malicious apps makes "local security" more important than ever.
The crowd pushing systemd, possibly including its author, is not content to have systemd be one choice among many. By providing public APIs intended to be used by other applications, systemd has set itself up to be difficult not to use once it achieves a certain adoption threshold.
With regards to upgrades, systemd's systemctl
has a
daemon-reexec
command to make systemd serialize its state, re-exec itself, and
continue uninterrupted. This could perhaps be used to switch to a new
version without rebooting. Various programs already use this
technique, such as the IRC client irssi which
lets you /upgrade
without dropping any connections. Unfortunately,
this brings us back to the issue of PID 1 being special. For normal
applications, if re-execing fails, the worst that happens is the
process dies and gets restarted (either manually or by some monitoring
process) if necessary. However for PID 1, if re-execing itself fails,
the whole system goes down (kernel panic).
For common reasons it might fail, the execve
syscall returns failure
in the original process image, allowing the program to handle the
error. However, failure of execve
is not entirely atomic:
The kernel may fail setting up the VM for the new process image after the original VM has already been destroyed; the main situation under which this would happen is resource exhaustion.
Even after the kernel successfully sets up the new VM and transfers execution to the new process image, it's possible to have failures prior to the transfer of control to the actual application program. This could happen in the dynamic linker (resource exhaustion or other transient failures mapping required libraries or loading configuration files) or libc startup code. Using musl libc with static linking or even dynamic linking with no additional libraries eliminates these failure cases, but systemd is intended to be used with glibc.
In addition, systemd might fail to restore its serialized state due to resource allocation failures, or if the old and new versions have diverged sufficiently that the old state is not usable by the new version.
So if not systemd, what? Debian's discussion of whether to adopt systemd or not basically devolved into a false dichotomy between systemd and upstart. And except among grumpy old luddites, keeping legacy sysvinit is not an attractive option. So despite all its flaws, is systemd still the best option?
No.
None of the things systemd "does right" are at all revolutionary. They've been done many times before. DJB's daemontools, runit, and Supervisor, among others, have solved the "legacy init is broken" problem over and over again (though each with some of their own flaws). Their failure to displace legacy sysvinit in major distributions had nothing to do with whether they solved the problem, and everything to do with marketing. Said differently, there's nothing great and revolutionary about systemd. Its popularity is purely the result of an aggressive, dictatorial marketing strategy including elements such as:
Engulfing other "essential" system components like udev and making them difficult or impossible to use without systemd (but see eudev).
Setting up for API lock-in (having the DBus interfaces provided by systemd become a necessary API that user-level programs depend on).
Dictating policy rather than being scoped such that the user, administrator, or systems integrator (distribution) has to provide glue. This eliminates bikesheds and thereby fast-tracks adoption at the expense of flexibility and diversity.
The Unix way: with simple self-contained programs that do one thing and do it well.
First, get everything out of PID 1:
The systemd way: Take advantage of special properties of pid 1 to the maximum extent possible. This leads to ever-expanding scope creep and exacerbates all of the problems described above (and probably many more yet to be discovered).
The right way: Do away with everything special about pid 1 by making pid 1 do nothing but start the real init script and then just reap zombies:
#define _XOPEN_SOURCE 700
#include <signal.h>
#include <unistd.h>
int main()
{
sigset_t set;
int status;
if (getpid() != 1) return 1;
sigfillset(&set);
sigprocmask(SIG_BLOCK, &set, 0);
if (fork()) for (;;) wait(&status);
sigprocmask(SIG_UNBLOCK, &set, 0);
setsid();
setpgid(0, 0);
return execve("/etc/rc", (char *[]){ "rc", 0 }, (char *[]){ 0 });
}
Yes, that's really all that belongs in PID 1. Then there's no way it can fail at runtime, and no need to upgrade it once it's successfully running.
Next, from the init script, run a process supervision system outside of PID 1 to manage daemons as immediate child processes (no backgrounding). As mentioned above are several existing choices here. It's not clear to me that any of them are sufficiently polished or robust to satisfy major distributions at this time. But neither is systemd; its backers are just better at sweeping that under the rug.
What the existing choices do have, though, is better design, mainly in the way of having clean, well-defined scope rather than Katamari Damacy.
If none of them are ready for prime time, then the folks eager to replace legacy init in their favorite distributions need to step up and either polish one of the existing solutions or write a better implementation based on the same principles. Either of these options would be a lot less work than fixing what's wrong with systemd.
Whatever system is chosen, the most important criterion is that it be
transparent to applications. For 30+ years, the choice of init system
used has been completely irrelevant to everybody but system
integrators and administrators. User applications have had no reason
to know or care whether you use sysvinit with runlevels, upstart, my
minimal init with a hard-coded rc script or a more elaborate
process-supervision system, or even /bin/sh
. Ironically, this sort
of modularity and interchangibility is what made systemd possible; if
we were starting from the kind of monolithic, API-lock-in-oriented
product systemd aims to be, swapping out the init system for something
new and innovative would not even be an option.
Added December 21, 2014.
There has been some interest in having a proper free software license on the trivial init code included above. I originally considered it too trivial to even care about copyright or need a license on it, but I don't want this to keep anyone from using or reusing it, so I'm explicitly licensing it under the following terms (standard MIT license):
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
The short version: using functions without prototypes is dangerous, and bad configure script recipes directly encourage this practice.
One of the most basic and important checks configure scripts perform is checking for the availability of library functions (either in the standard library or third-party libraries) that are optional, present only in certain versions, wrongly missing on some systems, and so on. In a sense, this is the main purpose of having a configure script, and one would think this kind of check would be hard to get wrong. Unfortunately, these checks are not only possible to get wrong, they're usually wrong, especially in GNU software or other software using gnulib.
The basic problem is that most configure scripts check only for the existence of a symbol whose name matches that of the interface they want to use. The basic formula for this test is to make a C program which does nothing but call the desired function. There are several general strategies that are used:
All of the above are wrong, with varying degrees of wrongness, but the basic common problems they all share are that:
They do not check whether the function is actually an implementation
of the interface the program wants, or just a similarly-named
function. This could be a major problem with interfaces like
qsort_r
which have different signatures on different historical
systems.
They do not check that the function is actually usable by the program; without a present and correct prototype in the headers, the function is not usable, and attempts to use it will result in subtle bugs.
In addition, the various approaches are likely to give false positives
and/or false negatives depending on factors like the version of C the
compiler is providing, and user-provided CFLAGS
such as
-Werror=implicit-function-declaration
.
Here is an excerpt of the latest configure check breakage we encountered building GNU coreutils on musl 0.9.12:
#undef strtod_l
/* Override any GCC internal prototype to avoid an error.
Use char because int might match the return type of a GCC
builtin and then its argument prototype would still apply. */
#ifdef __cplusplus
extern "C"
#endif
char strtod_l ();
/* The GNU C library defines this for functions which it implements
to always fail with ENOSYS. Some functions are actually named
something starting with __ and the normal name is an alias. */
#if defined __stub_strtod_l | defined __stub___strtod_l
choke me
#endif
int
main ()
{
return strtod_l ();
;
return 0;
}
Note the usage of an incorrect declaration; this could actually fail
to link under an advanced linker performing LTO, resulting in a
false-negative. And, as mentioned above, it can result in a
false-positive if there is no proper declaration of strtod_l
for the
application to use. This issue affected musl 0.9.12 because we added
strtod_l
purely as a compatibility symbol for loading binary
libraries which use it, but did not intend for it to be part of the
API for linking new programs. GNU coreutils happily attempted to use
strtod_l
with no prototype, resulting in treatment of the (junk)
contents of the integer return-value register as the result of the
conversion.
One legitimate question to ask would be: Is this musl's fault? In other words, is it a bug to lack a public declaration for a symbol which exists and which is a public interface on some systems? I believe the answer is no, for multiple reasons:
The function may be public, but simply not exposed in the namespace
corresponding to the requested feature test macros, in which case it
is not usable without defining different feature test macros. For
example, on glibc, splice
is not available without defining
_GNU_SOURCE
, and a configure script would be wrong to detect it as
available when it's not.
The function may be private, and only unintentionally exposed. This
is the closest to the situation with coreutils and musl 0.9.12, but
a better example might be the __secure_getenv
function in glibc
which some applications were incorrectly attempting to use as if it
were a public interface.
The function may be part of a legacy interface that is preserved
only for binary compatibility, but not supported for linking new
applications. For example, gets
in a C11 build environment would
still exist as a symbol in the standard library (for supporting old
hopelessly-broken applications) but would be explicitly excluded
from the interfaces available to an application.
What it comes down to is that the only public interfaces are those which are documented as public, and short of human-readable documentation indicating otherwise, the documentation of public interfaces is the headers associated with a library. Assuming the existence of a public interface based on the existence of a symbol with a particular name is not reasonable, especially when the whole point of having a configure script is to detect rather than assume.
So what would a correct check for strtod_l
look like? A good start
would be:
#include <stdlib.h>
#include <locale.h>
int main()
{
char *p;
return (strtod_l)("", &p, (locale_t)0);
}
Enclosing the function name in parentheses serves two purposes: it suppresses any function-like macro, and it forces the compiler to produce an error when there is no declaration for the function.
Unfortunately, the argument check logic above seems difficult to handle in an automated check generator. So a better approach might be something like:
#include <stdlib.h>
#include <locale.h>
int main()
{
double (*p)(const char *, char **, locale_t) = strtod_l;
}
This version has even stronger protection against implicit function
declarations, since strtod_l
is never called; instead, its address
is taken. The pointer assignment should then yield an error if the
types are not compatible, giving us a clean formulatic check for
types.
Unfortunately, these types of configure bugs are endemic, and for musl, the solution is going to be providing public prototypes (under the appropriate feature tests). The alternative, taking explicit action to prevent linking against the symbols, involves hacks that are even worse and more fragile than what configure is doing.
A little-known part of GCC's build process is a script called
"fixincludes", or fixinc.sh
. Purportedly, the purpose of this script
is to fix "non-ANSI system header files" which GCC "cannot compile".
This description seems to correspond roughly to the original intended
purpose of fixincludes, but the scope of what it does has since
ballooned into all sorts of unrelated changes. Let's look at the first
few rules in fixincludes' inclhack.def
:
Changing AIX's _LARGE_FILES
redirection of open
to open64
,
etc. to use GCC's __asm__
keyword rather than #define
, as the
latter breaks C++.
Exposing the long double
math functions in math.h on Mac OS
10.3.9, which inexplicably omitted declarations for them.
Adding workaround for Linux 2.2 and earlier kernel bug with
direction flag to FD_ZERO
macros.
Doing something inexplicable with Solaris's nonstandard header sys/varargs.h.
Removing incorrect char *
-based (rather than void *
-based)
prototypes for memcpy
, etc. on Sun OS 4.x, and replacing them with
the correct prototypes.
Replacing VxWorks' assert.h with the GCC developers' version. No explanation given for the reason.
Modifying some nonstandard VxWorks regs.h header to include another header whose definitions it depends on.
Replacing VxWorks's stdint.h, which is claimed to be broken, with
one full of incorrect definitions, for example #define UINT8_MAX
(~(uint8_t)0)
(which, contrary to the requirements of C, is not
valid for use in preprocessor conditionals).
etc.
(Source: fixincludes/inclhack.def)
Of the first 8 hacks (using GCC's terminology here) cited above, only one deals with fixing pre-ANSI-C headers. One more is fixing serious C++ breakage that would probably make it impossible to use C++ at all with the system headers, but the rest seem to be fixing, or attempting to fix, unrelated bugs that have nothing to do with making the compiler or compilation environment usable. And at least one has introduced major header breakage that might or might not be worse than what was in the vendor's original header.
In other words, what fixincludes evolved into is the GCC developers
forcibly applying their own, often arguably incorrect or buggy, bug
fix patches to system headers (and in some cases, non-system headers
that happen to be in /usr/include
) with no outside oversight or
input from the maintainers of the software they're patching.
So how does fixincludes work? Basically, it iterates over each header
file it finds under /usr/include
(or whatever the configured include
directory is), applies a set of heuristics based on the filename,
machine tuple, and a regular expression matched against the file
contents, and if these rules match, it applies a sequence of sed
commands to the file. As of this writing, there are 228 such "hacks"
that may be applied. The output is then stored in GCC's private
include-fixed directory (roughly
/usr/lib/gcc/$MACH/$VER/include-fixed
), which GCC searches before
the system include directory, thus "replacing" the original header
when the new GCC is used.
In case it's not already obvious what a bad idea this whole concept is, here are a few serious flaws:
Fixincludes prevents library upgrades from working. Suppose for
example you have libfoo
version 1.0 installed at the time GCC is
built and installed. The fixincludes script decides to patch foo.h
,
and puts its patched version in GCC's include-fixed
directory. Now
suppose you install libfoo
version 2.0, which comes with a new
foo.h
and which is incompatible with the definitions in the old
version of foo.h
. Due to GCC's include path order, the new version
of the header will be silently ignored and GCC will keep using the
old header from the version of libfoo
that was present when GCC was
installed. Moreover, since fixincludes does not take any precautions
to avoid applying its changes to files other than the original broken
file they were intended to fix, library authors who want to avoid the
danger of having their users get stuck with old headers must take on
the burden of ensuring that their header files don't match any of the
patterns in fixincludes.
Fixincludes can lead to unintended copyright infringement or leakage of private data. Unless you are fully aware of fixincludes, when building GCC, you would not expect an unbounded amount of local header files, some of which may be part of proprietary programs or site-local private headers, to end up in the GCC directory. Now, if you package up the GCC directory (think of people building cross compiler binaries and such), you could inadvertently ship copies of these headers in a public release.
Many of the fixes are actually incorrect or fail to achieve what
they're trying to achieve. For example, the VxWorks stdint.h "fix"
creates a badly broken stdint.h. Another example, which came up in our
development of musl, is the fix for
va_list
exposure in the prototypes in stdio.h
and wchar.h
. Per
ANSI/ISO C, va_list
is not defined in these headers (POSIX, on the
other hand, requires it to be defined), so GCC uses bad heuristic
regex matches to find such exposure and change it to __gnuc_va_list
.
Somehow (we never determined the reason), the resulting headers were
interfering with the definition of mbstate_t
and preventing
libstdc++
from compiling successfully. In addition, we found that,
while attempting to remedy an extremely minor "namespace pollution"
issue in these headers, fixincludes was making a new namespace
violation: for its double-inclusion guard macro, it used
FIXINC_WRAP_STDIO_H_STDIO_STDARG_H
, a name that happens to be in the
namespace reserved for the application, not the implementation.
The rules for whether and how to apply the "hacks" are poor heuristics, and no effort is made to avoid false positives. The README for fixincludes even states (line 118) their policy of applying hacks even when they might be a false positive, with no consideration for how incorrectly applying them (after all, they are hackish sed replacements, not anything robust or sophisticated) might break proper, working headers.
How could this situation be fixed? The GCC developers claim fixincludes is still needed (see also here), and while I'm fairly skeptical of this claim, I don't think it's a matter where they'll be convinced otherwise in the near future, so I'd like to look for other more constructive approaches. Here are the steps I think would be needed to fix fixincludes:
Remove all outdated hacks, i.e. hacks for systems which GCC no longer supports. While not strictly necessary, cleaning up the list of hacks in this manner should make the next steps more practical.
Remove all hacks for files that are none-of-GCC's-business. That means anything that doesn't absolutely need to be fixed to successfully compile GCC or provide a working (not necessarily fully conforming, if the underlying system was non-conforming, but "working") build environment after installation.
Eliminate false positives and buggy sed replacements by adding to
the hack definitions in inclhack.def
a list of hashes for
known-bad files the hack is meant to be applied to. If necessary,
include a configure option, off by default, that would ignore the
hashes.
Where some of the "fixes" made by fixincludes themselves have bugs like namespace violations or macros that do not meet the requirements for being usable in the preprocessor, they should be changed to output something more correct. There is no justification for replacing one broken header with another, potentially worse, broken header.
Add a --disable-fixincludes
option to configure so that
fixincludes can be completely turned off. This would be ideal for
system integrators, packagers, and basically anyone installing GCC
from source on a modern system. It's especially important for the
case where the user is installing GCC on a system that already has
many third-party library headers in /usr/include, some of which may
be "broken" in the eyes of fixincludes, where "fixing" them would
have the dangerous consequence of preventing future library
upgrades from working properly.
Finally, I suppose one might wonder why something that seems so
broken, as I've described fixincludes, might go undetected for so
long. The explanation is simple: distros. Most users of GCC use binary
packages prepared for a particular OS distribution, where the packager
has already cleaned up most of the mess, either by building GCC in a
sterile environment where it can't find any headers to pick up and
hack up, or by pruning the resulting include-fixed
directory. Thus,
the only people who have to deal with fixincludes are people who build
GCC from the source packages, or who are setting up build scripts for
their own deployment/distribution.
For the curious, here are some links to the tricks distros do to overcome fixincludes:
OpenSDE gcc package, whose comments claim there is another problem in fixincludes related to cross compiling, of which I am not aware.
It's unclear to me exactly what Debian does, but as their installed
include-fixed
directory is very minimal, they must also be doing
something. I have not inspected the other major binary distributions
with complex build and package systems, but casual experience suggests
they are taking some measures to contain the breakage of fixincludes.
The C and C++ languages define the macro NULL
, widely taught as the
correct way to write a literal null pointer. The motivations for using
NULL
are well-meaning; largely they come down to fact that it
documents the intent, much like how a macro named FALSE
might better
document boolean intent than a literal 0
would do. Unfortunately,
use of the NULL
macro without fully understanding it can lead to
subtle bugs and portability issues, some of which are difficult for
compilers and static analysis tools to diagnose.
Despite it being superceded by the 2011 standard, I'm going to quote
C99 because it's what I'm most familar with, and I suspect most
readers are in the same situation. 7.17 specifies the NULL
macro as:
NULL
which expands to an implementation-defined null pointer constant;
6.3.2.3 Pointers in turn defines null pointer constant as:
An integer constant expression with the value 0, or such an expression cast to type void *
And finally, for completeness, 6.6 defines integer constant expression as:
Constant expressions shall not contain assignment, increment, decrement, function-call, or comma operators, except when they are contained within a subexpression that is not evaluated.
An integer constant expression shall have integer type and shall only have operands that are integer constants, enumeration constants, character constants, sizeof expressions whose results are integer constants, and floating constants that are the immediate operands of casts. Cast operators in an integer constant expression shall only convert arithmetic types to integer types, except as part of an operand to the sizeof operator.
What's important to realize is that this allows a great deal of
freedom with regard to how an implementation defines NULL
. Some
examples:
#define NULL 0
#define NULL ((void *)0)
#define NULL 0L
typedef void *__voidp;
#define NULL ((__voidp)0)
#define NULL (1ULL-1ULL)
#define NULL ('9'-'0'-9)
#define NULL (sizeof(void *)/(sizeof(char)-2))
In particular, the set of possible types NULL
could have includes
all integer types and void *
. Fortunately, since these are all null
pointer constants, in most contexts using NULL
will get you The
Right Thing no matter which type NULL
might have. Pointer
comparisons, assignments to pointers, the ?:
operator, and so forth
all treat null pointer constants in special ways such that the type of
the null pointer constant mostly doesn't matter.
The main place the type of NULL
does become an issue is with
variadic (or non-prototyped) functions. Consider these examples:
printf("%p", NULL); // WRONG!
execl("/bin/true", "true", NULL); // WRONG!
The first is wrong, in general, because the %p
specifier for
printf
requires an argument of type void *
. The second example is
always wrong because the execl
function requires arguments of type
char *
, and NULL
can never have type char *
.
One might object that, despite being formally incorrect, the above
should work, especially if NULL
is defined with the typical C
definition of ((void *)0)
. Assuming also that pointers of different
types are treated identically by the calling convention, having type
void *
instead of char *
would not matter when calling execl
.
(Arguably it may not matter anyway, since C specified that char *
and void *
have the same representation, but it's not entirely clear
how this applies to arguments to variadic functions.)
Things get a lot uglier with C++. In C++, NULL
was traditionally
defined as plain 0
, on account of the fact that C++ has no implicit
conversion from void *
to other pointer types. (Note that the
current C++ standard also allows NULL
to be defined as nullptr
,
and some non-conforming implmentations define it in other ways, such
as __null
.) So in many historical C++ implementations, if you pass
NULL
to a variadic function, bad things might happen.
And bad things did happen. We were encountering several
hard-to-track-down crash bugs in certain applications (including
evince and audacity) built against musl,
which at the time defined NULL
as 0
for C++ programs, and it
turned out the source of the problem was that the C++ code was making
calls to variadic C functions using NULL
as a sentinel. The
original report and
further details are available in the mailing list archives for musl.
The crashes that were happening were on 64-bit archs (at the time, for
us that meant just x86_64
) and were, if I recall, "heisenbugs":
inserting additional code to inspect what was going on made some or
all of them go away. As it turns out, while x86_64
(and most 64-bit
archs) uses entire registers or 64-bit stack slots for all arguments,
even ones smaller than 64-bit, there's no guarantee in the ABI that
the unused upper bits will be cleared by the caller. So, when the
caller passed 0
as an int
to a variadic function, the upper 32
bits of the stack slot contained whatever junk happened to be there
before, and the callee's va_arg
, attempting to read an argument of
type void *
, pulled the junk in as the upper bits of the pointer.
Thus, manifestation of the bug was highly dependent both on the code
executed just prior to the variadic call, and the compiler's choice of
how to setup the arguments (e.g. push
versus mov
).
As far as what we would do about the issue, I've been on the fence for
a long time about the relative merits of trying to work around
incorrect, non-portable applications versus actively breaking them and
pressuring their maintainers to fix them. However, the widespread
incorrect use of NULL
seemed overwhelming, and developers seemed
very reluctant to fix the issue. Fortunately, we came up with what I
believe is a really good solution on musl's side: replacing this:
#ifdef __cplusplus
#define NULL 0
#else
#define NULL ((void *)0)
#endif
with this:
#define NULL 0L
(Note that musl assumes a type model where long
and pointers have
the same size; this is also assumed by Linux and is fundamental to the
way system calls work, so it's not an unreasonable assumption for us.)
By making this change, all programs that assume they can pass NULL
as a pointer argument to variadic functions (sentinel role or
otherwise) continue to work as if NULL
were defined as ((void
*)0)
. However, GCC does not consider 0L
a valid "sentinel" value
(since it has the wrong type) for functions declared with the
sentinel
attribute.
So what we have achieved is making incorrect programs work, but
produce warnings where the code is incorrect.
One initial concern I had with this approach is that the matter of
whether NULL
has pointer or integer type might be observable to
programs in such a way that the change from ((void *)0)
to 0L
could break some incorrect C programs. However, after some of the
smart folks over at Stack Overflow tried
to find ways a program might detect whether NULL is defined with
integer or pointer
type
and failed, I deemed the change safe, and our developers and users
have not run into any problems since.
It's been some time now (about 6 months) since the issue was raised
and addressed in musl, but I just recently reported the sentinel
issue
in Busybox and submitted a patch to fix it.
While getting the patch accepted was painless, I mention this anecdote
as a possible explanation for why the NULL
sentinel bug is so
prevalent: after the patch was accepted, I received several follow-up
emails
chipping in with misconceptions about NULL
. The impression I was
left with is that everybody seems to think NULL
is a much simpler
topic than it really is.
So what would I recommend programmers do with NULL
? After all this
trouble with variadic function arguments, I've just basically stopped
using NULL
completely. A simple 0 is shorter and usually clear when
initializing or returning a pointer, and for null pointer checks, I
usually use the !
operator rather than ==NULL
anyway. However, I
really don't have any opinion on what others should do; from a
correctness standpoint, what matters is simply not making
assumptions about the type of NULL
. Anywhere the type of the
expression matters, you really need to use a cast, which you could
apply to NULL
, or simply to 0
.
As a final remark, I don't think this article would be complete without mentioning a similar article, albeit much shorter and less detailed, written by Jens Gustedt over two years ago.
This post is not about any particular bug or bad programming practice, just a new “printf debugging” technique I came up with.
Often when tracking down a bug, it’s useful to add extra output to
track the state of the program in the moments leading up to the crash
or incorrect behavior, aka “printf debugging”. However, this technique
is “invasive” in the sense that it interleaves unwanted data into the
program output. Using stderr
instead of stdout
can alleviate the
problem to some extent, but when you’re inserting the debugging code
into a widely-used library (in my case, libc.so
) or a shell, even
having unwanted output on stderr
can be a problem.
A previous approach I had used was sending the output to an arbitrary
high file descriptor instead of stdout
or stderr
. For example,
dprintf(666, "message here");
This avoids breaking anything but a
program using a huge number of file descriptors, and allows reading
the output via adding 666>&2
to the command line.
However, let’s take that a bit further. I don’t really need the
program itself to print any output. I just need to be able to see the
message. Enter my other favorite debugging tool, strace
. Using
strace
, I can see the above output on a high file descriptor even if
the descriptor isn’t actually open; the syscall still happens and
returns with EBADF
. But there’s no reason it needed to be a valid
number that could be a file descriptor in the first place. I can
instead just use dprintf(-42, "message here\n");
and there’s no risk
of ever producing any output, but the strace
log (with the
appropriate options to increase string length limits) will show the
writes.
The result is a form of “printf debugging” that can safely be added in library code and other code that shouldn’t touch any file descriptors.