EWONTFIX

Open race-condition bugs in glibc

06 Mar 2014 19:10:09 GMT

As part of the freeze announcement for musl 1.0, we mentioned longstanding open race condition bugs in glibc. Shortly after the announcement went out on Phoronix, I got a request for details on which bugs we were referring to, so I put together a list.

The most critical (in my opinion) open race bugs are the ones I described on this blog back in 2012; they are reported in glibc issue 12683:

They make it virtually impossible to use thread cancellation safely; at the very least you would have to block cancellation around all cancellation points which allocate or free resources.

Since the original post on EWONTFIX was written, attempts to follow up with the Austin Group seem to have clarified (see issue 614, and issue 529 which was referenced in the response to it) that the requirements on side effects in the event of cancellation are as I interpreted them. A related glibc issue (symptom of the same design problem) is:

A few other race-related issues present in glibc (and which musl's implementation of pthreads avoids) are:

And one which musl shares (musl's POSIX aio is very immature and due for an overhaul):

The above are all bugs which I reported; in addition to these, there are at least a few other race-related bugs open against glibc:

In many ways, the above bugs are what inspired the creation of EWONTFIX, and while I haven't gotten a chance to write a detailed analysis of them all, it's nice to have them at least catalogued. I hope this list can serve as a catalyst for interest in getting them fixed.

Post or read comments...

Systemd has 6 service startup notification types, and they're all wrong

27 Feb 2014 04:14:26 GMT

In my last post, Broken by design: systemd, I covered technical aspects of systemd outside its domain of specialization that make it a poor choice for the future of the Linux userspace's init system. Since then, it's come to my attention as a result of a thread on the glibc development list that systemd can't even get things right in its own problem domain: service supervision.

Per the manual, systemd has the following 6 "types" that can be used in a service file to control how systemd will supervise the service (daemon):

The whole idea of systemd's service supervision and activation system is built on being able to start services asynchronously as soon as their dependencies are met (and no sooner). However, none of the above choices actually make it possible to do this with a daemon that was not written specifically to interact with systemd!

In the case of simple, there is no way for systemd to determine when the daemon is actually active and providing the service that subsequent services may depend on. If using "socket activation" (a feature by which systemd allocates the sockets a daemon will listen on and passes them to the daemon to use), this may not matter. However, most daemons not written for systemd are not able to accept preexisting sockets, and even if they can, this might preclude some of their functionality.

In the case of forking, systemd assumes that, after the original process exits, the forked daemon is already initialized and ready to provide its service. Not only is this unlikely to be true; attempting to make it true is likely to lead to buggy daemon code. If you're going to fork in a daemon, doing so needs to be one of the first things your program does; otherwise, if anything you do (e.g. calling third-party library code) creates additional threads, a subsequent fork puts the child in an async-signal context and the child basically cannot do anything but execve or _exit without invoking undefined behavior. So it's almost certainly wrong to write a daemon that forks at the last step after setting itself up successfully. You could instead fork right away but use a synchronization primitive to prevent the parent from exiting before the child signals it to do so; however, I have not seen this done in practice. And no matter what you do, if your daemon forks, you're subject to all the race issues of using pid files.

The remaining nontrivial options are dbus and notify; both of these depend on daemons being written as part of the Freedesktop.org/systemd library framework. There is no documented, stable way for a daemon to use either of these options without linking to D-Bus's library and/or systemd's library (and thereby, for binary packages, pulling in a dependency on these packages even if the user is not using them). Furthermore, there are issues of accessing the notification channel. If the daemon has to sandbox itself (e.g. chroot, namespace/container, dropping root, etc.) before it finishes initializing, it may not even have a means to access to notification channel to inform systemd of its success, or any means to prove its identity even if it could access the channel.

So in short, the only way to make systemd's asynchronous service activation reliable is to add systemd-specific (or D-Bus specific) code into the daemon, and even these may not work reliably for all usage cases.

There are several ways this could have been avoided:

Option 1: A simple notification mechanism

Rather than requiring library code to notify systemd that the daemon is ready, use some existing trivial method. The simplest would be asking daemons to add an option to write (anything; the contents don't matter) to and close a particular file descriptor once they're ready. Then systemd could detect success as a non-empty pipe, and the default case (closing the pipe or exiting without writing anything) would be interpreted as failure.

Option 2: Polling

Despite it being against the "spirit" of systemd, this is perhaps the cleanest and most reliable: have systemd poll whatever service the daemon is supposed to provide. For example, if the service is starting sshd on port 22, systemd could repeatedly try connecting to port 22, with exponential backoff, until it succeeds. This approach requires no modification to existing daemons, and if implemented correctly, would have minimal cost (only at daemon start time) in cpu load and startup latency.

Thankfully, this approach is already possible, albeit in a very convoluted way, without modifying systemd: you can wrap daemons with a wrapper utility that performs the polling and reports back to systemd using the sd_notify API.

The current situation

As it stands, my view is that systemd has failed to solve the problem everybody thinks it's solved: making dependency-based service startup work robustly without the traditional hacks (like sleep 1) all over the place in ugly init scripts. What it has instead done is setup a situation where major daemons are going to come under pressure to link to systemd's library and/or integrate themselves with D-Bus in order to make systemd's promises into a reality. And this of course leads to more entangled cross-dependency and more platform-specific behavior working its way into cross-platform software.

Post or read comments...

Broken by design: systemd

09 Feb 2014 19:56:09 GMT

Recently the topic of systemd has come up quite a bit in various communities in which I'm involved, including the musl IRC channel and on the Busybox mailing list.

While the attitude towards systemd in these communities is largely negative, much of what I've seen has been either dismissable by folks in different circles as mere conservatism, or tempered by an idea that despite its flaws, "the design is sound". This latter view comes with the notion that systemd's flaws are fixable without scrapping it or otherwise incurring major costs, and therefore not a major obstacle to adopting systemd.

My view is that this idea is wrong: systemd is broken by design, and despite offering highly enticing improvements over legacy init systems, it also brings major regressions in terms of many of the areas Linux is expected to excel: security, stability, and not having to reboot to upgrade your system.

The first big problem: PID 1

On unix systems, PID 1 is special. Orphaned processes (including a special case: daemons which orphan themselves) get reparented to PID 1. There are also some special signal semantics with respect to PID 1, and perhaps most importantly, if PID 1 crashes or exits, the whole system goes down (kernel panic).

Among the reasons systemd wants/needs to run as PID 1 is getting parenthood of badly-behaved daemons that orphan themselves, preventing their immediate parent from knowing their PID to signal or wait on them.

Unfortunately, it also gets the other properties, including bringing down the whole system when it crashes. This matters because systemd is complex. A lot more complex than traditional init systems. When I say complex, I don't mean in a lines-of-code sense. I mean in terms of the possible inputs and code paths that may be activated at runtime. While legacy init systems basically deal with no inputs except SIGCHLD from orphaned processes exiting and manual runlevel changes performed by the administrator, systemd deals with all sorts of inputs, including device insertion and removal, changes to mount points and watched points in the filesystem, and even a public DBus-based API. These in turn entail resource allocation, file parsing, message parsing, string handling, and so on. This brings us to:

The second big problem: Attack Surface

On a hardened system without systemd, you have at most one root-privileged process with any exposed surface: sshd. Everything else is either running as unprivileged users or does not have any channel for providing it input except local input from root. Using systemd then more than doubles the attack surface.

This increased and unreasonable risk is not inherent to systemd's goal of fixing legacy init. However it is inherent to the systemd design philosophy of putting everything into the init process.

The third big problem: Reboot to Upgrade

Windows Update rebooting

Fundamentally, upgrading should never require rebooting unless the component being upgraded is the kernel. Even then, for security updates, it's ideal to have a "hot-patch" that can be applied as a loadable kernel module to mitigate the security issue until rebooting with the new kernel is appropriate.

Unfortunately, by moving large amounts of functionality that's likely to need to be upgraded into PID 1, systemd makes it impossible to upgrade without rebooting. This leads to "Linux" becoming the laughing stock of Windows fans, as happened with Ubuntu a long time ago.

Possible counter-arguments

With regards to security, one could ask why can't desktop systems use systemd, and leave server systems to find something else. But I think this line of reasoning is flawed in at least three ways:

  1. Many of the selling-point features of systemd are server-oriented. State-of-the-art transaction-style handling of daemon starting and stopping is not a feature that's useful on desktop systems. The intended audience for that sort of thing is clearly servers.

  2. The desktop is quickly becoming irrelevant. The future platform is going to be mobile and is going to be dealing with the reality of running untrusted applications. While the desktop made the unix distinction of local user accounts largely irrelevant, the coming of mobile app ecosystems full of potentially-malicious apps makes "local security" more important than ever.

  3. The crowd pushing systemd, possibly including its author, is not content to have systemd be one choice among many. By providing public APIs intended to be used by other applications, systemd has set itself up to be difficult not to use once it achieves a certain adoption threshold.

With regards to upgrades, systemd's systemctl has a daemon-reexec command to make systemd serialize its state, re-exec itself, and continue uninterrupted. This could perhaps be used to switch to a new version without rebooting. Various programs already use this technique, such as the IRC client irssi which lets you /upgrade without dropping any connections. Unfortunately, this brings us back to the issue of PID 1 being special. For normal applications, if re-execing fails, the worst that happens is the process dies and gets restarted (either manually or by some monitoring process) if necessary. However for PID 1, if re-execing itself fails, the whole system goes down (kernel panic).

For common reasons it might fail, the execve syscall returns failure in the original process image, allowing the program to handle the error. However, failure of execve is not entirely atomic:

In addition, systemd might fail to restore its serialized state due to resource allocation failures, or if the old and new versions have diverged sufficiently that the old state is not usable by the new version.

So if not systemd, what? Debian's discussion of whether to adopt systemd or not basically devolved into a false dichotomy between systemd and upstart. And except among grumpy old luddites, keeping legacy sysvinit is not an attractive option. So despite all its flaws, is systemd still the best option?

No.

None of the things systemd "does right" are at all revolutionary. They've been done many times before. DJB's daemontools, runit, and Supervisor, among others, have solved the "legacy init is broken" problem over and over again (though each with some of their own flaws). Their failure to displace legacy sysvinit in major distributions had nothing to do with whether they solved the problem, and everything to do with marketing. Said differently, there's nothing great and revolutionary about systemd. Its popularity is purely the result of an aggressive, dictatorial marketing strategy including elements such as:

So how should init be done right?

The Unix way: with simple self-contained programs that do one thing and do it well.

First, get everything out of PID 1:

The systemd way: Take advantage of special properties of pid 1 to the maximum extent possible. This leads to ever-expanding scope creep and exacerbates all of the problems described above (and probably many more yet to be discovered).

The right way: Do away with everything special about pid 1 by making pid 1 do nothing but start the real init script and then just reap zombies:

#define _XOPEN_SOURCE 700
#include <signal.h>
#include <unistd.h>

int main()
{
    sigset_t set;
    int status;

    if (getpid() != 1) return 1;

    sigfillset(&set);
    sigprocmask(SIG_BLOCK, &set, 0);

    if (fork()) for (;;) wait(&status);

    sigprocmask(SIG_UNBLOCK, &set, 0);

    setsid();
    setpgid(0, 0);
    return execve("/etc/rc", (char *[]){ "rc", 0 }, (char *[]){ 0 });
}

Yes, that's really all that belongs in PID 1. Then there's no way it can fail at runtime, and no need to upgrade it once it's successfully running.

Next, from the init script, run a process supervision system outside of PID 1 to manage daemons as immediate child processes (no backgrounding). As mentioned above are several existing choices here. It's not clear to me that any of them are sufficiently polished or robust to satisfy major distributions at this time. But neither is systemd; its backers are just better at sweeping that under the rug.

What the existing choices do have, though, is better design, mainly in the way of having clean, well-defined scope rather than Katamari Damacy.

If none of them are ready for prime time, then the folks eager to replace legacy init in their favorite distributions need to step up and either polish one of the existing solutions or write a better implementation based on the same principles. Either of these options would be a lot less work than fixing what's wrong with systemd.

Whatever system is chosen, the most important criterion is that it be transparent to applications. For 30+ years, the choice of init system used has been completely irrelevant to everybody but system integrators and administrators. User applications have had no reason to know or care whether you use sysvinit with runlevels, upstart, my minimal init with a hard-coded rc script or a more elaborate process-supervision system, or even /bin/sh. Ironically, this sort of modularity and interchangibility is what made systemd possible; if we were starting from the kind of monolithic, API-lock-in-oriented product systemd aims to be, swapping out the init system for something new and innovative would not even be an option.

Post or read comments...

Incorrect configure checks for availability of functions

14 Aug 2013 00:35:06 GMT

The short version: using functions without prototypes is dangerous, and bad configure script recipes directly encourage this practice.

One of the most basic and important checks configure scripts perform is checking for the availability of library functions (either in the standard library or third-party libraries) that are optional, present only in certain versions, wrongly missing on some systems, and so on. In a sense, this is the main purpose of having a configure script, and one would think this kind of check would be hard to get wrong. Unfortunately, these checks are not only possible to get wrong, they're usually wrong, especially in GNU software or other software using gnulib.

The basic problem is that most configure scripts check only for the existence of a symbol whose name matches that of the interface they want to use. The basic formula for this test is to make a C program which does nothing but call the desired function. There are several general strategies that are used:

  1. Including the headers needed to use the function, then calling the function from main.
  2. Omitting all headers, prototyping the function correctly, then calling it from main.
  3. Omitting all headers, declaring the function with a bogus (prototype or non-prototype) declaration, then calling it from main.
  4. Omitting all headers and calling the function from main with no declaration whatsoever.

All of the above are wrong, with varying degrees of wrongness, but the basic common problems they all share are that:

In addition, the various approaches are likely to give false positives and/or false negatives depending on factors like the version of C the compiler is providing, and user-provided CFLAGS such as -Werror=implicit-function-declaration.

Here is an excerpt of the latest configure check breakage we encountered building GNU coreutils on musl 0.9.12:

#undef strtod_l

/* Override any GCC internal prototype to avoid an error.
   Use char because int might match the return type of a GCC
   builtin and then its argument prototype would still apply.  */
#ifdef __cplusplus
extern "C"
#endif
char strtod_l ();
/* The GNU C library defines this for functions which it implements
    to always fail with ENOSYS.  Some functions are actually named
    something starting with __ and the normal name is an alias.  */
#if defined __stub_strtod_l |    defined __stub___strtod_l
choke me
#endif

int
main ()
{
return strtod_l ();
  ;
  return 0;
}

Note the usage of an incorrect declaration; this could actually fail to link under an advanced linker performing LTO, resulting in a false-negative. And, as mentioned above, it can result in a false-positive if there is no proper declaration of strtod_l for the application to use. This issue affected musl 0.9.12 because we added strtod_l purely as a compatibility symbol for loading binary libraries which use it, but did not intend for it to be part of the API for linking new programs. GNU coreutils happily attempted to use strtod_l with no prototype, resulting in treatment of the (junk) contents of the integer return-value register as the result of the conversion.

One legitimate question to ask would be: Is this musl's fault? In other words, is it a bug to lack a public declaration for a symbol which exists and which is a public interface on some systems? I believe the answer is no, for multiple reasons:

What it comes down to is that the only public interfaces are those which are documented as public, and short of human-readable documentation indicating otherwise, the documentation of public interfaces is the headers associated with a library. Assuming the existence of a public interface based on the existence of a symbol with a particular name is not reasonable, especially when the whole point of having a configure script is to detect rather than assume.

So what would a correct check for strtod_l look like? A good start would be:

#include <stdlib.h>
#include <locale.h>
int main()
{
    char *p;
    return (strtod_l)("", &p, (locale_t)0);
}

Enclosing the function name in parentheses serves two purposes: it suppresses any function-like macro, and it forces the compiler to produce an error when there is no declaration for the function.

Unfortunately, the argument check logic above seems difficult to handle in an automated check generator. So a better approach might be something like:

#include <stdlib.h>
#include <locale.h>
int main()
{
    double (*p)(const char *, char **, locale_t) = strtod_l;
}

This version has even stronger protection against implicit function declarations, since strtod_l is never called; instead, its address is taken. The pointer assignment should then yield an error if the types are not compatible, giving us a clean formulatic check for types.

Unfortunately, these types of configure bugs are endemic, and for musl, the solution is going to be providing public prototypes (under the appropriate feature tests). The alternative, taking explicit action to prevent linking against the symbols, involves hacks that are even worse and more fragile than what configure is doing.

Post or read comments...

Breakincludes

04 Jul 2013 16:46:23 GMT

A little-known part of GCC's build process is a script called "fixincludes", or fixinc.sh. Purportedly, the purpose of this script is to fix "non-ANSI system header files" which GCC "cannot compile". This description seems to correspond roughly to the original intended purpose of fixincludes, but the scope of what it does has since ballooned into all sorts of unrelated changes. Let's look at the first few rules in fixincludes' inclhack.def:

(Source: fixincludes/inclhack.def)

Of the first 8 hacks (using GCC's terminology here) cited above, only one deals with fixing pre-ANSI-C headers. One more is fixing serious C++ breakage that would probably make it impossible to use C++ at all with the system headers, but the rest seem to be fixing, or attempting to fix, unrelated bugs that have nothing to do with making the compiler or compilation environment usable. And at least one has introduced major header breakage that might or might not be worse than what was in the vendor's original header.

In other words, what fixincludes evolved into is the GCC developers forcibly applying their own, often arguably incorrect or buggy, bug fix patches to system headers (and in some cases, non-system headers that happen to be in /usr/include) with no outside oversight or input from the maintainers of the software they're patching.

So how does fixincludes work? Basically, it iterates over each header file it finds under /usr/include (or whatever the configured include directory is), applies a set of heuristics based on the filename, machine tuple, and a regular expression matched against the file contents, and if these rules match, it applies a sequence of sed commands to the file. As of this writing, there are 228 such "hacks" that may be applied. The output is then stored in GCC's private include-fixed directory (roughly /usr/lib/gcc/$MACH/$VER/include-fixed), which GCC searches before the system include directory, thus "replacing" the original header when the new GCC is used.

In case it's not already obvious what a bad idea this whole concept is, here are a few serious flaws:

  1. Fixincludes prevents library upgrades from working. Suppose for example you have libfoo version 1.0 installed at the time GCC is built and installed. The fixincludes script decides to patch foo.h, and puts its patched version in GCC's include-fixed directory. Now suppose you install libfoo version 2.0, which comes with a new foo.h and which is incompatible with the definitions in the old version of foo.h. Due to GCC's include path order, the new version of the header will be silently ignored and GCC will keep using the old header from the version of libfoo that was present when GCC was installed. Moreover, since fixincludes does not take any precautions to avoid applying its changes to files other than the original broken file they were intended to fix, library authors who want to avoid the danger of having their users get stuck with old headers must take on the burden of ensuring that their header files don't match any of the patterns in fixincludes.

  2. Fixincludes can lead to unintended copyright infringement or leakage of private data. Unless you are fully aware of fixincludes, when building GCC, you would not expect an unbounded amount of local header files, some of which may be part of proprietary programs or site-local private headers, to end up in the GCC directory. Now, if you package up the GCC directory (think of people building cross compiler binaries and such), you could inadvertently ship copies of these headers in a public release.

  3. Many of the fixes are actually incorrect or fail to achieve what they're trying to achieve. For example, the VxWorks stdint.h "fix" creates a badly broken stdint.h. Another example, which came up in our development of musl, is the fix for va_list exposure in the prototypes in stdio.h and wchar.h. Per ANSI/ISO C, va_list is not defined in these headers (POSIX, on the other hand, requires it to be defined), so GCC uses bad heuristic regex matches to find such exposure and change it to __gnuc_va_list. Somehow (we never determined the reason), the resulting headers were interfering with the definition of mbstate_t and preventing libstdc++ from compiling successfully. In addition, we found that, while attempting to remedy an extremely minor "namespace pollution" issue in these headers, fixincludes was making a new namespace violation: for its double-inclusion guard macro, it used FIXINC_WRAP_STDIO_H_STDIO_STDARG_H, a name that happens to be in the namespace reserved for the application, not the implementation.

  4. The rules for whether and how to apply the "hacks" are poor heuristics, and no effort is made to avoid false positives. The README for fixincludes even states (line 118) their policy of applying hacks even when they might be a false positive, with no consideration for how incorrectly applying them (after all, they are hackish sed replacements, not anything robust or sophisticated) might break proper, working headers.

How could this situation be fixed? The GCC developers claim fixincludes is still needed (see also here), and while I'm fairly skeptical of this claim, I don't think it's a matter where they'll be convinced otherwise in the near future, so I'd like to look for other more constructive approaches. Here are the steps I think would be needed to fix fixincludes:

  1. Remove all outdated hacks, i.e. hacks for systems which GCC no longer supports. While not strictly necessary, cleaning up the list of hacks in this manner should make the next steps more practical.

  2. Remove all hacks for files that are none-of-GCC's-business. That means anything that doesn't absolutely need to be fixed to successfully compile GCC or provide a working (not necessarily fully conforming, if the underlying system was non-conforming, but "working") build environment after installation.

  3. Eliminate false positives and buggy sed replacements by adding to the hack definitions in inclhack.def a list of hashes for known-bad files the hack is meant to be applied to. If necessary, include a configure option, off by default, that would ignore the hashes.

  4. Where some of the "fixes" made by fixincludes themselves have bugs like namespace violations or macros that do not meet the requirements for being usable in the preprocessor, they should be changed to output something more correct. There is no justification for replacing one broken header with another, potentially worse, broken header.

  5. Add a --disable-fixincludes option to configure so that fixincludes can be completely turned off. This would be ideal for system integrators, packagers, and basically anyone installing GCC from source on a modern system. It's especially important for the case where the user is installing GCC on a system that already has many third-party library headers in /usr/include, some of which may be "broken" in the eyes of fixincludes, where "fixing" them would have the dangerous consequence of preventing future library upgrades from working properly.

Finally, I suppose one might wonder why something that seems so broken, as I've described fixincludes, might go undetected for so long. The explanation is simple: distros. Most users of GCC use binary packages prepared for a particular OS distribution, where the packager has already cleaned up most of the mess, either by building GCC in a sterile environment where it can't find any headers to pick up and hack up, or by pruning the resulting include-fixed directory. Thus, the only people who have to deal with fixincludes are people who build GCC from the source packages, or who are setting up build scripts for their own deployment/distribution.

For the curious, here are some links to the tricks distros do to overcome fixincludes:

It's unclear to me exactly what Debian does, but as their installed include-fixed directory is very minimal, they must also be doing something. I have not inspected the other major binary distributions with complex build and package systems, but casual experience suggests they are taking some measures to contain the breakage of fixincludes.

Post or read comments...

NULL considered harmful

04 Jul 2013 03:25:02 GMT

The C and C++ languages define the macro NULL, widely taught as the correct way to write a literal null pointer. The motivations for using NULL are well-meaning; largely they come down to fact that it documents the intent, much like how a macro named FALSE might better document boolean intent than a literal 0 would do. Unfortunately, use of the NULL macro without fully understanding it can lead to subtle bugs and portability issues, some of which are difficult for compilers and static analysis tools to diagnose.

Despite it being superceded by the 2011 standard, I'm going to quote C99 because it's what I'm most familar with, and I suspect most readers are in the same situation. 7.17 specifies the NULL macro as:

NULL

which expands to an implementation-defined null pointer constant;

6.3.2.3 Pointers in turn defines null pointer constant as:

An integer constant expression with the value 0, or such an expression cast to type void *

And finally, for completeness, 6.6 defines integer constant expression as:

Constant expressions shall not contain assignment, increment, decrement, function-call, or comma operators, except when they are contained within a subexpression that is not evaluated.

An integer constant expression shall have integer type and shall only have operands that are integer constants, enumeration constants, character constants, sizeof expressions whose results are integer constants, and floating constants that are the immediate operands of casts. Cast operators in an integer constant expression shall only convert arithmetic types to integer types, except as part of an operand to the sizeof operator.

What's important to realize is that this allows a great deal of freedom with regard to how an implementation defines NULL. Some examples:

#define NULL 0

#define NULL ((void *)0)

#define NULL 0L

typedef void *__voidp;
#define NULL ((__voidp)0)

#define NULL (1ULL-1ULL)

#define NULL ('9'-'0'-9)

#define NULL (sizeof(void *)/(sizeof(char)-2))

In particular, the set of possible types NULL could have includes all integer types and void *. Fortunately, since these are all null pointer constants, in most contexts using NULL will get you The Right Thing no matter which type NULL might have. Pointer comparisons, assignments to pointers, the ?: operator, and so forth all treat null pointer constants in special ways such that the type of the null pointer constant mostly doesn't matter.

The main place the type of NULL does become an issue is with variadic (or non-prototyped) functions. Consider these examples:

printf("%p", NULL); // WRONG!
execl("/bin/true", "true", NULL); // WRONG!

The first is wrong, in general, because the %p specifier for printf requires an argument of type void *. The second example is always wrong because the execl function requires arguments of type char *, and NULL can never have type char *.

One might object that, despite being formally incorrect, the above should work, especially if NULL is defined with the typical C definition of ((void *)0). Assuming also that pointers of different types are treated identically by the calling convention, having type void * instead of char * would not matter when calling execl. (Arguably it may not matter anyway, since C specified that char * and void * have the same representation, but it's not entirely clear how this applies to arguments to variadic functions.)

Things get a lot uglier with C++. In C++, NULL was traditionally defined as plain 0, on account of the fact that C++ has no implicit conversion from void * to other pointer types. (Note that the current C++ standard also allows NULL to be defined as nullptr, and some non-conforming implmentations define it in other ways, such as __null.) So in many historical C++ implementations, if you pass NULL to a variadic function, bad things might happen.

And bad things did happen. We were encountering several hard-to-track-down crash bugs in certain applications (including evince and audacity) built against musl, which at the time defined NULL as 0 for C++ programs, and it turned out the source of the problem was that the C++ code was making calls to variadic C functions using NULL as a sentinel. The original report and further details are available in the mailing list archives for musl.

The crashes that were happening were on 64-bit archs (at the time, for us that meant just x86_64) and were, if I recall, "heisenbugs": inserting additional code to inspect what was going on made some or all of them go away. As it turns out, while x86_64 (and most 64-bit archs) uses entire registers or 64-bit stack slots for all arguments, even ones smaller than 64-bit, there's no guarantee in the ABI that the unused upper bits will be cleared by the caller. So, when the caller passed 0 as an int to a variadic function, the upper 32 bits of the stack slot contained whatever junk happened to be there before, and the callee's va_arg, attempting to read an argument of type void *, pulled the junk in as the upper bits of the pointer. Thus, manifestation of the bug was highly dependent both on the code executed just prior to the variadic call, and the compiler's choice of how to setup the arguments (e.g. push versus mov).

As far as what we would do about the issue, I've been on the fence for a long time about the relative merits of trying to work around incorrect, non-portable applications versus actively breaking them and pressuring their maintainers to fix them. However, the widespread incorrect use of NULL seemed overwhelming, and developers seemed very reluctant to fix the issue. Fortunately, we came up with what I believe is a really good solution on musl's side: replacing this:

#ifdef __cplusplus
#define NULL 0
#else
#define NULL ((void *)0)
#endif

with this:

#define NULL 0L

(Note that musl assumes a type model where long and pointers have the same size; this is also assumed by Linux and is fundamental to the way system calls work, so it's not an unreasonable assumption for us.)

By making this change, all programs that assume they can pass NULL as a pointer argument to variadic functions (sentinel role or otherwise) continue to work as if NULL were defined as ((void *)0). However, GCC does not consider 0L a valid "sentinel" value (since it has the wrong type) for functions declared with the sentinel attribute. So what we have achieved is making incorrect programs work, but produce warnings where the code is incorrect.

One initial concern I had with this approach is that the matter of whether NULL has pointer or integer type might be observable to programs in such a way that the change from ((void *)0) to 0L could break some incorrect C programs. However, after some of the smart folks over at Stack Overflow tried to find ways a program might detect whether NULL is defined with integer or pointer type and failed, I deemed the change safe, and our developers and users have not run into any problems since.

It's been some time now (about 6 months) since the issue was raised and addressed in musl, but I just recently reported the sentinel issue in Busybox and submitted a patch to fix it. While getting the patch accepted was painless, I mention this anecdote as a possible explanation for why the NULL sentinel bug is so prevalent: after the patch was accepted, I received several follow-up emails chipping in with misconceptions about NULL. The impression I was left with is that everybody seems to think NULL is a much simpler topic than it really is.

So what would I recommend programmers do with NULL? After all this trouble with variadic function arguments, I've just basically stopped using NULL completely. A simple 0 is shorter and usually clear when initializing or returning a pointer, and for null pointer checks, I usually use the ! operator rather than ==NULL anyway. However, I really don't have any opinion on what others should do; from a correctness standpoint, what matters is simply not making assumptions about the type of NULL. Anywhere the type of the expression matters, you really need to use a cast, which you could apply to NULL, or simply to 0.

As a final remark, I don't think this article would be complete without mentioning a similar article, albeit much shorter and less detailed, written by Jens Gustedt over two years ago.

Post or read comments...

Non-invasive printf debugging

12 Dec 2012 16:09:05 GMT

This post is not about any particular bug or bad programming practice, just a new “printf debugging” technique I came up with.

Often when tracking down a bug, it’s useful to add extra output to track the state of the program in the moments leading up to the crash or incorrect behavior, aka “printf debugging”. However, this technique is “invasive” in the sense that it interleaves unwanted data into the program output. Using stderr instead of stdout can alleviate the problem to some extent, but when you’re inserting the debugging code into a widely-used library (in my case, libc.so) or a shell, even having unwanted output on stderr can be a problem.

A previous approach I had used was sending the output to an arbitrary high file descriptor instead of stdout or stderr. For example, dprintf(666, "message here"); This avoids breaking anything but a program using a huge number of file descriptors, and allows reading the output via adding 666>&2 to the command line.

However, let’s take that a bit further. I don’t really need the program itself to print any output. I just need to be able to see the message. Enter my other favorite debugging tool, strace. Using strace, I can see the above output on a high file descriptor even if the descriptor isn’t actually open; the syscall still happens and returns with EBADF. But there’s no reason it needed to be a valid number that could be a file descriptor in the first place. I can instead just use dprintf(-42, "message here\n"); and there’s no risk of ever producing any output, but the strace log (with the appropriate options to increase string length limits) will show the writes.

The result is a form of “printf debugging” that can safely be added in library code and other code that shouldn’t touch any file descriptors.

Post or read comments...

Stubborn and ignorant use of int where size_t is needed

25 Oct 2012 23:48:02 GMT

What’s wrong with this C function?

char *my_strchr(char *s, int c)
{
    int i;
    for (i=0; s[i]!=c; i++)
        if (!s[i]) return 0;
    return &s[i];
}

Unless its interface contract requires that the caller pass a string no longer than INT_MAX, it can invoke undefined behavior due to integer overflow, most likely resulting in a crash. Even if you change the type to unsigned instead of int to avoid the signed overflow issue, the function will still be buggy; now, it just goes into an infinite loop if the input string is longer than UINT_MAX.

The solution, simple enough, is to use the right type for i: size_t. Strictly speaking, this could still fail in the same way, since the C language places no requirement on implementations that they not support objects too large for size_t; however, given that the standard library functions all use size_t, even on such an implementation one would have to go out of the way to produce such an over-sized object (for example, using calloc or a non-standard allocation function). And any reasonable implementation of C will have size_t sufficiently large to store any object size.

Unfortunately, I regularly encounter a surprising degree of resistence to the idea that this type of code must use size_t to be correct. The most infamous example is the 64-bit vulnerability in qmail, which D. J. Bernstein refused to acknowledge and award bounty, claiming that it’s not a bug, but even to this day I encounter quite a few programmers who claim they should be allowed to use int instead of size_t for loop counters involving array element counts.

The argument for using int basically comes down to a claim that inputs with more than 2 billion elements are (1) pathological, (2) malicious, and (3) should have already been rejected at some earlier stage. In most cases, I mostly agree with the first two points. Unless you’re writing software that’s intended to deal with gigantic volumes of data (in which case, I’ll presume you know what you’re doing), gigantic inputs are probably bad inputs. Okay, fair enough.

The problem in the argument for using int instead of size_t arises when you get to the third point — the idea that large inputs should already have been rejected. This is basically a cheap way of pushing the blame/responsibility onto somebody else, and while valid, if you write your library functions this way, it will make them extremely inefficient to interoperate with code that’s not making the same assumptions.

As an example, suppose the above my_strchr is being called from code that can’t assume its inputs were already length-limited below INT_MAX. The only way to safely pass this input to my_strchr is to call strlen (or strnlen, with n==INT_MAX) on it first to validate that it’s not too long. Even if the typical runtime of my_strchr is low (because you expect to find the character in the first hundred bytes or so), you now have to scan the whole string (or at least up to some large limit) first — not because you’ll actually be using the data, but just to check that the string meets the awkward interface contract of my_strchr. You’ve replaced the good typical-case runtime with a guaranteed worst-case runtime, just for the sake of gluing together interfaces that don’t play nice with each other.

One could argue here that I’ve attacked a straw man — certainly the my_strchr function could be improved to bail out early if i passes some arbitrary limit. This actually does solve the problem, and used consistently, this sort of approach works just as well as using size_t as long as you never intend to support objects larger than INT_MAX. Of course, the same bounded functions using size_t in place of int would work just as well, and they would be more general and flexible, in that they could be reused in applications requiring support for huge data.

Post or read comments...

Unexpected observability of lock states

25 Oct 2012 03:20:38 GMT

This post is going to be the first that’s about one of my own bugs, in musl. For a long time, I’ve had certain stdio functions such as feof and ferror forgoing any locking, and simply relying on the fact that, per the memory model that’s assumed, reading the associated flags is safe without any locks. The issue with doing this is that, while it’s safe, it’s not correct; it leads to observably incorrect behavior in some cases.

Per POSIX,

All functions that reference ( FILE *) objects shall behave as if they use flockfile() and funlockfile() internally to obtain ownership of these ( FILE *) objects.

(This requirement is tucked away in the specification of flockfile, so it’s not immediately apparent if you start out just reading the General Information in XSH Chapter 2.)

What this means is that, even if locking is not required to make a particular operation safe, there’s a requirement on the observable behavior of the program that it be as if the locking happened. Obviously, the existence of flockfile and funlockfile makes it so that an observably-wrong behavior can happen if feof does not wait for the lock; for example, if thread A never unlocks a file F except when F is at end-of-file, then an observably-wrong behavior occurs if thread B ever observes feof returning zero for F.

This is the boring case. What’s interesting is when we ask, if flockfile and funlockfile were omitted (for example, in a statically linked program that does not use them), would it be valid to skip locking in functions like feof? (This could easily be achieved with weak symbols.) It turns out the answer is still no, and the reason is rather surprising: it’s actually possible to deduce that a thread is in the middle of a stdio operation. This is contrary to the general principle that it’s usually impossible to deduce, without race conditions, whether a thread is executing a function or just about to execute or just finished executing it.

As one example, consider a program with two threads, A and B, where thread A has filled up the writing end of a pipe and observed (via select or poll or EWOULDBLOCK) that further writes would block. Now, thread B calls a stdio function to read from a stream attached to the reading end of the pipe. If thread A then observes that the pipe has become writable again, and if there are no other readers, then it’s possible to deduce that the operations thread B is performing on the stream have started (in which case the lock must have been obtained). If thread A has not written a sufficient amount of data to cause thread B’s operation on the stream to finish (for example, if the function is fread and thread A wrote fewer than the requested number of bytes), it’s also possible to deduce that the operation could not have completed. Thus, the lock must be held by thread B, and any operation performed on the stream by thread A must deadlock. In particular, it would be wrong for feof to return zero, since by the time thread B’s operation completes, the stream may in fact have reached end-of-file status.

Another interesting issue that arises is the need for even fclose to perform locking. I originally reasoned that locking should not be needed in fclose, since any use of the stream pointer after fclose is called results in undefined behavior. However, if another thread provably holds the lock (either by having called flockfile, or by an argument such as the above example), the standard assigns well-defined behavior to the fclose call; fclose must wait to obtain the lock before closing the stream and freeing the associated resources. As long as the thread that held the lock does not make any further attempt to use the stream after the lock is released, the program has well-defined behavior, and implementations must support this usage.

As a consequence of this analysis, I’m fixing all the affected interfaces in musl to use locking.

Post or read comments...

vfork considered dangerous

21 Oct 2012 21:20:22 GMT

Traditional unix systems had a vfork function, which works like fork, but without creating a new virtual address space; the parent and child run in the same address space. Unlike with pthread_create, where the new thread runs on its own stack, vfork behaves like fork and “returns twice”, once in the child and once in the parent. This seems impossible, since the parent and child would clobber one another’s stacks, but a clever trick saves the day: the parent process is suspended until the child performs exec or _exit, breaking the shared-memory-space relation between the two processes.

vfork was omitted from POSIX and modern standards because it’s difficult to use; the original specification for the function left it undefined to do basically anything except exec or _exit after vfork in the child. However, many systems (including Linux) still provide a similar or identical interface at the kernel level, and new interest in its use has arisen again due to the fact that huge processes cannot fork on systems with strict commit charge accounting due to lack of memory, and the fact that copying the virtual memory layout of a process can be expensive if the process has a huge number of maps created by mmap. As such, both musl and glibc use vfork to implement posix_spawn, the modern interface for executing external programs as a new process.

While working on vfork usage in musl’s posix_spawn implementation, I realized that using it is a lot trickier (and more dangerous!) than I’d realized before. Here are some of the issues.

Signal handlers and vfork

Since the vfork child runs in the same address space as the parent, care needs to be taken to ensure that it does not modify the parent’s memory in unwanted/unsafe ways. This seems easy enough, until you realize that the calling program might have installed signal handlers, and these signal handlers could get invoked in the child. The most likely way this could happen is when signals are sent to entire process groups, for example, as a result of events like pressing the interrupt/quit key on the controlling terminal, or resizing the terminal. However, signals can also be sent explicitly to a process group as well.

If a signal handler runs in the child after vfork, there are several different ways it could corrupt the parent’s state:

  1. Modifications to memory will be seen, but modifications to the state of open file descriptors, signal dispositions, and other process state, will not be seen in the parent. This could leave the parent in an inconsistent state.
  2. The same signal might be seen twice “by the parent” (per what’s recorded in the parent’s memory) even though it should have been seen only once.
  3. The signal handler in the child process may store properties of the process (e.g. its pid) somewhere in the shared memory, where the parent’s values of these properties, not the child’s, should be stored.

As such, if vfork is to be used in code where the caller might have signal handlers which could be broken by the above issues, it’s necessary to ensure that the parent’s signal handlers don’t get invoked in the child. This amounts to:

  1. Block all signals before calling vfork.
  2. In the child, reset all signal dispositions that aren’t SIG_IGN to SIG_DFL, and then restore the old signal mask.
  3. After vfork returns in the parent, restore the signal mask.

Unfortunately, step 2 was illegal (undefined behavior) in the original specification of vfork, which basically meant vfork was impossible to use. Fortunately, on systems where vfork is supported, more specific semantics are provided/guaranteed, and step 2 works.

Threads and vfork

Formally, vfork was pretty much history by the time people started caring about threads. But in real-world implementations (Linux), we can observe that vfork doesn’t suspend the whole parent process (which would be really difficult to do, anyway), but instead just suspends the calling thread until the child calls exec/exit. This means that concurrency issues exist, and the vfork child is actually sharing memory with other running code, not just a suspended parent.

This leads us to...

setuid and vfork

Now we get to the worst of it. Threads and vfork allow you to get in a situation where two processes are both sharing memory space and running at the same time. Now, what happens if another thread in the parent calls setuid (or any other privilege-affecting function)? You end up with two processes with different privilege levels running in a shared address space. And this is A Bad Thing.

Consider for example a multi-threaded server daemon, running initially as root, that’s using posix_spawn, implemented naively with vfork, to run an external command. It doesn’t care if this command runs as root or with low privileges, since it’s a fixed command line with fixed environment and can’t do anything harmful. (As a stupid example, let’s say it’s running date as an external command because the programmer couldn’t figure out how to use strftime.)

Since it doesn’t care, it calls setuid in another thread without any synchronization against running the external program, with the intent to drop down to a normal user and execute user-provided code (perhaps a script or dlopen-obtained module) as that user. Unfortunately, it just gave that user permission to mmap new code over top of the running posix_spawn code, or to change the strings posix_spawn is passing to exec in the child. Whoops.

Working around the issues

The easy out would be just giving up on vfork. But in musl, a major target is systems where robustness is required (no overcommit) and memory might be constrained; therefore, fork is not a good option. Also, using vfork to implement posix_spawn might eventually allow us to support no-MMU targets.

In musl, there’s already a global lock that controls calls to the setuid family of functions; it was needed because Linux requires a userspace process to synchronize all its threads to make the setuid, etc. syscalls rather than doing the synchronization in kernelspace. Thus, it was easy to just reuse this lock to prevent uid/gid changes while posix_spawn is running. I believe glibc could do the same, since it has an equivalent locking mechanism in NPTL. musl may have a slight advantage here at present, in that the lock we’re using is a reader-writer lock, and callers of posix_spawn only count as readers, not writers.

Bug reports

I’ve filed and reopened several glibc bug reports related to the above issues:

These are security-relevant, but the rarity of multi-threaded programs that change their uid/gid after going threaded, and the rarity of real-world programs using posix_spawn, makes the impact extremely low at present.

Some final thoughts

Linux provides a CLONE_VFORK flag to clone which provides similar semantics to the traditional behavior of vfork, but allowing the new process to run on a separate stack with its own entry point, instead of utilizing the returns-twice idiom of fork. However, this does not solve any of the above problems; the signal handler and setuid issues still exist, and code in the child still has to tiptoe around anything that might upset the parent’s state. As such, I don’t see it as being any more useful than “traditional” vfork for implementing things like posix_spawn.

Post or read comments...

Earlier posts