EWONTFIX

Non-invasive printf debugging

12 Dec 2012 16:09:05 GMT

This post is not about any particular bug or bad programming practice, just a new “printf debugging” technique I came up with.

Often when tracking down a bug, it’s useful to add extra output to track the state of the program in the moments leading up to the crash or incorrect behavior, aka “printf debugging”. However, this technique is “invasive” in the sense that it interleaves unwanted data into the program output. Using stderr instead of stdout can alleviate the problem to some extent, but when you’re inserting the debugging code into a widely-used library (in my case, libc.so) or a shell, even having unwanted output on stderr can be a problem.

A previous approach I had used was sending the output to an arbitrary high file descriptor instead of stdout or stderr. For example, dprintf(666, "message here"); This avoids breaking anything but a program using a huge number of file descriptors, and allows reading the output via adding 666>&2 to the command line.

However, let’s take that a bit further. I don’t really need the program itself to print any output. I just need to be able to see the message. Enter my other favorite debugging tool, strace. Using strace, I can see the above output on a high file descriptor even if the descriptor isn’t actually open; the syscall still happens and returns with EBADF. But there’s no reason it needed to be a valid number that could be a file descriptor in the first place. I can instead just use dprintf(-42, "message here\n"); and there’s no risk of ever producing any output, but the strace log (with the appropriate options to increase string length limits) will show the writes.

The result is a form of “printf debugging” that can safely be added in library code and other code that shouldn’t touch any file descriptors.

Post or read comments...

Stubborn and ignorant use of int where size_t is needed

25 Oct 2012 23:48:02 GMT

What’s wrong with this C function?

char *my_strchr(char *s, int c)
{
    int i;
    for (i=0; s[i]!=c; i++)
        if (!s[i]) return 0;
    return &s[i];
}

Unless its interface contract requires that the caller pass a string no longer than INT_MAX, it can invoke undefined behavior due to integer overflow, most likely resulting in a crash. Even if you change the type to unsigned instead of int to avoid the signed overflow issue, the function will still be buggy; now, it just goes into an infinite loop if the input string is longer than UINT_MAX.

The solution, simple enough, is to use the right type for i: size_t. Strictly speaking, this could still fail in the same way, since the C language places no requirement on implementations that they not support objects too large for size_t; however, given that the standard library functions all use size_t, even on such an implementation one would have to go out of the way to produce such an over-sized object (for example, using calloc or a non-standard allocation function). And any reasonable implementation of C will have size_t sufficiently large to store any object size.

Unfortunately, I regularly encounter a surprising degree of resistence to the idea that this type of code must use size_t to be correct. The most infamous example is the 64-bit vulnerability in qmail, which D. J. Bernstein refused to acknowledge and award bounty, claiming that it’s not a bug, but even to this day I encounter quite a few programmers who claim they should be allowed to use int instead of size_t for loop counters involving array element counts.

The argument for using int basically comes down to a claim that inputs with more than 2 billion elements are (1) pathological, (2) malicious, and (3) should have already been rejected at some earlier stage. In most cases, I mostly agree with the first two points. Unless you’re writing software that’s intended to deal with gigantic volumes of data (in which case, I’ll presume you know what you’re doing), gigantic inputs are probably bad inputs. Okay, fair enough.

The problem in the argument for using int instead of size_t arises when you get to the third point — the idea that large inputs should already have been rejected. This is basically a cheap way of pushing the blame/responsibility onto somebody else, and while valid, if you write your library functions this way, it will make them extremely inefficient to interoperate with code that’s not making the same assumptions.

As an example, suppose the above my_strchr is being called from code that can’t assume its inputs were already length-limited below INT_MAX. The only way to safely pass this input to my_strchr is to call strlen (or strnlen, with n==INT_MAX) on it first to validate that it’s not too long. Even if the typical runtime of my_strchr is low (because you expect to find the character in the first hundred bytes or so), you now have to scan the whole string (or at least up to some large limit) first — not because you’ll actually be using the data, but just to check that the string meets the awkward interface contract of my_strchr. You’ve replaced the good typical-case runtime with a guaranteed worst-case runtime, just for the sake of gluing together interfaces that don’t play nice with each other.

One could argue here that I’ve attacked a straw man — certainly the my_strchr function could be improved to bail out early if i passes some arbitrary limit. This actually does solve the problem, and used consistently, this sort of approach works just as well as using size_t as long as you never intend to support objects larger than INT_MAX. Of course, the same bounded functions using size_t in place of int would work just as well, and they would be more general and flexible, in that they could be reused in applications requiring support for huge data.

Post or read comments...

Unexpected observability of lock states

25 Oct 2012 03:20:38 GMT

This post is going to be the first that’s about one of my own bugs, in musl. For a long time, I’ve had certain stdio functions such as feof and ferror forgoing any locking, and simply relying on the fact that, per the memory model that’s assumed, reading the associated flags is safe without any locks. The issue with doing this is that, while it’s safe, it’s not correct; it leads to observably incorrect behavior in some cases.

Per POSIX,

All functions that reference ( FILE *) objects shall behave as if they use flockfile() and funlockfile() internally to obtain ownership of these ( FILE *) objects.

(This requirement is tucked away in the specification of flockfile, so it’s not immediately apparent if you start out just reading the General Information in XSH Chapter 2.)

What this means is that, even if locking is not required to make a particular operation safe, there’s a requirement on the observable behavior of the program that it be as if the locking happened. Obviously, the existence of flockfile and funlockfile makes it so that an observably-wrong behavior can happen if feof does not wait for the lock; for example, if thread A never unlocks a file F except when F is at end-of-file, then an observably-wrong behavior occurs if thread B ever observes feof returning zero for F.

This is the boring case. What’s interesting is when we ask, if flockfile and funlockfile were omitted (for example, in a statically linked program that does not use them), would it be valid to skip locking in functions like feof? (This could easily be achieved with weak symbols.) It turns out the answer is still no, and the reason is rather surprising: it’s actually possible to deduce that a thread is in the middle of a stdio operation. This is contrary to the general principle that it’s usually impossible to deduce, without race conditions, whether a thread is executing a function or just about to execute or just finished executing it.

As one example, consider a program with two threads, A and B, where thread A has filled up the writing end of a pipe and observed (via select or poll or EWOULDBLOCK) that further writes would block. Now, thread B calls a stdio function to read from a stream attached to the reading end of the pipe. If thread A then observes that the pipe has become writable again, and if there are no other readers, then it’s possible to deduce that the operations thread B is performing on the stream have started (in which case the lock must have been obtained). If thread A has not written a sufficient amount of data to cause thread B’s operation on the stream to finish (for example, if the function is fread and thread A wrote fewer than the requested number of bytes), it’s also possible to deduce that the operation could not have completed. Thus, the lock must be held by thread B, and any operation performed on the stream by thread A must deadlock. In particular, it would be wrong for feof to return zero, since by the time thread B’s operation completes, the stream may in fact have reached end-of-file status.

Another interesting issue that arises is the need for even fclose to perform locking. I originally reasoned that locking should not be needed in fclose, since any use of the stream pointer after fclose is called results in undefined behavior. However, if another thread provably holds the lock (either by having called flockfile, or by an argument such as the above example), the standard assigns well-defined behavior to the fclose call; fclose must wait to obtain the lock before closing the stream and freeing the associated resources. As long as the thread that held the lock does not make any further attempt to use the stream after the lock is released, the program has well-defined behavior, and implementations must support this usage.

As a consequence of this analysis, I’m fixing all the affected interfaces in musl to use locking.

Post or read comments...

vfork considered dangerous

21 Oct 2012 21:20:22 GMT

Traditional unix systems had a vfork function, which works like fork, but without creating a new virtual address space; the parent and child run in the same address space. Unlike with pthread_create, where the new thread runs on its own stack, vfork behaves like fork and “returns twice”, once in the child and once in the parent. This seems impossible, since the parent and child would clobber one another’s stacks, but a clever trick saves the day: the parent process is suspended until the child performs exec or _exit, breaking the shared-memory-space relation between the two processes.

vfork was omitted from POSIX and modern standards because it’s difficult to use; the original specification for the function left it undefined to do basically anything except exec or _exit after vfork in the child. However, many systems (including Linux) still provide a similar or identical interface at the kernel level, and new interest in its use has arisen again due to the fact that huge processes cannot fork on systems with strict commit charge accounting due to lack of memory, and the fact that copying the virtual memory layout of a process can be expensive if the process has a huge number of maps created by mmap. As such, both musl and glibc use vfork to implement posix_spawn, the modern interface for executing external programs as a new process.

While working on vfork usage in musl’s posix_spawn implementation, I realized that using it is a lot trickier (and more dangerous!) than I’d realized before. Here are some of the issues.

Signal handlers and vfork

Since the vfork child runs in the same address space as the parent, care needs to be taken to ensure that it does not modify the parent’s memory in unwanted/unsafe ways. This seems easy enough, until you realize that the calling program might have installed signal handlers, and these signal handlers could get invoked in the child. The most likely way this could happen is when signals are sent to entire process groups, for example, as a result of events like pressing the interrupt/quit key on the controlling terminal, or resizing the terminal. However, signals can also be sent explicitly to a process group as well.

If a signal handler runs in the child after vfork, there are several different ways it could corrupt the parent’s state:

  1. Modifications to memory will be seen, but modifications to the state of open file descriptors, signal dispositions, and other process state, will not be seen in the parent. This could leave the parent in an inconsistent state.
  2. The same signal might be seen twice “by the parent” (per what’s recorded in the parent’s memory) even though it should have been seen only once.
  3. The signal handler in the child process may store properties of the process (e.g. its pid) somewhere in the shared memory, where the parent’s values of these properties, not the child’s, should be stored.

As such, if vfork is to be used in code where the caller might have signal handlers which could be broken by the above issues, it’s necessary to ensure that the parent’s signal handlers don’t get invoked in the child. This amounts to:

  1. Block all signals before calling vfork.
  2. In the child, reset all signal dispositions that aren’t SIG_IGN to SIG_DFL, and then restore the old signal mask.
  3. After vfork returns in the parent, restore the signal mask.

Unfortunately, step 2 was illegal (undefined behavior) in the original specification of vfork, which basically meant vfork was impossible to use. Fortunately, on systems where vfork is supported, more specific semantics are provided/guaranteed, and step 2 works.

Threads and vfork

Formally, vfork was pretty much history by the time people started caring about threads. But in real-world implementations (Linux), we can observe that vfork doesn’t suspend the whole parent process (which would be really difficult to do, anyway), but instead just suspends the calling thread until the child calls exec/exit. This means that concurrency issues exist, and the vfork child is actually sharing memory with other running code, not just a suspended parent.

This leads us to...

setuid and vfork

Now we get to the worst of it. Threads and vfork allow you to get in a situation where two processes are both sharing memory space and running at the same time. Now, what happens if another thread in the parent calls setuid (or any other privilege-affecting function)? You end up with two processes with different privilege levels running in a shared address space. And this is A Bad Thing.

Consider for example a multi-threaded server daemon, running initially as root, that’s using posix_spawn, implemented naively with vfork, to run an external command. It doesn’t care if this command runs as root or with low privileges, since it’s a fixed command line with fixed environment and can’t do anything harmful. (As a stupid example, let’s say it’s running date as an external command because the programmer couldn’t figure out how to use strftime.)

Since it doesn’t care, it calls setuid in another thread without any synchronization against running the external program, with the intent to drop down to a normal user and execute user-provided code (perhaps a script or dlopen-obtained module) as that user. Unfortunately, it just gave that user permission to mmap new code over top of the running posix_spawn code, or to change the strings posix_spawn is passing to exec in the child. Whoops.

Working around the issues

The easy out would be just giving up on vfork. But in musl, a major target is systems where robustness is required (no overcommit) and memory might be constrained; therefore, fork is not a good option. Also, using vfork to implement posix_spawn might eventually allow us to support no-MMU targets.

In musl, there’s already a global lock that controls calls to the setuid family of functions; it was needed because Linux requires a userspace process to synchronize all its threads to make the setuid, etc. syscalls rather than doing the synchronization in kernelspace. Thus, it was easy to just reuse this lock to prevent uid/gid changes while posix_spawn is running. I believe glibc could do the same, since it has an equivalent locking mechanism in NPTL. musl may have a slight advantage here at present, in that the lock we’re using is a reader-writer lock, and callers of posix_spawn only count as readers, not writers.

Bug reports

I’ve filed and reopened several glibc bug reports related to the above issues:

These are security-relevant, but the rarity of multi-threaded programs that change their uid/gid after going threaded, and the rarity of real-world programs using posix_spawn, makes the impact extremely low at present.

Some final thoughts

Linux provides a CLONE_VFORK flag to clone which provides similar semantics to the traditional behavior of vfork, but allowing the new process to run on a separate stack with its own entry point, instead of utilizing the returns-twice idiom of fork. However, this does not solve any of the above problems; the signal handler and setuid issues still exist, and code in the child still has to tiptoe around anything that might upset the parent’s state. As such, I don’t see it as being any more useful than “traditional” vfork for implementing things like posix_spawn.

Post or read comments...

AS + DC = AC

15 Oct 2012 00:24:57 GMT

Where A.S. are asynchronous signals, D.C. is deferred cancellation, and A.C. is asynchronous cancellation. In the previous post, I discussed asychronous versus deferred cancellation in POSIX threads, and issues that make it hard to use asynchronous cancellation well. I also mentioned that there are almost no functions which are async-cancel-safe. What if you want to cheat and get the behavior of asynchronous cancellation, but without having to follow the rules?

Enter asynchronous signals. Particularly, I’m thinking of signals sent to a specific thread using pthread_kill, but really the signal could be coming from another source like pressing the interrupt or quit key on a terminal.

Suppose the main flow of excution in a thread uses only async-signal-safe functions. These are not to be confused with async-cancel-safe functions, of which only three exist; the set of async-signal-safe functions is relatively large and includes lots of powerful tools. Cancellation mode is left as deferred (the default).

Now suppose you have a signal handler that interrupts the main flow of execution and calls pthread_testcancel. Despite pthread_testcancel being async-signal-unsafe, this is legal because no other async-signal-unsafe function was interrupted by the signal.

Under this setup, a call to pthread_cancel followed by sending the signal gives you the equivalent of asynchronous cancellation, but rather than being restricted to only calling async-cancel-safe functions, it seems it’s now legal to call arbitrary async-signal-safe functions.

On the one hand, this seems to be an argument that the concept of async-cancel-safety is misguided, and that async-signal-safety should be the condition for which functions an application can call when in asynchronous cancellation mode.

However, let’s take it a step further. pthread_testcancel was not async-signal-safe, but close is. So, close(-1) (a no-op, aside from setting errno) is an async-signal-safe version of pthread_testcancel! Now, we’re no longer restricted to only calling async-signal-safe functions in the main flow of execution, since the signal handler is async-signal safe. But this is obviously wrong. Cancelling a thread while it’s in the middle of malloc or printf is not something that’s intended to work.

On the other hand, this brings up serious concerns for applications which are not trying to get around the async-cancel-safety rules, but which just happen to call cancellation points from their signal handlers. Doing so will cause the interrupted code to get cancelled exactly as if asynchronous cancellation had been enabled. And this is dangerous and generally unwanted.

A well-behaved application would like to just call pthread_setcancelstate in its signal handlers to set PTHREAD_CANCEL_DISABLED for the duration of the signal handler. But that’s in general not possible, since pthread_setcancelstate is async-signal-unsafe. This leaves me with a conclusion that, unless/until some improvement or clarification is added to the standard, applications need to ensure that signal handler containing cancellation points do not get run in threads that are potential targets of cancellation.

I’ve filled issues #615 and #622 on the Austin Group bug tracker. Ultimately I think this just comes down to an omission in the standard of any text to forbid this madness, hopefully an omission which can be quickly and easily resolved. But it’s provided some nice insight into the non-obvious complexity of cancellation and its interaction with other features of the POSIX standard with which it was probably never intended to be used.

Post or read comments...

Asynchronous cancellation pitfalls

07 Oct 2012 04:24 GMT

In the past few posts, I’ve introduced thread cancellation and some of the implementation and application usage difficulties in making cancellation robust. One topic I haven’t yet touched on is asynchronous cancellation. POSIX threads support two cancellation types: asynchronous and deferred. The latter, deferred cancellation, is the default, whereby a cancellation request is only acted upon immediately if the thread to be cancelled is suspended at a cancellation point, and otherwise remains pending until the next call to a cancellation point.

The other option, asynchronous cancellation, allows (but does not require) the implementation to act on cancellation requests at any time. This obviously has the potential, to leave data in a horribly inconsistent state, so rules are imposed; the application cannot call any functions from the standard library except those designated “async-cancel safe” while asynchronous cancellation is in effect. Essentially, one can think of asynchronous cancellation as a feature to be used only when a thread is performing a long-running pure computation.

One obvious problem with asynchronous cancellation that limits its utility is the fact that it does not obligate an implementation to act on cancellation requests immediately; the language of the standard is such that the request need only be acted upon at some point between when it’s made and the next call to a cancellation point. Unfortunately, there do not exist any cancellation points which are also async-cancel-safe, so a pathological but conforming implementation could actually treat async cancellation mode the same as it treats disabling cancellation.

OK, let’s forget about pathological and gratuitously broken implementations for a moment. Even if asynchronous cancellation mode does cause requests arriving to be acted upon immediately, there’s still the question of what happens when a cancellation request is already pending upon switching to asynchronous cancellation. POSIX does not say anything about this case, and if cancellation is implemented as (or analogously to) a signal, it would be an easy oversight for an implementation to forget to check for alread-pending cancellation requests when switching to asynchronous cancellation.

What if an application wants to account for this possible quality of implementation issue? In principle, an application should just be able to call pthread_testcancel right after switching on asynchronous cancellation. But it can’t do this, because pthread_testcancel is not async-cancel-safe. And if it calls pthread_testcancel before switching modes, there’s a race condition whereby a request that arrives between these two calls won’t be picked up.

What we’re stuck with is a situation where the standard allows low-quality implementations and also forbids the obvious workaround applications could use to handle such low-quality implementations.

Do any such low-quality implementations exist? If not, POSIX should probably standardize that switching to asynchronous cancellation acts on any pending cancellation request (for example, by specifying pthread_setcanceltype to be a cancellation point when its argument is PTHREAD_CANCEL_ASYNCHRONOUS).

Post or read comments...

Updates on close, EINTR, & cancellation

03 Oct 2012 00:05 GMT

Last week I took the time to file a report with the Austin Group (responsible for POSIX) about the close issue. It is Issue #614, and it turns out the problem was already solved by fixing the specification of close when interrupted by a signal. Whew. I thought that latter would be a lot more controversial and harder to get fixed

Some basic history on the issue: Apparently, there was a historical disagreement over the behavior of close when interrupted by a signal. Some implementations (e.g. HPUX) had it leave the file descriptor open when returning with EINTR; others (Linux, AIX) closed it unconditionally, but returned with EINTR if a signal arrived while close() was interrupted before returning. This ambiguity was acceptable for single-threaded applications, which could just unconditionally call close() again on EINTR, possibly getting EBADF if the file descriptor no longer existed due to the Linux/AIX behavior. It's not acceptable for multi-threaded applications, where calling close again could close a file descriptor just obtained by another thread - which can have far-reaching and extremely dangerous consequences.

Last December it was refered to the Austin Group for interpretation as Issue #529. The issue was resolved by requiring the HPUX behavior if close returns with EINTR, but allowing implementations which want to deallocate the file descriptor even if interrupted by a signal to return with the EINPROGRESS error instead of EINTR.

This is really the best possible solution, and the latter choice is the choice high-quality implementations should make. The Linux developers objected vehemently to the HPUX behavior on many grounds (see this thread for details), but the most important is that, under the HPUX behavior, it becomes impossible for a program that needs to deallocate a file descriptor to make forward progress when blocked on close; retrying the close will just block again. The situation is even worse with cancellation: the cleanup handler must retry the close, now with cancellation disabled, and it will now block indefinitely. But the Linux behavior is wrong too: in all other cases, EINTR means the application should retry the operation if it needs the effects to have completed, and given how POSIX defines the side effects on cancellation in terms of the side effects on EINTR, it yields unacceptable behavior with respect to cancellation. Avoiding this would have required POSIX to special-case close in the rules for side effects at cancellation.

Unfortunately, Linux (the kernel) is still doing the same thing it always did, returning -EINTR from the close syscall despite having the EINPROGRESS semantics. We're now working around this in musl, and as far as I know, have the first Linux-based libc where close conforms to the amended POSIX requirements. See commit 82dc1e2e783815e00a90cd3f681436a80d54a314 for details. Avoiding the cancellation issue (acting on cancellation after close deallocated the file descriptor) requires additional work at the point where cancellation is processed, but I already took care of this a long time ago; it's in src/thread/cancel_impl.c.

By the way, why does close ever block at all? Some genius way back thought it would be clever to overload the close function with responsibility for rewinding tape devices, and block until rewinding is complete. Never mind that there are perfectly good ways to rewind the device before closing it. If not for this historical blunder, close would never have been a blocking function, would never have been subject to interruption-by-signal, would never fail with EINTR, and would not have been specified by POSIX as a cancellation point. And this would have in turn made programming with file descriptors, signals, and thread cancellation a lot easier and less error-prone...

Post or read comments...

To overcommit or not to overcommit

23 Sep 2012 01:43 GMT

I’ve written in the past on the topic of overcommit, which depending on your perspective, is either a feature of Linux and some other kernels, or a bug left over from a time when folks didn’t know how to do virtual memory accounting properly. I’m a serious proponent of strict commit accounting (opposite of overcommit), but for this article, I want to look at the state of the software ecosystem and how it often leaves us overcommit-enabled Linux systems being more failproof than their strict-accounting brothers and sisters.

The idea of strict commit accounting is that malloc never reports success only to let your program crash when you actually try to use the memory. If the kernel cannot ensure that there’s no possible sequence of paging events that would cause it to run out of physical storage for all to-be-mapped pages, then allocating new pages fails, and malloc returns a null pointer. This gives your program the power, but also the responsibility, to check for out-of-memory conditions and handle them, which is in principle, a very good thing.

Where problems begin to arise is when programs don’t check the return value of malloc, or use it only for the sake of calling abort when allocation fails.

Let’s consider a possible (rather likely, these days) scenario: you have several core system components (things like init/upstart/systemd, inetd, sshd, etc.) that would leave the system in a crippled, unusable, or even kernel-panic state if they die, and these programs are making use of dynamic allocation. What happens when your machine runs out of memory?

If strict commit accounting is in effect, one of their calls to malloc fails, resulting in one of the following:

  1. Dealing with the failure gracefully, but failing to provide service at this time, and possibly also failing in other areas like logging. This is what a well-behaved program would do.
  2. Immediately calling abort on the first allocation failure. A core system component is not likely to do this itself, but it may inadvertently do so by relying on a library (such as glib) that unconditionally aborts the calling program on allocation failure.
  3. Ignoring the fact that malloc could fail, and dereferencing relative to a null pointer. This is the worst possible behavior, but there’s plenty of software that buggy.

If on the other hand overcommit is enabled, along with Linux’s heuristic OOM killer, there’s only one likely result: this critical system component was either manually marked as not a candidate for OOM killing, or was naturally not a candidate since it never engaged in allocation behaviors that the OOM killer judged as abusive. Some bloated desktop app like Firefox or OpenOffice if this is a desktop system, or some runaway PHP program if it’s a webserver, gets OOM-killed instead, and the system is back to “normal” (minus the user perhaps being angry about losing his or her session).

Does this mean I’ve changed my mind about overcommit and it’s actually a good thing? No, not really. What it means, at least in my mind, is that there’s a great deal of work that needs to be done auditing core system components for robustness and fail-safe behavior. In particular:

This is hard work, but I still believe it’s better than the current situation where stability of the essential core components of the system depends on sacrificing (OOM-killing) user applications which might have valuable unsaved data. It just means we still have a long way to go towards a rock-solid, crash-free FOSS platform...

Post or read comments...

Thread cancellation and resource leaks

21 Sep 2012 02:00 GMT

In a multi-threaded C program where threads share address space and may be operating on shared objects as long as they use the proper synchronization tools, it’s unsafe to asynchronously kill an individual thread without killing the whole process. Stale locks may be left behind and data being modified under those locks may be in an inconsistent state. This includes even internal heap management structures used by malloc.

As such, the POSIX threads standard does not even offer a mechanism for forcible termination of individual threads. Instead, it offers thread cancellation, a mechanism by which early termination of a thread whose work is no longer needed can be negotiated in such a way that the thread to be cancelled cleans up any shared state and/or private resources it may be using before it terminates.

The way cancellation works, from an application standpoint, is that when thread A no longer needs the work that thread B is performing, thread A calls pthread_cancel on thread B. Under normal circumstances, this does not immediately kill thread B - remember, asynchronously killing threads is fundamentally unsafe. Instead, it causes as cancellation request to be pending for thread B. The next time thread B calls a function which is specified by POSIX as a cancellation point, this request is acted upon, and the thread terminates (after possibly running cancellation cleanup handlers, which are analogous to exception handlers in languages with exceptions). Alternatively, if thread B happened to be blocked at a cancellation point already when thread A made the cancellation request, thread B would be immediately cancelled.

This latter case is what makes cancellation a more powerful tool than simply setting an exit flag for the target thread to inspect: cancellation can interrupt functions that block waiting for an event such as input. This is important because, under many circumstances such as non-responsive network peers or fifos where the other end is not open, the event for which the thread is blocked waiting may never happen.

On first consideration, one might consider cancellation unnecessary. A cancellation-like mechanism seems possible with interrupting signals, as in:

do {
    pthread_mutex_lock(&ctx->exitflag_lock);
    int flag = ctx->exitflag;
    pthread_mutex_unlock(&ctx->exitflag_lock);
    if (flag) pthread_exit(0);
} while ((n = read(fd, buf, sizeof buf)) == -1 && errno == EINTR);

Unfortunately, this code has a race condition; if a signal is sent to the thread after ctx->exitflag is checked, but before read is called, read will then block indefinitely. This issue can be worked around by bombarding the thread with such signals until it exits (possibly with exponential backoff), as in:

unsigned ns = 255;
pthread_mutex_lock(&ctx->exitflag_lock);
ctx->exitflag = 1;
pthread_mutex_unlock(&ctx->exitflag_lock);
while (pthread_kill(ctx->thread_desc, sig) != ESRCH)
    nanosleep({.tv_nsec=(ns+=ns+1)},0);

However this solution is not only inelegant but also quite costly, and the cost grows severely under load when scheduling delays prevent the target thread from waking up right away.

In an ideal world, thread cancellation would be the definitive solution to this problem, but low-quality implementations and possibly even a defect in the standard make it difficult to use cancellation robustly. Most of the bad implementations are just obviously bad, like the one in Darwin (MacOSX/iOS) where the developers literally just invented their own semantics based on the names of the cancellation-related functions without even looking at what they’re specified to do. Those are topics for another post. What I’d like to examine now is a much more subtle problem in the NPTL implementation of POSIX threads used by the GNU C Library (glibc), which depending on how one interprets the standard, may be a conformance bug or may just be a flaw in the standard that allows for very-low-quality implementations that are impossible to use safely.

POSIX specifies the general semantics of thread cancellation in Chapter 2 (XSH) 2.9.5 Thread Cancellation, as follows:

The side-effects of acting upon a cancellation request while suspended during a call of a function are the same as the side-effects that may be seen in a single-threaded program when a call to a function is interrupted by a signal and the given function returns [EINTR]. Any such side-effects occur before any cancellation cleanup handlers are called.

Whenever a thread has cancelability enabled and a cancellation request has been made with that thread as the target, and the thread then calls any function that is a cancellation point (such as pthread_testcancel() or read()), the cancellation request shall be acted upon before the function returns. If a thread has cancelability enabled and a cancellation request is made with the thread as a target while the thread is suspended at a cancellation point, the thread shall be awakened and the cancellation request shall be acted upon. It is unspecified whether the cancellation request is acted upon or whether the cancellation request remains pending and the thread resumes normal execution if:

before the cancellation request is acted upon.

The interesting part is side effects. POSIX specifies quite a few functions as cancellation points, and their most interesting side effects are:

File IO is of course also a major side effect, but possibly the least-interesting one in the sense that if you’re cancelling a thread doing IO, you probably don’t care about the final state of the open file except that it get closed and possibly deleted in the cleanup routines. The others are a lot more critical to safe and correct program operation.

With regards to the interaction of cancellation and side effects, what you would like, as the application programmer, is that either:

  1. the side effects of the function occur, and control returns to the calling function, or
  2. no side effects of the function occur, and control never returns to the caller, instead passing into cancellation cleanup handlers and thread termination.

The other possibility to be concerned with, however, is that the side effects of the function occur, but control never returns. Consider what this would mean for open or accept: a new file descriptor is allocated, but the descriptor is never returned to the application. Thus it becomes impossible to close it. The application is left with a resource leak.

A much worse case is close. If the side effects of close take place without control returning, the application has no way of knowing the file descriptor was deallocated. If it assumes it’s still valid and attempts to close it again in cleanup handlers or elsewhere in the program, it may in fact close as different file that was opened later by another thread and assigned the same file descriptor number. If it assumes the file descriptor was closed already and this assumption turns out to be wrong, it leaks a file descriptor.

Are these just theoretical possibilities? Unfortunately, no. The way NPTL (used by glibc and uClibc) implements cancellable system calls is essentially (in pseudo-code):

ENABLE_ASYNC_CANCEL();
ret = DO_SYSCALL(...);
RESTORE_OLD_ASYNC_CANCEL();
return ret;

In other words, it temporarily turns on asynchronous cancellation, whereby any cancellation request will take place immediately, for the duration of the system call. This unfortunately leaves a race condition window after the side effects have taken place (in kernelspace) but before asynchronous cancellation is turned off, during which a cancellation request can arrive and be acted upon.

Here is a simple test case which demonstrates the issue:

#define _POSIX_C_SOURCE 200809L
#include <pthread.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <time.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

void *writeopener(void *arg)
{
    int fd;
    for (;;) {
        fd = open(arg, O_WRONLY);
        close(fd);
    }
}

void *leaker(void *arg)
{
    int fd = open(arg, O_RDONLY);
    pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, 0);
    close(fd);
    return 0;
}


#define ITER_COUNT 10000

int main()
{
    pthread_t td, bg;
    struct stat st;
    char tmp[] = "/tmp/cancel_race_XXXXXX";
    int i, leaks=0;

    mkstemp(tmp);
    unlink(tmp);
    mkfifo(tmp, 0600);
    srand(1);
    pthread_create(&bg, 0, writeopener, tmp);
    for (i=0; i<ITER_COUNT; i++) {
        pthread_create(&td, 0, leaker, tmp);
        nanosleep((&(struct timespec){ .tv_nsec=rand()%100000 }), 0);
        pthread_cancel(td);
        pthread_join(td, 0);
    }
    unlink(tmp);
    for (i=4; i<1024; i++) {
        if (!fstat(i, &st)) leaks++, printf("leaked fd %d\n", i);
    }
    return !!leaks;
}

So, is this a conformance issue, or just a show-stoppingly-bad quality of implementation issue? Let’s look again at the language in POSIX. The only text that explicitly gives an implementation freedom with regard to the behavior of cancellation is:

It is unspecified whether the cancellation request is acted upon or whether the cancellation request remains pending and the thread resumes normal execution if:

before the cancellation request is acted upon.

The way I read this, an implementation is definitely allowed to act on a cancellation request that arrives after the event for which it’s waiting occurs. Such events might be the availability of input on a pipe or socket, the arrival of a signal that would satisfy sigwaitinfo, the availability of free space in a pipe or network buffer that would make writing possible, etc. I don’t however see anywhere POSIX allows action on a cancellation request that has arrived not only after the event for which the thread is waiting, but also after the side effects of the function (such as IO, consuming a signal, allocating or deallocating a file descriptor, etc.). The relevant text seems to be:

The side-effects of acting upon a cancellation request while suspended during a call of a function are the same as the side-effects that may be seen in a single-threaded program when a call to a function is interrupted by a signal and the given function returns [EINTR]. Any such side-effects occur before any cancellation cleanup handlers are called.

My reading of this is that, since a call to open that fails with EINTR does not allocate a file descriptor, a cancelled call to open must not do so either. However, the issue remains unresolved for close (which happens also to be the most dangerous issue) since POSIX gives implementations the freedom to choose whether to deallocate the file descriptor on EINTR:

If close() is interrupted by a signal that is to be caught, it shall return -1 with errno set to [EINTR] and the state of fildes is unspecified.

Even worse, the ‘correct’ behavior for close under EINTR (or any error) is to always deallocate the file descriptor. This is what Linux does anyway, and the reasons why it’s the preferred behavior are way beyond the scope of this post. However, the ‘correct’ behavior for close under cancellation is obviously to have no side effects, since an application has no way to distinguish between cases where a pending cancellation request was acted upon before attempting to close the file (thus having no side effects) and cases where the cancellation request arrived during close.

As such, it seems the only safe way to use close with cancellation is by wrapping it in calls to pthread_setcancelstate that disable cancellation for the duration of the close call. This is probably harmless since close cannot in fact block except on special devices that hook the close event. Even POSIX makes it clear that the possibility of interruptable blocking in close should be avoided by implementations, stating in the Rationale (non-normative) section for the close function:

The use of interruptible device close routines should be discouraged to avoid problems with the implicit closes of file descriptors by exec and exit(). This volume of POSIX.1-2008 only intends to permit such behavior by specifying the [EINTR] error condition.

The ideal remedies I would like to see for this issue are:

  1. For glibc/NPTL to fix this issue, regardless of whether it’s a bug or just a quality-of-implementation issue. Various non-invasive fixes are possible, but they all amount to having the cancellation-request handling code examine the program counter register for the point at which the thread was interrupted to see if the system call has completed yet at the point the request was received. In musl libc, we do that by having labels in the assembly language and comparing the saved PC against them, but a DWARF2-annotation-based approach could also be devised which might be more appealing to glibc developers.

  2. For POSIX to clarify that cancellation must not be acted upon when any side effects of the function have already taken place. This will allow application programmers targetting conforming systems to actually use the cancellation interfaces for the purposes they were intended for.

  3. For POSIX to remove close from the list of cancellation points. I’ve never seen a situation where its being a cancellation point is beneficial, and the ambiguity of its behavior under EINTR, combined with the way cancellation side effects are specified in terms of EINTR behavior, just makes it unnecessarily ugly and difficult to use.

So far, I’ve filed bug #12683 with glibc, but have not yet pursued anything with the Austin Group for clarifying the requirements of the standard.

Status: OPEN

Post or read comments...

Introducing EWONTFIX

22 Sep 2012 22:47 GMT

Welcome to EWONTFIX, a blog about, well, bugs. Especially longstanding unfixed ones in C code for Linux or Unix-like systems. The idea for this blog grew out of conversations during the development of musl libc. Aside from the fact that longstanding bugs in glibc were one of the original motivations for musl, it turns out that developing a libc leads to spending a lot of time building and testing applications. And in the process of testing, one ends up reading a lot of source. And a lot of source is appallingly bad.

Most low-quality source code just isn’t that interesting to write about. It’s more just a matter of identifying the problems, submitting them to bug trackers, and following up until somebody fixes things. However there are also a good deal of cases where buggy code is interesting to discuss. These fall mostly under two major categories:

  1. The defect is endemic, in the sense that it’s become a widely-copied idiom for other software written in the same language.

  2. The defect is in widely-used library code and has serious implications for any application using that code; however, most users of the library are at best vaguely aware that the defect even exists.

In addition to providing analysis of bugs and design flaws, I aim to make the posts on EWONTFIX informative with regard to the broader context in which the flaw appears, covering topics such as thread cancellation, commit charge and overcommit, and async-signal-safety.

This blog is presently utilizing Disqus for the comment system. Disqus is a third-party service that handles all the dynamic content off-site so I don’t have to worry about server load, denial of service, and privilege escalation issues on the actual server EWONTFIX runs on. Anonymous comments are permitted. I’m still looking for a way to offer (at least read-only) access to the comments to visitors without JS/AJAX style web browsers. I’m sure some of this blog’s target audience is unhappy with the current comment situation, but on the bright side, at least you guys can’t post to flame about it.

Attempts at humor aside, I would love a ready-made solution to pull static comments from Disqus in a form that’s ready for pasting into the page’s <noscript> block, or even better, a gateway to submit comments from a pure-HTML client. Readers, if you have any leads on this or want to contribute scripts for it, please post in the comments (after finding access to a browser that’s capable of posting).

Apart from using Disqus for comments, EWONTFIX is running entirely on static content built using GNU make and simple, portable shell scripts. All content source is written in Markdown. The site is styled with clean, simple CSS. At some point I may take a detour into meta-blogging to discus the design and implementation of the site itself — including, of course, bugs found in the process.

Post or read comments...