This post is not about any particular bug or bad programming practice, just a new “printf debugging” technique I came up with.
Often when tracking down a bug, it’s useful to add extra output to
track the state of the program in the moments leading up to the crash
or incorrect behavior, aka “printf debugging”. However, this technique
is “invasive” in the sense that it interleaves unwanted data into the
program output. Using stderr instead of stdout can alleviate the
problem to some extent, but when you’re inserting the debugging code
into a widely-used library (in my case, libc.so) or a shell, even
having unwanted output on stderr can be a problem.
A previous approach I had used was sending the output to an arbitrary
high file descriptor instead of stdout or stderr. For example,
dprintf(666, "message here"); This avoids breaking anything but a
program using a huge number of file descriptors, and allows reading
the output via adding 666>&2 to the command line.
However, let’s take that a bit further. I don’t really need the
program itself to print any output. I just need to be able to see the
message. Enter my other favorite debugging tool, strace. Using
strace, I can see the above output on a high file descriptor even if
the descriptor isn’t actually open; the syscall still happens and
returns with EBADF. But there’s no reason it needed to be a valid
number that could be a file descriptor in the first place. I can
instead just use dprintf(-42, "message here\n"); and there’s no risk
of ever producing any output, but the strace log (with the
appropriate options to increase string length limits) will show the
writes.
The result is a form of “printf debugging” that can safely be added in library code and other code that shouldn’t touch any file descriptors.
What’s wrong with this C function?
char *my_strchr(char *s, int c)
{
int i;
for (i=0; s[i]!=c; i++)
if (!s[i]) return 0;
return &s[i];
}
Unless its interface contract requires that the caller pass a string
no longer than INT_MAX, it can invoke undefined behavior due to
integer overflow, most likely resulting in a crash. Even if you change
the type to unsigned instead of int to avoid the signed overflow
issue, the function will still be buggy; now, it just goes into an
infinite loop if the input string is longer than UINT_MAX.
The solution, simple enough, is to use the right type for i:
size_t. Strictly speaking, this could still fail in the same way,
since the C language places no requirement on implementations that
they not support objects too large for size_t; however, given that
the standard library functions all use size_t, even on such an
implementation one would have to go out of the way to produce such an
over-sized object (for example, using calloc or a non-standard
allocation function). And any reasonable implementation of C will have
size_t sufficiently large to store any object size.
Unfortunately, I regularly encounter a surprising degree of resistence
to the idea that this type of code must use size_t to be correct.
The most infamous example is the 64-bit vulnerability in
qmail,
which D. J. Bernstein refused to
acknowledge and award
bounty, claiming that it’s not a bug, but even to this day I encounter
quite a few programmers who claim they should be allowed to use int
instead of size_t for loop counters involving array element counts.
The argument for using int basically comes down to a claim that
inputs with more than 2 billion elements are (1) pathological, (2)
malicious, and (3) should have already been rejected at some earlier
stage. In most cases, I mostly agree with the first two points.
Unless you’re writing software that’s intended to deal with gigantic
volumes of data (in which case, I’ll presume you know what you’re
doing), gigantic inputs are probably bad inputs. Okay, fair enough.
The problem in the argument for using int instead of size_t arises
when you get to the third point — the idea that large inputs should
already have been rejected. This is basically a cheap way of pushing
the blame/responsibility onto somebody else, and while valid, if you
write your library functions this way, it will make them extremely
inefficient to interoperate with code that’s not making the same
assumptions.
As an example, suppose the above my_strchr is being called from code
that can’t assume its inputs were already length-limited below
INT_MAX. The only way to safely pass this input to my_strchr is to
call strlen (or strnlen, with n==INT_MAX) on it first to
validate that it’s not too long. Even if the typical runtime of
my_strchr is low (because you expect to find the character in the
first hundred bytes or so), you now have to scan the whole string (or
at least up to some large limit) first — not because you’ll actually
be using the data, but just to check that the string meets the awkward
interface contract of my_strchr. You’ve replaced the good
typical-case runtime with a guaranteed worst-case runtime, just for
the sake of gluing together interfaces that don’t play nice with each
other.
One could argue here that I’ve attacked a straw man — certainly the
my_strchr function could be improved to bail out early if i passes
some arbitrary limit. This actually does solve the problem, and used
consistently, this sort of approach works just as well as using
size_t as long as you never intend to support objects larger than
INT_MAX. Of course, the same bounded functions using size_t in
place of int would work just as well, and they would be more general
and flexible, in that they could be reused in applications requiring
support for huge data.
This post is going to be the first that’s about one of my own bugs, in
musl. For a long time, I’ve had certain
stdio functions such as feof and ferror forgoing any locking, and
simply relying on the fact that, per the memory model that’s assumed,
reading the associated flags is safe without any locks. The issue with
doing this is that, while it’s safe, it’s not correct; it leads to
observably incorrect behavior in some cases.
Per POSIX,
All functions that reference ( FILE *) objects shall behave as if they use flockfile() and funlockfile() internally to obtain ownership of these ( FILE *) objects.
(This requirement is tucked away in the specification of
flockfile,
so it’s not immediately apparent if you start out just reading the
General Information in XSH Chapter 2.)
What this means is that, even if locking is not required to make a
particular operation safe, there’s a requirement on the observable
behavior of the program that it be as if the locking happened.
Obviously, the existence of flockfile and funlockfile makes it so
that an observably-wrong behavior can happen if feof does not wait
for the lock; for example, if thread A never unlocks a file F except
when F is at end-of-file, then an observably-wrong behavior occurs if
thread B ever observes feof returning zero for F.
This is the boring case. What’s interesting is when we ask, if
flockfile and funlockfile were omitted (for example, in a
statically linked program that does not use them), would it be valid
to skip locking in functions like feof? (This could easily be
achieved with weak symbols.) It turns out the answer is still no, and
the reason is rather surprising: it’s actually possible to deduce that
a thread is in the middle of a stdio operation. This is contrary to
the general principle that it’s usually impossible to deduce, without
race conditions, whether a thread is executing a function or just
about to execute or just finished executing it.
As one example, consider a program with two threads, A and B, where
thread A has filled up the writing end of a pipe and observed (via
select or poll or EWOULDBLOCK) that further writes would block.
Now, thread B calls a stdio function to read from a stream attached to
the reading end of the pipe. If thread A then observes that the pipe
has become writable again, and if there are no other readers, then
it’s possible to deduce that the operations thread B is performing on
the stream have started (in which case the lock must have been
obtained). If thread A has not written a sufficient amount of data to
cause thread B’s operation on the stream to finish (for example, if
the function is fread and thread A wrote fewer than the requested
number of bytes), it’s also possible to deduce that the operation
could not have completed. Thus, the lock must be held by thread B, and
any operation performed on the stream by thread A must deadlock. In
particular, it would be wrong for feof to return zero, since by the
time thread B’s operation completes, the stream may in fact have
reached end-of-file status.
Another interesting issue that arises is the need for even fclose to
perform locking. I originally reasoned that locking should not be
needed in fclose, since any use of the stream pointer after fclose
is called results in undefined behavior. However, if another thread
provably holds the lock (either by having called flockfile, or by an
argument such as the above example), the standard assigns well-defined
behavior to the fclose call; fclose must wait to obtain the lock
before closing the stream and freeing the associated resources. As
long as the thread that held the lock does not make any further
attempt to use the stream after the lock is released, the program has
well-defined behavior, and implementations must support this usage.
As a consequence of this analysis, I’m fixing all the affected interfaces in musl to use locking.
Traditional unix systems had a vfork function, which works like
fork, but without creating a new virtual address space; the parent
and child run in the same address space. Unlike with pthread_create,
where the new thread runs on its own stack, vfork behaves like
fork and “returns twice”, once in the child and once in the parent.
This seems impossible, since the parent and child would clobber one
another’s stacks, but a clever trick saves the day: the parent process
is suspended until the child performs exec or _exit, breaking the
shared-memory-space relation between the two processes.
vfork was omitted from POSIX and modern standards because it’s
difficult to use; the original specification for the function left it
undefined to do basically anything except exec or _exit after
vfork in the child. However, many systems (including Linux) still
provide a similar or identical interface at the kernel level, and new
interest in its use has arisen again due to the fact that huge
processes cannot fork on systems with strict commit charge
accounting due to lack of memory, and the fact that copying the
virtual memory layout of a process can be expensive if the process has
a huge number of maps created by mmap. As such, both musl and glibc
use vfork to implement posix_spawn, the modern interface for
executing external programs as a new process.
While working on vfork usage in musl’s posix_spawn implementation,
I realized that using it is a lot trickier (and more dangerous!) than
I’d realized before. Here are some of the issues.
vforkSince the vfork child runs in the same address space as the parent,
care needs to be taken to ensure that it does not modify the parent’s
memory in unwanted/unsafe ways. This seems easy enough, until you
realize that the calling program might have installed signal handlers,
and these signal handlers could get invoked in the child. The most
likely way this could happen is when signals are sent to entire
process groups, for example, as a result of events like pressing the
interrupt/quit key on the controlling terminal, or resizing the
terminal. However, signals can also be sent explicitly to a process
group as well.
If a signal handler runs in the child after vfork, there are several
different ways it could corrupt the parent’s state:
As such, if vfork is to be used in code where the caller might have
signal handlers which could be broken by the above issues, it’s
necessary to ensure that the parent’s signal handlers don’t get
invoked in the child. This amounts to:
vfork.SIG_IGN
to SIG_DFL, and then restore the old signal mask.vfork returns in the parent, restore the signal mask.Unfortunately, step 2 was illegal (undefined behavior) in the original
specification of vfork, which basically meant vfork was impossible
to use. Fortunately, on systems where vfork is supported, more
specific semantics are provided/guaranteed, and step 2 works.
vforkFormally, vfork was pretty much history by the time people started
caring about threads. But in real-world implementations (Linux), we
can observe that vfork doesn’t suspend the whole parent process
(which would be really difficult to do, anyway), but instead just
suspends the calling thread until the child calls exec/exit.
This means that concurrency issues exist, and the vfork child is
actually sharing memory with other running code, not just a
suspended parent.
This leads us to...
setuid and vforkNow we get to the worst of it. Threads and vfork allow you to get in
a situation where two processes are both sharing memory space and
running at the same time. Now, what happens if another thread in the
parent calls setuid (or any other privilege-affecting function)? You
end up with two processes with different privilege levels running in a
shared address space. And this is A Bad Thing.
Consider for example a multi-threaded server daemon, running initially
as root, that’s using posix_spawn, implemented naively with vfork,
to run an external command. It doesn’t care if this command runs as
root or with low privileges, since it’s a fixed command line with
fixed environment and can’t do anything harmful. (As a stupid example,
let’s say it’s running date as an external command because the
programmer couldn’t figure out how to use strftime.)
Since it doesn’t care, it calls setuid in another thread without any
synchronization against running the external program, with the intent
to drop down to a normal user and execute user-provided code (perhaps
a script or dlopen-obtained module) as that user. Unfortunately, it
just gave that user permission to mmap new code over top of the
running posix_spawn code, or to change the strings posix_spawn is
passing to exec in the child. Whoops.
The easy out would be just giving up on vfork. But in musl, a major
target is systems where robustness is required (no overcommit) and
memory might be constrained; therefore, fork is not a good option.
Also, using vfork to implement posix_spawn might eventually allow
us to support no-MMU targets.
In musl, there’s already a global lock that controls calls to the
setuid family of functions; it was needed because Linux requires a
userspace process to synchronize all its threads to make the setuid,
etc. syscalls rather than doing the synchronization in kernelspace.
Thus, it was easy to just reuse this lock to prevent uid/gid changes
while posix_spawn is running. I believe glibc could do the same,
since it has an equivalent locking mechanism in NPTL. musl may have a
slight advantage here at present, in that the lock we’re using is a
reader-writer lock, and callers of posix_spawn only count as
readers, not writers.
I’ve filed and reopened several glibc bug reports related to the above issues:
These are security-relevant, but the rarity of multi-threaded programs
that change their uid/gid after going threaded, and the rarity of
real-world programs using posix_spawn, makes the impact extremely
low at present.
Linux provides a CLONE_VFORK flag to clone which provides similar
semantics to the traditional behavior of vfork, but allowing the new
process to run on a separate stack with its own entry point, instead
of utilizing the returns-twice idiom of fork. However, this does not
solve any of the above problems; the signal handler and setuid issues
still exist, and code in the child still has to tiptoe around anything
that might upset the parent’s state. As such, I don’t see it as being
any more useful than “traditional” vfork for implementing things
like posix_spawn.
Where A.S. are asynchronous signals, D.C. is deferred cancellation, and A.C. is asynchronous cancellation. In the previous post, I discussed asychronous versus deferred cancellation in POSIX threads, and issues that make it hard to use asynchronous cancellation well. I also mentioned that there are almost no functions which are async-cancel-safe. What if you want to cheat and get the behavior of asynchronous cancellation, but without having to follow the rules?
Enter asynchronous signals. Particularly, I’m thinking of signals sent
to a specific thread using pthread_kill, but really the signal could
be coming from another source like pressing the interrupt or quit key
on a terminal.
Suppose the main flow of excution in a thread uses only async-signal-safe functions. These are not to be confused with async-cancel-safe functions, of which only three exist; the set of async-signal-safe functions is relatively large and includes lots of powerful tools. Cancellation mode is left as deferred (the default).
Now suppose you have a signal handler that interrupts the main flow of
execution and calls pthread_testcancel. Despite pthread_testcancel
being async-signal-unsafe, this is legal because no other
async-signal-unsafe function was interrupted by the signal.
Under this setup, a call to pthread_cancel followed by sending the
signal gives you the equivalent of asynchronous cancellation, but
rather than being restricted to only calling async-cancel-safe
functions, it seems it’s now legal to call arbitrary async-signal-safe
functions.
On the one hand, this seems to be an argument that the concept of async-cancel-safety is misguided, and that async-signal-safety should be the condition for which functions an application can call when in asynchronous cancellation mode.
However, let’s take it a step further. pthread_testcancel was not
async-signal-safe, but close is. So, close(-1) (a no-op, aside
from setting errno) is an async-signal-safe version of
pthread_testcancel! Now, we’re no longer restricted to only calling
async-signal-safe functions in the main flow of execution, since the
signal handler is async-signal safe. But this is obviously wrong.
Cancelling a thread while it’s in the middle of malloc or printf
is not something that’s intended to work.
On the other hand, this brings up serious concerns for applications which are not trying to get around the async-cancel-safety rules, but which just happen to call cancellation points from their signal handlers. Doing so will cause the interrupted code to get cancelled exactly as if asynchronous cancellation had been enabled. And this is dangerous and generally unwanted.
A well-behaved application would like to just call
pthread_setcancelstate in its signal handlers to set
PTHREAD_CANCEL_DISABLED for the duration of the signal handler. But
that’s in general not possible, since pthread_setcancelstate is
async-signal-unsafe. This leaves me with a conclusion that,
unless/until some improvement or clarification is added to the
standard, applications need to ensure that signal handler containing
cancellation points do not get run in threads that are potential
targets of cancellation.
I’ve filled issues #615 and #622 on the Austin Group bug tracker. Ultimately I think this just comes down to an omission in the standard of any text to forbid this madness, hopefully an omission which can be quickly and easily resolved. But it’s provided some nice insight into the non-obvious complexity of cancellation and its interaction with other features of the POSIX standard with which it was probably never intended to be used.
In the past few posts, I’ve introduced thread cancellation and some of the implementation and application usage difficulties in making cancellation robust. One topic I haven’t yet touched on is asynchronous cancellation. POSIX threads support two cancellation types: asynchronous and deferred. The latter, deferred cancellation, is the default, whereby a cancellation request is only acted upon immediately if the thread to be cancelled is suspended at a cancellation point, and otherwise remains pending until the next call to a cancellation point.
The other option, asynchronous cancellation, allows (but does not require) the implementation to act on cancellation requests at any time. This obviously has the potential, to leave data in a horribly inconsistent state, so rules are imposed; the application cannot call any functions from the standard library except those designated “async-cancel safe” while asynchronous cancellation is in effect. Essentially, one can think of asynchronous cancellation as a feature to be used only when a thread is performing a long-running pure computation.
One obvious problem with asynchronous cancellation that limits its utility is the fact that it does not obligate an implementation to act on cancellation requests immediately; the language of the standard is such that the request need only be acted upon at some point between when it’s made and the next call to a cancellation point. Unfortunately, there do not exist any cancellation points which are also async-cancel-safe, so a pathological but conforming implementation could actually treat async cancellation mode the same as it treats disabling cancellation.
OK, let’s forget about pathological and gratuitously broken implementations for a moment. Even if asynchronous cancellation mode does cause requests arriving to be acted upon immediately, there’s still the question of what happens when a cancellation request is already pending upon switching to asynchronous cancellation. POSIX does not say anything about this case, and if cancellation is implemented as (or analogously to) a signal, it would be an easy oversight for an implementation to forget to check for alread-pending cancellation requests when switching to asynchronous cancellation.
What if an application wants to account for this possible quality of
implementation issue? In principle, an application should just be able
to call pthread_testcancel right after switching on asynchronous
cancellation. But it can’t do this, because pthread_testcancel is
not async-cancel-safe. And if it calls pthread_testcancel before
switching modes, there’s a race condition whereby a request that
arrives between these two calls won’t be picked up.
What we’re stuck with is a situation where the standard allows low-quality implementations and also forbids the obvious workaround applications could use to handle such low-quality implementations.
Do any such low-quality implementations exist? If not, POSIX should
probably standardize that switching to asynchronous cancellation acts
on any pending cancellation request (for example, by specifying
pthread_setcanceltype to be a cancellation point when its argument
is PTHREAD_CANCEL_ASYNCHRONOUS).
Last week I took the time to file a report with the Austin Group
(responsible for POSIX) about the close issue. It
is Issue #614, and it
turns out the problem was already solved by fixing the specification
of close when interrupted by a signal. Whew. I thought that latter
would be a lot more controversial and harder to get fixed
Some basic history on the issue: Apparently, there was a historical
disagreement over the behavior of close when interrupted by a
signal. Some implementations (e.g. HPUX) had it leave the file
descriptor open when returning with EINTR; others (Linux, AIX) closed
it unconditionally, but returned with EINTR if a signal arrived while
close() was interrupted before returning. This ambiguity was
acceptable for single-threaded applications, which could just
unconditionally call close() again on EINTR, possibly getting EBADF if
the file descriptor no longer existed due to the Linux/AIX behavior.
It's not acceptable for multi-threaded applications, where calling
close again could close a file descriptor just obtained by another
thread - which can have far-reaching and extremely dangerous
consequences.
Last December it was refered to the Austin Group for interpretation as
Issue #529. The issue
was resolved by requiring the HPUX behavior if close returns with
EINTR, but allowing implementations which want to deallocate the
file descriptor even if interrupted by a signal to return with the
EINPROGRESS error instead of EINTR.
This is really the best possible solution, and the latter choice is
the choice high-quality implementations should make. The Linux
developers objected vehemently to the HPUX behavior on many grounds
(see this thread for
details), but the most important is that, under the HPUX behavior, it
becomes impossible for a program that needs to deallocate a file
descriptor to make forward progress when blocked on close; retrying
the close will just block again. The situation is even worse with
cancellation: the cleanup handler must retry the close, now with
cancellation disabled, and it will now block indefinitely. But the
Linux behavior is wrong too: in all other cases, EINTR means the
application should retry the operation if it needs the effects to have
completed, and given how POSIX defines the side effects on
cancellation in terms of the side effects on EINTR, it yields
unacceptable behavior with respect to cancellation. Avoiding this
would have required POSIX to special-case close in the rules for
side effects at cancellation.
Unfortunately, Linux (the kernel) is still doing the same thing it
always did, returning -EINTR from the close syscall despite having
the EINPROGRESS semantics. We're now working around this in
musl, and as far as I know, have the first
Linux-based libc where close conforms to the amended POSIX
requirements. See commit
82dc1e2e783815e00a90cd3f681436a80d54a314
for details. Avoiding the cancellation issue (acting on cancellation
after close deallocated the file descriptor) requires additional work
at the point where cancellation is processed, but I already took care
of this a long time ago; it's in
src/thread/cancel_impl.c.
By the way, why does close ever block at all? Some genius way back
thought it would be clever to overload the close function with
responsibility for rewinding tape devices, and block until rewinding
is complete. Never mind that there are perfectly good ways to rewind
the device before closing it. If not for this historical blunder,
close would never have been a blocking function, would never have
been subject to interruption-by-signal, would never fail with EINTR,
and would not have been specified by POSIX as a cancellation point.
And this would have in turn made programming with file descriptors,
signals, and thread cancellation a lot easier and less
error-prone...
I’ve written in the past on the topic of overcommit, which depending on your perspective, is either a feature of Linux and some other kernels, or a bug left over from a time when folks didn’t know how to do virtual memory accounting properly. I’m a serious proponent of strict commit accounting (opposite of overcommit), but for this article, I want to look at the state of the software ecosystem and how it often leaves us overcommit-enabled Linux systems being more failproof than their strict-accounting brothers and sisters.
The idea of strict commit accounting is that malloc never reports
success only to let your program crash when you actually try to use
the memory. If the kernel cannot ensure that there’s no possible
sequence of paging events that would cause it to run out of physical
storage for all to-be-mapped pages, then allocating new pages fails,
and malloc returns a null pointer. This gives your program the
power, but also the responsibility, to check for out-of-memory
conditions and handle them, which is in principle, a very good thing.
Where problems begin to arise is when programs don’t check the return
value of malloc, or use it only for the sake of calling abort when
allocation fails.
Let’s consider a possible (rather likely, these days) scenario: you have several core system components (things like init/upstart/systemd, inetd, sshd, etc.) that would leave the system in a crippled, unusable, or even kernel-panic state if they die, and these programs are making use of dynamic allocation. What happens when your machine runs out of memory?
If strict commit accounting is in effect, one of their calls to
malloc fails, resulting in one of the following:
abort on the first allocation failure. A core
system component is not likely to do this itself, but it may
inadvertently do so by relying on a library (such as glib) that
unconditionally aborts the calling program on allocation
failure.malloc could fail, and dereferencing
relative to a null pointer. This is the worst possible behavior, but
there’s plenty of software that buggy.If on the other hand overcommit is enabled, along with Linux’s heuristic OOM killer, there’s only one likely result: this critical system component was either manually marked as not a candidate for OOM killing, or was naturally not a candidate since it never engaged in allocation behaviors that the OOM killer judged as abusive. Some bloated desktop app like Firefox or OpenOffice if this is a desktop system, or some runaway PHP program if it’s a webserver, gets OOM-killed instead, and the system is back to “normal” (minus the user perhaps being angry about losing his or her session).
Does this mean I’ve changed my mind about overcommit and it’s actually a good thing? No, not really. What it means, at least in my mind, is that there’s a great deal of work that needs to be done auditing core system components for robustness and fail-safe behavior. In particular:
This is hard work, but I still believe it’s better than the current situation where stability of the essential core components of the system depends on sacrificing (OOM-killing) user applications which might have valuable unsaved data. It just means we still have a long way to go towards a rock-solid, crash-free FOSS platform...
In a multi-threaded C program where threads share address space and
may be operating on shared objects as long as they use the proper
synchronization tools, it’s unsafe to asynchronously kill an
individual thread without killing the whole process. Stale locks may
be left behind and data being modified under those locks may be in an
inconsistent state. This includes even internal heap management
structures used by malloc.
As such, the POSIX threads standard does not even offer a mechanism for forcible termination of individual threads. Instead, it offers thread cancellation, a mechanism by which early termination of a thread whose work is no longer needed can be negotiated in such a way that the thread to be cancelled cleans up any shared state and/or private resources it may be using before it terminates.
The way cancellation works, from an application standpoint, is that
when thread A no longer needs the work that thread B is performing,
thread A calls pthread_cancel on thread B. Under normal
circumstances, this does not immediately kill thread B - remember,
asynchronously killing threads is fundamentally unsafe. Instead, it
causes as cancellation request to be pending for thread B. The next
time thread B calls a function which is specified by POSIX as a
cancellation point, this request is acted upon, and the thread
terminates (after possibly running cancellation cleanup handlers,
which are analogous to exception handlers in languages with
exceptions). Alternatively, if thread B happened to be blocked at a
cancellation point already when thread A made the cancellation
request, thread B would be immediately cancelled.
This latter case is what makes cancellation a more powerful tool than simply setting an exit flag for the target thread to inspect: cancellation can interrupt functions that block waiting for an event such as input. This is important because, under many circumstances such as non-responsive network peers or fifos where the other end is not open, the event for which the thread is blocked waiting may never happen.
On first consideration, one might consider cancellation unnecessary. A cancellation-like mechanism seems possible with interrupting signals, as in:
do {
pthread_mutex_lock(&ctx->exitflag_lock);
int flag = ctx->exitflag;
pthread_mutex_unlock(&ctx->exitflag_lock);
if (flag) pthread_exit(0);
} while ((n = read(fd, buf, sizeof buf)) == -1 && errno == EINTR);
Unfortunately, this code has a race condition; if a signal is sent to
the thread after ctx->exitflag is checked, but before read is
called, read will then block indefinitely. This issue can be worked
around by bombarding the thread with such signals until it exits
(possibly with exponential backoff), as in:
unsigned ns = 255;
pthread_mutex_lock(&ctx->exitflag_lock);
ctx->exitflag = 1;
pthread_mutex_unlock(&ctx->exitflag_lock);
while (pthread_kill(ctx->thread_desc, sig) != ESRCH)
nanosleep({.tv_nsec=(ns+=ns+1)},0);
However this solution is not only inelegant but also quite costly, and the cost grows severely under load when scheduling delays prevent the target thread from waking up right away.
In an ideal world, thread cancellation would be the definitive solution to this problem, but low-quality implementations and possibly even a defect in the standard make it difficult to use cancellation robustly. Most of the bad implementations are just obviously bad, like the one in Darwin (MacOSX/iOS) where the developers literally just invented their own semantics based on the names of the cancellation-related functions without even looking at what they’re specified to do. Those are topics for another post. What I’d like to examine now is a much more subtle problem in the NPTL implementation of POSIX threads used by the GNU C Library (glibc), which depending on how one interprets the standard, may be a conformance bug or may just be a flaw in the standard that allows for very-low-quality implementations that are impossible to use safely.
POSIX specifies the general semantics of thread cancellation in Chapter 2 (XSH) 2.9.5 Thread Cancellation, as follows:
The side-effects of acting upon a cancellation request while suspended during a call of a function are the same as the side-effects that may be seen in a single-threaded program when a call to a function is interrupted by a signal and the given function returns [EINTR]. Any such side-effects occur before any cancellation cleanup handlers are called.
Whenever a thread has cancelability enabled and a cancellation request has been made with that thread as the target, and the thread then calls any function that is a cancellation point (such as pthread_testcancel() or read()), the cancellation request shall be acted upon before the function returns. If a thread has cancelability enabled and a cancellation request is made with the thread as a target while the thread is suspended at a cancellation point, the thread shall be awakened and the cancellation request shall be acted upon. It is unspecified whether the cancellation request is acted upon or whether the cancellation request remains pending and the thread resumes normal execution if:
The thread is suspended at a cancellation point and the event for which it is waiting occurs
A specified timeout expired
before the cancellation request is acted upon.
The interesting part is side effects. POSIX specifies quite a few functions as cancellation points, and their most interesting side effects are:
accept, open, ...)fcntl, lockf, sem_wait, ...)close)pthread_join)sigwaitinfo, ...)File IO is of course also a major side effect, but possibly the least-interesting one in the sense that if you’re cancelling a thread doing IO, you probably don’t care about the final state of the open file except that it get closed and possibly deleted in the cleanup routines. The others are a lot more critical to safe and correct program operation.
With regards to the interaction of cancellation and side effects, what you would like, as the application programmer, is that either:
The other possibility to be concerned with, however, is that the side
effects of the function occur, but control never returns. Consider
what this would mean for open or accept: a new file descriptor is
allocated, but the descriptor is never returned to the application.
Thus it becomes impossible to close it. The application is left with a
resource leak.
A much worse case is close. If the side effects of close take
place without control returning, the application has no way of knowing
the file descriptor was deallocated. If it assumes it’s still valid
and attempts to close it again in cleanup handlers or elsewhere in the
program, it may in fact close as different file that was opened
later by another thread and assigned the same file descriptor number.
If it assumes the file descriptor was closed already and this
assumption turns out to be wrong, it leaks a file descriptor.
Are these just theoretical possibilities? Unfortunately, no. The way NPTL (used by glibc and uClibc) implements cancellable system calls is essentially (in pseudo-code):
ENABLE_ASYNC_CANCEL();
ret = DO_SYSCALL(...);
RESTORE_OLD_ASYNC_CANCEL();
return ret;
In other words, it temporarily turns on asynchronous cancellation, whereby any cancellation request will take place immediately, for the duration of the system call. This unfortunately leaves a race condition window after the side effects have taken place (in kernelspace) but before asynchronous cancellation is turned off, during which a cancellation request can arrive and be acted upon.
Here is a simple test case which demonstrates the issue:
#define _POSIX_C_SOURCE 200809L
#include <pthread.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <time.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
void *writeopener(void *arg)
{
int fd;
for (;;) {
fd = open(arg, O_WRONLY);
close(fd);
}
}
void *leaker(void *arg)
{
int fd = open(arg, O_RDONLY);
pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, 0);
close(fd);
return 0;
}
#define ITER_COUNT 10000
int main()
{
pthread_t td, bg;
struct stat st;
char tmp[] = "/tmp/cancel_race_XXXXXX";
int i, leaks=0;
mkstemp(tmp);
unlink(tmp);
mkfifo(tmp, 0600);
srand(1);
pthread_create(&bg, 0, writeopener, tmp);
for (i=0; i<ITER_COUNT; i++) {
pthread_create(&td, 0, leaker, tmp);
nanosleep((&(struct timespec){ .tv_nsec=rand()%100000 }), 0);
pthread_cancel(td);
pthread_join(td, 0);
}
unlink(tmp);
for (i=4; i<1024; i++) {
if (!fstat(i, &st)) leaks++, printf("leaked fd %d\n", i);
}
return !!leaks;
}
So, is this a conformance issue, or just a show-stoppingly-bad quality of implementation issue? Let’s look again at the language in POSIX. The only text that explicitly gives an implementation freedom with regard to the behavior of cancellation is:
It is unspecified whether the cancellation request is acted upon or whether the cancellation request remains pending and the thread resumes normal execution if:
The thread is suspended at a cancellation point and the event for which it is waiting occurs
A specified timeout expired
before the cancellation request is acted upon.
The way I read this, an implementation is definitely allowed to act on
a cancellation request that arrives after the event for which it’s
waiting occurs. Such events might be the availability of input on a
pipe or socket, the arrival of a signal that would satisfy
sigwaitinfo, the availability of free space in a pipe or network
buffer that would make writing possible, etc. I don’t however see
anywhere POSIX allows action on a cancellation request that has
arrived not only after the event for which the thread is waiting, but
also after the side effects of the function (such as IO, consuming a
signal, allocating or deallocating a file descriptor, etc.). The
relevant text seems to be:
The side-effects of acting upon a cancellation request while suspended during a call of a function are the same as the side-effects that may be seen in a single-threaded program when a call to a function is interrupted by a signal and the given function returns [EINTR]. Any such side-effects occur before any cancellation cleanup handlers are called.
My reading of this is that, since a call to open that fails with
EINTR does not allocate a file descriptor, a cancelled call to
open must not do so either. However, the issue remains unresolved
for close (which happens also to be the most dangerous issue) since
POSIX gives implementations the freedom to choose whether to
deallocate the file descriptor on EINTR:
If close() is interrupted by a signal that is to be caught, it shall return -1 with errno set to [EINTR] and the state of fildes is unspecified.
Even worse, the ‘correct’ behavior for close under EINTR (or any
error) is to always deallocate the file descriptor. This is what Linux
does anyway, and the reasons why it’s the preferred behavior are way
beyond the scope of this post. However, the ‘correct’ behavior for
close under cancellation is obviously to have no side effects, since
an application has no way to distinguish between cases where
a pending cancellation request was acted upon before attempting to
close the file (thus having no side effects) and cases where the
cancellation request arrived during close.
As such, it seems the only safe way to use close with cancellation
is by wrapping it in calls to pthread_setcancelstate that disable
cancellation for the duration of the close call. This is probably
harmless since close cannot in fact block except on special devices
that hook the close event. Even POSIX makes it clear that the
possibility of interruptable blocking in close should be avoided by
implementations, stating in the Rationale (non-normative) section for
the close function:
The use of interruptible device close routines should be discouraged to avoid problems with the implicit closes of file descriptors by exec and exit(). This volume of POSIX.1-2008 only intends to permit such behavior by specifying the [EINTR] error condition.
The ideal remedies I would like to see for this issue are:
For glibc/NPTL to fix this issue, regardless of whether it’s a bug or just a quality-of-implementation issue. Various non-invasive fixes are possible, but they all amount to having the cancellation-request handling code examine the program counter register for the point at which the thread was interrupted to see if the system call has completed yet at the point the request was received. In musl libc, we do that by having labels in the assembly language and comparing the saved PC against them, but a DWARF2-annotation-based approach could also be devised which might be more appealing to glibc developers.
For POSIX to clarify that cancellation must not be acted upon when any side effects of the function have already taken place. This will allow application programmers targetting conforming systems to actually use the cancellation interfaces for the purposes they were intended for.
For POSIX to remove close from the list of cancellation points.
I’ve never seen a situation where its being a cancellation point is
beneficial, and the ambiguity of its behavior under EINTR, combined
with the way cancellation side effects are specified in terms of
EINTR behavior, just makes it unnecessarily ugly and difficult to
use.
So far, I’ve filed bug #12683 with glibc, but have not yet pursued anything with the Austin Group for clarifying the requirements of the standard.
Status: OPEN
Welcome to EWONTFIX, a blog about, well, bugs. Especially longstanding unfixed ones in C code for Linux or Unix-like systems. The idea for this blog grew out of conversations during the development of musl libc. Aside from the fact that longstanding bugs in glibc were one of the original motivations for musl, it turns out that developing a libc leads to spending a lot of time building and testing applications. And in the process of testing, one ends up reading a lot of source. And a lot of source is appallingly bad.
Most low-quality source code just isn’t that interesting to write about. It’s more just a matter of identifying the problems, submitting them to bug trackers, and following up until somebody fixes things. However there are also a good deal of cases where buggy code is interesting to discuss. These fall mostly under two major categories:
The defect is endemic, in the sense that it’s become a widely-copied idiom for other software written in the same language.
The defect is in widely-used library code and has serious implications for any application using that code; however, most users of the library are at best vaguely aware that the defect even exists.
In addition to providing analysis of bugs and design flaws, I aim to make the posts on EWONTFIX informative with regard to the broader context in which the flaw appears, covering topics such as thread cancellation, commit charge and overcommit, and async-signal-safety.
This blog is presently utilizing Disqus for the comment system. Disqus is a third-party service that handles all the dynamic content off-site so I don’t have to worry about server load, denial of service, and privilege escalation issues on the actual server EWONTFIX runs on. Anonymous comments are permitted. I’m still looking for a way to offer (at least read-only) access to the comments to visitors without JS/AJAX style web browsers. I’m sure some of this blog’s target audience is unhappy with the current comment situation, but on the bright side, at least you guys can’t post to flame about it.
Attempts at humor aside, I would love a ready-made solution to pull
static comments from Disqus in a form that’s ready for pasting into
the page’s <noscript> block, or even better, a gateway to submit
comments from a pure-HTML client. Readers, if you have any leads on
this or want to contribute scripts for it, please post in the comments
(after finding access to a browser that’s capable of posting).
Apart from using Disqus for comments, EWONTFIX is running entirely on static content built using GNU make and simple, portable shell scripts. All content source is written in Markdown. The site is styled with clean, simple CSS. At some point I may take a detour into meta-blogging to discus the design and implementation of the site itself — including, of course, bugs found in the process.