EWONTFIX

Unexpected observability of lock states

25 Oct 2012 03:20:38 GMT

This post is going to be the first that’s about one of my own bugs, in musl. For a long time, I’ve had certain stdio functions such as feof and ferror forgoing any locking, and simply relying on the fact that, per the memory model that’s assumed, reading the associated flags is safe without any locks. The issue with doing this is that, while it’s safe, it’s not correct; it leads to observably incorrect behavior in some cases.

Per POSIX,

All functions that reference ( FILE *) objects shall behave as if they use flockfile() and funlockfile() internally to obtain ownership of these ( FILE *) objects.

(This requirement is tucked away in the specification of flockfile, so it’s not immediately apparent if you start out just reading the General Information in XSH Chapter 2.)

What this means is that, even if locking is not required to make a particular operation safe, there’s a requirement on the observable behavior of the program that it be as if the locking happened. Obviously, the existence of flockfile and funlockfile makes it so that an observably-wrong behavior can happen if feof does not wait for the lock; for example, if thread A never unlocks a file F except when F is at end-of-file, then an observably-wrong behavior occurs if thread B ever observes feof returning zero for F.

This is the boring case. What’s interesting is when we ask, if flockfile and funlockfile were omitted (for example, in a statically linked program that does not use them), would it be valid to skip locking in functions like feof? (This could easily be achieved with weak symbols.) It turns out the answer is still no, and the reason is rather surprising: it’s actually possible to deduce that a thread is in the middle of a stdio operation. This is contrary to the general principle that it’s usually impossible to deduce, without race conditions, whether a thread is executing a function or just about to execute or just finished executing it.

As one example, consider a program with two threads, A and B, where thread A has filled up the writing end of a pipe and observed (via select or poll or EWOULDBLOCK) that further writes would block. Now, thread B calls a stdio function to read from a stream attached to the reading end of the pipe. If thread A then observes that the pipe has become writable again, and if there are no other readers, then it’s possible to deduce that the operations thread B is performing on the stream have started (in which case the lock must have been obtained). If thread A has not written a sufficient amount of data to cause thread B’s operation on the stream to finish (for example, if the function is fread and thread A wrote fewer than the requested number of bytes), it’s also possible to deduce that the operation could not have completed. Thus, the lock must be held by thread B, and any operation performed on the stream by thread A must deadlock. In particular, it would be wrong for feof to return zero, since by the time thread B’s operation completes, the stream may in fact have reached end-of-file status.

Another interesting issue that arises is the need for even fclose to perform locking. I originally reasoned that locking should not be needed in fclose, since any use of the stream pointer after fclose is called results in undefined behavior. However, if another thread provably holds the lock (either by having called flockfile, or by an argument such as the above example), the standard assigns well-defined behavior to the fclose call; fclose must wait to obtain the lock before closing the stream and freeing the associated resources. As long as the thread that held the lock does not make any further attempt to use the stream after the lock is released, the program has well-defined behavior, and implementations must support this usage.

As a consequence of this analysis, I’m fixing all the affected interfaces in musl to use locking.