EWONTFIX

The worst of time64 breakage

15 Feb 2020 19:25:16 GMT

In preparing to release musl 1.2.0, I worked with distro maintainers from Adélie Linux and Yoe to find serious application compatibility problems users would hit when upgrading, so that we could have patches ready and reduce user frustration with the upgrade. Here are some of the findings.

By far the most dangerous type of app compatibility issue we found was in Berkeley DB 5.x, which defines its own wrong version of the timespec struct to pass to clock_gettime:

typedef struct {
    time_t  tv_sec;             /* seconds */
#ifdef HAVE_MIXED_SIZE_ADDRESSING
    int32_t tv_nsec;
#else
    long    tv_nsec;            /* nanoseconds */
#endif
} db_timespec;

The type of tv_sec in their redefinition was right (time_t), using the libc type for time_t. However, the layout of the rest of the struct failed to match. The libc timespec struct is (prettified and removing conditionals):

struct timespec {
    time_t tv_sec;
    long tv_nsec;
    long :32;
};

with the padding and tv_nsec swapped for big-endian archs, so as to line up with the low bits of a 64-bit field. However the padding matters even on little-endian archs, since otherwise the structure may be only 12 bytes long instead of 16 bytes (on archs that don't impose 8-byte alignment). This includes i386, sh, or1k, m68k, and others. So, when Berkeley DB declared its own version without the padding, it passed a 12-byte object to a function expecting a 16-byte one, resulting in stack smashing. Thankfully, stack-protector caught it; otherwise it might have yielded branch to a location specified by current time in nanoseconds mod 1 second.

Note that even without the size/layout mismatch, defining a custom type to pass to clock_gettime was undefined behavior and should never have been done.

(Reportedly there are later versions of Berkeley DB, which may or may not have already fixed the bug, but after taking over maintenance of it Oracle went and relicensed the code to be license-incompatible with basically every piece of software ever to use it, so in practice everyone uses 5.x.)

A similarly dangerous variant of this kind of issue is bypassing libc declarations to declare time-related functions manually. This is forbidden by C/POSIX except for functions that can be declared without the use of any types defined by the standard library, and in most cases should produce compile-time errors if the declarations mismatch, since making the types visible usually also makes the libc declarations visible. However, some Qt 2.x code copy-and-pasted into Doxygen managed to do this without an error by using the preprocessor to suppress the declaration of gettimeofday, then subsequently declaring the function itself:

#define gettimeofday    __hide_gettimeofday
#include "qdatetime.h"
#include "qdatastream.h"
#include <stdio.h>
#include <time.h>
#if defined(_OS_WIN32_)
#if defined(_CC_BOOL_DEF_)
#undef  bool
#include <windows.h>
#define bool int
#else
#include <windows.h>
#endif
#elif defined(_OS_MSDOS_)
#include <dos.h>
#elif defined(_OS_OS2_)
#include <os2.h>
#elif defined(_OS_UNIX_) || defined(_OS_MAC_)
#include <sys/time.h>
#include <unistd.h>
#undef  gettimeofday
extern "C" int gettimeofday( struct timeval *, struct timezone * );
#endif

Purportedly the reason for this hack was to deal with ambiguity over whether the secont argument is struct timespec * or void *, since C++ can't deal with implicit conversions from void *. However that's irrelevant since a null pointer implicitly converts to either and they pass a null pointer (because a null pointer is the only valid argument value).

Fortunately, this error manifests in the opposite direction: causing the old time32 function to get called, resulting in a smaller write to the correct-sized timeval type rather than an oversized write to an incorrect timeval type. But it still produced a crash, since the Qt code passed the resulting junk time_t value to localtime, then dereferenced the reult without checking for error (and localtime returned null because the bogus time_t value overflowed tm_year when converting to struct tm).

Other issues we found were a lot less spectacular, but may merit a second (or third) post later.