EWONTFIX - NULL considered harmful

NULL considered harmful

04 Jul 2013 03:25:02 GMT

The C and C++ languages define the macro NULL, widely taught as the correct way to write a literal null pointer. The motivations for using NULL are well-meaning; largely they come down to fact that it documents the intent, much like how a macro named FALSE might better document boolean intent than a literal 0 would do. Unfortunately, use of the NULL macro without fully understanding it can lead to subtle bugs and portability issues, some of which are difficult for compilers and static analysis tools to diagnose.

Despite it being superceded by the 2011 standard, I'm going to quote C99 because it's what I'm most familar with, and I suspect most readers are in the same situation. 7.17 specifies the NULL macro as:

NULL

which expands to an implementation-defined null pointer constant;

6.3.2.3 Pointers in turn defines null pointer constant as:

An integer constant expression with the value 0, or such an expression cast to type void *

And finally, for completeness, 6.6 defines integer constant expression as:

Constant expressions shall not contain assignment, increment, decrement, function-call, or comma operators, except when they are contained within a subexpression that is not evaluated.

An integer constant expression shall have integer type and shall only have operands that are integer constants, enumeration constants, character constants, sizeof expressions whose results are integer constants, and floating constants that are the immediate operands of casts. Cast operators in an integer constant expression shall only convert arithmetic types to integer types, except as part of an operand to the sizeof operator.

What's important to realize is that this allows a great deal of freedom with regard to how an implementation defines NULL. Some examples:

#define NULL 0

#define NULL ((void *)0)

#define NULL 0L

typedef void *__voidp;
#define NULL ((__voidp)0)

#define NULL (1ULL-1ULL)

#define NULL ('9'-'0'-9)

#define NULL (sizeof(void *)/(sizeof(char)-2))

In particular, the set of possible types NULL could have includes all integer types and void *. Fortunately, since these are all null pointer constants, in most contexts using NULL will get you The Right Thing no matter which type NULL might have. Pointer comparisons, assignments to pointers, the ?: operator, and so forth all treat null pointer constants in special ways such that the type of the null pointer constant mostly doesn't matter.

The main place the type of NULL does become an issue is with variadic (or non-prototyped) functions. Consider these examples:

printf("%p", NULL); // WRONG!
execl("/bin/true", "true", NULL); // WRONG!

The first is wrong, in general, because the %p specifier for printf requires an argument of type void *. The second example is always wrong because the execl function requires arguments of type char *, and NULL can never have type char *.

One might object that, despite being formally incorrect, the above should work, especially if NULL is defined with the typical C definition of ((void *)0). Assuming also that pointers of different types are treated identically by the calling convention, having type void * instead of char * would not matter when calling execl. (Arguably it may not matter anyway, since C specified that char * and void * have the same representation, but it's not entirely clear how this applies to arguments to variadic functions.)

Things get a lot uglier with C++. In C++, NULL was traditionally defined as plain 0, on account of the fact that C++ has no implicit conversion from void * to other pointer types. (Note that the current C++ standard also allows NULL to be defined as nullptr, and some non-conforming implmentations define it in other ways, such as __null.) So in many historical C++ implementations, if you pass NULL to a variadic function, bad things might happen.

And bad things did happen. We were encountering several hard-to-track-down crash bugs in certain applications (including evince and audacity) built against musl, which at the time defined NULL as 0 for C++ programs, and it turned out the source of the problem was that the C++ code was making calls to variadic C functions using NULL as a sentinel. The original report and further details are available in the mailing list archives for musl.

The crashes that were happening were on 64-bit archs (at the time, for us that meant just x86_64) and were, if I recall, "heisenbugs": inserting additional code to inspect what was going on made some or all of them go away. As it turns out, while x86_64 (and most 64-bit archs) uses entire registers or 64-bit stack slots for all arguments, even ones smaller than 64-bit, there's no guarantee in the ABI that the unused upper bits will be cleared by the caller. So, when the caller passed 0 as an int to a variadic function, the upper 32 bits of the stack slot contained whatever junk happened to be there before, and the callee's va_arg, attempting to read an argument of type void *, pulled the junk in as the upper bits of the pointer. Thus, manifestation of the bug was highly dependent both on the code executed just prior to the variadic call, and the compiler's choice of how to setup the arguments (e.g. push versus mov).

As far as what we would do about the issue, I've been on the fence for a long time about the relative merits of trying to work around incorrect, non-portable applications versus actively breaking them and pressuring their maintainers to fix them. However, the widespread incorrect use of NULL seemed overwhelming, and developers seemed very reluctant to fix the issue. Fortunately, we came up with what I believe is a really good solution on musl's side: replacing this:

#ifdef __cplusplus
#define NULL 0
#else
#define NULL ((void *)0)
#endif

with this:

#define NULL 0L

(Note that musl assumes a type model where long and pointers have the same size; this is also assumed by Linux and is fundamental to the way system calls work, so it's not an unreasonable assumption for us.)

By making this change, all programs that assume they can pass NULL as a pointer argument to variadic functions (sentinel role or otherwise) continue to work as if NULL were defined as ((void *)0). However, GCC does not consider 0L a valid "sentinel" value (since it has the wrong type) for functions declared with the sentinel attribute. So what we have achieved is making incorrect programs work, but produce warnings where the code is incorrect.

One initial concern I had with this approach is that the matter of whether NULL has pointer or integer type might be observable to programs in such a way that the change from ((void *)0) to 0L could break some incorrect C programs. However, after some of the smart folks over at Stack Overflow tried to find ways a program might detect whether NULL is defined with integer or pointer type and failed, I deemed the change safe, and our developers and users have not run into any problems since.

It's been some time now (about 6 months) since the issue was raised and addressed in musl, but I just recently reported the sentinel issue in Busybox and submitted a patch to fix it. While getting the patch accepted was painless, I mention this anecdote as a possible explanation for why the NULL sentinel bug is so prevalent: after the patch was accepted, I received several follow-up emails chipping in with misconceptions about NULL. The impression I was left with is that everybody seems to think NULL is a much simpler topic than it really is.

So what would I recommend programmers do with NULL? After all this trouble with variadic function arguments, I've just basically stopped using NULL completely. A simple 0 is shorter and usually clear when initializing or returning a pointer, and for null pointer checks, I usually use the ! operator rather than ==NULL anyway. However, I really don't have any opinion on what others should do; from a correctness standpoint, what matters is simply not making assumptions about the type of NULL. Anywhere the type of the expression matters, you really need to use a cast, which you could apply to NULL, or simply to 0.

As a final remark, I don't think this article would be complete without mentioning a similar article, albeit much shorter and less detailed, written by Jens Gustedt over two years ago.