EWONTFIX

Stubborn and ignorant use of int where size_t is needed

25 Oct 2012 23:48:02 GMT

What’s wrong with this C function?

char *my_strchr(char *s, int c)
{
    int i;
    for (i=0; s[i]!=c; i++)
        if (!s[i]) return 0;
    return &s[i];
}

Unless its interface contract requires that the caller pass a string no longer than INT_MAX, it can invoke undefined behavior due to integer overflow, most likely resulting in a crash. Even if you change the type to unsigned instead of int to avoid the signed overflow issue, the function will still be buggy; now, it just goes into an infinite loop if the input string is longer than UINT_MAX.

The solution, simple enough, is to use the right type for i: size_t. Strictly speaking, this could still fail in the same way, since the C language places no requirement on implementations that they not support objects too large for size_t; however, given that the standard library functions all use size_t, even on such an implementation one would have to go out of the way to produce such an over-sized object (for example, using calloc or a non-standard allocation function). And any reasonable implementation of C will have size_t sufficiently large to store any object size.

Unfortunately, I regularly encounter a surprising degree of resistence to the idea that this type of code must use size_t to be correct. The most infamous example is the 64-bit vulnerability in qmail, which D. J. Bernstein refused to acknowledge and award bounty, claiming that it’s not a bug, but even to this day I encounter quite a few programmers who claim they should be allowed to use int instead of size_t for loop counters involving array element counts.

The argument for using int basically comes down to a claim that inputs with more than 2 billion elements are (1) pathological, (2) malicious, and (3) should have already been rejected at some earlier stage. In most cases, I mostly agree with the first two points. Unless you’re writing software that’s intended to deal with gigantic volumes of data (in which case, I’ll presume you know what you’re doing), gigantic inputs are probably bad inputs. Okay, fair enough.

The problem in the argument for using int instead of size_t arises when you get to the third point — the idea that large inputs should already have been rejected. This is basically a cheap way of pushing the blame/responsibility onto somebody else, and while valid, if you write your library functions this way, it will make them extremely inefficient to interoperate with code that’s not making the same assumptions.

As an example, suppose the above my_strchr is being called from code that can’t assume its inputs were already length-limited below INT_MAX. The only way to safely pass this input to my_strchr is to call strlen (or strnlen, with n==INT_MAX) on it first to validate that it’s not too long. Even if the typical runtime of my_strchr is low (because you expect to find the character in the first hundred bytes or so), you now have to scan the whole string (or at least up to some large limit) first — not because you’ll actually be using the data, but just to check that the string meets the awkward interface contract of my_strchr. You’ve replaced the good typical-case runtime with a guaranteed worst-case runtime, just for the sake of gluing together interfaces that don’t play nice with each other.

One could argue here that I’ve attacked a straw man — certainly the my_strchr function could be improved to bail out early if i passes some arbitrary limit. This actually does solve the problem, and used consistently, this sort of approach works just as well as using size_t as long as you never intend to support objects larger than INT_MAX. Of course, the same bounded functions using size_t in place of int would work just as well, and they would be more general and flexible, in that they could be reused in applications requiring support for huge data.