What’s wrong with this C function?
char *my_strchr(char *s, int c)
{
int i;
for (i=0; s[i]!=c; i++)
if (!s[i]) return 0;
return &s[i];
}
Unless its interface contract requires that the caller pass a string
no longer than INT_MAX
, it can invoke undefined behavior due to
integer overflow, most likely resulting in a crash. Even if you change
the type to unsigned
instead of int
to avoid the signed overflow
issue, the function will still be buggy; now, it just goes into an
infinite loop if the input string is longer than UINT_MAX
.
The solution, simple enough, is to use the right type for i
:
size_t
. Strictly speaking, this could still fail in the same way,
since the C language places no requirement on implementations that
they not support objects too large for size_t
; however, given that
the standard library functions all use size_t
, even on such an
implementation one would have to go out of the way to produce such an
over-sized object (for example, using calloc
or a non-standard
allocation function). And any reasonable implementation of C will have
size_t
sufficiently large to store any object size.
Unfortunately, I regularly encounter a surprising degree of resistence
to the idea that this type of code must use size_t
to be correct.
The most infamous example is the 64-bit vulnerability in
qmail,
which D. J. Bernstein refused to
acknowledge and award
bounty, claiming that it’s not a bug, but even to this day I encounter
quite a few programmers who claim they should be allowed to use int
instead of size_t
for loop counters involving array element counts.
The argument for using int
basically comes down to a claim that
inputs with more than 2 billion elements are (1) pathological, (2)
malicious, and (3) should have already been rejected at some earlier
stage. In most cases, I mostly agree with the first two points.
Unless you’re writing software that’s intended to deal with gigantic
volumes of data (in which case, I’ll presume you know what you’re
doing), gigantic inputs are probably bad inputs. Okay, fair enough.
The problem in the argument for using int
instead of size_t
arises
when you get to the third point — the idea that large inputs should
already have been rejected. This is basically a cheap way of pushing
the blame/responsibility onto somebody else, and while valid, if you
write your library functions this way, it will make them extremely
inefficient to interoperate with code that’s not making the same
assumptions.
As an example, suppose the above my_strchr
is being called from code
that can’t assume its inputs were already length-limited below
INT_MAX
. The only way to safely pass this input to my_strchr
is to
call strlen
(or strnlen
, with n==INT_MAX
) on it first to
validate that it’s not too long. Even if the typical runtime of
my_strchr
is low (because you expect to find the character in the
first hundred bytes or so), you now have to scan the whole string (or
at least up to some large limit) first — not because you’ll actually
be using the data, but just to check that the string meets the awkward
interface contract of my_strchr
. You’ve replaced the good
typical-case runtime with a guaranteed worst-case runtime, just for
the sake of gluing together interfaces that don’t play nice with each
other.
One could argue here that I’ve attacked a straw man — certainly the
my_strchr
function could be improved to bail out early if i
passes
some arbitrary limit. This actually does solve the problem, and used
consistently, this sort of approach works just as well as using
size_t
as long as you never intend to support objects larger than
INT_MAX
. Of course, the same bounded functions using size_t
in
place of int
would work just as well, and they would be more general
and flexible, in that they could be reused in applications requiring
support for huge data.