190 Programming Spend a few moments thinking about this problem and you can convince yourself that it is theoretically impossible to modify the Unix mv command so that it would have the functionality of the MS-DOS “rename” command. So much for software tools. Robustness, or “All Lines Are Shorter Than 80 Characters” There is an amusing article in the December 1990 issue of Communica- tions of the ACM entitled “An Empirical Study of the Reliability of Unix Utilities” by Miller, Fredriksen, and So. They fed random input to a num- ber of Unix utility programs and found that they could make 24-33% (depending on which vendor’s Unix was being tested) of the programs crash or hang. Occasionally the entire operating system panicked. The whole article started out as a joke. One of the authors was trying to get work done over a noisy phone connection, and the line noise kept crashing various utility programs. He decided to do a more systematic investigation of this phenomenon. Most of the bugs were due to a number of well-known idioms of the C pro- gramming language. In fact, much of the inherent brain damage in Unix can be attributed to the C language. Unix’s kernel and all its utilities are written in C. The noted linguistic theorist Benjamin Whorf said that our language determines what concepts we can think. C has this effect on Unix it prevents programmers from writing robust software by making such a thing unthinkable. The C language is minimal. It was designed to be compiled efficiently on a wide variety of computer hardware and, as a result, has language constructs that map easily onto computer hardware. At the time Unix was created, writing an operating system’s kernel in a high-level language was a revolutionary idea. The time has come to write one in a language that has some form of error checking. C is a lowest-common-denominator language, built at a time when the low- est common denominator was quite low. If a PDP-11 didn’t have it, then C doesn’t have it. The last few decades of programming language research have shown that adding linguistic support for things like error handling, automatic memory management, and abstract data types can make it dra- matically easier to produce robust, reliable software. C incorporates none of these findings. Because of C’s popularity, there has been little motiva- tion to add features such as data tags or hardware support for garbage col- lection into the last, current and next generation of microprocessors: these
“It Can’t Be a Bug, My Makefile Depends on It!” 191 features would amount to nothing more than wasted silicon since the majority of programs, written in C, wouldn’t use them. Recall that C has no way to handle integer overflow. The solution when using C is simply to use integers that are larger than the problem you have to deal with—and hope that the problem doesn’t get larger during the life- time of your program. C doesn’t really have arrays either. It has something that looks like an array but is really a pointer to a memory location. There is an array indexing expression, array[index], that is merely shorthand for the expression (*(array + index)). Therefore it’s equally valid to write index[array], which is also shorthand for (*(array+index)). Clever, huh? This duality can be commonly seen in the way C programs handle character arrays. Array vari- ables are used interchangeably as pointers and as arrays. To belabor the point, if you have: char *str = "bugy” …then the following equivalencies are also true: 0[str] == 'b' *(str+1) == 'u' *(2+str) == 'g' str[3] == 'y' Isn’t C grand? The problem with this approach is that C doesn’t do any automatic bounds checking on the array references. Why should it? The arrays are really just pointers, and you can have pointers to anywhere in memory, right? Well, you might want to ensure that a piece of code doesn’t scribble all over arbi- trary pieces of memory, especially if the piece of memory in question is important, like the program’s stack. This brings us to the first source of bugs mentioned in the Miller paper. Many of the programs that crashed did so while reading input into a char- acter buffer that was allocated on the call stack. Many C programs do this the following C function reads a line of input into a stack-allocated array and then calls do_it on the line of input.