196 Programming Mission Control at the Johnson Space Center in Houston is switching to Unix systems for real-time data acquisition. Hmmm. Catching Bugs Is Socially Unacceptable Not checking for and not reporting bugs makes a manufacturer’s machine seem more robust and powerful than it actually is. More importantly, if Unix machines reported every error and malfunction, no one would buy them! This is a real phenomenon. Date: Thu, 11 Jan 90 09:07:05 PST From: Daniel Weise daniel@mojave.stanford.edu To: UNIX-HATERS Subject: Now, isn’t that clear? Due to HP engineering, my HP Unix boxes REPORT errors on the net that they see that affect them. These HPs live on the same net as SUN, MIPS, and DEC workstations. Very often we will have a prob- lem because of another machine, but when we inform the owner of the other machine (who, because his machine throws away error messages, doesn’t know his machine is hosed and spending half its time retransmitting packets), he will claim the problem is at our end because our machine is reporting the problem! In the Unix world the messenger is shot.
If You Can’t Fix It, Restart It! 197 If You Can’t Fix It, Restart It! So what do system administrators and others do with vital software that doesn’t properly handle errors, bad data, and bad operating conditions? Well, if it runs OK for a short period of time, you can make it run for a long period of time by periodically restarting it. The solution isn’t very reliable, nor scalable, but it is good enough to keep Unix creaking along. Here’s an example of this type of workaround, which was put in place to keep mail service running in the face of an unreliable named program: Date: 14 May 91 05:43:35 GMT From: tytso@athena.mit.edu (Theodore Ts’o)4 Subject: Re: DNS performance metering: a wish list for bind 4.8.4 Newsgroups: comp.protocols.tcp-ip.domains This is what we do now to solve this problem: I’ve written a pro- gram called “ninit” that starts named in nofork mode and waits for it to exit. When it exits, ninit restarts a new named. In addition, every 5 minutes, ninit wakes up and sends a SIGIOT to named. This causes named to dump statistical information to /usr/tmp/ named.stats. Every 60 seconds, ninit tries to do a name resolution using the local named. If it fails to get an answer back in some short amount of time, it kills the existing named and starts a new one. We are running this on the MIT nameservers and our mailhub. We find that it is extremely useful in catching nameds that die mysteri- ously or that get hung for some unknown reason. It’s especially use- ful on our mailhub, since our mail queue will explode if we lose name resolution even for a short time. Of course, such a solution leaves open an obvious question: how to handle a buggy ninit program? Write another program to fork ninits when they die for “unknown reasons”? But how do you keep that program running? Such an attitude toward errant software is not unique. The following man page recently crossed our desk. We still haven’t figured out whether it's a joke or not. The BUGS section is revealing, as the bugs it lists are the usual bugs that Unix programmers never seem to be able to expunge from their server code: NANNY(8) Unix Programmer's Manual NANNY(8) 4Forwarded to UNIX-HATERS by Henry Minsky.