If You Can’t Fix It, Restart It! 197 If You Can’t Fix It, Restart It! So what do system administrators and others do with vital software that doesn’t properly handle errors, bad data, and bad operating conditions? Well, if it runs OK for a short period of time, you can make it run for a long period of time by periodically restarting it. The solution isn’t very reliable, nor scalable, but it is good enough to keep Unix creaking along. Here’s an example of this type of workaround, which was put in place to keep mail service running in the face of an unreliable named program: Date: 14 May 91 05:43:35 GMT From: tytso@athena.mit.edu (Theodore Ts’o)4 Subject: Re: DNS performance metering: a wish list for bind 4.8.4 Newsgroups: comp.protocols.tcp-ip.domains This is what we do now to solve this problem: I’ve written a pro- gram called “ninit” that starts named in nofork mode and waits for it to exit. When it exits, ninit restarts a new named. In addition, every 5 minutes, ninit wakes up and sends a SIGIOT to named. This causes named to dump statistical information to /usr/tmp/ named.stats. Every 60 seconds, ninit tries to do a name resolution using the local named. If it fails to get an answer back in some short amount of time, it kills the existing named and starts a new one. We are running this on the MIT nameservers and our mailhub. We find that it is extremely useful in catching nameds that die mysteri- ously or that get hung for some unknown reason. It’s especially use- ful on our mailhub, since our mail queue will explode if we lose name resolution even for a short time. Of course, such a solution leaves open an obvious question: how to handle a buggy ninit program? Write another program to fork ninits when they die for “unknown reasons”? But how do you keep that program running? Such an attitude toward errant software is not unique. The following man page recently crossed our desk. We still haven’t figured out whether it's a joke or not. The BUGS section is revealing, as the bugs it lists are the usual bugs that Unix programmers never seem to be able to expunge from their server code: NANNY(8) Unix Programmer's Manual NANNY(8) 4Forwarded to UNIX-HATERS by Henry Minsky.
198 Programming NAME nanny - A server to run all servers SYNOPSIS /etc/nanny [switch [argument]] [...switch [argument]] DESCRIPTION Most systems have a number of servers providing utilities for the system and its users. These servers, unfortunately, tend to go west on occasion and leave the system and/or its users without a given service. Nanny was created and implemented to oversee (babysit) these servers in the hopes of preventing the loss of essential services that the servers are providing without constant intervention from a system manager or operator. In addition, most servers provide logging data as their output. This data has the bothersome attribute of using up the disk space where it is being stored. On the other hand, the logging data is essential for tracing events and should be retained when possible. Nanny deals with this overflow by being a go- between and periodically redirecting the logging data to new files. In this way, the logging data is partitioned such that old logs are removable without disturbing the newer data. Finally, nanny provides several control functions that allow an operator or system manager to manipulate nanny and the servers it oversees on the fly. SWITCHES .... BUGS A server cannot do a detaching fork from nanny. This causes nanny to think that the server is dead and start another one time and time again. As of this time, nanny can not tolerate errors in the configuration file. Thus, bad file names or files that are not really configuration files will make nanny die. Not all switches are implemented. Nanny relies very heavily on the networking facilities provided by the system to communicate between processes. If the network code produces errors, nanny can not tolerate the errors and will either wedge or loop.