196  Programming  Mission  Control  at  the  Johnson  Space  Center  in  Houston  is  switching  to  Unix  systems  for  real-time  data  acquisition.  Hmmm.  Catching  Bugs  Is  Socially  Unacceptable  Not  checking  for  and  not  reporting  bugs  makes  a  manufacturer’s  machine  seem  more  robust  and  powerful  than  it  actually  is.  More  importantly,  if  Unix  machines  reported  every  error  and  malfunction,  no  one  would  buy  them!  This  is  a  real  phenomenon.  Date:  Thu,  11  Jan  90  09:07:05  PST  From:  Daniel  Weise  daniel@mojave.stanford.edu  To:  UNIX-HATERS  Subject:  Now,  isn’t  that  clear?  Due  to  HP  engineering,  my  HP  Unix  boxes  REPORT  errors  on  the  net  that  they  see  that  affect  them.  These  HPs  live  on  the  same  net  as  SUN,  MIPS,  and  DEC  workstations.  Very  often  we  will  have  a  prob-  lem  because  of  another  machine,  but  when  we  inform  the  owner  of  the  other  machine  (who,  because  his  machine  throws  away  error  messages,  doesn’t  know  his  machine  is  hosed  and  spending  half  its  time  retransmitting  packets),  he  will  claim  the  problem  is  at  our  end  because  our  machine  is  reporting  the  problem!  In  the  Unix  world  the  messenger  is  shot.  
If  You  Can’t  Fix  It,  Restart  It!  197  If  You  Can’t  Fix  It,  Restart  It!  So  what  do  system  administrators  and  others  do  with  vital  software  that  doesn’t  properly  handle  errors,  bad  data,  and  bad  operating  conditions?  Well,  if  it  runs  OK  for  a  short  period  of  time,  you  can  make  it  run  for  a  long  period  of  time  by  periodically  restarting  it.  The  solution  isn’t  very  reliable,  nor  scalable,  but  it  is  good  enough  to  keep  Unix  creaking  along.  Here’s  an  example  of  this  type  of  workaround,  which  was  put  in  place  to  keep  mail  service  running  in  the  face  of  an  unreliable  named  program:  Date:  14  May  91  05:43:35  GMT  From:  tytso@athena.mit.edu  (Theodore  Ts’o)4  Subject:  Re:  DNS  performance  metering:  a  wish  list  for  bind  4.8.4  Newsgroups:  comp.protocols.tcp-ip.domains  This  is  what  we  do  now  to  solve  this  problem:  I’ve  written  a  pro-  gram  called  “ninit”  that  starts  named  in  nofork  mode  and  waits  for  it  to  exit.  When  it  exits,  ninit  restarts  a  new  named.  In  addition,  every  5  minutes,  ninit  wakes  up  and  sends  a  SIGIOT  to  named.  This  causes  named  to  dump  statistical  information  to  /usr/tmp/  named.stats.  Every  60  seconds,  ninit  tries  to  do  a  name  resolution  using  the  local  named.  If  it  fails  to  get  an  answer  back  in  some  short  amount  of  time,  it  kills  the  existing  named  and  starts  a  new  one.  We  are  running  this  on  the  MIT  nameservers  and  our  mailhub.  We  find  that  it  is  extremely  useful  in  catching  nameds  that  die  mysteri-  ously  or  that  get  hung  for  some  unknown  reason.  It’s  especially  use-  ful  on  our  mailhub,  since  our  mail  queue  will  explode  if  we  lose  name  resolution  even  for  a  short  time.  Of  course,  such  a  solution  leaves  open  an  obvious  question:  how  to  handle  a  buggy  ninit  program?  Write  another  program  to  fork  ninits  when  they  die  for  “unknown  reasons”?  But  how  do  you  keep  that  program  running?  Such  an  attitude  toward  errant  software  is  not  unique.  The  following  man  page  recently  crossed  our  desk.  We  still  haven’t  figured  out  whether  it's  a  joke  or  not.  The  BUGS  section  is  revealing,  as  the  bugs  it  lists  are  the  usual  bugs  that  Unix  programmers  never  seem  to  be  able  to  expunge  from  their  server  code:  NANNY(8)  Unix  Programmer's  Manual  NANNY(8)  4Forwarded  to  UNIX-HATERS  by  Henry  Minsky.  
 
             
            






































































































































































































































































































































































