It's a sad fact of life, but processes die. I know, it's horrible. You start them, send them out into process space, and hope for the best. Yet sometimes, despite your best coding, they core dump, seg fault, or some other calamity befalls them. Unlike our messy biological world so cruelly ruled by entropy, in the digital world processes can be given another chance. They can be restarted. A greater destiny awaits. And hopefully this time the random lottery of unforeseen killing factors will be avoided and a long productive life will be had by all.
This is fun code to write because it's a lot more complicated than you might think. And restarting processes is a highly effective high availability strategy. Most faults are transient, caused by an unexpected series of events. Rather than taking drastic action, like taking a node out of production or failing over, transients can be effectively masked by simply restarting failed processes. Though complexity makes it a fun problem, it's also why you may want to "buy" rather than build. If you are in the market, Supervisor looks worth a visit.
Adapted from their website:
Supervisor is a Python program that allows you to start, stop, and restart other programs on UNIX systems. It can restart crashed processes.
It is often inconvenient to need to write "rc.d" scripts for every single process instance. rc.d scripts are a great lowest-common-denominator form of process initialization/autostart/management, but they can be painful to write and maintain. Additionally, rc.d scripts cannot automatically restart a crashed process and many programs do not restart themselves properly on a crash. Supervisord starts processes as its subprocesses, and can be configured to automatically restart them on a crash. It can also automatically be configured to start processes on its own invocation.
It's often difficult to get accurate up/down status on processes on UNIX. Pidfiles often lie. Supervisord starts processes as subprocesses, so it always knows the true up/down status of its children and can be queried conveniently for this data.
Users who need to control process state often need only to do that. They don't want or need full-blown shell access to the machine on which the processes are running. Supervisorctl allows a very limited form of access to the machine, essentially allowing users to see process status and control supervisord-controlled subprocesses by emitting "stop", "start", and "restart" commands from a simple shell or web UI.
Users often need to control processes on many machines. Supervisor provides a simple, secure, and uniform mechanism for interactively and automatically controlling processes on groups of machines.
Processes which listen on "low" TCP ports often need to be started and restarted as the root user (a UNIX misfeature). It's usually the case that it's perfectly fine to allow "normal" people to stop or restart such a process, but providing them with shell access is often impractical, and providing them with root access or sudo access is often impossible. It's also (rightly) difficult to explain to them why this problem exists. If supervisord is started as root, it is possible to allow "normal" users to control such processes without needing to explain the intricacies of the problem to them.
Processes often need to be started and stopped in groups, sometimes even in a "priority order". It's often difficult to explain to people how to do this. Supervisor allows you to assign priorities to processes, and allows user to emit commands via the supervisorctl client like "start all", and "restart all", which starts them in the preassigned priority order. Additionally, processes can be grouped into "process groups" and a set of logically related processes can be stopped and started as a unit.
Supervisor also has a web interface and an XMP-RPC interface:
A (sparse) web user interface with functionality comparable to supervisorctl may be accessed via a browser if you start supervisord against an internet socket. Visit the server URL (e.g. http://localhost:9001/) to view and control process status through the web interface after activating the configuration file's [inet_http_server] section.
The same HTTP server which serves the web UI serves up an XML-RPC interface that can be used to interrogate and control supervisor and the programs it runs. To use the XML-RPC interface, connect to supervisor's http port with any XML-RPC client library and run commands against it. An example of doing this using Python's xmlrpclib client library is as follows.
PyCon Presentation: Supervisor as a Platform
Monitor Pylons application with supervisord