Paper: Making reliable distributed systems in the presence of software errors

Joe Armstrong is a co-inventor of Erlang and general all around renaissance software tinkerer as shown by his excellent work on writing a C Compiler and his voluminous work on GitHub.

Given the success of Erlang it's probably no surprise that he wrote his thesis on the ground breaking ideas behind Erlang: Making reliable distributed systems in the presence of software errors.

Even if you have yet to join the cult of Erlang the principles behind Erlang are universal and well worth exploring for your own designs. Highly recommended.

Introduction:

How can we program systems which behave in a reasonable manner in the presence of software errors? This is the central question that I hope to answer in this thesis. Large systems will probably always be delivered containing a number of errors in the software, nevertheless such systems are expected to behave in a reasonable manner.
To make a reliable system from faulty components places certain requirements on the system. The requirements can be satisfied, either in the programming language which is used to solve the problem, or in the standard libraries which are called by the application programs to solve the problem.
In this thesis I identify the essential characteristics which I believe are necessary to build fault-tolerant software systems. I also show how these characteristics are satisfied in our system.
Some of the essential characteristics are satisfied in our programming language (Erlang), others are satisfied in library modules written in Erlang. Together the language and libraries form a basis for building reliable software systems which function in an adequate manner even in the presence of programming errors.
Having said what my thesis is about, I should also say what it is not about. The thesis does not cover in detail many of the algorithms used as building blocks for construction fault-tolerant systems—it is not the algorithms themselves which are the concern of this thesis, but rather the programming language in which such algorithms are expressed. I am also not concerned with hardware aspects of building fault-tolerant systems, no with the software engineering aspects of fault-tolerance.
The concern is with the language, libraries and operating system requirements for software fault-tolerance. Erlang belongs to the family of pure message passing languages—it is a concurrent process-based language having strong isolation between concurrent processes. Our programming model makes extensive use of fail-fast processes. Such techniques are common in hardware platforms for building fault-tolerant systems but are not commonly used in software solutions. This is mainly because conventional languages do not permit different software modules to co-exist in such a way that there is no interference between modules. The commonly used threads model of programming, where resources are shared, makes it extremely difficult to isolate components from each other—errors in one component can propagate to another component and damage the internal consistency of the system.