System Failure Strategies

Recently while heading into the DC Metro System, I saw something that reminded me of system failure and thought I'd address it here. Let's face it, at some point, no matter what industry, technology, or business we're in, something somewhere is going to fail. The question is: How does it fail? To the best of my knowledge, I see three different modes of failure: fail completely, fail partially or fail transparently. Within those three categories, there are even different attributes to a given failure… which is often the most interesting part.

Transparent failure is often the best strategy despite being the most annoying. If the radio in your car stops working, you can still drive and get to work, but it will be a bit quieter. This generally only works for features or functionality which are completely unrelated to the underlying goal. An equivalent in software might be tabbed browsing. Despite the muscle memory, if your tabs stop working, you can always open new windows. Annoying, but the system is still functional because these “nice to haves” are trumped by the “gotta haves” functionality.

We're all familiar with the next failure strategy: fail completely. When you leave your car's headlights on all night, you've seen an example of fail completely. The battery is drained and you are effectively stuck. Of all the ways to fail, this is often the most annoying/distressing, but realistically, it is one of the safest. This is most common with products which perform one specific action and have quite a bit of supporting functionality and tools which are completely unrelated to the core purpose.

Despite this beautifully simple example, in software we generally have a “fail completely” strategy but we ignore the safety aspect. When Word, Photoshop, or Firefox run into a problem, they generally collapse into a small blackhole promptly sucking your work into nothingness. Some of this damage has been mitigated through the implementation of auto-save and a few other aspects, but the underlying problem is still there and still annoys us on a regular basis.

The final failure category is the most interesting of all: fail partially. This doesn't work in many scenarios but it works in just enough that it provides a striking balance between the two extremes. Think of a stopped escalator – conveniently the genesis of this article. Even when an escalator completely fails, it still has value. The best aspects are that no lives are risked, no damage is done, and all we lose is a bit of convenience. In the software world, I can't think of many examples of this that don't fit into the above grouping… if the core functionality of the software stops working, it's worthless, it doesn't even become a paperweight.

As developers, I think we need to stop and ponder this one a bit more… how do we build software which not only fails safely but still has some value after it fails? I believe there are various development strategies that can assist us. Unit Testing, the various agile methodologies and the concept of multiple iterations each teach us to think in terms of “useful functionality” and making sure it works.