Fighting the Last War

As a group, us software developers can be reactive. Generally we learn things the hard way. Every one of us has a story where we accidentally dropped the wrong database or deployed the wrong code to production or only got part of the garbage file from the Gibson…

How many of you have ever gotten a call in the middle of the night along the lines of "omg!? the site/system is down!? customers are dying!"

Once the problem is fixed and people calm down a bit, the logical question is asked:

How can we prevent this from happening again?

And here's the dirty little secret… you may not be able to. Sometimes sh stuff just breaks. Having a call list of ever-more qualified/authorized people can help. Having a checklist/process for deployment and fixes can help. Having a backup and recovery plan in place can help. Having a rollback plan can help. Having every configuration change be monitored and tracked can help. Having reduntdant servers can help. Having redundant, power, etc can help. Having physically separate datacenters with automatic fail-over DNS can help.

See the problem?

Despite all of this, there are so many places and posssibilities for errors and so many things outside of your control, at some point you just need to let go and accept that problems will happen. The key is how you respond..