This post was originally a response to a forum discussion. I felt obliged to cut it down for the forum but have posted it in full here.
For those of you that have not been following the thread, the original post asked about the form Emergency Fix processes should take (and Billy Langston responded most thoroughly). The suggestion was made by Joe Townsend that all Emergency Fixes originated from bad planning...
I agree with Joe that there is always a root cause, and arguably we may be able to plan to avoid problem stemming from these root causes. It is certainly true that on the whole IT organisations are poor at planning ahead to avoid problems.
That said, the observation that "all emergency fixes are due to bad planning", suggests that if only we could plan better we would have no need of an emergency fix system. This is simply impractical in any real-world environment. "All emergency fixes could have been avoided if they had been foreseen and planned for" is strictly true, but seldom economically viable and never helpful when the problem has occurred.
It comes down to a simple economic equation: is it more costly to plan to avoid the problem than to plan to resolve the problem when it happens.
Risk and Cost vs Benefit
Taking Joe's interview example. Whenever I go to client meeting I always plan to arrive an hour before the meeting is due to occur (I can always have a relaxing coffee once I arrive and having my laptop means I don't waste any time). Now, is it bad planning on my part when a lorry load of marbles sheds its load onto the highway, causing a tanker to jackknife, leak, and explode blocking the highway for eight hours leaving me stranded in the middle of nowhere? Well, you could argue that I should have travelled down the day before, shame on me for planning so badly. Didn't I realise that this could have happened? Well, yes of course it could happen but what are the chances? Well, when it happened the chances were obviously 100% but that observation would get anyone who made it a punch and no mistake.
In Britain we have notoriously unpredictable weather and every winter there seems to be a furore about road closures. "Why can't we keep the roads free of ice and snow?" the tabloids cry. "They manage in Finland and it snows all the time there!" And that is precisely the point. In Finland the know it is going to snow so they treat and plough the roads all of the time and the roads stay open. If the authorities did that in Britain the did the same (and they could, the infrastructure is there) the cost to the tax payer would be huge and everyone would wring their hands and lament the fact that the roads were treated every single day despite the fact it only snowed on a few and potentially iced over on a few others. So, the authorities assess the likelihood that snow will occur or ice form. Based on this assessment they may treat the roads. Sometimes they gamble and lose (by treating when it is unnecessary, or not treating when it is necessary), but overall we get a happy medium (although it is hard to see that when you're stuck on a snowbound highway).
If it costs me $100,000 dollars to plan and mitigate a problem that is likely to occur only once in one hundred years and costs $500,000 when it does. Then I am not going to bother. Sure, it could happen tomorrow and some smug bugger could say "Ah, you see, planning for this could have avoided it". If the event does not happen though everyone would point out what a waste of money it was to plan for such an unlikely event. We based our assessment on the balance of probability. We gambled. We lost. This sort of assessment is precisely what actuaries do all the time in the insurance business, sometimes they gamble and a freak event means they lose, but on the average they win big using this sort of technique.
Fences and Ambulances
It is the difference between fences and ambulances. If you have a cliff edge you can either build fences at the top to stop people falling off, or you station ambulances at the bottom to pick up the pieces when the go over. Obviously it is better for all concerned if the ambulances are never used, but maintaining a perfect fence may be impractical.
Personally I would recommend build fences at the top that prevent all but the most improbably problems (planning). And station ambulances at the bottom to pick up the pieces if an unlikely event occurs (emergency fix). Once the ambulances have done their bit you do a root cause analysis and decide whether it is economically viable to build a bigger fence.
This set up gives me the best balance between avoiding as many foreseeable problems as possible (with my fences) while acknowledging that no plan is perfect (the ambulances) and when things do go wrong I have a strategy for dealing with the problem that will allow me to recover as quickly as practicable (which is actually part of my plan, so in fact I have planned for this eventuality) and assess my planning (fence building) to decide whether I want to invest in avoiding the problem again.
It is always uneconomical to build the ultimate fence and dispense with the ambulances.
In fact I would say no fence, no matter how carefully planned, could possibly by 100% fall proof. You would end up with ridiculous contingencies: "Meteor strike" Build a mirrored data centre on the other side of the planet. No, that's not enough, what if it's a meteor shower that lasts 24 hours? We need an extraterrestrial data centre that is outside any common meteor track with Earth. And so it goes on. Just cannot be practically done. You cannot avoid every problem, no matter how good you think your planning is there will always be something you didn't think of.
Finally...
So, is every Emergency Fix down to bad planning? No. Is every Emergency Fix down to a lack of planning? No. I can have a plan which says something like this, "Look we've covered as many scenarios as we can think of right now. Some of these are so wildly unlikely that the cost of avoiding them is too high compared with the cost of fixing them, so we'll let them be handled by the emergency fix process. As for all the things we cannot see at the moment, we'll leave those to the emergency fix process too." This is not a lack of planning. I am deliberately planning to handle unplanned events. I am stationing ambulances. That is my plan.
How much planning should you do? Simple, do as much as you can afford to do (or perhaps as much as you cannot afford not to do ;) ). Have an emergency fix process in place to catch the rest and rely on a continuous improvement loop to improve your contingencies as problems occur.
Where people go wrong is they are too lazy in the preparation of their fences and they rely solely on their ambulances.
Trackback(0)
 |
Mark Bools |
| About the author: |
| |
|