“You can fool some of the people all of the time, and all of the people some of the time, but you cannot fool all of the people all of the time” – Abraham Lincoln.
Without casting aspersions on the veracity of some of the figures people quote for the reliability of their IT systems, on examination, the outage times claimed by many look ludicrously minimal.
The first issue to look at is the mean time to repair (MTTR) of a particular failure. This may be a short or a long time, but the real problem is that it says nothing about the extra time needed to “get the show back on the road”.
If you substitute the word ‘recover’ for ‘repair’ in the above definition, you will be closer to the truth. It may take a minute to decide that you have run two supposedly sequential jobs in the wrong order and two minutes to restart them in the correct order. However, your database will almost certainly be out of kilter as far as consistency is concerned and the ‘repair’ of that will take much longer.
In too many cases, financial bodies (banks, stock dealers) have repaired faults but have taken many hours to recover normal working conditions. “The system was repaired at 11am and trading commenced normally at 2.30pm” is a typical report.
This leaves us with an equation for recovery: MTTR = mean time to fix error + mean time to recover to full working mode.
The last part of the equation I have called ‘ramp-up’ time, representing the time needed to put the systems back into operational mode as viewed by the end business user and not the network specialist who took three minutes to repair a failing network module. A decent service-level agreement will include the ramp-up time in the recovery time specification.
The recovery time should emerge from a business impact analysis (BIA), which specifies how long a business service can be out of ‘normal’ action before the situation becomes critical or otherwise untenable.
It is possible for the repair action to take place while the system application is still running, for example repairing a part while a parallel redundant part takes over its job. In such a case, there will be a repair time of X minutes but a zero outage time to report because the end user sees no interruption to his or her service.
This leads me to the penultimate point: only by understanding all the steps in a failure and its recovery can you plan to minimise the times involved in each stage.
The simple diagram below illustrates this:
The final point to make is that there are several viewpoints of an outage or period of downtime, depending on your place in an organisation.
The end user’s view will be that the outage lasts as long as he or she is prevented from using IT to do the job they are supposed to do.
The server specialist’s view might be that the outage of his hardware was a mere minute or two before it was fixed, whereas the network person will say: “What’s all the fuss about? Everything is working fine.”
It all depends on your viewpoint and I know what viewpoint the company CEO and board will take. Do you?