Partner Article

How to avoid your own BA-style outage

BA’s reputation and brand has suffered immense damage from the outage which left 400 planes unable to fly and up to 75,000 passengers stranded at Heathrow and Gatwick Airport. The negative news headlines and stories have given competitors a stick to beat the airline with and probably caused many prospective passengers to think again about booking a flight with BA.

BA’s explanation is that the whole debacle started when a worker accidentally unplugged a power supply which led to a power surge that resulted in major damage when the engineer reconnected it. Although in the aftermath of an IT outage like this there is rarely an explanation that will satisfy outraged customers, this particular reasoning merely serves to underline the inadequacy of the measures the airline had in place to protect its systems from failure – and to recover them.

The irony is that the damage caused to BA’s reputation is likely to be far more costly than the measures the airline would have needed to put in place to recover from any IT failure or outage.

BA has blamed the whole sorry mess on “human error” and claimed that because the problem was caused by a power surge taking out its IT systems, this was not an IT failure. While human error is frequently cited as a cause for many IT outages and failures, it speaks volumes about the lack of thought and planning that has gone into business continuity (BC) and disaster recovery (DR) from even the biggest companies, such as BA, that their systems can be offline for many hours, even days, because of a mistake by a single person.

Lessons need to be learned from the very public, global IT meltdown that BA experienced, if companies in a similar situation do not want to face the same fate. The issue of cost may be a deterrent to some businesses, but they need to consider the costs of the damage to their brand and IT systems if they don’t take steps to mitigate the consequences of an outage.

Here are a number of factors that organisations should consider if they want to avoid having their brand bracketed with BA the next time someone writes or talks about the embarrassing consequences of IT failures.

Multiple data centres: Where systems are mission-critical and organisations cannot afford any downtime, they need to look at dual data centre platforms and geo-redundant hosting. In the event of a failure at the primary data centre, an organisation can guarantee business continuity by switching to the secondary data centre. It also makes sense to house the data centres in different locations so that any problem that affects the primary data centre, such as a power failure, will have no effect on the secondary data centre.

Staff that follow procedures: With human error being cited as the cause of so many high profile IT failures, it is very important that organisations do their best to remove the potential for mistakes by ensuring staff follow the correct procedures. This includes documentation of those involved, their specific roles and training to ensure everyone understands exactly what is expected of them.

A fully drafted and understood DR and BC plan: In the event of a disaster or failure, organisations need to have a fully documented process or set of procedures to recover and protect the business IT infrastructure. It should outline the actions to be taken during and after a disaster, in a specific order and with detailed instructions. This document should also include contact information for all vendors and support lines.

Continuous backup: In the event of a disaster, organisations will be keen to recover systems to as close a point in time before the failure as they can. Normal backup only restores data from the time the scheduled backup was made, which could be an hour or 24 hours ago. Continuous backup eliminates the need for scheduled backups by asynchronously writing any data created in the first data centre to the second location.

Choose the right partner: Many businesses, including BA, prefer to outsource their data centre operations, along with their management and maintenance, to third parties. While this may make sense from a cost point of view, it should be a well informed decision. For example, there have to be very clear and agreed procedures and SLAs between the business and its chosen partner covering what happens in the event of a disaster and including expected recovery time objectives (RTO) and recovery point objectives (RPO). This BC/DR plan needs to be something a partner can prove, through previous examples of helping other customers, that it can deliver if, or when, the time comes that it needs to be implemented.

The steps outlined above can go a long way to mitigating any disaster and help businesses to keep their IT systems (and brand) up and available to customers at all times. In the aftermath of the BA incident, many companies will be re-evaluating their BC/DR plan, but it shouldn’t take a scare to make it a priority. Planning for outages should be part of an IT teams everyday work to ensure business success.

Jon Lucas, Director at Hyve Managed Hosting

This was posted in Bdaily's Members' News section by Hyve Managed Hosting .