In May, BA suffered a catastrophic IT failure when the power supply to a key data centre was lost and the backup system was rendered ineffective. The failure shut down the airline’s IT systems, causing passenger chaos worldwide. BA has yet to explain the precise cause and sequence of events that resulted in the failure of two of its data centres.
The incident caused consternation in the data-centre sector, with many experts surprised that BA’s systems were not more resilient, and that the procedures which should have been in place to prevent this type of meltdown failed. It was not only the scale and duration of the IT power cut, but that the failure brought down a key data centre and the backup data centre, too. ‘What was the most surprising aspect of this, for me, was that BA couldn’t restart their data processors somewhere else,’ says Alan Beresford, managing director at EcoCooling.
So how are data centres designed to be resilient – and what is it about the way they are engineered that should prevent downtime and failures from occurring?
To understand resilience, you first need to appreciate how a typical data centre is arranged. The most critical area contains the data halls – rooms in which the data processing units, or servers, are housed in rows of cabinets or racks. These servers need a continuous supply of power and cooling, which is why data centres are designed with a robust set of systems to deal with power failures and to ensure cooling is always available. The measurement of how vulnerable your system is to failure determines its resilience.
The Uptime Institute, an organisation focused on business-critical systems, defines four tiers of data centre resilience: N, N+1, 2N and 2N+1, where N is the base and 2N+1 is the most resilient. This terminology is best explained using the example of standby generators serving a 1MW data centre (see panel, ‘Tiers of data resilience’).
CIBSE launches data centre performance award
A new category has been added to the CIBSE Building Performance Awards 2018: Project of the Year – Data Centre. Entries, for projects completed between 1 June 2014 and 31 August 2016, should demonstrate how a new-build or a refurbishment of a data centre meets high levels of user satisfaction and comfort. The entry also needs to demonstrate how outstanding measured building performance, energy efficiency and reduced carbon emissions has been achieved. Visit the Building Performance Awards 2018 website for more information.
It is important to note that this tiering makes no reference to the type of systems employed; it does not state which type of uninterruptible power supply should be used, or how a data centre is to be cooled. Tiering is about how the systems are arranged.
The other thing to note is that the tiering designation is about the maintainability of systems. ‘Most people would argue that a Tier III data centre is concurrently maintainable, because you can take out a piece of kit to maintain it and you don’t lose anything,’ says Robert Thorogood, executive director at consultant Hurley Palmer Flatt. ‘Some banks specify Tier IV, which means the systems are not only concurrently maintainable, but you can have a fault anywhere on the system and you still won’t lose anything, because there is more redundancy.’
Not all businesses will require the same level of resilience as a bank. Thorogood says they have to ask: ‘What will happen to my business if the data centre goes down?’
Some organisations can deal with an organised period of downtime once a year. However, increased reliance on the internet means access to it is becoming critical for more and more businesses. Many retailers, for example, now have a 24/7 web presence and can no longer accept downtime overnight.
The measurement of how vulnerable your system is to failure determines its resilience
‘It used to be that research organisations did not require a high level of data-centre resilience; if the data centre went down, it went down. These days, because everybody relies on email and the internet, even universities want access to a Tier III data centre,’ explains Thorogood.
However, it is important to remember that not all areas in a Tier III data centre will be serviced to the same level of resilience.
‘A typical data centre will have the hall housing the computer racks, accompanied by support areas – such as storage, loading bays, security and plant, and the uninterruptible power supply (UPS); the infrastructure serving these areas will not have to be nearly as reliable as that serving the data hall,’ says Don Beaty, CEO at DLB Associates Consulting Engineers in the US, and the person responsible for starting the ASHRAE technical committee on mission critical facilities, TC9.9.
Beaty warns that – just because you have multiple systems inside a data centre – the building can still be vulnerable to single points of failure externally, particularly with data network. ‘Data centres are nothing without connectivity to the outside world; you want diverse fibre routes from different carriers coming into the building from diagonally opposite corners,’ he says. ‘However, if those fibres converge upstream, then that will become a single point of failure’.
The same issue is true for power, where it can be difficult to avoid a common supply. Very few data centres have two discrete power supplies, but it is common to have two incoming power supplies from different substations – although these can come from the same power source further upstream – with supplies entering the building on different sides.
On a Tier 3 data centre, for example, each supply – once inside the building – will be kept separate, passing through a dedicated set of transformers through a UPS, and then down dedicated cables until they reach the server. So each computer server is fed from two independent power supplies. ‘The UPS will be supported by standby generation, so – if the mains go down – the UPS batteries will take over until the standby generators fire up, synchronise and supply power,’ says Beresford.
Tiers of data resilience
Definitions are based on an example of standby generators serving a 1MW data centre
- Tier I (N) Normal, the data centre has 2 x 500kW generators
- Tier II (N+1) The data centre has a spare generator – so, 3 x 500kW
- Tier III (2N) The data centre has two power supply systems, A and B,
and each stream has two 500kW generators – 4 x 500kW in total
- Tier IV (2N+1) Each A and B stream has 2 x 500kW generators, plus a spare 500kW generator – so, 6 x 500kW generators in total
Beresford adds that not all systems require the same level of resilience. ‘Power and fibre optics systems might be 2N, but cooling might be N+1, because it’s a lot simpler,’ he says. ‘You play tunes on the level of redundancy according to the type of technology.’
When considering resilience, it is important to ensure that, should a system fail, the operator understands how to deal with the situation. ‘When you get big data centres with multiple levels of redundancy, their operation can become very complicated,’ warns Beresford. ‘There is an alternative view that very simple systems can actually prove to be more resilient and more reliable than complicated ones.’
The ‘keep the engineering simple’ mantra has been embraced by data-centre developer and operator DigiPlex, which engages with the operational team when it puts together a design. ‘If you put a design in front of the operations guys and they don’t get it, then scrap it, because it must be easily understandable for them to operate in an emergency,’ says Geoff Fox, DigiPlex’s chief innovation and engineering officer. ‘If technicians don’t understand the system, your resilience is super weak.’
DigiPlex’s philosophy means it designs to minimise the opportunity for human error by following a 2N – rather than an N+1 – solution for data centre electrical infrastructure. ‘We found that trying to save on the cost of a generator builds in complexity to the design, results in additional costs for the switchboards and cross-connects, which makes it harder to maintain,’ says Fox. Resilience is further enhanced by using factory manufactured, prefabricated switchrooms and plantrooms, to enable quality to be controlled and to fully test the units before they arrive on site.
Sophia Flucker, director at consultant Operational Intelligence, believes commissioning the data centre before it is operational is fundamental to its resilience. She lists what she terms the ‘five levels of commissioning’ necessary to achieve resilience: factory acceptance; dead testing on site; start up on site; systems testing; and integrated systems testing.
Flucker says a comprehensive approach to commissioning is to ‘test all the components, then test the systems and their failure modes’. Sound advice, which perhaps BA will follow in the future – particularly testing in failure mode.
Read more in CIBSE’s Data Centres: an introduction to concepts and design at www.cibse.org/knowledge
Credit: Istock polybutmono