Cloud Outages Might Seem Trivial, But Show That Our Data Infrastructure Is Fragile

From Email glitch to large-scale outage?

Cloud Outages Might Seem Trivial, But Show That Our Data Infrastructure Is Fragile

From Email glitch to large-scale outage?

The computing cloud we have created supports much of our day-to-day office and leisure activity, from office email to online shopping and sharing holiday photos. Even health, social care and government functions are moving towards digital delivery over the internet.

However, we should be wary that as we become more dependent on it, the cracks will show. The systems are often a patchwork of interconnected services provided by various companies and industry partnerships. A failure of one can lead to a failure in others.

The Move To Cloud

And so our digital world continues to move to the Cloud. It generally provides a low-cost option to on-premise computing resources. But thing major things have generally held it back: performance, security and robustness. While the first one has been address with ever increasing performance levels from AWS and Azure, there are still questions about the other two. Can we scale our existing ways of doing things into the Cloud, and still be secure? And will the Cloud be able to give us an almost 100% uptime?

Many organisation have moved their server infrastructure into the Cloud and support Infrastructure-as-a-Service (IaaS). Increasingly, too, we are using Platform-as-a-Service (PaaS) and where we do not need to run our own server infrastructure, but adopt whole platforms. When this happens we become highly dependent on a whole service area providing key applications for our organisations.

Email glitch

And so for the past two days, Microsoft 365 has been glitching for its email infrastructure, with some users identifying that it was taking around three hours to send and receive emails (as well as emails going missing).

After an investigation Microsoft discovered that part of their Domain Controller infrastructure was down. The main areas of the world affected are identified with Downdetector:

No power … no data

In 2017, Capita’s systems were knocked out for at least two days, and affected council and NHS infrastructures around the UK (including Sheffield City Council). It related to a power failure in West Malling where generators failed, and shut down the whole of the data centre. Capita is the largest IT provider public sector in the UK, with around a 50% per cent of the overall market (£1.9bn in 2016).

On 8 July 2015, all the flights for UA (United Airlines) were grounded, followed by a computer crash on the NYSE (New York Stock Exchange); and then the Wall Street Journal site crashed. With a major outage on the Internet or a large-scale cyber-attack, these were the kind of things that would signal the start of a major problem. Wired classified it as “Cyber Armageddon”, and Macfee have since pointed towards it being suspicious that it all happened on the same day and that there could have been a major cyber attack.

No matter if it was a cyber attack or not, it does show:

how dependent that our world is on information technology, and a failure in any part of it could be devastating to both the economy and our lives.

Overall the NYSE was down for over three hours and it was reported that it was a technical glitch (costing around $400million in trades). The think that it highlighted is that airlines and the stock exchange are two key parts of our critical infrastructure, and problems in either of these, on a long-term basis, could have a devastating effect. Unfortunately few designers of systems take into account failover systems, as it can considerably increase the costs. Imagine you are quoting for an IT contract, and you say:

“Well that’ll be a million to build, but we’ll need another million to build it somewhere else and then there’s the systems to flip them over, and then there’s the load balancers … and then … hello … are you still there?

Often terrorists and a cyber attack are used as the threat actors in triggering this chaos, but in most cases it will be a lack of thought; a lack of investment; and/or human error which will be the likely causes, and these things should not be forgotten. While, in the UK, the banks have been toughening their infrastructure against attack, the whole back-end infrastructure needs examined, especially in the security of power sources, which will bring everyone down.

There are still two major things that most systems are not resilient against:

  • Long-term power failure.
  • Sustained DDoS (Distributed Denial of Service).

Lightning too …

Recently Skype recently went down for almost an entire day, while Facebook was down for more than an hour — the second time in a week — meaning that many sites that depend on Facebook accounts as authentication were locked out too.

Losing Facebook is an annoyance, but interruptions to major health and social care services or energy supply management systems can lead to real damage to the economy and people’s lives.

In 2015, Google’s data centres in Belgium (europe-west1-b) lost power after the local power grid was struck by lightning four times. While most servers were protected by battery backup and redundant storage, there was still an estimated 0.000001% loss of disk space — which for Google’s huge data stores meant a few gigabytes of data.

The lesson is not to trust cloud providers to store and provide backups for your data. Your backups need backups too. What it also shows is our dependence on power supply system which, as long runs of conductive metal, are more prone to lightning strikes than you might imagine.

When the lights go out

Former US secretary of defence, William Cohen, outlined how the US power grid was vulnerable to a large-scale outage: “The possibility of a terrorist attack on the nation’s power grid — an assault that would cause coast-to-coast chaos,” he said, “is a very real one.”

As a former electrical engineer, I understand well the need for a safe and robust power supply, and that control systems can fail. It’s not uncommon to have alternative or redundant power supplies for important equipment. Single points of failure are accidents waiting to happen. Back-up your backup.

The electrical supply grid will try to provide alternative power whenever any part of it fails. The power supply system needs to be built with redundancy in case of problems, and monitoring and control systems that can respond to failures and keep the electricity supply balanced.

Cohen fears a major power outage could lead to civil unrest. Janet Napolitano, former Department of Homeland Security secretary, said a cyber-attack on the power grid was a case of “when,” not “if”. And former senior CIA analyst Peter Vincent Pry went so far as to say that an attack on the US electrical power supply network could “take the lives of every nine out of ten Americans”. The damage that an electromagnetic pulse (EMP) could cause, such as from a nuclear weapon air-burst, is well known. But many now think the complex and interconnected nature of industrial control systems, known as SCADA, could be the major risk.

An example of the potential problem is the north-east US blackout on August 14 2003, which affected 508 generating units at 265 separate power plants, cutting off power to 45m people in eight US states and 10m people in Ontario. It was caused by a software flaw in an alarm system in an Ohio control room which failed to warn operators about an overload, leading to domino effect of failures. It took two days to restore power.

As the world becomes increasingly internet-dependent, we have created a network that provides redundant routes to carry traffic from point to point, but electrical supply failures can still take out core routing systems.

Control systems — the weakest link

Often it’s the less obvious elements of infrastructure that are most open to attack. For example, air conditioning failures in data centres can cause overheating sufficient to melt equipment, especially the tape drives used to store vast amounts of data. This could affect anything from banking transactions worth billions, the routing of traffic around a busy city, or an emergency services call centre.

As we become more dependent on data and data-processing, so we are more vulnerable to their loss. Safety critical systems are built with failsafe control mechanisms, but those mechanisms can also attacked and compromised.

The cloud we have created and upon which we increasingly depend is not as hardy as we think. The internet itself, and the way we use it, is not as distributed as it was designed to be. We still rely too heavily on key physical locations where data and network interconnections are concentrated, creating unacceptable points of failure that could lead to a domino-effect collapse. The DNS infrastructure is a particular weak point, where just 13 root servers worldwide act as master lists for the entire web’s address book.

I don’t think governments have fully thought this through. Without power, without internet connectivity, there is no cloud. And without the cloud we have big problems.