One of the tenets of the Cloud religion is that it should be possible - through the use of intelligent software - to build reliable systems on top of unreliable hardware.
Just like you can build reliable and affordable storage systems using RAID (Redundant Arrays of Inexpensive Disks).
One of the largest cloud providers even evangelises to its application development customers that they should assume that “everything that can go wrong, will go wrong”.
In fact their SLA only kicks in after a minimum of two zones becomes unavailable. Quite a surprising but none the less a typical cloud approach.
Nowadays most of the large cloud providers buy very reliable hardware. When running several hundred thousands of servers a failure rate of 1 PPM versus 2 PPM (parts per million) makes quite a difference.
And using too cheap memory chips can cause a lot of very difficult to pinpoint problems.
These providers also increase the up-time by buying simpler (purpose-optimised) equipment and by thinking carefully about what exactly is important for reliability.
For example: one of the big providers routinely removes the overload protection from its transformers.
They prefer that occasionally a transformer costing a few thousand dollars breaks down, to regularly having whole isles loose power because a transformer manufacturer was worried about possible warranty claims. And not to worry, they do not remove the fire safety breakers.
With that we are not implying that the idea of assuring reliability at higher stack levels than hardware is no longer necessary. Sometimes even the best quality hardware can (and will) fail.
Not to mention human errors (Oops, wrong plug!) that still on a regular basis take complete data centres out of the air (or rather, out of the cloud).
The real question continues to be what happens to your application when something like this happens.
Does it simply remain operational, does it gracefully decline to a slightly simpler, slightly slower but still usable version of itself, or does it just crash and burn? And for how long? For end users bringing their own applications to the cloud it is clear where the responsibility lies for addressing this (with themselves).
But end users who outsource their applications to a so called “managed cloud provider” may (and should) expect that the provider who provides that management takes responsibility.
Recently several customers of a reputable IT provider - who earned his stripes largely in the pre-cloud era and who now offers cloud services from a large number of regionally distributed DCs - lost access to their applications for several days because one operator in one data centre did something fairly stupid with just one plug.
Luckily we do see the rate of such human mistakes decline as cloud providers gain more experience (and add more process). Experience counts, especially in the cloud. But an outage like this simply is not acceptable.
If a provider boasts it has more local cloud data centres than others, but then is unable to move specific customer workloads to those other data centres within an acceptable timeframe, it is not really a “managed” cloud provider.
Simply lifting and shifting customer applications to a cloud instance without “pessimistically” looking at what could go wrong, is as stupid as putting all your data on a single inexpensive disk without RAID and without backup.
And if reengineering the applications is too expensive to create a feasible cloud business case, then users should ask themselves whether cloud in that case is really the right solution.
In the words of R.E.M.: “I think I thought I saw you try” is really not enough assurance for success. The cloud is not about technology or hardware, it’s about mindset.
And providers who do not change their mindset may see their customers loosing faint in the cloud (or at least in their cloud). Quit quickly.
Losing My Religion (1991), was the biggest commercial hit of alternative rock band R.E.M.. The song was written more or less accidentally as the bandleader was trying to teach himself to play a second hand Banjo he bought on sale.
By Gregor Petri - Research Analyst, Gartner