Just think, what if there was a run on resources at AWS and Azure?
- 21 April, 2015 05:20
During a recent inquiry, a client asked Elias Khnaser, research director, Gartner how they could purchase “guaranteed capacity” at AWS in the event of a disaster.
“Frankly, I had never even considered such a scenario,” Khnaser admits.
“After asking the client for clarification, I discovered that they were concerned about AWS’ ability to guarantee capacity when/if a large number of organisations tried to simultaneously provision or power-on instances.
“This is assuming, of course, that the disaster affected a large geographic area and, consequently, a large number of organisations.
“That immediately reminded me of the old way of doing Disaster Recovery, where organisations would pay to have physical servers reserved and, in the event of a disaster, would be guaranteed access to those servers to rebuild their environment.”
So Khnaser began to answer the question…
“Well, with AWS, you could purchase Reserved Instances which would guarantee that your instances would power on, but there are design best practices that are intended to avoid those types of situations.
For starters, you would need to deploy your instances into two Availability Zones within the same Region in order to adhere to AWS’ compute SLA.”
Now, of course the client is thinking about DR, and as such, may not be willing to deploy instances to two Availability Zones.
Nonetheless, Khnaser maintains that is the correct deployment method to avoid such an outage. But the client then asked, ”What if the disaster affected the entire Region?”
“I then explained that AWS Regions have at least two AZs, and some have more,” Khnaser recalls.
“Furthermore, the likelihood of a Region running out of capacity is extremely low.
“And of course, you can always architect your environment to work across multiple Regions; you sacrifice synchronous replication capabilities and some other things, but it is doable.”
Khnaser then took the client through the importance of a Business Impact Analysis (BIA), which would not only consider the types of disasters the business needs to protect against, but would also identify the RTOs and RPOs needed to design an effective DR plan.
Khnaser also explained to the client that at some level, they must consider the social aspects of a disaster, not just the technical aspects.
“If the disaster is that big, the last thing on anyone’s mind is how to bring back services,” he adds, “you are in survival mode, at that point.
“If you are trying to protect against more than a hurricane, a tornado, an earthquake of a certain magnitude, or even against terrorist attacks of a specific caliber, you face many more challenges than whether or not your instances will power on.”
Food for thought?
But that got Khnaser thinking – could there be a run on resources at large cloud service providers like AWS, Azure, or others in the event of a really large disaster? And if so, what would that look like?
Putting the social and survival instinct aside, and assuming a disaster was large enough where organisations would look to AWS and Azure to quickly rebuild their environments, what could these providers do to avoid running out of capacity?
“Sure, I know what you will say “the cloud offers infinite capacity,” but we all know that isn’t true, and at some point, there is a limit to that capacity, even if it is very high,” Khnaser adds.
If the disaster was large enough to, say, affect the East Coast of the United States, and therefore push customers to redeploy their systems on AWS, Khnaser assumes they can take the following measures, provided that organisations decide to deploy on a particular region or two:
• I would assume that AWS would immediately inventory instances and services that they are consuming, which are not critical and can be shut down to allow incoming organisations to power on and use services.
• I would almost guarantee that AWS’ Spot instances would immediately be powered off, and there are a lot of those spot instances available. This would free up a significant amount of extra capacity on short order.
• Depending on the severity of the outage or expected resource usage, Amazon.com (being one of the bigger consumers of AWS) would also power down certain services and instances that serve the affected area, considering that business would be affected.
If not power down, it would at least reduce the number of instances, as their business would inevitably be affected and, therefore, they would not need that capacity.
• If need be, I would also suspect that AWS would contact clients on the platform, asking them to shut down non-critical instances or reduce instances, if possible.
• As a final resort, I would not be surprised if AWS, facing capacity “drought,” would consider helping organizations deploy on another cloud provider’s infrastructure, similar to how airlines will leverage each other in times of need.
Of course, that is a final desperate measure, but if the disaster is that grave, I am sure the greater good would be put forth.
But, what would Azure do?
“Well, for the sake of not being repetitive, they can adopt very similar measures,” Khnaser adds. “They can also reduce the instances consumed by their other businesses (XBOX comes to mind immediately).”
Khnaser’s analysis is based on a disaster the industry has not yet seen, while not considering the social factor.
“In the event of that perfect storm, I think that the cloud service providers still have better options at their disposal than what would be available to enterprises not leveraging the cloud today,” he concludes.