ec2 - News, Features, and Slideshows

News about ec2
  • VMware causes second outage while recoving from first

    VMware's attempt to recover from an outage in its brand-new cloud computing service inadvertently caused a second outage the next day, the company said.
    VMware's new Cloud Foundry service - which is still in beta - suffered downtime over the course of two days last week, not long after the more highly publicized outage that hit Amazon's Elastic Compute Cloud.
    Cloud Foundry, a platform-as-a-service offering for developers to build and host Web applications, was announced April 12 and suffered "service interruptions" on April 25 and April 26.
    The first downtime incident was caused by a power outage in the supply for a storage cabinet. Applications remained online but developers weren't able to perform basic tasks, like logging in or creating new applications. The outage lasted nearly 10 hours and was fixed by the afternoon.
    But the next day, VMware officials accidentally caused a second outage while developing an early detection plan to prevent the kind of problem that hit the service the previous day.
    VMware official Dekel Tankel explained that the April 25 power outage is "something that can and will happen from time to time," and that VMware has to ensure that its software, monitoring systems and operational practices are robust enough to prevent power outages from taking customer systems offline.
    With that in mind, VMware began developing "a full operational playbook for early detection, prevention and restoration" the very next day.
    "At 8am [April 26] this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon," Tankel wrote. "This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed. Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."
    The second-day outage was the more serious of the two.
    "This was our first total outage, which is an event where we need to put up a maintenance page," Tankel continued. "During this outage, all applications and system components continued to run. However, with the front-end network down, we were the only ones that knew that the system was up. By 11:30 a.m. PDT the front end network was fully operational."
    VMware's second-day problem illustrated the element of human error in cloud networks, just as the root-cause analysis of Amazon's cloud outage did. In the case of Amazon, a mistake made during a system upgrade led to trouble that took several days to fully correct. (See also: "Amazon: Bad execution during planned upgrade caused outage")
    VMware, which is best known for its server virtualization technology, is a new player in offering a publicly available cloud service. Previously, VMware sold technology to help customers and service providers build their own clouds.
    Because Cloud Foundry is so new the customer impact was not as severe as the one caused by Amazon, whose outage forced offline numerous websites that rely on Amazon infrastructure. But VMware is getting a taste of what it's like to be a service provider when things go wrong.

  • Opinion: The failure behind the Amazon outage isn't just Amazon's

    When Amazon.com's outage last week - specifically, the failure of its EBS (elastic block storage) subsystem - left popular websites and services such as Reddit, Foursquare, and Hootsuite crippled or outright disabled, the blogosphere blew up with noise around the risks of using the cloud. Although a few defenders spoke up, most of these instant experts panned the cloud and Amazon.com. The story was huge, covered by the New York Times and the national business press; Amazon.com is now "enjoying" the same limelight that fell on Microsoft in the 1990s. It will be watched carefully for any weakness and rapidly kicked when issues occur.
    It's the same situation we've seen since we began to use computers: They are not perfect, and from time to time, hardware and software fails in such a way that outages occur. Most cloud providers, including Amazon.com, have spent a lot of time and money to create advanced multitenant architectures and advanced infrastructures to reduce the number and severity of outages. But to think that all potential problems are eliminated is just being naive.
    Some of the blame around the outage has to go to those who made Amazon.com a single point of failure for their organizations. You have to plan and create architectures that can work around the loss of major components to protect your own services, as well as make sure you live up to your own SLA requirements.
    Although this incident does indeed show weakness in the Amazon.com cloud, it also highlights liabilities in those who've become overly dependent on Amazon.com. The affected companies need to create solutions that can fail over to a secondary cloud or locally hosted system - or they will again risk a single outage taking down their core moneymaking machines. I suspect the losses around this outage will easily track into the millions of dollars.
    Never trust a single system component, be it a cloud, a network, a router, a database, or whatever. Figure out what to do when a component goes offline or fails in other ways. The typical solution is to fail to secondary components that can operate until the primary is back online. That used to be a given in IT. Unfortunately, many organisations have put too much trust into clouds, pushing their systems out to providers with the incorrect thought that a third party will provide the resiliency and the redundancy they require.
    As we've seen so dramatically, clouds have limitations, too. Don't get mad at that fact - just deal with it.

  • Amazon investigating cloud service outage

    Several days after Amazon.com's cloud outage knocked some high-profile websites offline, the company said its cloud service was largely back up and running. Now Amazon is trying to track down the root of the problem.
    The outage partially disabled or knocked out popular websites including Quora, Foursquare and Reddit.
    On Saturday, two days after Amazon suffered a failure in its web hosting services, the company announced that it had fixed most of the problem. However, the latest update on Amazon's Service Health Dashboard noted that engineers are still working on some remaining issues with its EBS, or Elastic Block Storage.
    At 10:35 pm ET on Sunday, Amazon reported, "We're in the process of contacting a limited number of customers who have EBS volumes that have not yet recovered and will continue to work hard on restoring these remaining volumes.
    Users still having problems with their hosted Web sites should contact Amazon on their Web Services site. Users should select Amazon Elastic Compute Cloud in the "Services" field. And in the description field, they should list the instance and volume IDs and describe the issue they're experiencing.
    The company also noted on its dashboard that workers are "digging deeply into the root causes of this event" and will post their findings in a post mortem.
    The trouble started a little after 5 a.m. Eastern on Thursday when the company's Service Health Dashboard reported connectivity problems that were affecting its Relational Database Service, which is used to manage a relational database in the cloud, across multiple zones in the eastern U.S.
    Because of server problems at Amazon's data center, which handles the company's EC2 Web hosting services, some websites, including popular Web 2.0 sites, were left staggering or disabled.
    Web sites Reddit, Foursquare, Quora and HootSuite, which suffered through Amazon's outage, are back up today.

[]