Computerworld

Startup claims it saw early signs of Amazon's cloud outage

Almost two hours before Amazon Web Services publicly acknowledged an outage that brought down websites such as Reddit, Imgur and Heroku, application monitoring startup Boundary claims it started noticing the problem.

The company is developing an early-warning system for cloud outages and following AWS's most recent incident, it says its service was proven to work at scale for the first time.

BACKGROUND: Amazon outage started small, snowballed into 12-hour event 

AWS ALTERNATIVES: 10 Most powerful IaaS companies 

Boundary is an application performance management (APM) tool that installs an agent that monitors second-by-second performance of virtual machines running in a public cloud or data center. The information from the VMs is sent into Boundary's cloud where it is analyzed. Boundary relays information to customers about the health of their system and aggregates data from its hundreds of customers to monitor trends in the cloud.

On Monday, almost two hours before AWS officially announced an issue in its Elastic Block Storage (EBS), which is a volume storage service used in conjunction with its Elastic Cloud Compute (EC2), Boundary started noticing abnormal activity in AWS's cloud, the company says.

During the next two hours, nearly a third of the agents among Boundary's more than 300 customers stopped reporting back to Boundary's cloud at one time or another. Data transfer from AWS's cloud to Boundary's data analysis servers dropped 27% from 38Mbps to 28Mbps.

Roundtrip latencies between the agents and Boundary's cloud increased by three times their normal levels, the company says. The latency in VM reporting continued until 2 a.m. PT on Tuesday, when AWS reported that the issue had mostly been resolved. Stamos detailed what the company found in a blog post.

Boundary only tracks the performance of the VMs, so she says there's no way to know what caused the issue on Monday. The decreased network traffic could mean customers were experiencing performance problems on their own instances, which were then being reflected in Boundary's tools, or that there was a problem the VMs' ability to send tracking data to Boundary from AWS. Either way, it was enough of an abnormal spike, Stamos says, that they knew something was up. "There's no way to go inside Amazon's infrastructure, what we're trying to do is be a leading indicator, alerting customers that there is a problem developing," she says.

Boundary hasn't quite fully developed that functionality yet though. In the most recent incident, Boundary didn't actually inform customers that AWS was experiencing some abnormal activity, it just reported the results on its website. The company hopes to in the future use this data to create that early warning system for users.

Having that knowledge could be critical, she argues. If customers were alerted to performance issues within an Availability Zone they could switch workloads out of it and into another unaffected Availability Zone, into a another cloud provider, or into their own data center.

MORE CLOUD: Does OpenStack need a Linus Torvalds? 

All of this is not easy to do, and to do right, says Jim Frey, managing research director for Enterprise Management Association. The predictive analytics industry is still very young. Frey says that if he had seen the change in activity that Boundary noticed on Monday, it may have raised an eyebrow for him, but it's difficult to predict if those anomalies would lead to a significant event, such as Monday's outage. "Many cases in IT there's smoke before there's fire," Frey says. "The problem is when you cry wolf."

Boundary is not alone in offering predictive analytics capabilities for the cloud either. The company does take a unique approach to the issue though. Traditionally, APMs have measured the output performance of virtual machines. A variety of players do this, from NetScout to Riverbed to Network Instruments. Big name tech companies such as IBM, HP and CA have APM tools as well.

Boundary, by contrast, installs an agent that tracks individual VMs, which Frey says provides both a more intricate and holistic view of a cloud environment. Plus, the agent is able to follow the VM wherever it goes, whether on a public cloud or in the data center.

Also, if an administrator does want to transfer workloads out of a certain Availability Zone, or across to another cloud provider, the system has to be architected to support that transition beforehand. Load balancers have to be in place, the application has to be horizontally scalable and the new VM instances have to be able to be onboarded quickly. If the system has been architected that way, then at the first sign of some problem, the workloads could theoretically be transferred out of the impacted Availability Zone. Just how many cloud users have such a system setup today is unclear, Frey says.

As for the false positives, Boundary is collecting a lot of data, and the more data it collects, the better it will be at knowing which issues are real problems and which are insignificant hiccups.

Boundary released its product in April but it hopes to roll out additional features in the coming weeks. Customers currently receive a 10-minute history of one-second intervals of their system's performance. The goal is to offer a 24-hour look-back of system performance, plus one month's worth of minute-by-minute data, or a year's worth of hourly data. Alerts for customers warning of potential outage events is a goal of the company as well.

Meanwhile, AWS has still not yet released details of exactly what caused this week's incident, but the outage represents the third significant incident in two years. About two weeks after the company's last major outage in July, it issued a detailed post-mortem report explaining that power outages, bugs and bottlenecks that caused the problem, which the company may do for this one as well.

Network World staff writer Brandon Butler covers cloud computing and social collaboration. He can be reached at BButler@nww.com and found on Twitter at @BButlerNWW.