EBS fails to bounce back
Weekend cloud outage puts Amazon in the shade
For 49 minutes last Sunday, a northern Virginia based data storage centre for Amazon suffered an outage. As a result of the crash, a ripple effect of issues hit the center’s clients, taking with it some behemoths of the app space including Instagram, Netflix, and Twitter’s Vine. Netflix, Heroku, and IFTTT were also impacted. With another crash in the eastern US a few days before bringing down Amazon’s e-commerce center and potentially wiping out thousands of dollars worth of profit, it’s been a hectic few weeks for the company’s engineers.
Frustratingly for Amazon, for the third time in two years, their Elastic Block Store (EBS) was cited as the crash culprit. This network-attached block level storage service is an all singing, all dancing entity, used by applications for databases, filing systems, or accessing raw block level storage. It is an integral part of the Amazon Web Services arsenal for attracting a diverse range of clients, including a recent $600 million investment by the CIA, beating out established computing rivals such as IBM and Oracle.
In total, the company’s global infrastructure-as-a-service and platform-as-a-service generated $2.25 billion in revenue in the second quarter of 20130- a colossal 28 percent of all market share. Ubiquity doesn’t necessarily translate into infallibility however, and with a hat trick of major crashes now blotting Amazon’s EBS copybook, it’s clear that some serious rethinking needs to be done.
It’s not entirely Amazon’s fault though. The company is an open advocate of geographical redundancy, encouraging customers to spread out their data across multiple regions to ensure against all out system failure. The number of high profile sites impacted by Sunday’s glitch demonstrates that, for now, even major traffic hubs are shunning this advice. Although some impacted sites, such as Facebook owned Instagram, no doubt have some degree of redundancy due to in-house data services, clearly they weren’t enough in this instance.
Amazon noted the problems on the company’s status dashboard at 1.22 Pacific Time, informing users that it was "investigating degraded performance for some volumes in a single [Availability Zone] in the US-EAST-1 Region.”
The company is now performing a ‘forensic’ investigation into how the outage came about, attributing the issues to a network problem that triggered elevated EBS-related API error rates in one isolated region.
One of the key issues with EBS is that any gremlins in the machine can trigger a domino effect across an entire data cluster- and, as EBS-dependent company Awe.sm noted in the wake of last December’s crash, “if it goes down when connected to an image when running Ubuntu it fails severely.”
Although all systems now appear to be back online, given Amazon’s history, it might be a wakeup call to any users to rejig their systems to work around future potential failures in the company’s data center hubs.
Photo by Sektordua