EBS fails to bounce back

Weekend cloud outage puts Amazon in the shade

Lucy Carey
rain4

Users of Amazon’s cloud storage system were greeted with an unfortunately familiar scenario on Sunday when the service went down- the third such incident in two years.

 

For 49 minutes last Sunday, a
northern Virginia based data storage centre for Amazon suffered an
outage. As a result of the crash, a ripple effect of issues hit the
center’s clients, taking with it some behemoths of the app space
including
Instagram, Netflix, and
Twitter’s Vine. Netflix, Heroku, and IFTTT were
also impacted.
With another crash in the eastern US a
few days before bringing down Amazon’s e-commerce center and
potentially wiping
out thousands of dollars worth of
profit, it’s b
een a hectic few weeks for the company’s
engineers.
 

Frustratingly for Amazon, for the

third time
in two years, their Elastic Block
Store (EBS) was cited as the crash culprit. This network-attached
block level storage service is an all singing, all dancing entity,
used by applications for databases, filing systems, or accessing
raw block level storage. It is an integral part of the Amazon Web
Services arsenal for attracting a diverse range of clients,
 including a recent $600 million investment by the CIA,
beating out established computing rivals such as IBM and
Oracle.

In total, the company’s global
infrastructure-as-a-service and platform-as-a-service generated
$2.25 billion in revenue in the second quarter of 20130- a colossal
28 percent of all market share. Ubiquity doesn’t necessarily
translate into infallibility however, and with a hat trick of major
crashes now blotting Amazon’s EBS copybook, it’s clear that some
serious rethinking needs to be done.  

It’s not entirely Amazon’s fault though. The
company is an open advocate of geographical redundancy, encouraging
customers to spread out their data across multiple regions to
ensure against all out system failure. The number of high profile
sites impacted by Sunday’s glitch demonstrates that, for now, even
major traffic hubs are shunning this advice. Although some impacted
sites, such as Facebook owned Instagram, no doubt have some degree
of redundancy due to in-house data services, clearly they weren’t
enough in this instance.

Amazon noted the problems on the company’s
status
dashboard at
1.22 Pacific Time, informing users that it was

“investigating degraded performance for some volumes in a
single [Availability Zone] in the US-EAST-1 Region.”

The company is now performing a ‘forensic’
investigation into how the outage came about, attributing the
issues to a network problem that triggered elevated EBS-related API
error rates in one isolated region.

One of the key issues with EBS is that any
gremlins in the machine can trigger a domino effect across an
entire data cluster- and, as EBS-dependent company Awe.sm noted in
the wake of last December’s crash, “if it goes down when connected
to an image when running Ubuntu it fails severely.”

Although all systems now appear to be back
online, given Amazon’s history, it might be a wakeup call to any
users to rejig their systems to work around future potential
failures in the company’s data center hubs.

Photo by Sektordua

Author
Comments
comments powered by Disqus