Meet the ringleader of the Simian Army

Netflix unleashes Chaos Monkey

Chris Mayer

Netflix continues its open source pledge, bringing an AWS drill sergeant to make sure your infrastructure is ready for blackout.

Over the past few months, video streaming behemoth Netflix have been giving back to the developer community, open sourcing a number of interesting tools, namely two Cassandra helpers Astyanax and Priam.

Now the technical team say it’s the right time to uncage the leader of their so-called ‘Simian Army’, Chaos Monkey, a veteran that tests how malleable their Amazon Web Services infrastructure is during times of strife.

Available through Github, Chaos Monkey hunts down groups of systems and randomly terminates virtual machine instances in applications, to simulate what would happen in a disaster scenario. Writing in the Netflix technical blog, Cory Bennett and Ariel Tseitlin tell us why Chaos Monkey should be present in your architecture:

We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient.

Failures happen and they inevitably happen when least desired or expected. If your application can’t tolerate an instance failure would you rather find out by being paged at 3am or when you’re in the office and have had your morning coffee? Even if you are confident that your architecture can tolerate an instance failure, are you sure it will still be able to next week? How about next month? Software is complex and dynamic and that “simple fix” you put in place last week could have undesired consequences.

Whilst this might sound like some sort of mercenary taking out your infrastructure, the benefits of having Chaos Monkey there are numerous. It makes sure engineers are responsive and alert when it’s all hands to the pump, making sure that your infrastructure can cope and has a battleplan in Code Red scenarios.

If anyone is well placed to deal with situations like this, it’s Netflix with their massive cloud infrastructure. Crucially, with scrutiny on Amazon Web Services growing with more blackouts, it always seems to be Netflix who come out of it unscathed, no doubt in part due to their Simian Army – a ruthless squadron of tools that puts the Netflix architecture through its paces.

The stats are impressive too:

There are many failure scenarios that Chaos Monkey helps us detect. Over the last year Chaos Monkey has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don’t happen again.

With some tweaking to Chaos Monkey’s REST API, you can initiate a testing scheme to make sure you aren’t caught out. The great news is Netflix intends to deploy its entire army moving forward, with environment tidier Janitor Monkey and something they call Chaos Gorilla, which simulates an entire AWS zone outage. Get this tool now to make sure AWS doesn’t make a monkey out of you.

Image courtesy of FHGitarre on Flickr

Inline Feedbacks
View all comments