Nothing to do with beer, sadly

Facebook improves MapReduce job scheduling with Corona

Elliot Bentley

Social network’s engineers open source scheduling framework designed to optimise job scheduling efficiency.

The first social network to obtain one billion active users is likely to have a big data problem, and indeed generates “over half a petabyte of new data” every 24 hours.

While the company use MapReduce as the foundation of their data crunching infrastructure, they found themselves “reaching the limits of that system” last year, mostly in the capabilities of MapReduce’s scheduling framework.

Their solution was Corona, which in an introductory blog post they describe as a “scheduling framework that separates cluster resource management from job coordination”. A cluster manager tracks nodes in the cluster and free resources, while each job has its own dedicated job tracker, either sharing the client’s process or having a process of its own.

The team report that, since deploying Corona, slot refill times have reduced from 10 seconds to 600 milliseconds, while a test job run every four minutes has seen its latency drop by roughly half.

Facebook is notorious for building everything in-house, often finding existing technologies too slow for their needs. For example, since 2010 the site has run not on PHP itself, but PHP transformed by HipHop into C++.

Facebook aren’t the first to work on improving handling of MapReduce tasks: YARN, also known as MapReduce 2.0, also splits the JobTracker into two separate daemons – a global ResourceManager and per-application ApplicationMaster. However, citing incompatibilities with their modified version of Hadoop Distributed File System that would be “time-prohibitive and risky to fix”, Facebook instead opted to build their own version from scratch.

If you’re interested in the gritty details, the source has been made available on GitHub.

Photo by plasticpeople.

Inline Feedbacks
View all comments