Facebook improves MapReduce job scheduling with Corona
Social networks engineers open source scheduling framework designed to optimise job scheduling efficiency.
The first social network to obtain one billion active users is likely to have a big data problem, and indeed generates “over half a petabyte of new data” every 24 hours.
While the company use MapReduce as the foundation of their data
crunching infrastructure, they found themselves “reaching the
limits of that system” last year, mostly in the capabilities of
MapReduce’s scheduling framework.
Their solution was Corona, which in an introductory blog post they describe as a “scheduling framework that separates cluster resource management from job coordination”. A cluster manager tracks nodes in the cluster and free resources, while each job has its own dedicated job tracker, either sharing the client’s process or having a process of its own.
The team report that, since deploying Corona, slot refill times have reduced from 10 seconds to 600 milliseconds, while a test job run every four minutes has seen its latency drop by roughly half.
Facebook is notorious for building everything in-house, often finding existing technologies too slow for their needs. For example, since 2010 the site has run not on PHP itself, but PHP transformed by HipHop into C++.
Facebook aren’t the first to work on improving handling of
MapReduce tasks: YARN, also known as MapReduce 2.0, also splits the
JobTracker into two separate daemons – a global ResourceManager and
per-application ApplicationMaster. However, citing
incompatibilities with their modified version of Hadoop Distributed
File System that would be “time-prohibitive and risky to fix”,
Facebook instead opted to build their own version from
If you’re interested in the gritty details, the source has been made available on GitHub.
Photo by plasticpeople.