LinkedIn open sourcing optimization tool for Hadoop and Spark
LinkedIn has open sourced Dr. Elephant , a tool focused toward helping Hadoop users understand and optimize their flows which solves about 80 percent of the problems through simple diagnosis.
LinkedIn open sourced Dr. Elephant, a performance monitoring and tuning tool which helps Hadoop and Spark users understand, analyze and improve their flows’ performance. Dr. Elephant was first presented to the community in 2015 during the eighth annual Hadoop Summit.
The goal of Dr. Elephant is to improve developer productivity and boost cluster efficiency. It analyzes the Hadoop and Spark jobs using a collection of pluggable, configurable, rule-based heuristics which offer insights on how a job performed and then makes suggestions based on the results about how to tune the job to make it perform more effectively.
Why does LinkedIn need Dr. Elephant?
The majority of the Hadoop optimization tools are designed to collect system resource metrics and monitor cluster resources and are focused on simplifying the deployment and management of Hadoop clusters. There are just a handful of tools that are focused on helping Hadoop users optimize their flows, but the ones that are available are inactive or have failed to scale and support the growing Hadoop frameworks. Dr. Elephant’s role is to support Hadoop with a collection of frameworks —but it can be extended to newer frameworks. It also has support for Spark.
LinkedIn has employees with different levels of experience with Hadoop using different frameworks to run their Hadoop jobs, but due to the growing number of Hadoop users, having regular sessions for different users on distinct frameworks did not work anymore. The tech giant could not verify if they achieved optimal performance for the job or guarantee performance coverage, which is why they needed to standardize and automate the process.
What’s Dr. Elephant’s expertise and how does it work?
This tool has continued to grow since its birth in mid-2014 and now consists of a collection of skills and capabilities, such as job-level comparison of flows and out-of-the-box integration with Azkaban scheduler and support for adding any other Hadoop scheduler, such as Oozie. Dr. Elephant also has the following skills: pluggable and configurable rule-based heuristics which diagnose a job, representation of historic performance of jobs and flows, diagnostic heuristics for MapReduce and Spark, REST API to fetch all the information and is easily extensible to newer job types, applications, and schedulers.
SEE ALSO: Spark vs Hadoop –Who wins?
The tool receives a list of all recent succeeded and failed applications, at regular intervals, from the YARN resource manager while the metadata for every application are fetched from the Job History server. After it gathers all the metadata, the tool runs a series of heuristics on them and produces a diagnostic report on how the individual heuristics and the job as a whole performed; they are eventually tagged with one of five severity levels, to indicate prospective performance issues.
LinkedIn announced the expansion of its open source software portfolio roughly two weeks after open sourcing Simoorg, the company’s failure induction framework. Thanks to its new open source status, Dr. Elephant’s performance can be tailored in accordance with the necessities of the analytics cluster in which they are deployed.