New Workflow Scheduling and Coordination system

Yahoo!’s Hadoop-Based Project Proposed for Apache Incubator

Jessica Thornsby
Yahoo-s-Hadoop-Based-Project-Proposed-for-Apache-Incubator

“Oozie is built on Hadoop to solve most common requirements on handling job dependency, data dependency, and time dependency.”

Last month, ‘Oozie’ was proposed as a
new Apache Incubator project
. After starting life as a
Yahoo!-internal project, Oozie was released on GitHub in early
2010, and now a proposal has been submitted for the project to
enter the Apache Incubator. In this interview, we speak to senior
software engineer at Yahoo! Hadoop Cloud Computing, Angelo Huang,
on what the Oozie project is looking to bring to the Apache
ecosystem.

JAXenter: Oozie has just been proposed as a new
project under the Apache Incubator. Can you give us an introduction
to Oozie?

Angelo Huang: Oozie is a server-based workflow
scheduling and coordination system to manage data processing jobs
for Apache
Hadoop
. Oozie is built on Hadoop to solve most common
requirements on handling job dependency, data dependency, and time
dependency.

Dependencies of jobs can be expressed as Directed Acyclic Graph,
also called a workflow in Oozie. A node in workflow is typically a
job, such as map-reduce, streaming, pipes, pig, hive, hdfs
operation, or java program. Oozie provides an execution framework
to define workflow and scheduling of jobs, so jobs can be
sequential executed and start after the other job has finished.

Furthermore, some applications or workflows need to run in
periodic intervals or when dependent data is available. For
example, a workflow could be executed every day as soon as output
data from the previous 24 instances of another, hourly workflow is
available. The workflow coordinator provides such scheduling
features, along with prioritization, load balancing and throttling
to optimize utilization of resources in the cluster. This makes it
easier to maintain, control, and coordinate complex data
applications.

Nearly three years ago, a team of Yahoo! developers addressed
these critical requirements for Hadoop-based data processing
systems by developing a new workflow management and scheduling
system called Oozie. While it was initially developed as a
Yahoo!-internal project, it was designed and implemented with the
intention of open-sourcing. Oozie was released as a
GitHub project
in early 2010. Oozie is used in production
within Yahoo and since it has been open-sourced it has been gaining
adoption with external developers.

JAXenter: What unique requirements does Oozie
address, for Hadoop-based data processing systems?

Angelo: Commonly, applications that run on
Hadoop require multiple Hadoop jobs in order to obtain the desired
results. Furthermore, these Hadoop jobs are commonly a combination
of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java
programs and shell scripts.

Because of this, developers find themselves writing ad-hoc glue
programs to combine these Hadoop jobs. These ad-hoc programs are
difficult to schedule, manage, monitor and recover.

Oozie addresses the challenge by providing an execution
framework to flexibly specify the job dependency, data dependency,
and time dependency. In addition, Oozie provides a
multi-tenant-based centralized service and the opportunity to
optimize load and utilization while respecting SLAs.

JAXenter: Who is the Oozie project
targeting?

Angelo: Oozie provides a workflow management
and scheduling system to any users and companies with data
processing needs. Most organizations used to run scripts or cron
jobs to handle data and time dependencies on a combination of
Hadoop jobs. Oozie is a clean and cost-efficient solution for any
organizations to run multiple types of Hadoop jobs without the
hassle to handle dependencies and execution flows in their own
applications. Moreover, Oozie is an important project in the Apache
Hadoop ecosystem to manage and schedule miscellaneous jobs. When
more and more projects or tools add to the Hadoop ecosystem, Oozie
gives users the options to plug-in a new type of job and include it
in an existing or newly created workflow application.

JAXenter: What potential benefits do you see,
if the project successfully joins the Apache ecosystem?

Angelo: Oozie is built on Apache Hadoop to
schedule jobs related to various Apache projects such as Hadoop,
Pig, and Hive. As an Apache Open source project, Oozie is expected
to attract the larger and more diversified community that currently
uses such Apache sponsored projects. Other users of the Hadoop
ecosystem will benefit from existing features, and contribute, to
make Hadoop related projects better. Additionally, users of the
Hadoop ecosystem can influence Oozie’s roadmap, and contribute to
it. Likewise, Oozie, as part of the Apache Hadoop ecosystem, will
be a great benefit to the current Hadoop/Pig/Hive/HBase/HCatalog
community.

Author
Comments
comments powered by Disqus