New Workflow Scheduling and Coordination system

Yahoo!’s Hadoop-Based Project Proposed for Apache Incubator

Jessica Thornsby

“Oozie is built on Hadoop to solve most common requirements on handling job dependency, data dependency, and time dependency.”

Last month, ‘Oozie’ was proposed as a new Apache Incubator project. After starting life as a Yahoo!-internal project, Oozie was released on GitHub in early 2010, and now a proposal has been submitted for the project to enter the Apache Incubator. In this interview, we speak to senior software engineer at Yahoo! Hadoop Cloud Computing, Angelo Huang, on what the Oozie project is looking to bring to the Apache ecosystem.

JAXenter: Oozie has just been proposed as a new project under the Apache Incubator. Can you give us an introduction to Oozie?

Angelo Huang: Oozie is a server-based workflow scheduling and coordination system to manage data processing jobs for Apache Hadoop. Oozie is built on Hadoop to solve most common requirements on handling job dependency, data dependency, and time dependency.

Dependencies of jobs can be expressed as Directed Acyclic Graph, also called a workflow in Oozie. A node in workflow is typically a job, such as map-reduce, streaming, pipes, pig, hive, hdfs operation, or java program. Oozie provides an execution framework to define workflow and scheduling of jobs, so jobs can be sequential executed and start after the other job has finished.

Furthermore, some applications or workflows need to run in periodic intervals or when dependent data is available. For example, a workflow could be executed every day as soon as output data from the previous 24 instances of another, hourly workflow is available. The workflow coordinator provides such scheduling features, along with prioritization, load balancing and throttling to optimize utilization of resources in the cluster. This makes it easier to maintain, control, and coordinate complex data applications.

Nearly three years ago, a team of Yahoo! developers addressed these critical requirements for Hadoop-based data processing systems by developing a new workflow management and scheduling system called Oozie. While it was initially developed as a Yahoo!-internal project, it was designed and implemented with the intention of open-sourcing. Oozie was released as a GitHub project in early 2010. Oozie is used in production within Yahoo and since it has been open-sourced it has been gaining adoption with external developers.

JAXenter: What unique requirements does Oozie address, for Hadoop-based data processing systems?

Angelo: Commonly, applications that run on Hadoop require multiple Hadoop jobs in order to obtain the desired results. Furthermore, these Hadoop jobs are commonly a combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs and shell scripts.

Because of this, developers find themselves writing ad-hoc glue programs to combine these Hadoop jobs. These ad-hoc programs are difficult to schedule, manage, monitor and recover.

Oozie addresses the challenge by providing an execution framework to flexibly specify the job dependency, data dependency, and time dependency. In addition, Oozie provides a multi-tenant-based centralized service and the opportunity to optimize load and utilization while respecting SLAs.

JAXenter: Who is the Oozie project targeting?

Angelo: Oozie provides a workflow management and scheduling system to any users and companies with data processing needs. Most organizations used to run scripts or cron jobs to handle data and time dependencies on a combination of Hadoop jobs. Oozie is a clean and cost-efficient solution for any organizations to run multiple types of Hadoop jobs without the hassle to handle dependencies and execution flows in their own applications. Moreover, Oozie is an important project in the Apache Hadoop ecosystem to manage and schedule miscellaneous jobs. When more and more projects or tools add to the Hadoop ecosystem, Oozie gives users the options to plug-in a new type of job and include it in an existing or newly created workflow application.

JAXenter: What potential benefits do you see, if the project successfully joins the Apache ecosystem?

Angelo: Oozie is built on Apache Hadoop to schedule jobs related to various Apache projects such as Hadoop, Pig, and Hive. As an Apache Open source project, Oozie is expected to attract the larger and more diversified community that currently uses such Apache sponsored projects. Other users of the Hadoop ecosystem will benefit from existing features, and contribute, to make Hadoop related projects better. Additionally, users of the Hadoop ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise, Oozie, as part of the Apache Hadoop ecosystem, will be a great benefit to the current Hadoop/Pig/Hive/HBase/HCatalog community.

Inline Feedbacks
View all comments