Netflix open-sources Conductor, a microservices orchestration engine
Netflix has open sourced its “orchestration engine” Conductor, which helped orchestrate over 2.6 million process flows ranging from simple linear workflows to very complex dynamic workflows which run over multiple days. Let’s see what is under its hood.
Viren Baraiya, architect at Netflix, explained in a blog post that “the Netflix Content Platform Engineering team runs a number of business processes which are driven by asynchronous orchestration of tasks executing on microservices.” He revealed that some of these are long-running processes spanning several days and emphasized their role in getting titles ready for streaming to our viewers across the globe.
Traditionally some of these processes (such as content ingestion, encoding, and deployment to CDN; studio partner integration for content ingestion; process of setting up new titles within Netflix and IMF based content ingestion from their partners) had been orchestrated in an ad-hoc manner using a combination of pub/sub, making direct REST calls, and using a database to manage the state, Baraiya wrote. Still, as the number of microservices increases and the complexity of the processes continues to grow, getting visibility into these distributed workflows becomes difficult without a central orchestrator.
The need for Conductor
This is where Conductor comes into play: to address the requirements (such as the ability to synchronously process all the tasks when needed, to scale to millions of concurrently running process flows, to pause, resume and restart processes etc), take out the need for boilerplate in apps, and offer a reactive flow. According to the developer documentation, Conductor was designed to help Netflix orchestrate microservices-based process flows with the following features:
- Allow creating complex process / business flows in which individual task is implemented by a microservice.
- A JSON DSL based blueprint defines the execution flow.
- Provide visibility and traceability into these process flows.
- Expose control semantics around pause, resume, restart, etc allowing for better devops experience.
- Allow greater reuse of existing microservices providing an easier path for onboarding.
- User interface to visualize the process flows.
- Ability to synchronously process all the tasks when needed.
- Ability to scale millions of concurrently running process flows.
- Backed by a queuing service abstracted from the clients.
- Be able to operate on HTTP or other transports e.g. gRPC.
In a microservices world, a lot of business process automations are driven by orchestrating across services. Conductor enables orchestration across services while providing control and visibility into their interactions. Having the ability to orchestrate across microservices also helped us in leveraging existing services to build new flows or update existing flows to use Conductor very quickly, effectively providing an easier route to adoption.
At the center of Conductor is a state machine service aka Decider service, which combines the workflow blueprint with the current state of the workflow, identifies the next state, and schedules appropriate tasks and/or updates the status of the workflow. The Netflix team has been using dyno-queues on top of Dynomite for managing distributed delayed queues.
The orchestration engine follows RPC-based communication model where workers are running on a separate machine from the server. Workers communicate with the server over HTTP-based endpoints and employ polling model for managing work queues.
Tasks (implemented by worker applications) communicate via the API layer; the orchestration engine provides APIs to inspect the workload size for each worker that can be used to autoscale worker instances. The APIs are exposed over HTTP, which allows for ease of integration with different clients.
Netflix uses Dynomite “as a storage engine” along with Elasticsearch for indexing the execution flows.
Baraiya explained that they started with an early version using a simple workflow from AWS but they chose to build Conductor due to some of the limitations with SWF such as the need for blueprint based orchestration (as opposed to programmatic deciders as required by SWF), for more synchronous nature of APIs when required (rather than purely message based), for indexing inputs and outputs for workflow and tasks and ability to search workflows based on that, the need to maintain a separate data store to hold workflow events to recover from failures, search etc.
AWS Step Functions added some of the features Netflix was seeking in an orchestration engine, so Conductor might adopt the states language to define workflows.