Bringing data back – Overcoming common issues for DataOps teams
What most large organisations are missing is a unified view of their data applications. Without this, there is no magic bullet to improve your data operations, make applications reliable and improve overall system and application delivery efficiency. Kunal Agarwal, CEO of Unravel Data, talks about some common challenges for data operations teams and how to overcome them.
Broadly speaking, enterprises and other types of large organisations face a series of data application challenges that can be grouped into two areas: Challenges from the application viewpoint and from the operations perspective.
Whether organisations have data infrastructure systems from the tens or thousands of nodes, most leverage multi system pipelines using Spark, Kafka, Hadoop and NoSQL in a typical multi-tenant fashion wherein they have a multitude of different applications – ETL, Business Intelligence, Machine Learning etc – depending on a common data infrastructure.
Looking after these systems and pipelines are the data operations team, who manage and maintain the platforms, and Engineers/application owners, who are building applications on the data stack (with solutions like Hive, Spark, Tez, Kafka, Oozie etc.).
If a businesses’ data applications get stuck in the doldrums, then these groups need to be able to overcome the challenges keeping them from leveraging their data assets to the maximum.
The developer and data engineer doldrums
Some of the most common application challenges that I hear from people at the forefront of business delivery include:
- When an ad-hoc application (likely submitted via Hive, Spark, Impala etc.) is stuck (not making forward progress) or fails after a while. For example, where a Spark job gets hung at the last task and eventually fails, or where a Spark job fails with executor OOM at a particular stage.
- When an application is performing poorly, suddenly. For example, a hive query that used to take ~6 hours is now taking > 10 hours.
- Not having a good understanding of what ‘gears’ (configuration parameters) to change to improve application performance and resource usage.
- Needing a self-serve platform to understand end-to-end how their specific application(s) behave.
In order to solve these common occurrences engineers end up going to five different sources (e.g CM/Ambari UI, job history UI, application logs, Spark WebUI, AM/RM UI/logs) to get an end-to-end understanding of application behaviour and performance challenges.
However, these sources may not be sufficient to allow them to truly understand the bottlenecks associated with an application (e.g detailed container execution profiles, visibility into the transformations that execute within a Spark stage etc.).
To add to these various challenges, many developers do not have access privileges to systems and have to log tickets with operations teams, which adds to significant delays in solving the issues that they have right now.
The data operations doldrums
Further common challenges for data operations teams include the following:
- A lack of visibility of cluster usage from an application perspective. For example:
- Which application(s) cause my cluster usage (CPU, memory) to spike?
- Are queues being used optimally?
- Are various data sets actually being used?
- Need a comprehensive chargeback/showback view for planning/budgeting purposes.
- Not having good visibility and understanding when data pipelines miss SLAs: Where does the business start to triage these issues? For example:
- An Oozie orchestrated data pipeline that needs to complete by 4AM every morning is now consistently getting delayed. Completes by 6AM only.
- Unable to control or manage runaway jobs that could end up taking way more resources than needed, affecting the overall cluster and starving other applications.
- How to quickly identify inefficient applications that can be reviewed and acted upon? For example:
- Most operations team members do not have deep Hadoop application expertise. Having a way to quickly triage and understand root-causes around application performance degradation would be helpful.
- A need to track and manage runaway jobs automatically.
These application and operations challenges are real pains that prevent enterprises from making their data applications production ready. This, in turn, slows down the ROI organisations are seeing from their data investments.
What most large organisations are missing is a unified view of their data applications. How they are performing and are they meeting SLAs and cost targets. Without this, there is no magic bullet to improve your data operations, make applications reliable and improve overall system and application delivery efficiency.
To get out of the data doldrums businesses cannot continue doing what was done in the past. Teams need a detailed appreciation of what they are doing today, what gaps they still have, and what steps they can take to improve business outcomes. It’s not uncommon to see 10x or more improvements in root cause analysis and remediation times for customers who are able to gain a deep understanding of the current state of their big data strategy and make a plan for where they need to be.