Getting down to the root cause

Keeping up in real-time: Ensuring reliability across streaming and IoT applications

Shivnath Babu
© Shutterstock / majcot (modified)

If a streaming application starts to lag behind in processing data in real-time, then the root cause may be hard to find manually in the complex application and infrastructure stack. Learn more in this article by Shivnath Babu, co-founder/CTO at Unravel Data Systems, and Adjunct Professor of Computer Science at Duke University.

Most personalised, real-time, streaming applications are likely to run on Kafka, Spark, Kudu, Flink, or HBase to manage the heavy big data requirements of modern cloud-delivered services. To provide these around the clock needs a vast amount of easily accessible data, often comes from connected sensors, providing data such as customer sales, miles driven, GPS, humidity, temperature, and air quality, etc. Many businesses are struggling to get this right, and data pipelines become inefficient, undermining the foundation which app-driven services depend upon for instant consumer gratification.

This faulty foundation causes apps to lag in processing disrupting the entire business function as app depends upon app to deliver services internally and externally.

If a streaming application starts to lag behind in processing data in real-time, then the root cause may be hard to find manually in the complex application and infrastructure stack. Given the reliance on cloud in the delivery of a complex enterprise analytics process, a methodology based on machine learning and AI is one way to guarantee better performance, predictability, and reliability for streaming applications. Advances here apply statistical learning to the full-stack monitoring data available for streaming applications and provide much more detail and granularity when optimising the stack.

This monitoring data includes metrics and logs from applications, from systems like Kafka, Spark Streaming, and Kudu, as well as from on-premise and cloud infrastructure.

The root cause may be an application problem (e.g., poor data partitioning in Spark Streaming) or a system problem (e.g., suboptimal configuration of Kafka), or an infrastructure problem (e.g., multi-tenancy resource contention). Current practices rely on a manual process with popular tools whereas newer, machine learning and AI-based solutions automatically identify the root cause of application bottlenecks, slowdowns, and failures and invoke automated remediation.

As a further benefit, implementing such processes allows for more effective capacity planning to lower costs and raise reliability compared to the organisational guesstimates so common.

Typical use cases – cybersecurity

It’s critical for the well-being of enterprises and large organisations – and is one of the most effective and impactful use cases for big data i.e. fraud detection and security intelligence. Yet, many consider it the most challenging data use, illustrating the complexity of managing the real-time streaming data.

To analyse streaming traffic data, generate statistical features, and train machine learning models to help detect threats on large-scale networks, like malicious hosts in botnets, big data systems require complex and resource-consuming monitoring methods. Analysts may apply multiple detection methods simultaneously, to the same massive incoming data, for pre-processing, selective sampling, and feature generation, adding to complexity and performance challenges. Applications often span across multiple systems (e.g., interacting with Spark for computation, with YARN for resource allocation and scheduling, with HDFS or S3 for data access, with Kafka or Flink for streaming) and may contain independent, user-defined programs, making it inefficient to repeat data preprocessing and feature generation common in multiple applications, especially in large-scale traffic data.

SEE ALSO: DevOps Culture: the Neuroscience of Behavior

These create bottlenecks in execution, hog underlying systems, cause suboptimal resource utilisation, increase failures (e.g., due to out-of-memory errors), and may decrease the chances to detect a threat in time.

Learning new techniques

Newer techniques for enabling dependable workload management for cybersecurity analytics and a better understanding of how to address the operational challenges faced by modern data applications on-premises and in the cloud can transform DevOps.

Better workload management for analytics include being able to:

  • Identify applications sharing common characteristics and requirements and grouping them based on relevant data colocation (e.g., a combination of port usage entropy, IP region or geolocation, time or flow duration in security)
  • Segregate applications with different requirements (e.g., disk i/o heavy preprocessing tasks vs. computational heavy feature selection) submitted by different users (e.g., SOC level 1 vs. level 3 analysts)
  • Allocating applications with increased sharing opportunities and computational similarities to appropriate execution pools/queues

Enabling efficient, continuous metrics collection from logs and sensors through machine learning algorithms enables scrutiny of application execution, identifying the cause of potential failure, and generating recommendations for improving performance and resource usage. This does rely on algorithms for: Improving performance/resource utilisation; providing automatic fixes for failed applications using historical examples of successful/failed runs of the application (or similar applications); and trying out a limited number of alternative configurations to quickly get the application to a running state, followed by getting the application to a resource-efficient, well-running state.

Say a user wants to apply machine learning algorithms to generate insights and recommendations. These techniques are very common and using them boils down to understanding what the business goal is, what are the DevOps goals, and then mapping the right algorithms to the challenge.

As an easy-to-understand problem, take outlier detection in a Kafka world. It’s very common to have the load be imbalanced across brokers or across partitions. It would be common to have one broker only taking one-tenth of a load with respect to others, and often the other way, one broker becoming a hotspot. This can happen at partitions at many different levels. This is a problem which can be found or detected quickly with the simplest algorithm for outlier detection. But there are a couple of different algorithmic dimensions to consider.

There are great algorithms for analysing one-time series. Some are great for multidimensional analysis. The Z-score fits a distribution to data and any points that don’t match the distribution are outliers. Even simple algorithms like these can go a long way to quickly notify an operator that something requires immediate attention.

Sometimes the problem can fit into a single dimension, but often the user might want to identify brokers that are both outliers in terms of input data, as well as CPU or disk utilisation. There are algorithms that work for multidimensional and early detection.

SEE ALSO: Implement stream-aware, reactive, integration pipelines with Alpakka Kafka

The DBScan algorithm uses a density-based clustering, grouping points into clusters so that things that don’t satisfy part of the clusters can be identified. Recently there has been interest in outliers as it applies to any kind of data.

People have been extending these algorithms by bringing more interesting and sophisticated models like decision trees to be applied. Secondly, with deep learning, there are autoencoders which can recreate the data based on original data and identify which points don’t match.

It’s key to consider use cases and apply logic to make DevOps’ lives easier. It starts with considering the challenges in the environment, understanding the simple ways to chip away at these business problems, and applying a smart solution to take away the rotework from users via machine learning/AI. Delivery at scale relies on efficiency which can only come, in the big data, cloud-enabled world, from automation that augments the human problem-solving machine.


Shivnath Babu

Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease-of-use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.

Inline Feedbacks
View all comments