March of progress
DataTorrent prepares for the Big Data interconnected eclipse
In Hadoopy news, today DataTorrent announced the general ability of their big flagship product, DataTorrent Real Time Streaming (RTS) version 1.0. Built on top of Hadoop 2.0, this software offers real-time streaming analysis with performance exceeding one billion data events per second - a market first, say Team DT (to put into context how far ahead of the current usage curve that is, Twitter currently averages about 6,000 Tweets a second). We caught up with Co-founder and CEO Phu Hoang for the inside story on the latest offering to hit the interconnected scene, and find out why he believes IoT predictions should be taken with a pinch of salt.
JAX: Can you give us a technical deep dive into DataTorrent RTS?
Hoang: A technical deep dive into DataTorrent would be hard to describe in a few lines. The two DataTorrent co-founders Phu Hoang and Amol Kekre spent twelve years each at Yahoo where they built expertise in processing big data and streaming data on a massive scale, fault tolerant way (Yahoo Search, Yahoo Advertising, Yahoo Finance, Hadoop).
DataTorrent RTS is a native Hadoop 2.0 application built to leverage YARN as a resource manager. DataTorrent RTS enables a Hadoop cluster to do real-time streaming event processing at billions of events per second with state-full fault tolerance. By that, we mean DataTorrent RTS can recover from node outages in seconds without any loss of state or data and without human interactions, while the rest of the application is still running.
A DataTorrent RTS application is built by connecting stream Operators together into a data flow graph. DataTorrent has developed and open sourced over 400 Operators to enable rapid and easy application development. The application is then compiled and deployed and managed to run 24/7 on a Hadoop cluster by DataTorrent.DataTorrent takes care of buffering and transmitting events, executing operator code continuously, tracking and checkpointing operator states, and outputting resultant data.
How did you achieve one billion events per second?
Massive performance is achieved by running the application all in memory on a Hadoop cluster, distributing the computation and memory usage across many nodes. Here’s the blog where we took it to 1.6B events/sec on 37 nodes: https://www.datatorrent.com/scaling-up-event-ingestion/
What were the biggest challenges in the development process?
The biggest challenge was meeting the high bar that we set for ourselves in terms of performance, scalability, and fault-tolerance that must be enterprise-class. We have built a world-class engineering team with deep experience in big data and scalable and fault-tolerance streaming architectures (from our global streaming quotes platform in Yahoo Finance) and have exceeded even our lofty expectations.
Who would you say are your biggest competitors?
We are in the same space as the Open Source technologies like Storm or Spark Streaming, and commercial solutions like IBM Info Streams. For architectural reasons, DataTorrent is orders of magnitude better in performance and scalability, and none of these solutions offer enterprise-class fault tolerance that DataTorrent does.
There are huge numbers flying around, but do you think it’s feasible at this stage to predict the size of the IoT? According to a recent ZeroTurnaround study, only 5% of companies are making it a priority within the next two years.
We do not. Unlike the current Big Data trend, which is already massive but is mostly human generated, the next wave of machine generated data will dwarf this current wave from all the sensors and appliances and devices sending data back for processing. Sensors can sense and report back thousands of events per second. It is unclear how fast the industry will embrace IoT, but we believe it will be coming.
What's your focus now DataTorrent RTS 1.0 is GA?
Our focus is on providing enterprise-class solutions to enterprises’ most demanding real-time big data applications. We are engaging deeply with many customers with massive real-time analytics projects, and are building tools and solutions to allow them to operate these mission critical applications at scale.
The DataTorrent platform works seamlessly on premise and on the Cloud, and is certified on Cloudera, Hortonworks, MapR, Amazon AWS and Google Cloud. We will continue to invest in our platform and our operators to make sure enterprises can easily build, deploy, and integrate with their existing data infrastructures.