March of progress

DataTorrent prepares for the Big Data interconnected eclipse

Lucy Carey

As flagship product ‘DataTorrent Real Time Streaming (RTS) goes live, we get the full story on the native Hadoop 2.0 application built to leverage YARN.


In Hadoopy news, today DataTorrent announced the general
ability of their big flagship product, DataTorrent Real Time
Streaming (RTS) version 1.0. Built on top of Hadoop 2.0, this
software offers real-time streaming analysis with performance
exceeding one billion data events per second – a market first, say
Team DT (to put into context how far ahead of the current usage
curve that is, Twitter currently averages about 6,000 Tweets a second). We caught up with Co-founder
and CEO Phu Hoang for the inside story on the latest offering to
hit the interconnected scene, and find out why he believes IoT
predictions should be taken with a pinch of salt.

JAX: Can you give us a technical deep dive
into DataTorrent RTS?

Hoang: A technical deep dive
into DataTorrent would be hard to describe in a few lines. The two
DataTorrent co-founders Phu Hoang and Amol Kekre spent
twelve years each at Yahoo where they built expertise in processing
big data and streaming data on a massive scale, fault tolerant way
(Yahoo Search, Yahoo Advertising, Yahoo Finance,

DataTorrent RTS is a native Hadoop
2.0 application built to leverage YARN as a resource manager.
DataTorrent RTS enables a Hadoop cluster to do real-time streaming
event processing at billions of events per second with state-full
fault tolerance. By that, we mean DataTorrent RTS can recover from
node outages in seconds without any loss of state or data and
without human interactions, while the rest of the application is
still running.

A DataTorrent RTS application is built by
connecting stream Operators together into a data flow graph.
DataTorrent has developed and open sourced over 400 Operators to
enable rapid and easy application development. The application is
then compiled and deployed and managed to run 24/7 on
a Hadoop cluster by DataTorrent.DataTorrent takes care of buffering
and transmitting events, executing operator code continuously,
tracking and checkpointing operator states, and outputting
resultant data.

How did you achieve one billion events per

Massive performance is achieved by running the
application all in memory on a Hadoop cluster, distributing the
computation and memory usage across many nodes.  Here’s the
blog where we took it to 1.6B events/sec on 37 nodes:

What were the biggest challenges in the
development process?

The biggest challenge was meeting
the high bar that we set for ourselves in terms of
performance, scalability, and fault-tolerance that must be
enterprise-class. We have built a world-class engineering team with
deep experience in big data and scalable and fault-tolerance
streaming architectures (from our global streaming quotes platform
in Yahoo Finance) and have exceeded even our lofty expectations.

Who would you say are your biggest

We are in the same space as the Open Source
technologies like Storm or Spark Streaming, and commercial
solutions like IBM Info Streams. For architectural reasons,
DataTorrent is orders of magnitude better in performance and
scalability, and none of these solutions offer enterprise-class
fault tolerance that DataTorrent does.

There are huge numbers flying around, but do
you think it’s feasible at this stage to predict the size of the
IoT? According to a recent ZeroTurnaround study, only 5% of
companies are making it a priority within the next two

We do not. Unlike the current Big Data trend, which is
already massive but is mostly human generated, the next wave of
machine generated data will dwarf this current wave from all the
sensors and appliances and devices sending data back for
processing. Sensors can sense and report back thousands of events
per second. It is unclear how fast the industry will embrace IoT,
but we believe it will be coming.

What’s your focus now DataTorrent RTS 1.0 is

Our focus is on providing enterprise-class
solutions to enterprises’ most demanding real-time big data
applications. We are engaging deeply with many customers with
massive real-time analytics projects, and are building tools and
solutions to allow them to operate these mission critical
applications at scale.

The DataTorrent platform works seamlessly on
premise and on the Cloud, and is certified on Cloudera,
Hortonworks, MapR, Amazon AWS and Google Cloud. We will continue to
invest in our platform and our operators to make sure enterprises
can easily build, deploy, and integrate with their existing data

comments powered by Disqus