Conscious uncoupling

Why it pays to plot new courses for MapReduce

Lucy Carey
two-directions

Bill Bain, CEO of in-memory data grid provider ScaleOut Software, expands on the beauty of decoupling from batch, why we should all follow the millisecond rule, and more.

Bill Bain, CEO of ScaleOut
Software, has an axe to grind. For too long, he believes, people
have associated MapReduce exclusively with  Hadoop
MapReduce and batch implementation, imposing pointless limits on
the technology. Read on, and let Bill open your mind to the myriad
ways in which MapReduce could be making your life easier, the
technologies that are fostering the evolution of Hadoop, and where
to find the sweet spot for data analytics.

JAX: As an in-memory database grid
vendor, what distinguishes you in the market?

Bain: ScaleOut Software’s
in-memory data grid incorporates an in-memory computing platform
that has been carefully designed to support operational
intelligence. This platform keeps scheduling times to a few
milliseconds and takes great care to minimize data motion at all
steps within a computation. These characteristics enable it to
deliver results within milliseconds to seconds (depending on the
application).

Moreover, ScaleOut Software has extended its
execution platform to run pure Hadoop MapReduce applications on
highly available, memory-based data. This means that developers can
fully leverage their skill sets in MapReduce and avoid the use of
vendor-specific APIs. Because they can analyze highly available
data hosted within an in-memory data grid, they can employ
MapReduce in live, operational systems which require instant access
to individual data items. This cannot be accomplished with other
low latency execution platforms, such as Spark, whose data storage
models do not have these characteristics.

Who is your biggest customer base, and can you
see this changing?

Financial services and e-commerce comprise our
primary vertical markets. Over time, we see the application of
operational intelligence in many additional vertical markets, such
as energy management, power systems, traffic handling,
telecommunications, etc.

What irks you about people’s perceptions of
MapReduce?

Since its first release, Hadoop MapReduce has
focused on “business intelligence,” that is, analyzing very large,
static data sets to extract patterns and trends. In fact, its
primary value lies in unlocking the potential to analyze very
large, static data sets in minutes or hours (instead of hours or
days), opening up countless new opportunities for employing data
analytics.

MapReduce’s very success in business
intelligence has created the impression that it is not suitable for
other applications, in particular, “operational intelligence,”
which analyzes fast-changing data in live environments to provide
immediate feedback. This perception is rooted in the fact that the
mainstream, open source implementation of MapReduce is implemented
using batch scheduling, which introduces delays unsuitable for
operational intelligence.

However, MapReduce’s design pattern of
data-parallel computation is well suited to operational
intelligence because it enables large volumes of data to be
analyzed quickly and easily. When run on a real-time, in-memory
execution platform, MapReduce applications can analyze live data
and provide results in less than a second.

Can you give us any examples of how people
could better utilize the technology? What would be an ideal
best-use-case scenario?

Using in-memory computing, MapReduce can provide
operational intelligence in numerous scenarios, such as real-time
portfolio tracking, e-commerce recommendation engines, logistics,
credit card fraud detection, wire transfer verification, and many
more. For example, in financial services, MapReduce’s data-parallel
execution can enable a hedge fund to update a large set of stock
portfolios in real-time based on a market feed and continuously
analyze them for needed trades to maintain balanced positions with
regard to business rules customized for each portfolio.

How do you decouple MapReduce from
batch?

Because MapReduce is a data-parallel design
pattern, it can be hosted on any execution platform that provides
mechanisms for data-parallel computing. For example, it can be run
on an in-memory data grid spanning a cluster of computers, and it
can use in-memory data storage to feed the mappers and hold results
generated by the reducers. It is important that the execution
platform’s scheduling and data transfer mechanisms be designed to
minimize latency and network congestion. In-memory data grids with
integrated data-parallel computing provide a highly efficient
execution platform by eliminating batch scheduling delays and data
motion to and from disk, both of which are inherent in
batch-scheduled implementations.

How does running MapReduce direct in-memory
result in actionable intelligence on trends as they
occur?

Because in-memory MapReduce can produce results
in milliseconds to a few seconds instead of minutes to hours, these
analytic insights are immediately available to steer the behavior
of a live system. This capability is the essence of “operational
intelligence.” We saw an example of this benefit in the financial
services example above.

Humans respond to external stimuli in 100’s of
milliseconds to a few seconds, so this time frame is a particular
sweet spot for data analytics, enabling operational intelligence to
influence behavior while humans are focused on a task, whether it
be making a shopping decision, placing a credit card transaction
(that might be fraudulent), or deciding which direction to turn in
a congested neighborhood.

Back to ScaleOut: Who are your biggest rivals
on the market?

The Hadoop community tends to view Spark and
Spark Streaming as the enabling technology for real-time analytics,
and this technology indeed reduces execution time by hosting
intermediate results in memory. (It also offers new data-parallel
operators which take the next steps beyond MapReduce.) However,
Spark’s execution platform does not support operational systems
because it cannot store collections of highly available,
individually accessed objects. Also, MapReduce applications must be
modified to use Spark’s new operators.

Most in-memory data grid vendors either do not
support data-parallel computation for operational intelligence or
solely use vendor-specific APIs to provide this capability.
ScaleOut Software lets Hadoop MapReduce developers create
applications for operational intelligence by accessing the full
power of data-parallel computing on highly available, in-memory
data.

What do you say to people who think Hadoop is
out-dated?

Hadoop’s execution model and platform will
continue to evolve as new technologies such as YARN, Spark, and Tez
emerge. However, this does not imply that the MapReduce design
pattern is out-dated. This evolution recognizes that MapReduce
provides an excellent design pattern for some but not all
data-parallel applications. (It also recognizes the benefits of
refactoring the underlying execution platform.) What remains
important, as it has for over 35 years, is that data-parallel
computation provides a powerful yet surprisingly simple approach to
analyzing large data sets as fast as possible. Understanding the
tradeoffs in power and simplicity of this model vis-à-vis other
models, such as task-parallel computation, is key to placing
Hadoop’s evolution in the proper perspective.

Author
Comments
comments powered by Disqus