Why it pays to plot new courses for MapReduce
Bill Bain, CEO of in-memory data grid provider ScaleOut Software, expands on the beauty of decoupling from batch, why we should all follow the millisecond rule, and more.
Bill Bain, CEO of ScaleOut Software, has an axe to grind. For too long, he believes, people have associated MapReduce exclusively with Hadoop MapReduce and batch implementation, imposing pointless limits on the technology. Read on, and let Bill open your mind to the myriad ways in which MapReduce could be making your life easier, the technologies that are fostering the evolution of Hadoop, and where to find the sweet spot for data analytics.
JAX: As an in-memory database grid vendor, what distinguishes you in the market?
Bain: ScaleOut Software’s in-memory data grid incorporates an in-memory computing platform that has been carefully designed to support operational intelligence. This platform keeps scheduling times to a few milliseconds and takes great care to minimize data motion at all steps within a computation. These characteristics enable it to deliver results within milliseconds to seconds (depending on the application).
Moreover, ScaleOut Software has extended its execution platform to run pure Hadoop MapReduce applications on highly available, memory-based data. This means that developers can fully leverage their skill sets in MapReduce and avoid the use of vendor-specific APIs. Because they can analyze highly available data hosted within an in-memory data grid, they can employ MapReduce in live, operational systems which require instant access to individual data items. This cannot be accomplished with other low latency execution platforms, such as Spark, whose data storage models do not have these characteristics.
Who is your biggest customer base, and can you see this changing?
Financial services and e-commerce comprise our primary vertical markets. Over time, we see the application of operational intelligence in many additional vertical markets, such as energy management, power systems, traffic handling, telecommunications, etc.
What irks you about people’s perceptions of MapReduce?
Since its first release, Hadoop MapReduce has focused on “business intelligence,” that is, analyzing very large, static data sets to extract patterns and trends. In fact, its primary value lies in unlocking the potential to analyze very large, static data sets in minutes or hours (instead of hours or days), opening up countless new opportunities for employing data analytics.
MapReduce’s very success in business intelligence has created the impression that it is not suitable for other applications, in particular, “operational intelligence,” which analyzes fast-changing data in live environments to provide immediate feedback. This perception is rooted in the fact that the mainstream, open source implementation of MapReduce is implemented using batch scheduling, which introduces delays unsuitable for operational intelligence.
However, MapReduce’s design pattern of data-parallel computation is well suited to operational intelligence because it enables large volumes of data to be analyzed quickly and easily. When run on a real-time, in-memory execution platform, MapReduce applications can analyze live data and provide results in less than a second.
Can you give us any examples of how people could better utilize the technology? What would be an ideal best-use-case scenario?
Using in-memory computing, MapReduce can provide operational intelligence in numerous scenarios, such as real-time portfolio tracking, e-commerce recommendation engines, logistics, credit card fraud detection, wire transfer verification, and many more. For example, in financial services, MapReduce’s data-parallel execution can enable a hedge fund to update a large set of stock portfolios in real-time based on a market feed and continuously analyze them for needed trades to maintain balanced positions with regard to business rules customized for each portfolio.
How do you decouple MapReduce from batch?
Because MapReduce is a data-parallel design pattern, it can be hosted on any execution platform that provides mechanisms for data-parallel computing. For example, it can be run on an in-memory data grid spanning a cluster of computers, and it can use in-memory data storage to feed the mappers and hold results generated by the reducers. It is important that the execution platform’s scheduling and data transfer mechanisms be designed to minimize latency and network congestion. In-memory data grids with integrated data-parallel computing provide a highly efficient execution platform by eliminating batch scheduling delays and data motion to and from disk, both of which are inherent in batch-scheduled implementations.
How does running MapReduce direct in-memory result in actionable intelligence on trends as they occur?
Because in-memory MapReduce can produce results in milliseconds to a few seconds instead of minutes to hours, these analytic insights are immediately available to steer the behavior of a live system. This capability is the essence of “operational intelligence.” We saw an example of this benefit in the financial services example above.
Humans respond to external stimuli in 100’s of milliseconds to a few seconds, so this time frame is a particular sweet spot for data analytics, enabling operational intelligence to influence behavior while humans are focused on a task, whether it be making a shopping decision, placing a credit card transaction (that might be fraudulent), or deciding which direction to turn in a congested neighborhood.
Back to ScaleOut: Who are your biggest rivals on the market?
The Hadoop community tends to view Spark and Spark Streaming as the enabling technology for real-time analytics, and this technology indeed reduces execution time by hosting intermediate results in memory. (It also offers new data-parallel operators which take the next steps beyond MapReduce.) However, Spark’s execution platform does not support operational systems because it cannot store collections of highly available, individually accessed objects. Also, MapReduce applications must be modified to use Spark’s new operators.
Most in-memory data grid vendors either do not support data-parallel computation for operational intelligence or solely use vendor-specific APIs to provide this capability. ScaleOut Software lets Hadoop MapReduce developers create applications for operational intelligence by accessing the full power of data-parallel computing on highly available, in-memory data.
What do you say to people who think Hadoop is out-dated?
Hadoop’s execution model and platform will continue to evolve as new technologies such as YARN, Spark, and Tez emerge. However, this does not imply that the MapReduce design pattern is out-dated. This evolution recognizes that MapReduce provides an excellent design pattern for some but not all data-parallel applications. (It also recognizes the benefits of refactoring the underlying execution platform.) What remains important, as it has for over 35 years, is that data-parallel computation provides a powerful yet surprisingly simple approach to analyzing large data sets as fast as possible. Understanding the tradeoffs in power and simplicity of this model vis-à-vis other models, such as task-parallel computation, is key to placing Hadoop’s evolution in the proper perspective.