Spark vs Hadoop –Who wins?
After a survey on Hadoop revealed that 70 percent of respondents are most interested in Spark, a whole debate on the potential demise of Hadoop started to shape up.
Many IT professionals hurried up to answer a Quora challenge: Will Spark overtake Hadoop? Will Hadoop be replaced by Spark? Some pointed out their differences, others insisted that they cannot be compared but most agreed that both Spark and Hadoop have evolved greatly.
Matei Zaharia, the CTO of Databricks, the company created by the developers of Apache Spark, opined that the answer to this question depends on the way people look at Hadoop. “Some people take Hadoop to mean a whole ecosystem (HDFS, Hive, MapReduce, etc.), in which case Spark is designed to fit well within the ecosystem (reading from any input source that MapReduce supports through the InputFormat interface, being compatible with Hive and YARN, etc.). Others refer to Hadoop MapReduce in particular, in which case I think it’s very likely that non-MapReduce engines will take over in a lot of domains, and in many cases they already have,” he said.
Zaharia said in a Reddit AMA last year that Spark’s main benefits over the current layers are “1) unified programming model (you don’t need to stitch together many different APIs and languages, you can call everything as functions in one language) and 2) performance optimizations that you can get from seeing the code for a complete pipeline and optimizing across the different processing functions.”
Core differences between Spark and Hadoop
Hadoop and Spark may be big data frameworks, but they do not necessarily serve the same purposes. Hadoop distributes huge data collections across a variety of nodes within a cluster of commodity servers and keeps track of that data while Spark operates on those distributed data collections and fails to do distributed storage.
Many people believe that Spark and Hadoop are better together, but they can also be used one without the other. Hadoop does not need Spark because it includes not only a storage component, but also a processing component (MapReduce); conversely, Spark can be used without Hadoop, but it can also be complementary to it.
One of the reasons why MapReduce may seem unattractive is its slow nature paired with a high level of complexity. Spark’s speed improvement may help it gain extra points, but this advantage may not come in handy if the user’s data operations and reporting requirements are mainly static and you don’t mind waiting for batch-mode processing.
Battle for dominance? Not necessarily!
Data expert and best-selling author Bernard Marr explained in a Forbes article that many vendors offer both Spark and Hadoop and advise companies on which they will find most suitable. Marr pointed out that even though Spark is developing very quickly, the security and support infrastructure is not as advanced as Hadoop’s since it is still in its infancy.
The data expert’s viewpoint is, to some extent, shared by Cloudera’s Sean Owen, who explained in the Quora discussion that even though Spark and its tools are “very good, given how much is offered form just this one project, the good news is that there is no either/or choice, not anymore.”
Although the current trends are in favor of Spark, the choice depends on the user-based case, which means that making an autonomous selection should not be an option.