Databricks delivers Spark 2.0, focuses on speed, simplicity and more
Metal welding with spark image via Shutterstock
Spark 2.0 can be summed up in three words: “easier, faster, and smarter”. This new release focuses on standard SQL support and unifying DataFrame/Dataset API, but that’s not all. Plus, we revisit the interview with Matei Zaharia, CTO and co-founder of Databricks and creator of Apache Spark.
The Spark Survey 2015 revealed that Spark users initially chose it because of its ease-of-use and performance. Spark 2.0 is taking this one step further and it’s focusing on three themes: “easier, faster, and smarter.” It also puts an emphasis on standard SQL support and unifying DataFrame/Dataset API; on the SQL side, Spark’s SQL support has been expanded thanks to a new ANSI SQL parser and subqueries. Since SQL has been one of the main interfaces to Spark, these boosted capabilities massively reduce the effort of porting legacy applications. On the programmatic APIs side, the team behind Apache Spark has streamlined Spark’s APIs: unifying DataFrames and Datasets in Scala/Java, SparkSession, simpler, more performant Accumulator API, DataFrame-based Machine Learning API emerges as the primary ML API and more. Find all the details here.
Spark 2.0: 10X faster
Last year’s survey showed that 91 percent of Spark users consider performance as the most important aspect of Apache Spark. The team wanted to make it even faster, so what they did was to rethink the way they build Spark’s physical execution layer. The explanation was offered by Databricks’ When you look into a modern data engine (e.g. Spark or other MPP databases), majority of the CPU cycles are spent in useless work, such as making virtual function calls or reading/writing intermediate data to CPU cache or memory. Optimizing performance by reducing the amount of CPU cycles wasted in these useless work has been a long time focus of modern compilers.” Spark 2.0 ships with the second generation Tungsten engine.
Another addition to Spark 2.0 is Structured Streaming, a new API which allows applications to make decisions in real-time. The three improvements are: integrated API with batch jobs, transactional interaction with storage systems and rich integration with the rest of Spark. Spark 2.0 ships with an initial, alpha version of Structured Streaming, as an extension to the DataFrame/Dataset API.
Matei Zaharia, CTO and co-founder of Databricks and creator of Apache Spark, said in a press release that what’s exciting for him “as a developer of Apache Spark is seeing how quickly users start to use new features and APIs we introduce, and in turn, offer almost instantaneous feedback, so that we can continue to improve them.”
Spark 2.0 is highly compatible with Spark 1.6, so migrating code should not be a problem.
“We designed Spark as a next-generation, more flexible MapReduce”
Matei Zaharia told JAX Magazine that Spark was designed “as a next-generation, more flexible MapReduce.”
Although Spark started out more optimized for some new use cases, such as interactive queries and advanced analytics, we have put a lot of energy into optimizing it even for the batch processing that MapReduce excelled at. Multiple large organizations migrated their batch workloads from MapReduce to Spark, and Spark is also generally outperforming MapReduce on batch benchmarks such as the sort competition.
Matei also revealed that their ultimate goal is “to make Spark as easy to work with large datasets as possible.” Click here to read the entire interview.