A taste of what’s to come: Apache Spark 2.0 technical preview
Bright festive Christmas sparkler image via Shutterstock
It’s been two years since the release of Apache Spark 1.0, and Databricks is now busy preparing Spark 2.0. To get a taste of what’s coming, the team announced the technical preview of Spark 2.0. Although it is not ready for production, its goal is to receive feedback from the community in time for the general availability release.
The Apache Spark 2.0 technical preview is based on the upstream 2.0.0-preview release and focuses on feature improvements based on community feedback. The team emphasized that even though the final Spark 2.0 release “is still a few weeks away, this technical preview is intended to provide early access to the features in Spark 2.0 based on the upstream codebase.”
Spark 2.0 continues the tradition of creating simple, expressive and intuitive APIs and focuses on two areas: standard SQL support and unifying DataFrame/Dataset API. As far as the first focal point is concerned, the team claims they have significantly expanded the SQL capabilities of Spark, with the introduction of a new ANSI SQL parser and support for subqueries. Spark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features.
As far as the programming API side is concerned, they have streamlined the following APIs: unifying DataFrames and Datasets in Scala/Java, SparkSession, simpler, more performant Accumulator API, DataFrame-based Machine Learning API emerges as the primary ML API, machine learning pipeline persistence and distributed algorithms in R.
SEE ALSO: Spark vs Hadoop –Who wins?
Performance optimizations have always been a focal point in the Spark development, so the obvious step was to rethink the way Spark’s physical execution layer was built. “Optimizing performance by reducing the amount of CPU cycles wasted in these useless work has been a long time focus of modern compilers,” the team wrote in the announcement.
Spark users appreciated its ease-of-use and performance, which is why it makes sense for Spark 2.0 to double down on these while extending it to support an even wider range of workloads. The team warned impatient users against fully migrating any production workload onto this preview package until the upstream Apache Spark 2.0 release is finalized.
The technical preview is now available on Databricks.