days
-4
-6
hours
0
-3
minutes
-3
-2
seconds
0
-6
search
Celebrations: Happy Thanksgiving to all our American readers!
Two years after Spark 1.0's debut

A taste of what’s to come: Apache Spark 2.0 technical preview

Gabriela Motroc
Spark 2.0

Bright festive Christmas sparkler image via Shutterstock

It’s been two years since the release of Apache Spark 1.0, and Databricks is now busy preparing Spark 2.0. To get a taste of what’s coming, the team announced the technical preview of Spark 2.0. Although it is not ready for production, its goal is to receive feedback from the community in time for the general availability release.

The Apache Spark 2.0 technical preview is based on the upstream 2.0.0-preview release and focuses on feature improvements based on community feedback. The team emphasized that even though the final Spark 2.0 release “is still a few weeks away, this technical preview is intended to provide early access to the features in Spark 2.0 based on the upstream codebase.”

    DevOpsCon Whitepaper 2018

    Free: BRAND NEW DevOps Whitepaper 2018

    Learn about Containers,Continuous Delivery, DevOps Culture, Cloud Platforms & Security with articles by experts like Michiel Rook, Christoph Engelbert, Scott Sanders and many more.

Focal points

Spark 2.0 continues the tradition of creating simple, expressive and intuitive APIs and focuses on two areas: standard SQL support and unifying DataFrame/Dataset API. As far as the first focal point is concerned, the team claims they have significantly expanded the SQL capabilities of Spark, with the introduction of a new ANSI SQL parser and support for subqueries. Spark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features.

As far as the programming API side is concerned, they have streamlined the following APIs: unifying DataFrames and Datasets in Scala/Java, SparkSession, simpler, more performant Accumulator API, DataFrame-based Machine Learning API emerges as the primary ML API, machine learning pipeline persistence and distributed algorithms in R.

SEE ALSO: Spark vs Hadoop –Who wins?

Performance optimizations have always been a focal point in the Spark development, so the obvious step was to rethink the way Spark’s physical execution layer was built. “Optimizing performance by reducing the amount of CPU cycles wasted in these useless work has been a long time focus of modern compilers,” the team wrote in the announcement.

Spark users appreciated its ease-of-use and performance, which is why it makes sense for Spark 2.0 to double down on these while extending it to support an even wider range of workloads. The team warned impatient users against fully migrating any production workload onto this preview package until the upstream Apache Spark 2.0 release is finalized.

The technical preview is now available on Databricks.

Author
Gabriela Motroc
Gabriela Motroc is editor of JAXenter.com and JAX Magazine. Before working at Software & Support Media Group, she studied International Communication Management at the Hague University of Applied Sciences.

Leave a Reply

Be the First to Comment!

avatar
400
  Subscribe  
Notify of