Focuses on usability, stability, and refinement

Apache Spark 2.2 marks major milestone for Structured Streaming: Experimental tag is off!

Gabriela Motroc
Apache Spark

© Shutterstock / iDesign

Apache Spark 2.2 is here and it comes bearing gifts. This release removes the experimental tag from Structured Streaming and focuses on usability, stability, and polish. Read on to find out what’s new in the third release on the 2.x line.

First things first! The star of Apache Spark 2.2 is Structured Streaming, a high-level API for building continuous applications introduced in Apache Spark 2.0. It allows applications to make decisions in real-time. Databricks’ goal is “to make it easier to build end-to-end streaming applications, which integrate with storage, serving systems, and batch jobs in a consistent and fault-tolerant way,” according to the blog post announcing the new version.

Structured Streaming is now production ready and it comes bearing a few high-level changes:

Apache Spark 2.2 overview

SQL and Core APIs

Apache Spark 2.2 adds a number of SQL functionalities:

  • API Updates: Unify CREATE TABLE syntax for data source and hive serde tables and add broadcast hints such as BROADCAST, BROADCASTJOIN, and MAPJOIN for SQL Queries
  • Overall Performance and stability:
    • Cost-based optimizer cardinality estimation for filter, join, aggregate, project and limit/sample operators and Cost-based join re-ordering
    • TPC-DS performance improvements using star-schema heuristics
    • File listing/IO improvements for CSV and JSON
    • Partial aggregation support of HiveUDAFFunction
    • Introduce a JVM object based aggregate operator
  • Other notable changes:
    • Support for parsing multi-line JSON and CSV files
    • Analyze Table Command on partitioned tables
    • Drop Staging Directories and Data Files after completion of Insertion/CTAS against Hive-serde Tables

SEE ALSO: Databricks delivers Spark 2.0, focuses on speed, simplicity and more

MLlib and SparkR

These new algorithms were added to MLlib and GraphX:

  • Locality Sensitive Hashing
  • Multiclass Logistic Regression
  • Personalized PageRank

Spark 2.2 also adds support for the following distributed algorithms in SparkR:

  • ALS
  • Isotonic Regression
  • Multilayer Perceptron Classifier
  • Random Forest
  • Gaussian Mixture Model
  • LDA
  • Multiclass Logistic Regression
  • Gradient Boosted Trees
  • Structured Streaming API for R
  • column functions to_jsonfrom_json for R
  • Multi-column approxQuantile in R

If you want to read more about Apache MLlib, check out this interview with Xiangrui Meng, software engineer at Databricks.

MLlib’s mission is to make practical machine learning easy and scalable. We want to make it easy for data scientists and machine learning engineers to build real-world machine learning (ML) pipelines. Spark MLlib makes life easier for data scientists and machine learning engineers so that they can focus on building better ML models and applications. We also want MLlib to be capable of learning from large-scale datasets. There’s an enormous amount of data being collected everyday. Having more data leads to better potential to extract more value. However, it is often limited by the scalability of implementations (slow or infeasible to handle big datasets).

Xiangrui Meng, software engineer at Databricks

For more information about what’s new in Apache Spark 2.2, read the official release notes.


Gabriela Motroc
Gabriela Motroc is an online editor for Before working at S&S Media she studied International Communication Management at The Hague University of Applied Sciences.

comments powered by Disqus