Apache Spark 2.2 marks major milestone for Structured Streaming: Experimental tag is off!
© Shutterstock / iDesign
Apache Spark 2.2 is here and it comes bearing gifts. This release removes the experimental tag from Structured Streaming and focuses on usability, stability, and polish. Read on to find out what’s new in the third release on the 2.x line.
First things first! The star of Apache Spark 2.2 is Structured Streaming, a high-level API for building continuous applications introduced in Apache Spark 2.0. It allows applications to make decisions in real-time. Databricks’ goal is “to make it easier to build end-to-end streaming applications, which integrate with storage, serving systems, and batch jobs in a consistent and fault-tolerant way,” according to the blog post announcing the new version.
Structured Streaming is now production ready and it comes bearing a few high-level changes:
- Kafka Source and Sink: Support for reading and writing data in streaming or batch to and from Apache Kafka
- Kafka Improvements: Cached producer for lower latency Kafka to Kafka streams
- Additional Stateful APIs: Support for complex stateful processing and timeouts using [flat]MapGroupsWithState
- Run Once Triggers: Allows to trigger only one-time execution, hence lowering the cost of clusters
Apache Spark 2.2 overview
SQL and Core APIs
Apache Spark 2.2 adds a number of SQL functionalities:
- API Updates: Unify CREATE TABLE syntax for data source and hive serde tables and add broadcast hints such as BROADCAST, BROADCASTJOIN, and MAPJOIN for SQL Queries
- Overall Performance and stability:
- Cost-based optimizer cardinality estimation for filter, join, aggregate, project and limit/sample operators and Cost-based join re-ordering
- TPC-DS performance improvements using star-schema heuristics
- File listing/IO improvements for CSV and JSON
- Partial aggregation support of HiveUDAFFunction
- Introduce a JVM object based aggregate operator
- Other notable changes:
- Support for parsing multi-line JSON and CSV files
- Analyze Table Command on partitioned tables
- Drop Staging Directories and Data Files after completion of Insertion/CTAS against Hive-serde Tables
MLlib and SparkR
These new algorithms were added to MLlib and GraphX:
- Locality Sensitive Hashing
- Multiclass Logistic Regression
- Personalized PageRank
Spark 2.2 also adds support for the following distributed algorithms in SparkR:
- Isotonic Regression
- Multilayer Perceptron Classifier
- Random Forest
- Gaussian Mixture Model
- Multiclass Logistic Regression
- Gradient Boosted Trees
- Structured Streaming API for R
- column functions to_json, from_json for R
- Multi-column approxQuantile in R
If you want to read more about Apache MLlib, check out this interview with Xiangrui Meng, software engineer at Databricks.
MLlib’s mission is to make practical machine learning easy and scalable. We want to make it easy for data scientists and machine learning engineers to build real-world machine learning (ML) pipelines. Spark MLlib makes life easier for data scientists and machine learning engineers so that they can focus on building better ML models and applications. We also want MLlib to be capable of learning from large-scale datasets. There’s an enormous amount of data being collected everyday. Having more data leads to better potential to extract more value. However, it is often limited by the scalability of implementations (slow or infeasible to handle big datasets).
Xiangrui Meng, software engineer at Databricks
For more information about what’s new in Apache Spark 2.2, read the official release notes.