Data analysis tool in a new version

Apache Spark 1.6 with Dataset API

Michael Thomas
flying sparks image via Shutterstock

After a preview version had been published at the end of November 2015, the final version of Apache Spark 1.6 is at long last ready for download. The update contains a total of over 1,000 changes; release highlights include a variety of performance improvements, the new Dataset API and expanded data science functions.

Performance improvements

Since Parquet is among the most often used data formats in Spark and scan performance can have considerable influence on large applications, a newer Parquet reader has been implemented in Spark 1.6 which promises performance improvements of up to 50%: in benchmarks, the scan throughput for 5 columns could therefore be increased from 2.9 million lines to 4.5 million lines per second.

In earlier versions, Spark divided the available memory into execution memory (sorting, hashing, shuffling) and cache memory (buffering of “hot” data.) With version 1.6 on the other hand, a new memory manager has made its way into Spark. It automatically takes care of adjusting both memory regions, according to the requirements of the application running; individual adjustment therefore becomes obsolete. Spark developers claim that for a considerable number of applications, this means a significant increase in memory than can be used elsewhere.

Since state management for streaming applications represents a key functionality, the State Management API in Spark Streaming has been redesigned and a new API called mapWithState introduced which scales linearly with the number of updates and should accelerate state management by as much as 10 times.

Dataset API

A further key feature is the new Dataset API for work with typified objects. As an extension of the DataFrame API, it will combine the advantages of the RDD API and DataFrames, offering static typification and user functionality from RDD, as well as compile-time type checking known from DataFrames.

Data Science functionality

The new data science functionality includes persistence for machine learning pipelines, as well as new algorithms for machine learning.

The ML Pipeline feature of Spark can be used by applications from the area of machine learning to create learn pipelines. Up to now, user-defined code was needed for external storage. With the new pipeline API, on the other hand, the user has the possibility to save or load pipelines and apply models created at an earlier time to new data at a later point in time. New algorithms for machine learning include among others those for univariate and bivariate statistics, survival analysis, etc.

You can find a comprehensive list of all changes on Databricks.

About Apache Spark

In February 2014, Apache Spark was released from the project incubator of the Apache Foundation and selected as a full-fledged project just a few weeks after Cloudera had announced commercial support for the In Memory Framework. Only a few months later, Spark could look back at a small success story. The US-based company MapR Technologies integrated Spark in its distributions. DataStax, specialist for the Cassandra Database, launched cooperation with Databricks to implement an integration of Spark in Cassandra (also refer to the JAXenter interview with Martin Van Ryswyk from DataStax.)

Lead picture: flying sparks By Shutterstock / Copyright: Bildagentur Zoonar GmbH

Michael Thomas
Michael Thomas studied Educational Science at the Johannes Gutenberg University in Mainz and has been working as a freelance author at since 2013.

Inline Feedbacks
View all comments