days
1
2
hours
1
1
minutes
2
2
seconds
5
5
Lightning-fast updates

Apache Spark 1.3 arrives with DataFrame API

JAX Editorial Team
Firework image via Shutterstock

The fourth release for the 1.X line of Spark is here, with Apache Spark 1.3 introducing usability improvements alongside a new DataFrame API that is sure to provide powerful and convenient operators.

Apache Spark 1.3 is now available, with this release incorporating more than 1,000 individual patches, according to project developers. The highlights of the release are undoubtedly the new DataFrame API and the fact that Spark SQL has now officially graduated out of alpha stage.

New in: DataFrame API and Spark SQL

The new DataFrame API will speed up the processing of structured data records, with the DataFrame API an evolution of the base RDD API. The team says it’s conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

Data frames are supported in Java, Scala and Python and will also be available under the still-unpublished R API. The advantage of the Spark SQL addition is the ability for users to plan the physical execution of operations in such a way that they work well for large data sets.

SEE ALSO: Typesafe survey reveals increased exploration of Apache Spark

No longer in alpha stage with version 1.3, the Spark SQL component provides backwards compatibility for the HiveQL dialect and stable programmatic API’s. Spark SQL is now fully interoperable with the data frame component, which allows the creation of a data frame from Hive tables, Parquet files and similar sources.

In addition, tables can now be read from a JDBC connection, which allows importing and exporting from MySQL, Postgres, and other RDBMS systems. 

Other notable features available in version 1.3 include a handful of usability improvements in the core engine, support for multi level aggregation trees and the presence of SSL encryption for some communication endpoints.

Several new algorithms have also been introduced for the Spark MLlib, along with a new direct Kafka API which “enables exactly-once delivery without the use of write ahead logs”.

An overview of all new features can be found in the official release notes.

Comments
comments powered by Disqus