New Apache Spark library aims to make deep learning approachable
© Shutterstock / fortovik
The Apache Spark Summit is almost over but one cannot deny that it’s been an interesting ride: Deep Learning Pipelines, Structured Streaming and Databricks Serverless are among the newest additions to the Spark universe. Let’s see why they deserve your attention.
Databricks, the company founded by the creators of the Apache Spark project, believes that Deep Learning Pipelines has the potential to accomplish “what Spark did to big data: make the deep learning ‘superpower’ approachable for everybody.”
Matei Zaharia, co-founder and chief technologist at Databricks, says that the new library “is a huge step in furthering Databricks’ mission to democratize artificial intelligence and data science. This work has the potential to accomplish for deep learning what Spark did for big data, which is to make it approachable to a much broader audience, from data scientists to business analysts.”
This open-source library aims to enable everyone to easily integrate scalable deep learning into their workflows, from machine learning practitioners to business analysts. It builds on Apache Spark’s ML Pipelines for training, and with Spark DataFrames and SQL for deploying models. It includes high-level APIs for common aspects of deep learning so they can be done efficiently in a few lines of code:
- Image loading
- Applying pre-trained models as transformers in a Spark ML pipeline
- Transfer learning
- Distributed hyperparameter tuning
- Deploying models in DataFrames and SQL
To find out more about Deep Learning Pipelines, check out the blog post announcing the new library.
Structured Streaming: Best-in-class performance and latency
Structured Streaming is one year old and it is already the simplest-to-use streaming engine and, for many workloads, also the fastest. Not only does this new way make it easy to build end-to-end streaming applications by exposing a single API to write streaming queries as you would write batch queries, but it also handles streaming complexities by ensuring exactly-once-semantics, doing incremental stateful aggregations, and providing data consistency across sources and sinks,
Structured Streaming brings the efficiency of Spark SQL to real-time streaming. In Databricks’ benchmarks, they showed 5x or better throughput than other popular streaming engines on the widely used Yahoo! Streaming Benchmark. Furthermore, they designed the API of Structured Streaming to be agnostic to the underlying execution engine, eliminating the concept of batching in the API and they have also been working to remove batching in the engine.
A new extension, continuous processing, that also eliminates micro-batches from execution was also announced at the Spark Summit. This new execution mode lets users achieve sub-millisecond end-to-end latency for many important workloads — with no change to their Spark application, Armbrust added.
Databricks clients can access the latest streaming features through the Databricks Runtime 3.0 beta. Structured Streaming is production ready.
Databricks Serverless: first fully managed computing platform for Apache Spark
Databricks Serverless allows teams to share a single pool of computing resources and automatically isolates users and manages costs. It aims to remove the complexity and cost of users managing their own Spark clusters. Additional benefits of Databricks’ Serverless offering include auto-managed configuration of clusters, scaling of local storage, adaption to multiple users sharing the cluster and security.
Ali Ghodsi, co-founder and chief executive officer at Databricks will offer more details of the new Databricks Serverless platform today in his Spark Summit 2017 keynote.