“A machine learning model is only as good as the data it is fed”
Apache Spark 2.3 was released earlier this year; it marked a major milestone for Structured Streaming but there are a lot of other interesting features that deserve your attention. We talked with Reynold Xin, co-founder and Chief Architect at Databricks about the Databricks Runtime and other enhancements introduced in Apache Spark 2.3.
JAXenter: What’s next for Apache Spark?
Reynold Xin: Innovative development on Apache Spark remains really active with the goal of further unifying big data and AI for the masses. For example, Databricks continues to contribute significantly to the Apache Spark community with the goal of making machine learning and deep learning more accessible to those companies that may not have the engineering talent to do AI.
Also, there’s a lot of work going into helping organizations tap into the promise of continuous processing via Structured Streaming — which is a high-level, easy-to-use stream processing API for the development of continuous applications that delivers milliseconds level latency. The goal here is to continue to build upon what is already the easiest to use and fastest open-source streaming API in terms of performance per dollar and per machine.
Last, we are focused on making it easier to use Spark in Python and R, to improve performance, and also to make it easier to deploy on more types of platforms such as Kubernetes, Apache Mesos, or various clouds.
JAXenter: The Databricks Runtime — built on top of Apache Spark— focuses on simplifying big data and AI for enterprise organizations. How does it plan to do that?
Reynold Xin: The Databricks Unified Analytics Platform is a fully managed platform, which unifies big data and AI — removing the operational complexities of Spark. At the core of the Databricks Unified Analytics Platform is Databricks Runtime.
Databricks Runtime enhances Apache Spark to be faster, more reliable, and scalable in the cloud. It accelerates performance in a number of ways:
- We have a special I/O layer in Databricks Runtime that is optimized to easily and quickly access data from cloud object stores such as Azure Blob Store and AWS S3.
- We provide automated caching and indexing to accelerate query speeds at massive scale. Through the TPC-DS benchmark, we’ve seen up to 5x performance gains compared to vanilla Spark in the cloud, and we have observed up to 1200X speedups in customer workloads
- Reliability: We can ensure data integrity and reliability with transactions for multiple concurrent writers of both batch and streaming jobs.
- Scalability: Built natively for the cloud, we are able to leverage the inherent benefits and massive scalability of the cloud.
Finally, to completely unify the end-to-end process, we offer a collaborative workspace that supports multiple programming languages (R, Scala, Python, SQL), offers integrated visualizations, and packages machine learning and deep learning libraries as well as integrations with common frameworks (e.g. MLlib, TensorFlow) to simplify data science to make it easier to build, train, and deploy models into production. The collaborative workspace helps close the skills gap by increasing productivity by more than 5x by enabling data scientists with diverse skills can explore data, build and train models collaboratively with other data teams and the business.
JAXenter: How does Databricks make it easier to deploy machine learning models?
Databricks Runtime enhances Apache Spark to be faster, more reliable, and scalable in the cloud.
Reynold Xin: Databricks offers a collaborative workspace that offers an interactive notebook environment for data scientists to easily build, train, and deploy machine learning models — all from a single tool. The notebooks support multiple programming languages (R, Scala, Python, SQL) so data teams can be more productive. Also integrated into the platform are machine learning and deep learning libraries to facilitate the modeling process.
Last, a model export feature allows you to export models and full machine learning pipelines from Spark MLlib. These exported models and pipelines can be imported into other (Spark and non-Spark) platforms to do scoring and make predictions. This feature allows companies to build low-latency and lightweight machine learning-powered applications with ease.
JAXenter: Runtime 4.0 is twice as fast as 3.0. How did you manage that? What’s under the hood of Runtime 4.0?
Reynold Xin: We have done a number of optimizations in DBR 4.0, but the most impactful one is Databricks Caching. Databricks Caching in Databricks Runtime 4.0 automatically caches hot input data for a user and load balances across a cluster. It leverages the advances in NVMe SSD hardware with state-of-the-art columnar compression techniques and can improve interactive and reporting workloads performance significantly. It can cache 30 times more data than Spark’s in-memory cache. Together with other performance improvements, Databricks Runtime 4.0 is 2x faster than Databricks Runtime 3.0 in industry standard TPC-DS benchmark.
JAXenter: How can the innovations brought by Spark 2.3 and Databricks Runtime 4.0 unite big data and machine learning?
Reynold Xin: The enhancements introduced in the Spark 2.3, which is supported within the latest Databricks Runtime 4.0, focus on usability, stability, and refinement which are all critical for data engineers to tackle common big data challenges when building performant and reliable data pipelines that are critical to enabling machine learning applications. Why is it so critical to address your big data challenges before trying to tackle machine learning use cases? Simply put, a machine learning model is only as good as the data it is fed. This data has to be cleansed and prepped in an efficient and reliable manner before being passed into the model. Without the data, your model is useless.
In addition, new features such as stream-to-stream joins will help companies more easily built continuous applications and the extension of new functionality to SparkR, Python, MLlib, and GraphX will make it easier for data scientists to build and deploy machine learning models — all with a focus on helping companies unify big data and machine learning to accelerate AI innovations.
Finally, our new Pandas UDF substantially improves the performance and usability of user-defined functions (UDFs) in Python — providing a key way to connect machine learning frameworks in Python with Spark to unify big data and AI.