Spark and Cassandra team up to make database analytics super fast
In a pairing described as ‘magical’, Apache Spark and Cassandra have pooled their resources to deliver analytics up to “100 times” swifter in-memory, and 10 times speedier on disk.
In an industry precedent-setting move, last week DataStax trumpeted a partnership that will see the integration of Apache Spark into the Cassandra database. In this interview, Martin Van Ryswyk, Executive Vice President of Engineering at DataStax, talks integration essentials, benefits to users, and shifting enterprise approaches to big data.
JAX: How did this collaboration come about?
Van Ryswyk: We have been following the evolution of Spark and are very impressed with both the technology and the talent at Databricks. The two technologies are a natural fit and we decided that users deserved a strong solution backed by the leaders in both technologies.
Do you think there’s anything on the market comparable to this new solution?
DataStax is the first NoSQL provider to provide faster analytics for real time data. In the NoSQL space, the only thing somewhat comparable is Spark with HBase, but again, HBase is more of a Hadoop data warehouse component whereas Cassandra is used for online, always-on, transactional applications.
What will be the biggest challenges for integration?
To ensure tight integration between Spark and
Cassandra, mapping of data is necessary. After that comes high
availability and security considerations.
What will be the biggest benefits, and who in particular do you think it will be useful for?
This new integration gives modern businesses an alternative to relational databases to deliver near real-time data analytics for their online applications. Hear from our users below how they benefit.
“The new Spark/Shark functionality on Cassandra is giving our users a scalable and high-performance way to quickly analyze our constantly growing data set. By moving from a relational database, this new functionality will allow us to deliver real-time data analytics where before our users relied on time delayed reports.” - Chanan Braunstein, Director of Next Gen Homework Applications at Pearson Education
“What we all need is a generic way to run functions over data stored in Cassandra. Sure, you could go grab Hadoop, and be locked into articulating analytics/transformations as MapReduce constructs. But that just makes people sad. Instead, I’d recommend Spark. It makes people happy.” - Brian O’Neil, CTO at Health Market Science
Can you see any other players in the industry in particular following suit now you’ve set this precedent?
I can only speak to our own activity but since this is a leading industry first collaboration, it is reasonable to assume that other players will follow our lead.
Are you seeing a change in how companies value and approach data in 2014?
We see an increase with modern enterprises utilizing data as a strategic asset to compete. Companies are moving towards more “near term” analytics that can provide data insights in real time, so they can respond quickly. Because of this, online applications that interact with customers and collect data have zero tolerance for downtime and must be capable of reaching and interacting with their customer’s data no matter where they are located.