Instaclustr: “We want to help grow the Apache Cassandra project”
Instaclustr, an open source as-a-service company, delivering reliability at scale wants to improve and maintain Apache Cassandra. Their intent is to build a team of committers that are actively involved and can provide some real operational experience to the community. We talked with Ben Bromhead, CTO and Co-Founder at Instaclustr about all this and more.
Reaction Andrew Lampitt, Sr. Director, Product Marketing at DataStax
DataStax has not departed the Apache Cassandra open source project. Some may not be aware that DataStax has built most of Apache Cassandra with more than 85% of code commits to date. At the request of the Apache Foundation, our co-founder and CTO stepped down as chair of the project. And we applaud the objective, to encourage broader community participation.
Still, we employ the lion’s share of committers, and we will likely be the biggest contributor of Apache Cassandra 4.0 when it releases. As a strong open source advocate, we are also driving Apache TinkerPop with nearly 100% of code contributions. Plus, developers and other technical distributed database professionals and those working with DataStax Enterprise and Apache Cassandra are fully supported by DataStax’s Developer Relations team. We are additionally driving Apache Cassandra events in various cities around the world. Check us out here.
JAXenter: Since DataStax moved all full-time Apache Cassandra™ developers to the DSE team, someone needs to step in — and that someone might be Instaclustr. Does it want to be the next DataStax for Cassandra?
Ben Bromhead: We do not want to be the next DataStax for Cassandra. I don’t think the community wants that and neither do we; what we are looking to be is a positive and prominent citizen without dominating.
DataStax certainly helped to bootstrap the Apache Cassandra project, but over time a rich ecosystem of other vendors and operators started to grow around the community. Since DataStax moved all full-time Apache Cassandra™ developers to the DSE team, some of the machinery that helps run an open source project moved away (including test infrastructure and issue triage). One of the things we are doing is filling this gap through compute and human resources.
The end result, as we see it at Instaclustr, is a strong, sustainable community comprised of those who run open source Apache Cassandra in production and rely on this technology for some of their most important revenue-generating applications.
Within Instaclustr, we have now established a team of dedicated developers to work on community-related Apache Cassandra activities.
JAXenter: How does Instaclustr plan to fill DataStax’s shoes?
Ben Bromhead: We don’t believe the Apache Cassandra community wants somebody to replicate DataStax’s role as it had been historically structured, per se. Rather, we want to help grow the project by making sure there is engagement with the many users (large and small) of open source Apache Cassandra, and to improve and maintain this particularly powerful and reliable NoSQL database.
Within Instaclustr, we have now established a team of dedicated developers to work on community-related Apache Cassandra activities. This includes writing code, participating in project activities, fixing bugs, and writing new features that the community can take advantage of.
Our intent is to build a team of committers that are actively involved and can provide some real operational experience to the community. We have well over 15 million node hours of Apache Cassandra management experience and think it’s fair to say we have seen this amazing database deployed in both the most effective and the most ineffective ways – and from very large-scale deployments as well as smaller projects.
We believe that our unique operational experience is a real positive for the Apache Cassandra community, and we feel compelled to do our part and make that experience available.
JAXenter: What is the difference between DataStax’s Apache Cassandra and that with Instaclustr playing a larger role?
Ben Bromhead: The project has always been under the control of the Apache Software Foundation. It is just that some participants at times play significant roles in the community, which can cause that company or brand to be synonymous with the project. The community owns Apache Cassandra.
At Instaclustr, we have made a commitment to deploying the open source version of Apache Cassandra and other related and integrated solutions. While the database is important to our deployment, we also deploy an integrated suite of other open source technologies with their own projects and communities, such as Apache Spark and (soon) Apache Kafka.
The community owns Apache Cassandra.
DataStax Enterprise (DSE) is a proprietary database, built by DataStax using the open source Apache Cassandra as the foundation. The product is commercialized and, as such, has license fees associated with it – and much of the code base is proprietary.
With Instaclustr, there is no license fee for Apache Cassandra because we deploy open source. Our customers pay us for management of the infrastructure and database, and to be available 24/7 for technical support. This focuses us on adding value for our customers every day rather than having to justify hefty technology license fees.
JAXenter: Why is the open source Cassandra version more important than the commercialized version?
Ben Bromhead: Fundamentally, without the open source version there wouldn’t be a commercialized version. The community and the resulting capability came well before any commercialization of the technology. For many organizations choosing to deploy Apache Cassandra – and then deciding on either commercial or open source – it really comes down to a strategic technology decision.
Some traditional enterprises feel more comfortable with the approach of paying licensing fees. However, we see that so many of the technology leaders and the largest users of this technology will only ever deploy the open source version of Apache Cassandra. No technology lock-in, more transparency, and a large and vibrant community are the key reasons why. These large users need to have the decision-making in their own hands, rather than by a vendor promoting their own product.
JAXenter: What is the value of open source Cassandra? What does it take to make open source Cassandra succeed?
Ben Bromhead: I believe there is significant value in any open source technology that is supported by a strong community. This is especially true for data-related technologies such as Cassandra, Spark, ElasticSearch, and Kafka, to name just a few.
The community is the key and always has been for the strongest open source projects. The community is so powerful and exponential in what it can achieve compared to any commercially-driven organization. For Apache Cassandra, the continued participation of real operational users is so critical to its success.
JAXenter: If things go south, what would the world miss the most about open source Cassandra?
Ben Bromhead: Given the size and depth of the community (and user base), I think it is very unlikely that things will go south. Some things may change such that Cassandra in five years looks very different to what it looks like now, as the needs of its users and contributors adjust and evolve. The biggest and best thing about Cassandra is that it has an extensive community of users and companies who are solving scalability and availability challenges, and as long as those challenges exist, I believe Cassandra will exist.
For Apache Cassandra, the continued participation of real operational users is so critical to its success.
JAXenter: What does Instaclustr have to do in order to prove that Cassandra’s performance is on par with its ability to scale and its availability?
Ben Bromhead: I think that’s a false scenario, as I believe Cassandra has proven to be very performant and is now the de facto benchmark when others want to demonstrate scale and performance. Cassandra certainly can and will improve in terms of resource efficiency and helping users get the most out of their hardware – which is a problem most databases face to varying degrees. Due to the challenges in this space, Cassandra certainly does fall victim to “vendor benchmarking.”
Let’s take this recent post from YugaByte as an example. The arrangement for this testing states that the machine setup was identical. That’s all well and good, but the configuration for Cassandra was terrible – 30GB maximum heap (probably too large for optimum performance) with 1600MB HEAP_NEWSIZE (~5% of total heap when 25% is the recommended starting point for tuning). We would never deploy Cassandra in this configuration, nor would any organization deploying a production-grade cluster.
That’s not to say that Yugabyte hasn’t built an interesting or performant platform, but it does make it hard to take seriously.
JAXenter: What’s next for Cassandra? What should users expect to find in the next releases?
Ben Bromhead: So many cool things are happening right now! We just attended the Next Generation Cassandra Conference where a number of upcoming features were announced. Here’s an overview summary (with some additional details available in a blog post I recently wrote):
- Pluggable storage with RocksDB, delivering considerable performance improvements.
- Virtual tables. Some great work is being done on this front and will allow API developers to create virtual tables in Cassandra.
- Change Data Capture (CDC) improvements. Uber is spending a ton of time improving CDC performance since they have built some in-process CDC mechanisms – which means this code path is going to be better tested.
- Decoupling redundancy from availability. Developers from both Instagram and Apple have ongoing work allowing Cassandra to have nodes acting as hint stores or lightweight replicas in specific situations (using different approaches to enable better cost efficiencies in high availability situations).