Making a blind “default” choice of a database will shape the application
Today, we’re taking a second look at a new database on the scene, PumpkinDB. Although this new event sourcing database engine is just getting started, we’re fascinated by what we can see so far. We talked to Yurii Rashkovskii, maintainer at pumpkindb.org about this new project, its benefits and challenges and databases in general.
The explosion in database flavors creates more interest
JAXenter: The findings of our annual survey show that data processing is a very relevant topic for our readers. Although databases haven’t exactly been the coolest things to talk about in tech for a while, it seems that they are back in the public eye. What’s your take on their comeback?
Yurii Rashkovskii: I think it’s a multifaceted phenomenon. Multiple things are at play here. Most obvious one is hype. There are a lot of new database vendors and they need to market their solutions, often times aggressively, as it is getting harder and harder to get anybody’s attention. It is already a fairly saturated market. As a result of this massive marketing effort, the public is naturally more aware of the topic. Add the consequential decision-making anxiety (“what should I choose”?) and you have a recipe for additional interest and, again, more content.
Another aspect, I believe, is the changing environment and needs of our applications. Distributed computing, big data, complex business logic, high-performance and high-frequency operations, machine learning and other segments are creating new problems, new challenges, and new competing solutions. There is no “one right answer” anymore. Different domains drive different types of solutions, and we’re yet to see them converge. Again, naturally, this explosion in database flavors creates more interest. Sure, we also had different types of databases before, but they often were more niche or on the margins.
JAXenter: What is the biggest misconception about databases?
Yurii Rashkovskii: It’s a difficult question. I am going to be extremely biased here, I suppose. I feel like a lot of people treat databases as a somewhat “foreign” yet “complementary” entity in relation to their application. A place where the application stores data.
I personally believe that, philosophically, a database is just as integral part of the application as the business logic. Even more, the line between them might at times get a bit blurry, and that’s a good thing. This might sound like a subtle argument about terminology, but I think the reality is that an application shapes the way it works with the the data and the data, in turn, shapes the application. Oftentimes, making a blind “default” choice of a database will, in fact, shape the application. Whether that’s desirable or not, it kind of depends.
One source of inspiration for PumpkinDB was functional programming.
Another (longer-term) thing I want to mention is that I think there’s a lot of lust for scalability (and, sometimes, performance). While it’s not a bad thing by itself, it does make systems more complicated. Sure, right now we are living through the SaaS stage of computing and it kind of imposes this model onto us — we need to serve more and more customers from a relatively centralized facility. As things will gradually go a bit more decentralized (the amount of computing power on the edges is staggering!), I suspect that a database’s needs can become, in a sense, less demanding on the scalability/performance front; and we’ll see other capabilities evolve.
PumpkinDB — “Data is gold. Don’t let it slip away”
JAXenter: PumpkinDB is the new kid on the block. What’s it about and what inspired you to create it?
Yurii Rashkovskii: Originally, I was looking for a way to rewrite a previous project of mine (Eventsourcing for Java) with a better storage. I’ve always had a thing for MUMPS. However esoteric it was, it was a surprisingly productive tool for developing database applications. It wasn’t great for developing business logic though. After a few attempts to integrate GT.M (one of MUMPS implementations) into a Rust library in a nice way, I gave up and decided to rethink this a bit.
My inkling was that if I just a take some good K/V storage and integrate a programming language directly with it, I can replicate at least some of the MUMPS’ productivity. But I didn’t want to be bothered to implement a sophisticated language. This is why I took LMDB and implemented the most barebones type of language I know — a Forth-like one. It is now called PumpkinScript and it is really, really simple. Just like M, it has only one data type (binary) and the only construct it has is an instruction — a thing that operates on a stack. I find that it is actually quite suitable for describing “data pipelines”, which is the most common use case for it anyway!
Another source of inspiration, and you can see it in PumpkinDB’s forced immutability, is functional programming and techniques like event sourcing. Mutable data is one of the things that make programming more complicated, especially when you have more than one application server (so, very often!). PumpkinDB is obviously not the first immutable database, but it is still not the most popular approach, so I wanted to explore it further.
Ultimately, the driving force behind this was to create an engine that would allow me to build applications that don’t lose the multifaceted nature of data by shaping it into domains too early. You can read some of my articles on “lazy event sourcing” here and here for further clarification.
This explosion in database flavors creates more interest. Sure, we also had different types of databases before, but they often were more niche or on the margins.
JAXenter: What are its benefits? How about its challenges?
Yurii Rashkovskii: Let me start with challenges. I think the biggest challenge is that PumpkinDB requires you to re-think how you record and query your data as you can’t delete it (at least not immediately, as we’ll have mechanisms for gradually retiring data). This effectively drives you away from the “state of the world” domain modelling to event, or fact-centric systems. It’s a big challenge as, on the surface, domain modelling is easier.
However, over time, as things grow, understanding changes or improves, it’s becoming more challenging. And this is where PumpkinDB (or PumpkinDB-based databases) can really help. Having you focus on the records of facts or similar kinds of information allows you to have a much lower cost when it’ll come to changing domain models — as it’ll mostly come down to changing some queries and not migrating or duplicating data for the purpose of re-use in overlapping domain models.
JAXenter: Users can now embed PumpkinDB into their Rust applications. According to your blog post, C is next. Why this language? When should we expect Java or other languages?
Yurii Rashkovskii: Just to make sure we’re clear on this, when I refer to embedding I mean it is possible to put PumpkinDB’s engine inside of another application (no network connection required, it just becomes part of your applications’ process). Like you said, it’s possible to do it for Rust applications now and at some point it’ll be relatively easy to extend this to C. Technically speaking, I think from that point on, we can allow embedding the engine ino programs written in other languages. For Java, that’ll probably mean the JNI route. I am not intimately familiar with it, though.
On the other hand, integrating with PumpkinDB is readily available to any language over a TCP connection. It has a very simple protocol — in fact, it’s just a dumb framing protocol that one uses to send binary form PumpkinScript programs over to the other side. We already have a pre-release Java library for it.
JAXenter: Is PumpkinDB for everybody?
Yurii Rashkovskii: PumpkinDB itself is definitely not for everybody. I’ll explain why. It’s rather a low-level database. In fact, and it has been mentioned a couple of times in its materials, it’s rather a “database engine”. Its original and core audience consists of those who’d like to build more suitable “last mile” databases on top of it. Usually, that would take a form of embedding PumpkinDB engine and adding some functionality on top of it that would shape the way programs interact with it. These databases can also easily leverage its simple but powerful protocol that’s effectively a way to send small programs instead of having pre-determined message types. This gives database designers a lot of flexibility.
As an example of that, I am currently working on ViewDB. It’s a relatively thin layer on top of PumpkinDB that allows users to collect data in a form of so called “facts” that are comprised of attributes and then query or stream them to obtain runtime domain model mappings, depending on what is needed. In a sense, ViewDB is a successor of one of my previous projects, Eventsourcing for Java but it can be used with any programming language and it has a much, much lighter core and is a lot more flexible.
I am also trying to break away from the event sourcing nomenclature with this project (the reasons for this are perhaps out of this interview’s scope). Some of the early followers of my work in this area mentioned that there are a lot of similarities with Datomic. I never invested time to learn it back when it was released, as it wasn’t and isn’t free/open source software and I tend to avoid putting proprietary software at the core of my applications. Nevertheless, I went back to their documentation and, indeed, there are a lot of similarities in some of our decisions. If anything, this helps me believe that our project is on the right track!
ViewDB hasn’t been released yet, but I am hoping to publish its initial prototype this summer. Those who want to track it can subscribe to this issue on GitHub.
Another category of PumpkinDB users is represented by application developers that want a greater control of how they persist and search their data. Instead of writing declarative queries, they can write B-Tree navigation and aggregation algorithms themselves, leading to the shortest query path, and no overhead from invoking query plan optimizer (however small that might be). It sounds difficult, but for the most part it’s actually just iterating over cursors, so it’s not that complicated!
JAXenter: What’s next for PumpkinDB?
Yurii Rashkovskii: Oh, it’s an interesting one. PumpkinDB is a very young project. It began in February 2017, so it’s just a few months old. As a consequence of that, there’s a lot in the “next” queue. Just to mention a few things:
1) Re-haul the PumpkinScript interpreter. Right now it’s fairly simplistic. It’s a good proof-of-concept and works relatively well in I/O-bound scenarios (it’s a database, after all), but going forward, we want a much more efficient intermediate representation, better control over transactional data lifetime, better abstraction over the stack, temporary resource management, better memory management, etc.
2) Introduce a Typed PumpkinScript layer in a form of a type checker. While at the core, it’ll still remain a single-type language, type checking will allow for much better error checking and perhaps optimizations that aren’t possible right now.
3) Experiment on storage backends. Right now we use LMDB, but this can be somewhat limiting. It’s not perfect for all scenarios (it’s read-optimized, to begin with). There are already some ongoing efforts in abstracting LMDB away in PumpkinDB’s engine so we can easily use other existing K/V backends. And, perhaps more interestingly, I want to experiment more with NVMe/SSD-optimized storages. There’s this interesting SPDK library that allows you to talk to an NVMe controller directly, bypassing the kernel. I wrote an initial binding for it for Rust back in April and I am hoping to return to this topic and figure out how, if, instead of a file system, one is given an entire raw SSD (and there’s also Intel Optane now!), with all the factors like no file system, no implicit caching, write amplification, etc. — how would one approach designing a storage? Especially considering that the database is immutable — it’s a nice constraint to have.
But the most important next step for us is to keep attracting users and contributors. We are trying to make the project as accessible and inclusive as possible. For example, we have a very optimistic PR merge model (we’ll merge as fast as is humanely possible!). We also need to spend more time improving our documentation to improve its accessibility. Lack of variety in examples or thorough documentation have been mentioned as stumbling blocks before. We want to articulate use cases better. We want to test PumpkinDB a lot more. So many things to be done, and this means we have to keep working on presenting our work to the groups of people who might be most interested in it.
JAXenter: Finally, a question that has nothing to do with PumpkinDB. Which open source database(s) do you prefer?
Yurii Rashkovskii: Aside from my own work (haha!), I am a big fan of PostgreSQL — it’s versatile, innovative and has been around for quite some time, so you know it has been tested fairly thoroughly!
Thank you very much!