Netflix OSS: Change the game with Hollow
Person with blank face outdoors image via Shutterstock
Netflix Hollow is a Java library and comprehensive toolset for harnessing small to moderately sized in-memory datasets which are disseminated from a single producer to many consumers for read-only access. It is built with servers busily serving requests at or near maximum capacity in mind and its aim is to address the scaling challenges of in-memory datasets. Let’s see the advantages that come from using Netflix Hollow.
Software engineers often encounter problems that necessitate the dissemination of a dataset which does not fit the “big data” label, Drew Koszewnik of Netflix wrote in a blog post announcing Hollow. Some of the most common ways to solve these issues are to keep the data in a centralized location and to serialize it, then disseminate it to consumers which keep a local copy.
Each path has its own challenges, which is why engineers often choose a hybrid approach — “cache the frequently accessed data locally and go remote for the ‘long-tail’ data.” This approach is not without challenges either: a significant amount of the cache heap footprint is consumed when bookkeeping data structures and objects are often kept around just long enough for them to be promoted and negatively impact GC behavior.
Netflix’s engineers realized that a hybrid approach often represents a false savings and came to a conclusion that eventually set Hollow —which replaced Zeno— into motion.
If you can cache everything in a very efficient way, you can often change the game — and get your entire dataset in memory using less heap and CPU than you would otherwise require to keep just a fraction of it.
The need for Hollow
Netflix Hollow is a Java library and comprehensive toolset for harnessing small to moderately sized in-memory datasets which are disseminated from a single producer to many consumers for read-only access. It is built with servers busily serving requests at or near maximum capacity in mind and its aim is to address the scaling challenges of in-memory datasets.
Hollow focuses on keeping an entire, read-only dataset in-memory on consumers; it circumvents the consequences of updating and evicting data from a partial cache. Plus, it shifts the scale in terms of appropriate dataset sizes for an in-memory solution.
Agility & stability
Hollow is meant to boost teams’ agility when dealing with data related tasks. Although one of its advantages is its ability to automatically generate a custom API based on a specific data model, thus allowing consumers to intuitively interact with the data, with the benefit of IDE code completion, “the real advantages come from using Hollow on an ongoing basis.” According to Koszewnik, “once your data is Hollow, it has more potential.”
As far as stability is concerned, Hollow is not susceptible to environmental issues such as network outages, disk failures or noisy neighbors in a centralized data store. Netflix engineers use it to represent crucial datasets, essential to the fulfillment of the Netflix experience, on servers busily serving live customer requests at or near maximum capacity.
Hollow datasets are self-contained, which means that no use-case specific code needs to accompany a serialized blob in order for it to be usable by the framework. Plus, Hollow is designed with backwards compatibility in mind so deployments can happen less frequently. It comes with a multitude of prefabricated tools, and creation of your own tools using the basic building blocks provided by the library is straightforward.
“If something looks wrong about a specific record, you can pinpoint exactly what changed and when it happened with a simple query into the history tool. If disaster strikes and you accidentally publish a bad dataset, you can roll back your dataset to just before the error occurred, stopping production issues in their tracks. Because transitioning between states is fast, this action can take effect across your entire fleet within seconds.”
Despite Hollow’s multiple benefits, one should keep in mind that it is not appropriate for datasets of all sizes. If the data is large enough, keeping the entire dataset in memory is not feasible.
The code is available on GitHub.