5 reasons you should stop hosting your ELK stack locally
Alex Zhitnitsky shares some of the caveats of deploying and managing the ELK stack on your own, and introduces us to the world of hosted ElasticSearch to help solve problems of scalability, stability and maintenance.
This post was originally published on the Takipi blog – Java and Scala exception analysis and performance monitoring.
You don’t look a gift-horse in the mouth. Especially if it’s not a horse and actually an elk. It will poke you with its antlers. Or get drunk on fermented apples and trash your backyard. That’s an occupational hazard we’re willing to accept though so let’s take a look anyhow.
In this post we’d like to share some of our experience with the caveats of deploying and managing the ELK stack on your own and introduce you to the world of hosted ElasticSearch. Or in other words, when does it make sense to move on from managing your own ElasticSearch deployment, and what are the options you have when you decide to flip the switch. Having been on both sides, and understanding each team and system will have their own unique requirements, we wanted to share some insights to help you reach the right decision for your environment.
What’s wrong with hosting your ELK stack on your own?
When thinking about the open source software landscape you can roughly place any project on a spectrum between how easy and hard it is actually to use. How complex / time consuming it is to tend to its quirks and make it do what you want. Ease of use can then be broken down to parameters like deployment and setup, getting started with the tool, integrations, UI, scalability, stability and… maintenance. Spoiler alert: these last 3 areas are where we got hit. So if you can place projects like, say, Logback, far on the easier side of the scale, then ELK is going deep into the hard side.
1. The pains of scaling up
We’ve been pretty happy with the in-house ELK deployment but as more and more new users were joining Takipi, all kinds of issues started popping up. In Takipi’s case, the users attach a Java agent to their JVMs and then they’re able to monitor their production environment for exceptions and log errors down to the variable state that caused them. Think about the scale of this data in your own application, billions of events flying by each day, and now multiply it by the thousands. The Kibana dashboards we’ve built were able to give an overview of the system as a whole and also drill down to a view of a single user, letting us support users during their installation process if needed:
The tipping point occurred when the product reached a phase where companies with tens or hundreds of servers were able to get full teams on board and expand the visibility into their system. The stats flew off the charts. And so has our poor Kibana dashboard who started to hang far too often. At this stage we were experimenting with using the ELK stack mainly for BI purposes and occasional high touch support, so it was not defined mission critical.
Lesson #1: Don’t leave your ELK stack behind when gearing up towards high scale.
2. Stability takes a tough hit
What started as an occasional slow and hanging Kibana dashboard, quickly turned to crashing tumbling down ElasticSearch. Tough queries where the index size was bigger than allocated RAM caused lots of OutOfMemory exceptions finally resulting in a non-responsive database. The quick solution was just to restart it and take it easy on the queries.
Lesson #2: Watch out on those big queries when you work through your Kibana dashboard.
The real solution though would require tuning the $ES_HEAP_SIZE, stronger AWS instances, more RAM, disk space was running out too so we needed to compensate for that as well. Either through cyclic logs which would result in shorter historic records or just more disk space if you don’t want to flush your DB once in a while.
Lesson #3: What started with an easy deployment and integration process, quickly turned to an issue that requires constant monitoring and domain expertise.
3. Surprise! You’re now an ElasticSearch DBA
At this stage, especially if you’re not only using ELK for BI stats but pipeline all your logging data through it, it might make sense to get on the paid subscription service to get dedicated support and monitoring / security / permission management capabilities. The whole shebang.
Lesson #4: When your Kibana dashboard needs to be accessed by members of different teams you’ll probably need to set up access and user control. Patching up a solution of your own might not be smartest way to go here.
More things to keep tabs on include upgrades, backups, and managing sharding between nodes as your ElasticSearch cluster grows and gets… well… elastic. Before you know it, you sidekick as an ElasticSearch DBA and it consumes more and more of your time. This of course depends on how big your dev team is and if it makes sense to put more time into it in-house.
Lesson #5: Scaling up with more nodes for your ElasticSearch cluster is relatively easy and only requires a few settings, but don’t let it get out of hand – Another node is not necessarily the right solution.
Enter hosted ElasticSearch services
In a nutshell, if in the in-house setup we were already piping the logs through Logstash to an ElasticSearch cluster that was set up on a few AWS nodes with a Kibana dashboard on top, we decided to move to a hosted ElasticSearch solution. In hosted mode, ElasticSearch cluster management is taken off your shoulders and you’re free to focus on other things. The two main questions here are will it scale and how much would it cost?
The basic requirement would be to use a service that hosts its servers on the same cloud hosting provider that you’re using in your day to day. So if you’re on AWS, you want a hosted ElasticSearch service that uses AWS; saving costs and ensuring better network performance.
Pricing: Using a hosted service would cost more than the infrastructure needed to run it on your own. The upside you’re getting here is freeing your time from managing your ElasticSearch deployment, with support from experts and DBAs.
These two are probably the most popular solutions currently available. Found was recently acquired by Elastic, and it will be interesting to see how this will affect their offering on the long run. As far as pricing goes, both services bill hourly with varying steps according to parameters like disk size (and type), memory and data retention. Pricing is also affected by the region you choose to host your ElasticSearch server at, and if you need to have longer data retention it wouldn’t be too farfetched to assume your bill will go well over $1,000 per month.
With so many moving parts, it would be best to experiment with a few of the solutions, get the customized quote that would best reflect your needs. Installation and setup is promising to be super quick, and pricing turns to be the major factor in the decision making process here. As far as experimenting goes, Found has a 14 day trial (with 1GB memory, 8GB storage, on 2 AWS zones), and QBox delivers a $60 credit to new accounts. This is enough to get a feel for the product but probably won’t be sufficient for a full test run, which might require some negotiation or a paid POC. The cost of switching services is pretty low, just a matter of a few configuration changes, so you have a chance to experiment here with the only downside of losing some history.
- Logz.io (AWS)
- Bonzai (AWS)
- Compose (AWS, DigitalOcean, SoftLayer)
- FacetFlow (Azure)
- Sematext Logsense
Currently at Takipi we’re testing the waters with Logz.io, who also support shipping logs to ElasticSearch without necessarily using Logstash. Apart from the 2 bigger players in this space we see more services like Bonzai.io, Compose.io, FacetFlow and others. Each providing their own management dashboards that extend Kibana as the visualization engine for ElasticSearch.
The ELK stack is an awesome open source platform that provides a complete solution for log management, it’s easy to get started with and the eye candy is super sweet, but when it comes to managing it on the long run – Things get a bit awkward. While ElasticSearch is built well to scale, the effort you need to put into it may often outweigh the benefits of using a free open solution. That’s where you need to look into getting professional services involved, swallow the pill and let a hosted ElasticSearch-as-a-Service solution ease your pain to keep enjoying the benefits of centralized log management and visualization.
Java/Scala developer? Discover a new way to see the code and variables that cause errors in your server: Try Takipi.