Prometheus 1.0: “Prometheus and Kubernetes share spiritual ancestry”
Prometheus 1.0 was launched last week —it delivers a stable API and user interface. In short, “Prometheus 1.0 means upgrades won’t break programs built atop the Prometheus API, and updates won’t require storage re-initialization or deployment changes.” Let’s allow Björn Rabenstein, engineer at SoundCloud and Prometheus core developer, to tell us all everything we need to know about this release.
In this interview, Björn Rabenstein, engineer at SoundCloud and Prometheus core developer, talks about the features and benefits of Prometheus 1.0 and reveals what’s next for this open-source systems monitoring and alerting toolkit originally built at SoundCloud.
JAXenter: What problems does Prometheus 1.0 solve?
Björn Rabenstein: Depending on your use case, it could solve a lot of different problems. The one common problem that is usually part of the mix has been nailed by Jamie Wilkinson in Google’s Site Reliability Engineering book (O’Reilly 2016): “We need monitoring systems that allow us to alert for high-level service objectives, but retain the granularity to inspect individual components as needed.”
JAXenter: What are its main benefits?
Björn Rabenstein: As mentioned in the Cloud Native Computing Foundation post, there are three groups that benefit from the Prometheus software for monitoring and analysis of cloud native architectures and time series data in different ways:
- Cloud developers: Easy instrumentation of your code, which will not only help in production, but also during development to spot performance issues and other irregularities. Kubernetes scheduler performance was increased by 10x by debugging using Prometheus. Additionally, Prometheus does not lock you in as there are many integration points with other systems.
- DevOps: Finally you can implement the alerting you want and ask the questions you always needed answers for.
- End users: The increased reliability and better performance of Prometheus is a major benefit to end users. As the first company using Prometheus, SoundCloud was able to detect and handle outages much better than before and increase the site’s availability significantly.
JAXenter: What does Prometheus 1.0 mean for you personally?
Björn Rabenstein: The incremental changes between our previous release 0.20 and 1.0 are relatively small. The main change for us as developers is the stability guarantees we are now providing, as lined out in our blog post. Obviously, we didn’t just pick an arbitrary point in time for that. Instead, we waited until we gathered the confidence that the crucial parts such as the query language PromQL and the mainline APIs from exposition all the way up the stack to alerting are mature enough to keep them stable. Having reached that state makes me really happy.
This success was in no way obvious at the time the project began.
JAXenter: What does Prometheus mean for SoundCloud?
Björn Rabenstein: Prometheus has really paid off for SoundCloud, both in terms of what Prometheus has enabled (running a very complex site reliably) and what Prometheus has saved us (less operational effort to set up and run monitoring and to detect and investigate outages, less money paid to external monitoring providers), not to mention the more vague gains like tech credibility. But this success was in no way obvious at the time the project began. And the investment was huge compared to the size and available resources of the company. I only joined a year after the initial decision to invest in Prometheus, and I like to joke that I would have rejected the project if I had been in charge back then. Sometimes you just have to be bold, which is obviously easy to say in hindsight, when you already know you did the right thing.
JAXenter: What do Prometheus and Kubernetes have in common?
Björn Rabenstein: Prometheus and Kubernetes share spiritual ancestry: Kubernetes is inspired by Borg (Google’s internal cluster management solution), Prometheus is inspired by Borgmon (Google’s internal monitoring system). Despite being developed independently and under quite different circumstances, the shared spirit of how to run large-scale production systems shines through (a spirit that has now been codified in the excellent book “Site Reliability Engineering – How Google Runs Production Systems” quoted above). Both are “second systems” that try to learn from the lessons of their ancestors.
The most striking thing they have in common is the idea of labels. Everything is labeled in Kubernetes, and selections can happen along arbitrary label dimensions. The same is true for Prometheus, where time series are labeled (and everything in Prometheus is a time series or acts on time series, including alerts). The labels from Kubernetes easily translate into Prometheus labels. If needed, they propagate through the full stack. The page you receive on your mobile phone may very well feature a label you have assigned to a container at start-up time.
The most important integration points are Prometheus’s support of the Kubernetes service discovery mechanism and the exposition of Prometheus metrics by Kubrnetes components (like the API server or the Kubelet). The result is that you can use Prometheus to monitor services running on Kubernetes, you can use Prometheus to monitor Kubernetes itself, and finally you can run Prometheus on Kubernetes.
JAXenter: What’s next for Prometheus?
Björn Rabenstein: Personally, I’m most looking forward to the upcoming clustered highly-available Alertmanager because that concludes Prometheus’s concept of extremely robust HA in monitoring and alerting.
The community is most keen on a distributed long-term storage for time series data. Multiple efforts in this area are underway, so stay tuned for results in the not too distant future.
Thank you very much!