Grafana 7.0: “We’ve built one of the best visualisation tools and it’s not tied to any one database”
The open source platform Grafana is among the world’s most popular dashboarding tools – it currently has more than 550,000 active installations and millions of users across the globe. We spoke to Tom Wilkie, VP Product at Grafana Labs, as Grafana announces the general availability of version 7.0 of its observability platform. Grafana 7.0 is set to simplify the development of custom plugins and make it easier for you to visualise your data.
JAXenter: Grafana is an observability platform, could you give our readers a brief overview of what that means and how it works?
With Grafana, we’re focused on the visualisation component.
Tom Wilkie: A lot of people will think of an observability platform as an end-to-end solution, which includes the collection, storage and visualisation all in one place. With Grafana, we’re focused on the visualisation component. I think the thing that makes us different is that focus. Most other vendors will tie their visualisation tools to their database. Our approach is different in that we’ve built, I think, one of the best visualisation tools out there and it’s not tied to any one database. We’ve got over 60 different data sources that can talk to Grafana, and they are all treated as equal citizens.
But it’s important to understand why we’ve done that. It used to be that IT ops and so on were dominated by the log aggregation giants, right? What we would now consider observability problems were solved by emitting some logs, storing them in a central place and querying them. People started to look for more opportunities to improve their developers’ productivity and experience and generally get more done with fewer people. We saw this with the rise of cloud-native architectures and microservices, and with these trends we saw an explosion in data, in observability data, so new techniques were needed.
Metrics and monitoring have always been popular but mostly from the resource side of things. You mostly use metrics to look at CPU and memory consumption and so on. Now, I think what we’re seeing, over the recent few years, is more usage of application or whitebox metrics; where we can peek inside the applications to look at the request rates and latencies and queue lengths, all kinds of information that previously would have been hidden if we treated an application as a black box. So now we’ve got two things: logs and metrics. They’re both useful for different things, and there is a whole group of vendors now that will either do both, or they’ll specialise in one and not the other. I think what you tend to find is there’s no one vendor out there, no one project or system, that’s best in breed for both.
Microservices have become even more popular with distributed architectures, dynamic scheduling with Kubernetes and so on, and suddenly you need tracing as well.
Microservices have become even more popular with distributed architectures, dynamic scheduling with Kubernetes and so on, and suddenly you need tracing as well. If you’re going to do any kind of performance work in a microservices architecture, distributed tracing is an essential technique. Now, we have three things, and still, no one vendor is good at all of them. I think this is why Grafana and our take on building an observability platform has become so popular: it’s because we’re the only people out there to say, “Leave your data where it is.” You, the user, can pick your combination of tools that suit you best and you can express your opinions by choosing Graphite and Splunk or by choosing ElasticSearch, Prometheus and Zipkin. You can build that combination, bring it all together in Grafana, and get what you need.
You can have that single pane of glass that can talk to all of it, and where you can build incident response and debugging flows within Grafana that help you see what’s needed to reduce your mean-time to recovery and generally help your developers with their day-to-day experience. I think that’s why maybe the phrase ‘Grafana is an observability platform’ might be unexpected, but Grafana is the ‘glue’ that holds together your observability approach, and it enables users to own their own observability strategy and make their own choices about what suits them.
JAXenter: Moving on from that, what features will developers be excited to try out in Grafana 7.0?
Tom Wilkie: We do a yearly release cycle with Grafana. We’re now announcing 7.0 and there’s a year’s worth of work in there, so it’s hard to talk about quickly, but we generally tend to pick three top things that users might find interesting.
For a start, the scale that Grafana has attained as a project is somewhat astounding to us. We track well over 500,000 active installs. We’ve had 360 contributors to Grafana since 6.0. It’s an open source project. We’ve had more pull requests and issues fixed between six and seven than any other release. I think something like 18,000 commits; it’s the biggest release yet. But we’re really picking out three features. The first is about completing that observability vision we just talked about, and starting to bring tracing into Grafana, and building some of the early workflows that allow you to go from metrics to logs to traces – all within one user experience.
You, the user, can pick your combination of tools that suit you best and you can express your opinions by choosing Graphite and Splunk or by choosing ElasticSearch, Prometheus and Zipkin.
The second thing we like to draw people’s attention to is the CloudWatch logs data source that we’ve built, so you can bring more data sources into Grafana. This has been built in conjunction with Amazon and is going to be very popular. A lot of users have requested this. We’ve also enhanced the experience you can get from other data sources. A lot of people have tried to unify many different and disparate data sources in a single UI by going for a lowest common denominator and only exposing features that are common across all of them. In Grafana, we’ve done almost the exact opposite: the data sources get to own a part of the Grafana UI and get to really express what makes them unique and different, and expose all of their different features and full functionality.
Really, the only unification we’ve done in Grafana is on the data side. When you run that query, when you build that query in a custom UI for that particular data source, the results of that query are normalised into the Grafana format which can then be visualised. We are very proud of how data sources can own some of the UI and show what’s special about them. For instance, the Prometheus data source has a Prometheus query editor that has rich syntax highlighting and context-sensitive tab completion and all of these kinds of features. Whereas if you go to the Graphite UI, the Query Builder looks very different and is a series of dropdowns that enable you to compose together a query.
We’ve extended this to other areas of Grafana in 7.0. We’ve introduced an Inspect drawer that pops out from the side, where you can see what’s going on behind the scenes. You can see what query was sent to the data source, you can see how long it took, what the raw data was that came back – all sorts of metrics and metadata about the query. But the data source can also own part of that Inspect drawer. The data source can inject into its own tab and can, for example, say ‘I’m a Metrictank data source and you hit these following roll-ups and preaggregations to execute part of this query.’ Or I’m a Cortex data source or a Loki data source and actually your query hit these cached records and was parallelised and sharded in this particular way. Maybe even in the future, which is not something we can do right now, but the data sources could be extended to show the SQL query plan, and these kinds of things. So it’s powerful.
Users only really have to learn once how to handle data coming from different data sources. It really levels the playing field between different data sources.
The third thing we’d like to highlight is the unified data pipeline. Once you’ve built a query and executed it, that data source now outputs a new unified data frame format. This is really important because this new format is not only very performant – we use Apache Arrow – but it enables many things: It enables us to unify a lot of the data processing and transformation, whereas previously this used to live in individual data sources. Again, for instance, the Prometheus data source used to be able to transpose an operation and use Prometheus data in a tabular format – other data sources didn’t have that. Now, that’s part of the transformation pipeline, which means you can apply that to any data source.
Users only really have to learn once how to handle data coming from different data sources. It really levels the playing field between different data sources. Another example, a feature users have been requesting for many years now, is the ability to just do maths between data returned from different data sources. You might have your resource usage monitored in Prometheus, but your application-level metrics might come from appD. Now, in Grafana, you can take the two and get capacity planning metrics like CPU cycles per request, which help you predict, for example, how your CPU consumption is going to grow over time.
Right now, as I mentioned, we have over 60 first-class plugins which are either built into the core or part of Grafana Enterprise or built by us. But there are maybe 100 or so other data sources coming from the community. I fully expect with 7.0 to see an explosion in data sources and integrations, because it’s got easier, more reliable and more performant.
JAXenter: What do you think is the standout feature in 7.0? What do you think it’s going to be remembered for?
Tom Wilkie: I mean, I’m super biased, because I helped drive the development of the tracing features. So for me, it’s that. I think if you were to ask Torkel Ödegaard (creator of Grafana), it would be this data pipeline and that’s going to be really transformational going forward, because suddenly not only have we made it easier, but we’ve also built a really solid framework for doing some really exciting things in the future. For instance, previously – and some of these things might sound trivial – pretty much every panel in Grafana handled data differently.
Some panels, for instance, supported export to CSV and some didn’t, but with this unified data pipeline, not only can we do transposition, joining, and maths across series for any panel and for any data source, but we can now export CSV from any panel. That’s going to be key to the growth of Grafana and to the flexibility of projects. I have no idea where it’s going to go, but it’s exciting because developers are going to be able to do all sorts of crazy things with this that previously were quite hard!
JAXenter: Presumably, that will make it accessible to the rest of the business? Is that the idea with the ease of CSV export?
We think Grafana should be able to easily connect to anything.
Tom Wilkie: One of the mantras of Grafana Labs is: ‘Don’t get in the way of the data’. If you want to export the data into a spreadsheet and work on it there and do what you like with it, then that is definitely something you should be able to do with Grafana and can do now.
But the flipside is also true. We think Grafana should be able to easily connect to anything. So we have a Google Sheets data source built on this new plugins platform and new unified data pipeline. Not only can we now export a CSV, fiddle with it in Google Sheets, we can now bring that data back into Grafana, which is cool. You can combine that with live data coming from your monitoring system and so on. I think, if we look back at 6.0, I’d say the Explore feature was the big one, but I think when we look back at 7.0, it’s definitely going to be this data pipeline as it’s foundational.
JAXenter: You’ve already touched on it a little, but what does the tracing feature bring to the table? What are the benefits?
Tom Wilkie: Tracing is traditionally the third pillar of observability and Grafana 7.0 is the first one to have support for tracing data sources. We’ve built support for Zipkin and Jaeger into Grafana. It’s a plugin interface, so we really hope to be able to add support for other tracing vendors and projects in the future.
In 6.0, you can automatically switch to logs and find the logs behind those metrics and really do that kind of correlation. In 7.0, you can now do an extra step to traces as well.
We’ve really been focused on a very simple use case for 7.0, and that’s one of incident response. Up until very recently, I was still getting paged in the middle of the night. That’s only occasionally. Our software is very reliable. But occasionally I get paged and have to wake up and go through this incident response workflow. Maybe the alerts push me into a dashboard where alerts are annotated. Usually, you have to come in and fiddle with that dashboard. That was something we enabled in 6.0 with the Explorer view, and we built this great architecture for diving into a panel and exploring the data. With Loki we built this ability to switch between the metrics and logs experience, so now suddenly you go into it a panel, you start fiddling with the data and focusing on, maybe, just the errors in this case and the time range you’re interested in. In 6.0, you can automatically switch to logs and find the logs behind those metrics and really do that kind of correlation. In 7.0, you can now do an extra step to traces as well.
This focus on this use case, and this experience adds a lot of value, even though it’s still pretty early days for tracing and Grafana. This feature got merged a month or so ago, and we’ve been using it a lot internally and it’s incredibly useful, especially when you get paged for a thing being slow. This helps you nail down why it’s slow, which bits are slow. It helps you get there so much faster than you used to be able to, if every single one of these transitions was copying and pasting queries and translating them in your head and copying identifiers between different user interfaces. Personally, I’m excited about this one. For me, this is the biggest deal and this is just the tip of the iceberg. This is one workflow to get between metrics, logs and traces. We’re already working on two or three more that, I think, are really going to help flesh out and complete this, this picture.
JAXenter: In your opinion, what sets Grafana apart from its competitors?
Tom Wilkie: Most of our competitors make their money from selling you a database, and I think what makes us different is we don’t. We live it internally with this ‘big tent culture’ where all of our data sources are treated equally. As you can see in Grafana 7 with the tracing, we’re not going to launch with just one data source; we’re not going to just launch with Jaeger. We’ve launched with Jaeger and Zipkin. When we launched Loki, we treated Elastic as a first-class integration in Explore, and living that is key to the Grafana experience.
I think if we started picking favourites and treating some data sources as better than others, the whole message of this composable observability platform where our users own their observability strategy would start to break down. It’s quite amazing to see it lived in the company. This is not something that we just talked about with the press or not something that we only say externally; we argue about this internally all the time. We would love to live the simpler life where we only got to focus on a handful of data sources, but it’s not what our users want and it’s not what makes Grafana special. We are always checking ourselves and making sure we’re living up to this ‘big tent’ promise that we’ve made.
JAXenter: Observability seems to have become a hot topic of conversation in the last couple of years and everybody seems to be offering observability. Where do you think it comes from? Is it from the growth in cloud native technology?
Tom Wilkie: Yes, I think that’s the big driver. We want to figure out how to make our developers more productive. As a business, right? How can we do more with fewer people? Previously, you used to have to build your own operating systems and then more and more abstractions were built. Now software engineering is more like composing together pre-built packages. That drive for productivity amongst your engineers has led to architectural choices, I think, that has led to this explosion of complexity.
I’m happy about the fact that the emphasis has moved away from being right the first time to just being ‘Well, if it does break, let’s make sure we can fix it quickly’.
Suddenly, understanding the behaviour of a system is non-trivial. OK, observability is a bit of a bandwagon. But I think for me, that’s what it means: observability is the ability to understand the behaviour of a complicated system. Therefore, observability tools should always be about helping the developer understand what’s going on quickly and easily. And hopefully enjoyably as well. We put a lot of emphasis at Grafana Labs on this. We use these tools and we want to use tools that are easy to use, but also give you that ‘wow moment’. That’s why the Grafana team put so much effort into making Grafana beautiful. We want it to be this thing that’s a pleasure to use.
Yes, so I think observability is mainly driven by a developer productivity argument. Cloud native is also driven by the same desire. We want to build our application architectures that allow developers to iterate quicker, allow us to get that feedback cycle from user feedback to changes in the product down from years to months to days to hours through techniques like continuous deployment and so on.
With all those fast-moving parts and all that constant iteration, you need good tools to understand what broke and when it broke. I’m happy about the fact that the emphasis has moved away from being right the first time to just being ‘Well, if it does break, let’s make sure we can fix it quickly’. I love that because it enables me to carry out more experimentation; it enables me to try new things. You’ve probably seen the studies out of Google about high-performing teams being the ones where it’s acceptable to take risks and it’s acceptable to be wrong? I think observability solutions are about the same thing: it’s acceptable to take risks because you know you’ve got that safety net of tooling that enables you to figure out what went wrong and fix it quickly.