The JVM Immune System: Debugging distributed servers at scale
Assembling a time-critical response to errors is important in the context of keeping your application in check. Distributed logging and testing in production all form part of the solution for Alex Zhitnitsky. We must address vulnerabilities quicker.
This post was originally published on the Takipi blog – Java and Scala exception analysis and performance monitoring.
Today more than ever, speed plays a larger role in the software development lifecycle. We see R&D teams who want to push code faster to production environments with rising complexity, and this amplifies a vulnerability that must be addressed.
Those few hours after a new deployment set the tone for its success. Every once in a while, things go wrong, no matter how strict your tests are. When your code is out in production and it meets the real-world architecture and scale of your system, with real data flowing through the application, things can go south pretty quickly. In order to be resilient and stay on top of things, a strategy needs to be implemented that allows you to:
- Identify when there’s an error happening
- Assess the error’s severity to prioritise it
- Draw out the state that caused the error
- Trace back and solve the root cause
- Deploy a hotfix
In this post we’ll cover some of the most useful practices to allow you to assemble a time-critical response and “weaponise” your application.
The case for distributed logging
With production environments spread across multiple nodes and clusters, it’s easy for a transaction that starts on one machine or service to cause an error someplace else. When an exception happens, there’s a need to be able to trace back this type of distributed transaction, and the logs are often the first place to look for clues.
This is why, for every log line printed out, we need to be able to extract the full context to understand exactly what happened there. Some data might come from the logger itself and the location the log is created in; other data needs to be extracted at the moment of the event. A good way to trace such errors to their origin would be generating UUIDs at every thread’s application entry point.
A useful yet underutilised feature here is using thread names to provide a window for this precious context, right before the stack collapses and the data is lost. You can format your thread name to something like:
Thread.currentThread().setName(prettyFormat(threadName, getUUID(), message.getMsgType(), message.getMsgID(), getCurrentTime()));
So instead of an anonymous name like ”pool-1-thread-17” your application now produces smart stack traces that start this way:
“threadName: pool-1-thread-17, UUID: AB5CAD, MsgType: AnalyzeGraph, MsgID: 5678956, 30/08/2015 17:37”
This works well when handling caught exceptions, but what if there’s an uncaught exception involved? A good practice is to set a default uncaught exception handler, both to cover for that and to help extract any useful data you need. Other than thread names, additional places we can use to store hints about what happened are the TLS (Thread Local Storage) and the MDC (Mapped Diagnostic Context, which is provided by your logging framework). All other data gets lost as the stack frames collapse.
Leaning on the JVM Tool Belt
Some more complex bugs like deadlocks or heavy performance bottlenecks require a different approach. Take jstack, for instance: a powerful tool that ships together with the JDK. Most of you are probably already familiar with it in some way. Basically, jstack allows you to hook into a running process and output all the threads that are currently running in it. It will print each thread’s stack trace; frames—either Java or native; locks they’re holding; and all sorts of other metadata. It can also analyse heap dumps or core dumps of processes that have already ended. It’s a longstanding and super useful toolkit.
The problem here is that jstack is mostly used in retrospect. The condition you’re looking to debug has already happened, and now you’re left searching through the debris. The server isn’t responding, the throughput is dropping, database queries are taking forever: a typical output would be a few threads stuck on some nasty database query, with no clue of how we got there. A nice hack that would allow you to get the jstack output where it matters most is to activate it automatically when things start tumbling down. For example, you can set a certain throughput threshold and get jstack to run at the moment it drops.
Combined with using smart thread names, we can now know exactly which messages caused us to get stuck, and we can retrace our steps back, reproduce the error, isolate it, and solve it.
Using Java agents to skip logging altogether
The next step in this process is gaining visibility into your application during runtime. Logs are inspected in retrospect and only include the information that you’ve decided to put there in advance. We’ve seen how we can enrich them with stateful data, but we also need a way to access the exact variable values that caused each error to get down to the real root cause. Java agents give us the ability to get to the data we need straight from the source without writing to disk and using huge log files, so we can extract only the data we’ll actually be using.
One interesting approach is using BTrace, an open-source Java agent that hooks up to a JVM and opens up a scripting language that lets you query it during runtime. For instance, you can get access to things like ClassLoaders and their subclasses, and load up jstack whenever some troubled new class is instantiated. It’s a useful tool for investigating specific issues and requires you to write scripts for each case you want to cover.
You could also write your own custom Java agent, just like BTrace. One way this helped our team at Takipi was when a certain class was instantiating millions of new objects for some reason. We wrote an agent that hooks up to the constructor of that object. Anytime the object was allocated an instance, the agent would extract its stack trace. Later we analysed the results and understood where the load was coming from. These kind of problems really pique our team’s interest. On our day-to-day we’re building a production grade agent that knows how to extract the variable values that cause each exception or logged error, all across the stack trace, and across different machines. Check out a live demo right here.
Testing in production: Not for the faint hearted
Jokes aside, testing in production is a serious practice that many companies are taking part in. They don’t cancel the testing phase completely, but they understand that staging environments are not enough to mimic full blown distributed environments, no matter how much time and effort you put into setting them up. The only real testing takes place in production, with real data flowing through the system and unexpected use cases being thrown at it.
There are several approaches you can adopt for performing controlled production testing, depending on what kind of functionality is it that you’re trying to test. One option is duplicating and routing live traffic both through the current system and through the new version of the component that you’re testing. This way you can see how the new component behaves and compare it directly to the current version without risking the delivery of wrong results back to the user if, for example, it’s some data crunching task.
Another option is segmenting your traffic and releasing new features gradually. One way to do this is to use a canary server, which is a single node in your system updated with the new version you’d like to roll out (just like a canary in the coal mine). Sometimes it’s also helpful to add more fine-grained monitoring and logging to the canary server. Another option is to add more abstraction on top of the canary setup, implementing and making use of gradual rollouts with feature switches, or A/B testing small changes in your application to see how they impact performance.
Debugging Java applications in production requires a creative and forward-thinking mindset. Unless we prepare our applications and environment in advance, there will not be much insight to recover after getting hit by errors and performance issues.
Java/Scala Developer? Takipi replaces logging in production JVMs and lets you see the variable state that caused each log error and exception – Try Takipi.