Fixing Java Production Problems with Application Performance Management
Originally appearing in JAX Magazine, Dan Delany goes hunting in the haystack with New Relic.
Tracking down issues in a production system can be a nightmare, but application performance management systems such as New Relic – which combine isolated log files and network and database monitoring – can help.
Finding and fixing issues in a production system can be really difficult. Usually by the time the problem is visible, users are already complaining. Fixing these problems under the eye of management is no fun for anybody, especially when you don't know where the problems may be.
You may or may not have access to the servers in question, and you may have to diagnose an issue involving multiple servers. And sometimes there’s a third party involved, such as a database administrator (DBA) or hosting company, for whom your problem is not a priority. Depending on how detailed your log files are, you might be able to search through them and find some hints. It may also be that your code is using third party jars, and they may not log the level of detail you need.
How APM can help
It's often possible to derive useful information from log files, network monitoring, database server monitoring, and the like. The problem there is that you're trying to infer things about your code's behavior from the information that you’ve already decided to log. If you change your logging to add more information, it's too late. The error has already happened.
Application Performance Management (APM) systems allow you to remotely instrument your code and log data to an external system continuously. This is advantageous for several reasons. Since this data collection and logging is happening in the background, you don’t need to think about logging metrics during software development. When you need information about the performance of your software in production, the information has already been gathered for you during the normal operation of the system. It has been gathered under real system load on the actual production environment, as opposed to data from a test system under simulated load. It also means that when an error occurs in production, such as a performance problem or a threading problem, data about it has already been gathered and is already available.
In addition to providing help diagnosing problems, an APM system can provide more visibility into your code's performance and usage patterns by providing metrics about which pages are accessed the most often and how much time the server is taking to generate those pages. Once a page has been identified as needing improvement, an APM can help you drill in and see where the server is spending the most time. This lets you can prioritize your fixes.
For example, this page shows statistics about our office’s site that shows people’s contact information. It’s a small site, but it gives a feel for what APM can tell you. We see usage spikes, and can see how much time is being spent in application code versus database code. And it’s identified in Figure 1 that the PeopleController#phonenumbers page is the slowest on the site.
Figure 1: Summary dashboard: Shows general statistics about an app in New Relic
In this article, I’ll demonstrate using New Relic's APM system to help identify production performance issues. I created a demo app with a single servlet that takes in a first name and last name and searches for entries with that name in a database using Hibernate. Adding APM to a system is fairly simple: to get started, I only had to set up an additional directory containing code and configuration, which contains the contents of a zip file downloaded from New Relic.
6:/opt/local/apache-tomcat-7.0.34/newrelic% ls CHANGELOG newrelic-extension-example.xml LICENSE newrelic-extension.xsd README.text newrelic.jar logs newrelic.yml newrelic-api.jar 7:/opt/local/apache-tomcat-7.0.34/newrelic%
After the directory is created, you can activate New Relic with a simple change to the launch script. In this case, the change is in Tomcat’s catalina.sh script.
# ---- New Relic switch automatically added to start command on 2013 Jan 08, 11:43:26 NR_JAR=/opt/local/apache-tomcat-7.0.34/newrelic/newrelic.jar; export NR_JAR JAVA_OPTS="$JAVA_OPTS -javaagent:$NR_JAR"; export JAVA_OPTS
Figure 2: Web Transactions page showing four very slow servlet calls
Once your server has been launched with this new flag (see Figure 2), it will report data to New Relic. The data can then be mined to help you monitor your code as it runs.
In this case, the performance problem seen in Figure 3 is easy to spot. My single servlet is taking between 8000 and 9000 milliseconds every time it runs.
Figure 3: Transaction Trace page showing where time was spent in a specific servlet call
The dashboard shows us that the issue lies with the QueryServlet that’s taking a long time to run. It’s revealed to be a database query that is taking all but 6ms of the slow request. Since I used Hibernate in my persistence layer, it’s generating SQL for me. Tweaking the SQL code may not be so simple a task (Figure 4).
Drilling a little deeper shows us exactly which query was slow:
select person0_.id as id0_, person0_.fname as fname0_, person0_.lname as lname0_, person0_.middlename as middlename0_ from person person0_ where frame+? and lname=?
Figure 4: SQL Detail Tab on the Transaction Trace showing the SQL as captured by New Relic
Now I can send this query to my DBA and ask what can be done to make that query run faster (Figure 5). It turns out to be a simple fix. The query is only against a single table which has over 21 million rows, and none of the columns in the 'where' clause of the query have indexes.
Figure 5: DBA tool showing that the table being queried isn’t indexed for our query
The DBA has added some indexes to the table. Now I can run the app again and see the results of the change in Figure 6.
Figure 6: Transaction Trace showing the same servlet call after the table indexes were added
We improved the system response time from 8470ms to 20ms, a huge improvement in a simple case. But most importantly, I was able to get all the information I needed in an organized fashion in the browser. I didn’t waste any time logging into servers, viewing log files or anything like that. I also didn’t need to change anything in my source code to enable this data collection. I added the New Relic jar to the server launch scripts, and after that, my server logged information to New Relic in the background. From the New Relic website, I was able to track down my performance problem. I drilled through to the slow web transaction, looked at different parts of the transaction to see what was the slowest, and acted on those results.
This was a simple demonstration where the fix was obvious once the slow query was identified, but it illustrates the value of app performance management. Not only can it be used to find performance problems, it can also be used to measure your app in your production environment so you can know where to spend your time and money to make your system better.
Author Bio: Dan has been writing Java code since 1996, and is currently a senior software engineer at New Relic in Portland. When he is not at work, he enjoys playing with trains with his son and writing model train related software for his iPhone.
This article first appeared in JAX Magazine: Socket to Them. For other previous issues, click here.