The leak hunter faces his toughest challenge yet
Nikita Salnikov-Tarnovski recounts a nightmare twelve-hour search for the source of an applications memory leaks.
A week ago I was asked to fix a problematic webapp suffering from memory leaks. How hard can it be, I thought – considering that I have both seen and fixed hundreds of leaks over the past year or so.
But this one proved to be a challenge. 12 hours later I had discovered no less than five leaks in the application and had managed to fix four of them. I figured it would be an experience worth sharing.
The application at hand was a simple Java web application with a few datasources connecting to the relational databases, Spring in the middle to glue stuff together and simple JSP pages rendered to the end user. No magic whatsoever. Or so I thought. Boy, was I wrong.
First stop - MySQL drivers. Apparently the most common MySQL drivers launches a thread in the background cleaning up your unused and unclosed connections. So far so good. But the catch is that the context classloader of this newly created thread is your web application classloader. Which means that while this thread is running and you are trying to undeploy your webapp, its classloader is left dangling behind - with all the classes loaded in it.
Apparently it took from July 2012 to February 2013 to fix this after the bug was discovered. You can follow the discussion in MySQL issue tracker. The solution finally implemented was a shutdown() method to the API, which you as a developer should know to invoke before redeploys. Well, I didn’t. And I bet 99% of you out there didn’t, either.
There is a good place for such shutdown hooks in your typical Java web application, namely the ServletContextListener class contextDestroyed() method. This specific method gets called each and every time the servlet context is destroyed, which most often happens during redeploys for example. Chances are that quite a few developers are aware this place exists, but how many are actually realise the need to clean up in this particular hook?
Back to the application, which was still far from being fixed. My second discovery was also related to context classloaders and datasources. When you are using com.jdbc.myslq.Driver it registers itself as a driver in java.sql.DriverManager class. Again, this is done with good intentions. After all, this is what your application uses to figure out how to choose the right driver for each query when connecting to the database URL. But as you might guess, there is a catch: this DriverManager is loaded in bootstrap classloader, rather than your web application’s classloader, so cannot be unloaded when redeploying your application.
What now makes things really peculiar is that there is no general way to unregister the driver by yourself. The reference to the class you are trying to unregister seems to deliberately hidden from you. In this particular case I was lucky and the connection pool used in the application was able to unregister the driver. In case I remember to ask. Looking back to similar cases in my past, this was the first time I saw such a feature implemented in connection pool. Before that, I once had to enumerate through all the JDBC drivers registered with DriverManager to figure out which ones should I unregister. Not an experience I can recommend to anyone.
This should be it, I thought. Two leaks in the same application is already more than one can tolerate. Wrong. The third issue staring right at me from the leak report was sun.awt.AppContext with its static field mainAppContext. What? I have no idea what this class is supposed to do, but I was pretty sure that the application at hand didn’t use AWT in any way. So I started a debugger to find out who loads this class (and why). Another surprise: it was com.sun.jmx.trace.Trace.out() . Can you think of a good reason why a com.sun.jmx class would call a sun.awt class? I certainly can’t. Nevertheless, that class stack originated from my connection pool, BoneCP. And there’s absolutely zero way to skip that code line that leads to this particular memory leak. Solution? The following magic incantation in my ServletContextListener.contextInitialized():
Thread.currentThread().setContextClassLoader(null); // Force the AppContext singleton to be created and initialized without holding reference to WebAppClassLoder sun.awt.AppContext.getAppContext();
But I still wasn’t done: Something was still leaking. In this case I found out that our application was binding this datasource to the InitialContext() JNDI tree, a good, standardized way to bind your objects for future discovery. But again – when using this nice thing you had to clean up after yourself by unbinding this datasource from the JNDI tree in the very same contextDestroy() method.
Well, so far we had pretty logical, albeit rare and somewhat obscure problems, but with some reasoning and google-fu were quickly fixed. My fifth and last problem was nothing like that. I still had that application crashing with OutOfMemoryError: PermGen. Both Plumbr and Eclipse MAT reported to me that the culprit, the one who had taken my classloader hostage, was a thread named com.google.common.base.internal.Finalizer.
“Who the hell is this guy?” – was my last thought before the darkness engulfed me.
A couple of hours and four coffees later I found myself staring at three lines:
emf.close(); emf = null; ds = null;
It is hard to recollect exactly what happened during the intervening hours. I have remote memories of WeakReferences, ReferenceQueues, Finalizers, Reflection and my first time of seeing a PhantomReference in the wild. Even today I still cannot fully explain why and for what purpose my connection pool used finalizers tied to google’s implementation of reference queue running in a separate thread.
Nor can I explain why closing javax.persistence.EntityManagerFactory (named emf in the code above and held in static reference in one of application’s own classes) was not enough; and so I had to manually null this reference. And similar static reference to the data source used by that factory. I was sure that Java’s GC could cope with circular references all day long, but it seems that this magical ring of classes, static references, object, finalizers and reference queues was too hard even for him. And so, again for first time in my long career, I had to nullify java reference.
I am a humble guy and thus cannot claim that I was the most efficient in finding the cure for all of the above in a mere 12 hours. But I have to admit I have been dealing with memory leaks almost exclusively for the past three years. And I even had my own creation, Plumbr, helping me (in fact, four out of five of those leaks were discovered by Plumbr in 30 minutes or so). But to actually solve those leaks, it took me more than a full working day in addition.
Overall – something is apparently broken in the Java EE and/or classloader world. It cannot be normal that a developer must remember all those hooks and configuration tricks, because it simply isn’t possible. After all, we like to use our heads for something productive. And, as seen from the workarounds bundled with two popular servlet containers (Tomcat and Jetty), the problem is severe. Solving it, however, will require more than simply alleviating some of the symptoms, but curing the underlying design errors.
Photo by Blai Biosca.