Garbage Collection logs: Can we really use it like that?
Garbage Collection is primarily used to troubleshoot memory problems. But, as Ram Lakshmanan explains, a few enterprises have a few innovative uses for their GC logs, some of which could even save you money!
Engineers use Garbage Collection logs primarily to troubleshoot memory related problems and tune their GC settings. However, we are seeing several innovative enterprises using Garbage collection logs for following purposes:
Lowering AWS bills
Most applications saturate memory first before saturating other resources (CPU, network bandwidth, storage). Most applications upgrade their EC2 instance size to get additional memory rather to get additional CPU or network bandwidth. With right memory size settings and GC parameters, you can run on basic EC2 instances itself effectively. You don’t have to upgraded to higher EC2 configurations. It can directly cut-down your AWS bills. Analyzing GC Logs thoroughly will help you to come up with optimal memory size settings.
Micro metrics – catch performance problems in test environment
Despite thorough stress testing in the test environment, still, performance problems find their way to production. It’s because a lot of enterprises measures only macro metrics i.e. CPU utilization, memory utilization, and response time. Macro metrics don’t give visibility on the acute degradation. These acute degradations are the ones which manifest into major performance problems in production. If proper micro metrics are measured in the test environment, several performance problems can be caught in the testing phase itself. You can gather all the micro metrics related to memory from GC logs itself.
Production Monitoring & Alerting
Industry has seen several interesting Application Performance Monitoring tools. Yet, none of these tools provide insightful metrics on Garbage collection. When we say insightful metrics, we are referring to:
- Memory problems detection such as Memory leaks, consecutive Full GCs, GC starvation…
- GC KPIs: Latency, throughput, footprint
- Object creation rate, promotion rate, reclamation rate…
- GC Pause time statistics: Duration distribution, average, count, average interval, min/max, standard deviation
- GC Causes statistics: Duration, Percentage, min/max, total
- GC phases related statistics: Each GC algorithm has several sub-phases. Example for G1: initial-mark, remark, young, full, concurrent mark, mixed
It’s not that APM tools are not interested in these metrics, rather it’s because they don’t have that data. This data is only available in GC logs and within the JVM runtime. With the advent of GC Log analysis API, you can proactively monitor applications GC logs and build alerts if any of the thresholds are breached. Machine learning capabilities of this log analysis API, not only help you to detect the memory problems but also predict future memory problems.
Catching defects during the development phase is much cheaper than catching them during the testing phase. As part of the Continuous Integration (CI) process, several enterprises are running stress tests. GC logs generated during those tests are programmatically analyzed through the GC Log analysis API. If thresholds like Object creation rate, full GC count, GC interval time are breached, then the build gets failed automatically. It’s a very powerful way to catch performance problems during code commit period itself.