AWS CloudWatch + yCrash = Monitoring + RCA
This article goes into an outage the online application GCeasy experienced on October 11, and what the monitoring tool AWS CloudWatch showed. AWS CloudWatch clearly indicated two things: Problem and symptom, and the time frame since the problem started.
We had an outage in our online application GCeasy on Monday morning (PST) Oct’ 11, 2021. When customers uploaded their Garbage Collection logs for analysis, the application was returning an HTTP 504 error. An HTTP 504 status code indicates that transactions are timing out. In this post, we would like to document our journey to identify the root cause of the problem.
Here are the primary components of the technology stack of the application:
- AWS EC2 instance
- AWS Elastic Beanstalk
- Nginx Web Server
- AWS Elastic Load Balancer
- Java 8
- Tomcat 8
- MySQL (RDS Service)
- AWS S3
AWS CloudWatch – Monitoring tool
We use AWS CloudWatch as our monitoring tool. From Fig 1, you can see AWS CloudWatch clearly reporting that the CPU consumption and MYSQL DB connection started to climb up from Oct’ 09 (Friday). Actually, on Oct’ 09, we made the new code deployment to the production environment. So it was clear that the new code was the culprit, causing the instability in the production environment.
AWS CloudWatch clearly indicated two things:
- Problem symptom (i.e., CPU and DB connection count spiked up)
- Time frame since the problem started (Oct’ 09, Friday)
However, AWS cloud watch didn’t report which line of code (i.e., root cause) was causing the CPU or DB connections to spike up.
yCrash – RCA tool
We use yCrash as our root cause analysis tool. This tool captures GC log, thread dump, heap dump, netstat, vmstat, iostat, top, disk usage, kernel logs, and other system-level artifacts from the sick application, analyzes them, and generates root cause analysis reports instantly. Fig 2 shows the summary page of the yCrash report. Please refer to the red arrow mark in Fig 2, it points out that “20 threads are stuck waiting for a response from the external system“. It also gives a hyperlink to the thread report to examine those 20 BLOCKED threads stack traces. Clicking on the hyperlink shows the stack trace of those 20 threads, as shown in Fig 3.
Based on the stack trace, you can see that these threads were making MySQL Database calls. Look at the red arrow mark in Fig 3. It points to ‘com.tier1app.diamondgc.dao.GCReportDAO.selectReportById(GCReportDAO.java:335)‘. This is the line of code which is making the MYSQL Database call. We looked up the source code of this line. This line of code was making a ‘select’ SQL call to a table in the MySQL Database. This ‘select’ SQL query turned out to be quite inefficient. This inefficiency wasn’t exposed in the lower test environments because we had only a handful of records on the table. However, in production, this table had several million records. Thus ‘select’ SQL query started to perform poorly in the production environment. It took anywhere from 5 to 7 minutes to complete. During this time period, application threads were completely stuck, thus ultimately customer requests started to timeout with HTTP 504 error.
It turned out that this SQL was added in the recent release, which went live on Friday. Because yCrash pointed out the exact line of code causing this degradation, we commented out this ‘select’ SQL. Once new code was deployed, the application’s performance recovered right away.
In our earlier blog, we attempted to explain the difference between a monitoring tool (AWS Cloud watch) and a root cause analysis tool (yCrash) in theory. In this blog, we have once again attempted to explain the difference through a real-world production problem. Thank you for reading this post.