Overview of ycrash – finding the source of your problem
Take a tour of ycrash in this article by Ram Lakshmanan. ycrash helps capture critical artifacts, including garbage collection logs, thread dumps, core dumps, heap dumps, disk usage, and more when the problem happens. It applies machine learning algorithms and generates a report which gives you a complete view of the problem, down to the lines of code that caused it.
The industry has seen cutting edge application performance monitoring tools (AppDynamics, NewRelic, Dynatrace…), log analysis tools (DataDog, Splunk,…). These are great tools for detecting problems. i.e. they can detect CPU spiked by x%, memory degraded by y%, response time shot up by z seconds. But they don’t answer the question: Why has the CPU spiked up? Why has memory degraded? Why has the response time increased? You still need to engage developers/architects/vendors to troubleshoot the problem and identify the root cause of the problem.
ycrash captures critical artifacts (GC logs, thread dumps, core dumps, heap dumps, netstat, vmstat, lsof, iostat, top, disk usage….) when the problem happens, applies machine learning algorithms, and generates one unified root cause analysis report. This report gives you a 360-degree view of the problem. The report points out the exact class, method, and line of code that caused the problem.
ycrash has two primary components:
a. ycrash transmitter: This is a simple shell script. This script can either be triggered manually or triggered in an automated manner based on certain alerts. The script integrates very well with APM tools. From APM tools, you can trigger this script. This script will capture the following artifacts from the application:
- GC Log
- Thread dump
- Heap Dump (if needed)
- Top -h <pid>
- Disk usage
Captured artifacts are transmitted to ycrash server.
b. ycrash server: ycrash server analyzes all different artifacts using machine learning algorithms and generates one unified root cause analysis report. It uses following Tier1app tools to analyze the artifacts:
- GCeasy: analyze Garbage Collection log
- FastThread: analyze Thread dumps
- HeapHero: analyze Heap dumps
A report will be archived in the server and can be viewed online at any time. Also, powerful dashboards are provided, so that reports can be searched and viewed in calendar mode easily.
What are the advantages of ycrash?
- One unified root cause analysis report: Most of the time SRE engineers capture the artifacts and gives it to developers/architects sometimes vendors to troubleshoot the problem. They can take their own sweet time to get to the root cause of the problem. ycrash produces one unified root cause analysis report. This report gives you a 360-degree view of the problem. Report points out the exact class, method, line of code that caused the problem.
- Capturing right artifacts, at right time: Due to heat of the moment, most SRE engineers restart the application without capturing all the right artifacts. Even if they capture, most SRE engineers only capture a partial set of artifacts. Without the right artifacts, it’s going to be hard to debug the problem. ycrash automates this process. It captures all the artifacts at the right point in time.
is ycrash a replacement for APM tools?
No, it is not a replacement for APM (Application Performance Monitoring) tools. It compliments APM tools. It’s meant to address the shortcomings of APM tools. It integrates very well with APM tools.
On-prem deployment model
Both ycrash transmitter and ycrash server run on the customer’s premise. No data is transmitted outside the customer’s corporate firewall.
Authentication and authorization
ycrash supports SAML based authentication, thus it can easily integrate into the client’s authentication & authorization solution.
API & integration with monitoring tools
ycrash provides a simple REST API. ycrash can easily integrate with industry-standard APM, Log analysis & CI/CD tools. More details about the ycrash APIs can be found here:
Agents or triggers
The ycrash transmitter is a simple shell script that captures and transmits various artifacts to the ycrash server. The Ycrash transmitters should be executed on the host that is experiencing the problem. This script can be automatically executed from the APM tools when any alerts are generated.
If a customer has concerns about running the ycrash transmitter script, they can write a custom script that can capture the above artifacts do HTTP(S) POST to the ycrash server.
Data is transferred from the ycrash transmitter to the server through a simple HTTP(S) POST. Both the ycrash transmitter and the server are deployed within customer’s network.
No data leaves a customer’s corporate firewall. All of the collected data is stored in the customer’s network only. Thus, data privacy is subjected to customer’s data policy.
Data is stored and archived in the customer’s site. Thus, data retention is subjected to customer’s data retention policy.
The server is a platform agnostic. It can run on any platform (Unix, Windows, private data center, VM, cloud,…). The ycrash server needs only Java 8 or above version to run. Furthermore, the server can be made highly available by running on multiple nodes behind any type of load balancer (F5, apache, nginix,…).
Does the server & script need to be updated regularly?
ycrash server: We make one new release every quarter. Whenever we make a new release, we will send you the notification. You can upgrade the ycrash server on your permissible schedule.
ycrash script: Not frequently. But if we plan to capture additional artifacts (which we do), then the script needs to be updated.
How long does the capture/analyze process take to run?
It takes 120 seconds to capture all the artifacts. However, this value is configurable. An analysis is done instantly once artifacts are captured and transmitted to the ycrash server.
Are there any notifications when the script is triggered or when the analysis is complete?
An email notification will be sent out once all artifacts are captured and analysis is complete.