Monitoring serverless applications
The Thundra Application Observability and Security Platform was originally developed inside Opsgenie and spun off in 2018. In this article, we will go into detail about Thundra’s features. We will be using a real-life example of a serverless application and show how to boost its visibility by using distributed tracing.
In the early ‘60s, Joseph Carl Robnett Licklider came up with the idea to connect people and data anytime, anywhere through Arpanet, which was the predecessor of the internet. Then in the good old ‘90s, networks and virtualization became popular thanks to telecommunications companies. It was revolutionary when companies could cut costs by sharing resources.
Transition and evolution from bare metal to the cloud was enthusing for both small and large enterprise organizations. At the crest of the new millennium, vendors like Amazon, Google, Microsoft, IBM, and others started to introduce their managed cloud services. IT managers began to think about how fast they could shift from on-premise to the cloud, without even questioning whether the cloud was something they should be moving toward.
Cloud computing brought the phenomenon of everything from infrastructure and software to functions as a service. IaaS, PaaS, SaaS, then CaaS, FaaS, etc. respectively took place, enabling users to manage fewer operations and focus more on their business logic. The snowball kept rolling and turned into a huge hype of Containers in the decade of 2010, which then led the way to serverless computing.
Serverless lets you develop microservices, mobile apps, and APIs quickly while being cost-efficient and without thinking of servers. This new cloud-computing execution model brought ease of use to the next level with both pros and cons. The biggest problems of the applications on the cloud, such as scalability and cost disadvantage, turned into an advantage with the serverless approach for many use-cases. Simplified back-end tasks boosted the productivity of many developers. But of course, there are challenges, such as resource and execution limits, monitoring and debugging, and security and privacy.
Serverless is highly distributed and works with asynchronous events, which makes them hard to track with the pile of logs provided by default. The cloud vendors’ environments are typically not open source; hence, performance analysis of a stateless function can be a complication in most cases. Observability is critical because if you do not keep an eye on your serverless applications, they can fail and lead to several problems like unavailability or high cost. Traditional APMs add overhead to your invocations and can’t live in stateless environments.
What Is Thundra?
Thundra started as an internal project inside Opsgenie to enable the engineering team to optimize their resources and resolve problems while migrating to serverless in AWS. Then in 2018, Thundra spun off from Opsgenie, which was acquired by Atlassian. The Thundra Application Observability and Security Platform provides end-to-end visibility, anomaly detection, debugging, troubleshooting, alerting, and automated actions for serverless-centric, container, and virtual machine workloads. Thundra can offer application management, security, and compliance together in a manner that pinpoints issues down to lines of code.
Thundra helps to increase performance, optimize cost, and reduce MTTR by up to 80%. Customers can understand application behaviors and troubleshoot in just minutes. Thundra’s security guardrails and compliance helps to prevent security vulnerabilities with whitelist/blacklist policies and auto-detect anomalies, so discovering bottlenecks and detecting issues in development and production is 90% faster! Thundra also enables customers to debug serverless-centric applications online and offline, line-by-line.
Thundra helps software teams resolve complex issues in asynchronous, distributed serverless environments. With flexible querying and alerting capabilities, along with rich visualizations of aggregated metrics, logs, and traces, software teams can quickly respond to incidents and solve performance problems in their serverless and containerized environments. All of this is done with easy setup and no additional overhead. Thundra enables its users to debug, monitor, troubleshoot, and secure their serverless and containerized environments in AWS, instrumenting code by wrapping AWS Lambda functions to generate data or links from the Amazon Cloudwatch service and then collect data. Thundra also helps developers to debug locally during the development phase of a product and real-time monitoring while in the production phase. With its presence on several US, EU, and APAC Amazon regions, Thundra is able to minimize the latency added to the AWS Lambda functions’ invocation durations, and serves a wide range of developers by supporting Java, Node.JS, Python, C#, and GoLang runtimes.
For those who are concerned about overhead, which means an increase in the execution duration of the AWS Lambda function, Thundra has an alternative method of integration called “Asynchronous Monitoring.” Thundra publishes monitoring data through CloudWatch, considered to be a best practice by AWS as explained in their “Serverless Architectures with AWS Lambda” whitepaper. By using asynchronous monitoring, customers do not have to worry about their function failing because of Thundra since Thundra’s Lambda sends data-runs independently. The most important advantage of using asynchronous monitoring is that writing logs to Cloudwatch only adds negligible overhead to your Lambda.
Let’s use a real-life example to demonstrate the benefits of Thundra. We will use a blog application to demonstrate how Thundra can boost the visibility of customers’ serverless applications by distributed tracing. This application has two different personas: blog authors and editors. Using the blog application, blog authors can submit their drafts, get feedback, and edit/delete their blog posts. The application also allows editors to review drafts, submit feedback, and publish blogs when ready.
The blog application is designed with a serverless paradigm and decomposed into many small Lambda functions with very small responsibility and least privileged permissions. When a first request comes into the system, an asynchronous chain of invocation begins through events flowing between Lambda functions. This is called “business flow,” which represents a meaningful chain of invocations that achieve a real job for the users of the software. As an example, when an author submits a blog post, a business flow of making that blog post ready for review gets started.
Let’s take a closer look.
The business flow can be seen in the screenshot above, which is taken from the Thundra console and is drawn automatically.
Let’s walk over those operations one by one:
- The blog application should immediately display a message explaining that the post is saved and will be reviewed by editors.
- The new blog post is ingested into an SQS queue.
- A worker Lambda takes over the post from SQS and publishes a message to SNS to notify the editors.
- The same Lambda saves the blog post so that it can be used for review by editors.
- From the DynamoDB record, another Lambda gets triggered and writes the content to an Elasticsearch table with necessary indexes so that it can be searchable among millions of blog posts.
When you look at all this from the developers’ perspective, it’s hard to locally replicate the issues happening in production to understand the behavior of the applications “under the hood.” These problems all can be summed up in one phrase: Understanding application behavior and sustaining application health. We believe there are three pillars of understanding application health: Observability, debugging/testing, and security/compliance.
Observability for Microservices
We have entered a new era of automatic observability. While enterprises migrate to serverless-centric workloads, event-based architectures are being used to cope with complex business logic and the asynchronous programming model.
Thundra defines three pillars of Observability:
Developers can reduce context-switching across multiple tools. Instead of using multiple tools to apply characteristics of Observability Engineering, developers can use a true end-to-end management platform for distributed applications across serverless architectures. Including 17 metrics by default (some runtime specific, e.g., Java or Go), Thundra can provide coverage for:
- Invocation Counts
- Invocation Durations
- Memory Usages
- CPU Percentages
- Disk IO Bytes
- Process Memory Usages
- Thread Count
- GC Counts
With compatibility for Open Tracing, Thundra provides full tracing capability, including local and distributed tracing, where you can check your Lambda interactions with other resources.
In the same dashboard, developers can quickly dive into logs, making easily connected invocations and traces.
With the ability to view from different perspectives (e.g., duration, error, cost, resource consumption), Thundra helps users pinpoint the root cause of errors in AWS Lambda functions and other resources. Errors, cold starts, and timeouts can thus be dramatically reduced. Thundra enables you to understand the issues behind the errors in stateless environments and effortlessly track the health of third-party APIs and resources.
Real-Time Debugging for Serverless Applications
Since developers lost access to the underlying infrastructure, the only resources they now have to debug serverless applications are the logs printed as a result of an invocation. However, it takes a lot of time to write a new logline, deploy the function, and check the logs once again. Step-through debugging, setting breakpoints, and inspecting variables can increase developer efficiency. To some extent, the AWS Toolkit and SAM make it easy to perform step-through debugging on your local environment. However, recreating the same AWS environment locally can be hard due to permissions, triggers, and VPC configurations. Sometimes, troubleshooting an issue without live debugging can be complicated and time-consuming. But being able to use step-through debugging live functions with the comfort of traditional debugging on your IDE offers a convenient way to accomplish that task.
In order to provide the same debugging experience that developers have on their local environment for Lambdas on the AWS environment, Thundra released its Online Debugger. This tool enables developers to debug the Lambda function in its native environment and use their IDE, giving them the capability to perform traditional debugging tasks such as setting breakpoints and viewing and changing variables. Customers can do this with just a few configuration changes, and without modifying their code. The Online Debugger also supports Java, Nodejs, and Python runtimes. To make debugging even easier, Thundra has published the VSCode plugin for Nodejs, Python, and the Intellij plugin for Java; for other environments, it published a Python client.
Application teams needed a different monitoring method to track all the async messages traveling between functions, because old-school monitoring of metrics and logs doesn’t help with such distributed systems. Developers need more high-cardinality data in a chronological order showing the behavior of the async architecture. Distributed tracing came to the rescue, addressing this pain-point of understanding the behavior of modern applications and maintaining applications’ health.
Distributed tracing can stay short most of the time because all the business logic still occurs inside the function instead of between functions. When anything goes wrong in big Lambda functions, developers are left alone with only the logs; in these cases, there should be a structured logging mechanism carefully placed in the application to use the logs efficiently. However, this requires a ton of manual work and still pollutes the code. A developer would be happy to debug these applications with an IDE-like experience in this doomed situation. In this way, the developer can walk through the business logic to see if anything unexpected occurred after the code is executed. To address this need, Thundra’s offline debugging capability enables developers to debug serverless functions after they finish the execution.
Security is a crucial topic that should be handled with care even in managed environments like serverless. Serverless computing has evolved since its inception, and it allows every organization to move quickly without needing to think about what’s going on behind the scenes. That has resulted in a blast of new services that fit into the serverless ecosystem. Serverless environments are less vulnerable than non-serverless environments; however, that does not mean security vulnerabilities don’t occur. In most cases, securing your serverless applications is all about making sure that you have tightly closed potential security holes, such as distinguishing sensitive and non-sensitive data. A lot of security exploits with serverless-related technologies come from misconfiguration, which exposes non-public information publicly. This was a problem long before the arrival of serverless, but now we can simply press a couple of buttons and let the fully managed power of serverless technologies do most of the work.
When building serverless applications, a common area that slips through the cracks is limiting the inner API calls being made inside AWS Lambda functions. From the outside, you can limit these with AWS IAM permissions, but that only has an impact on AWS resource access.
What about if you wanted to block all API URLs except for the ones that are approved by your security team? This would require some extra tooling on top of AWS Lambda to inspect what requests are being made inside of your AWS Lambda function, and then take action based on what you defined.
In the above image, we can see that we have a few existing AWS IAM permissions on our AWS Lambda function. We can also see that we added a blacklisted resource which will block ALL READ requests to SQS.
As you can see, we have two options:
- Alert and Notify Me
- Block and Notify Me
If you select “Alert,” the AWS Lambda execution will continue to run; however, as the name and description imply, a message will be sent to your team letting them know what took place.
SEE ALSO: The Serverless First Mindset
The other option of “Block” will actually stop the operation from taking place. This means that your AWS Lambda execution will stop. If you’re working on highly critical applications where you need high levels of security, and strict security policies are widespread throughout your company and development team, then this option may be worth it.
The complexity of building microservices while benefiting from the latest cloud technologies often proves to be a burden for developers. Both engineering and DevOps teams can face observability challenges during the development or production stages, which can adversely impact cost and efficiency. Thundra’s serverless observability platform for testing, debugging, monitoring, troubleshooting, and securing your serverless applications offers rich visualizations of aggregated metrics, logs, and traces. We are happy to call Thundra a tool that gives a tool relief.
Thundra is an AWS Partner Network (APN) Advanced Technology Partner with the AWS DevOps Competency. Our cloud-native observability tool helps you test, debug, monitor, and troubleshoot AWS Lambda functions and their environment. Thundra is available on the AWS Marketplace, and its software-as-a-service (SaaS) is available through the five most popular runtimes used on the market, with its seamless observability libraries written for Java, Node.JS, Python, C#, and Golang.