Serverless chaos engineering – interview with Emrah Samdan
What is chaos engineering and what can it be used for in serverless setups? What is Thundra and how can it help secure your serverless architecture? JAXenter sat down with Serverless Architecture Conference speaker Emrah Samdan to find out the answers to these questions and learn more about chaos engineering in a serverless world.
JAXenter: Hi Emrah and thanks for taking the time for this interview. In your session at the Serverless Architecture Conference, you are speaking about chaos engineering. Can you for starters perhaps explain the basic concept?
Emrah: With its most simple definition, chaos engineering is breaking things on purpose to understand how your software system reacts to such turbulent situations. After the rise of microservice patterns, software systems are becoming more and more distributed. Now, engineers are architecting systems composed of resources that are in different regions and with different failure modes. These resources are mainly interacting with each other via asynchronous ways. In this context, the failures become inevitable due to a failure in this distributed architecture.
Chaos engineering was introduced by Netflix in 2010 to create highly resilient software architectures even with the highest complexity of distribution. With chaos engineering, software practitioners can inject the failure in a controlled fashion to their architecture to see whether or not the whole system will stay robust. It is common to inject an error to a subsystem while injecting latency is also equally important. You will need to be sure that how your system will perform when a downstream service starts to respond more slowly than normal.
JAXenter: Why is chaos engineering a perfect fit for serverless applications?
Emrah: Serverless brought a new paradigm to software development by dividing the software components into the tiniest pieces with atomic responsibilities. For this reason, your software architecture became even more distributed and the failure surface increased although serverless has numerous advantages like shorter time to market and easier development. In such a distributed architecture, developing a chaos engineering discipline is very important because there are now even more components that have different failure modes, permissions, timeout settings and etc. In order to understand the complexity of serverless applications, you need to face the potential issues in advance to educate yourself and your team against incidents.
JAXenter: What challenges do developers face when they are building highly resilient serverless apps?
Emrah: The first one that comes to my mind is that improperly tuned timeouts. If you are making synchronous calls to other functions or to third party APIs, you should be very careful while setting a timeout value for your function. If the downstream API or function that your function is interacting with starts to respond slowly, your function may experience timeouts while waiting idly. For this reason, it’s crucial to run chaos engineering experiments that simulate the slowly responding downstream services. This is done by injecting latency to downstream services.
With chaos engineering, software practitioners can inject the failure in a controlled fashion to their architecture to see whether or not the whole system will stay robust.
The second problem that developers might face is the failures in the resources that they are using. For example, the database that they are using might start throwing exceptions and they can retrieve/write their mission-critical data. It is vital that serverless developers should have a Plan B for such type of issues. Several examples of these plans can be exponential back-off mechanisms or circuit breakers. Similarly, serverless functions can be triggered by many different resources. During trigger operations, your function may face an issue and produce an error. In such cases, cloud vendor, AWS in our example, retries the invocation with the same input value in case that the previous error is temporary for any reason. However, every resource has a different retry mechanism. For example, AWS Kinesis retries until the data expires, it’s only twice for AWS SNS.
The last one is the correct roles and permissions given to serverless functions. You can face an illegal access exception if someone in your team deploys a function with insufficient permissions. In this case, your function might lose access to the resources that it reaches or can stop working.
JAXenter: What is Thundra and how can it help to be prepared for unpredictable failure scenarios?
Emrah: Thundra is a serverless observability tool that helps developers to debug, monitor and troubleshoot the serverless architectures. Thundra achieves this by aggregating distributed and local traces, performance metrics and application logs into a single pane of glass. Using Thundra, developers can detect the bottlenecks in their complex serverless architectures at a glance, thanks to color-coded architectural view and smart alerts.
When a failure or a slowdown happens in your architecture, Thundra creates an alert automatically with the proper explanation of the error stack and the blast radius (part of the serverless architecture which is impacted by the issue). Using this information, developers can resolve the issue a lot faster than using their traditional monitoring tools.
Although Thundra provides deep dive granularity for the incidents in serverless architectures, Thundra encourages and evangelizes a proactive approach to resolve the issues. For this reason, Thundra provides a toolset for running chaos experiments on serverless architectures and testing the resiliency of your architecture even before an issue happens. We frankly believe that you can tackle the issues if your team has enough level experience, and we believe that this is only possible by controlled chaos experiments that are done periodically.
JAXenter: What is the key take away that attendees of your session should take home with them?
Emrah: I’m aiming to educate the audience about the importance of chaos engineering and why it is particularly important for serverless applications. The audience will leave my talk with the following takeaways:
- Chaos engineering is as critical as other testing methodologies for modern applications.
- It is more expected to face the issues due to even more distributed nature of serverless compared to “traditional” microservices. So, the best way to get ourselves ready is to experience some issues in advance under our control.
- If you’re new to chaos engineering and serverless, breaking things on purpose on production may not be the best idea because you’re just beginning this new paradigm. In such cases, you may prefer to start chaos engineering on staging as well.
- There are many failure modes in serverless but you can simulate most of them by either injecting latency and injecting errors to serverless functions and resources.
- If you’re new to chaos engineering, it’s better to use an automation tool that can make your first steps safer. Thundra is a unique tool to achieve this.
- After you discover the issues as a result of a chaos engineering experiment, you can implement some fixes like exponential backoffs, tuning the timeouts or circuit breakers.