Pick a serverless fight: A comparative research of AWS Lambda, Azure Functions & Google Cloud Functions
Here comes the serverless discussion! We’ve seen report after report taking the pulse of this increasingly trending technology; we’ve seen giants, one after the other, engaging in providing serverless services. Now, it’s time to take a closer look at an academic research. In this article, we present an overview of the most interesting results introduced in the paper titled “Peeking Behind the Curtains of Serverless Platforms”.
This time, however, we are not going to discuss any of the above. This article is going to be a bit more…academic!
The paper “Peeking Behind the Curtains of Serverless Platforms” is a comparative research and analysis of the three big serverless providers AWS Lambda, Azure Functions and Google Cloud Functions. The authors (Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, Michael Swift) conducted the most in-depth (so far) study of resource management and performance isolation in these three providers.
The study systematically examines a series of issues related to resource management including how quickly function instances can be launched, function instance placement strategies, and function instance reuse. What’s more, the authors examine the allocation of CPU, I/O and network bandwidth among functions and the ensuing performance implications, as well as a couple of exploitable resource accounting bugs.
Did I get your attention now?
In this article, we have an overview of the most interesting results presented in the original paper.
Let’s get started!
First things first. Let’s have a quick introduction to the methodology of this study.
The authors conducted this research by integrating all the necessary functionalities and subroutines into a single function that they call a measurement function.
According to the definition found in the paper, this function performs two tasks:
- Collect invocation timing and function instance runtime information
- Run specified subroutines (e.g., measuring local disk I/O throughput, network throughput) based on received messages
In order to have a clear overview of the specifications for each provider, the following table provides a comparison of function configuration and billing in the three services.
The authors examined how instances and VMs are scheduled in the three serverless platforms in terms of instance coldstart latency, lifetime, scalability, and idle recycling and the results are extremely interesting.
Scalability and instance placement
One of the most intriguing findings, in my opinion, is on the scalability and instance placement of each provider. There is a significant discrepancy among the three big services with AWS being the best regarding support for concurrent execution:
AWS: “3,328MB was the maximum aggregate memory that can be allocated across all function instances on any VM in AWS Lambda. AWS Lambda appears to treat instance placement as a bin-packing problem, and tries to place a new function instance on an existing active VM to maximize VM memory utilization rates.”
Azure: Despite the fact that Azure documentation states that it will automatically scale up to at most 200 instances for a single Nodejs-based function and at most one new function instance can be launched every 10 seconds, the tests of Nodejs-based functions performed by the authors showed that “at most 10 function instances running concurrently for a single function”, no matter how the interval between invocations were changed.
Google: Contrary to what Google claims on how HTTP-triggered functions will scale to the desired invocation rate quickly, the service failed to provide the desired scalability for the study. “In general, only about half of the expected number of instances, even for a low concurrency level (e.g., 10), could be launched at the same time, while the remainder of the requests were queued.”
Coldstart and VM provisioning
Concerning coldstart (the process of launching a new function instance) and VM provisioning, AWS Lambda appears to be on the top of its game:
AWS: Two types of coldstart events were examined: “a function instance is launched (1) on a new VM that we have never seen before and (2) on an existing VM. Intuitively, case (1) should have significantly longer coldstart latency than (2) because case (1) may involve starting a new VM.” However, the study shows that “case (1) was only slightly longer than (2) in general. The median coldstart latency in case (1) was only 39 ms longer than (2) (across all settings). Plus, the smallest VM kernel uptime (from /proc/uptime) that was found was 132 seconds, indicating that the VM has been launched before the invocation.” Therefore, these results show that AWS has a pool of ready VMs! What’s more, concerning the extra delays in case (1), the authors argue that they are “more likely introduced by scheduling rather than launching a VM.”
Azure: According to the findings, it took much longer to launch a function instance in Azure, despite the fact that their instances are always assigned 1.5GB memory. The median coldstart latency was 3,640 ms in Azure.
Google: “The median coldstart latency in Google ranged from 110 ms to 493 ms. Google also allocates CPU proportionally to memory, but in Google memory size has a greater impact on coldstart latency than in AWS.”
Additional to the tests described above, the research team “collected the coldstart latencies of 128 MB, Python 2.7 (AWS) or Nodejs 6.* (Google and Azure) based functions every 10 seconds for over 168 hours (7 days), and calculated the median of the coldstart latencies collected in a given hour.” According to the results, “the coldstart latencies in AWS were relatively stable, as were those in Google (except for a few spikes). Azure had the highest network variation over time, ranging from about 1.5 seconds up to 16 seconds.” Take a look at the figure below:
The research team defines as instance lifetime “the longest time a function instance stays active.
Keeping in mind that users prefer the longer lifetimes, the results depict Azure winning this one since Azure functions provide significantly longer lifetimes than AWS and Google, as you can see in the figures below:
Idle instance recycling
Instance maximum idle time is defined by the authors as “the longest time an instance can stay idle before getting shut down.” Specifically for each service provider, the results show:
AWS: An instance could usually stay inactive for at most 27 minutes. In fact, in 80% of the rounds instances were shut down after 26 minutes.
Azure: No consistent maximum instance idle time was found.
Google: The idle time of instances could be more than 120 minutes. After 120 minutes, instances remained active in 18% of the experiments.
Inconsistent function usage
According to this paper, users of these serverless providers expect the requests following a function update to be handled by the new function code, “especially if the update is security-critical.” Nonetheless, the results of this research show that “in AWS there was a small chance that requests could be handled by an old version of the function”, what the authors refer to in the paper as inconsistent function usage.
Further testing made it clear that these inconsistency issues are caused by race conditions in the instance scheduler.
In this part, the research team defines the metric instance CPU utilization rate as “the fraction of the 1,000 ms for which a timestamp was recorded.” The tests showed the following for each provider:
AWS: Instances with higher memory get more CPU cycles. What’s more interesting is that, according to the study, “when there is no contention from other coresident instances, the CPU utilization rate of an instance can vary significantly, resulting in inconsistent application performance.”
Azure: Relatively high variance in the CPU utilization rates (14.1%–90%), while the median was 66.9% and the SD was 16%.
Google: The median instance CPU utilization rates ranged from 11.1% to 100% as function memory increased.
I/O and network
Tests on the I/O throughput for the three services showed the following:
AWS: Despite the fact that the aggregate I/O and network throughput remains relatively stable, “each instance gets a smaller share of the I/O and network resources as colevel increases.” What’s more, “coresident instances get less share of the network with more contention.”
Azure: I/O and network throughput of an instance also drops as colevel increases. Also, it fluctuates due to contention from other coresident instances.
Google: Both the measured I/O and network throughput increase as function memory increases. However, the network throughput measured from different instances with the same memory size can vary significantly.
Resource accounting bugs
We finally reached what is, arguably, one of the most interesting findings of the paper. And I am referring to the resource accounting issues that were discovered in this research.
The research team came across a couple of, let’s say exploitable, resource accounting loopholes. Namely:
Billing issue: The research uncovered a billing issue in Google that could be exploited to run sophisticated tasks at a cost next to nothing! How is that possible, you ask? The authors found out that in Google one could execute an external script in the background that continued to run even *after* the function invocation concluded. The more you know…
CPU boost: Google does it again! As the study uncovered, “there was an 80% chance that a just-launched function instance (of any memory size other than 2,048 MB) could temporally gain more CPU time than expected.”
As I have expressed before, I do appreciate and enjoy a well-written scientific research with interesting and well-presented findings. The paper by Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, and Michael Swift is definitely one of them!
This article just scratched the surface of the rich results included in this study. However, it is safe to say that AWS demonstrated a higher level of consistency and reliability.
I do encourage you to take a closer look at the official paper and draw your own conclusions.
Until then, take the time to picture the three giant serverless platforms engaging into an iconic minion-style fight like this one: