Who will win the serverless war?

Pick a serverless fight: A comparative research of AWS Lambda, Azure Functions & Google Cloud Functions

Eirini-Eleni Papadopoulou
© Shutterstock / Azamatovic  

Here comes the serverless discussion! We’ve seen report after report taking the pulse of this increasingly trending technology; we’ve seen giants, one after the other, engaging in providing serverless services. Now, it’s time to take a closer look at an academic research. In this article, we present an overview of the most interesting results introduced in the paper titled “Peeking Behind the Curtains of Serverless Platforms”.

The saturation point is nowhere to be seen in the serverless discussion with tons of news coming online every day and numerous reports trying to take the pulse of one of the hottest topics out there.

This time, however, we are not going to discuss any of the above. This article is going to be a bit more…academic!

During the last USENIX Annual Technical Conference ’18 that took place in Boston, USA in mid-July, an amazingly interesting academic research was presented.

The paper “Peeking Behind the Curtains of Serverless Platforms” is a comparative research and analysis of the three big serverless providers AWS Lambda, Azure Functions and Google Cloud Functions. The authors (Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, Michael Swift) conducted the most in-depth (so far) study of resource management and performance isolation in these three providers.

SEE ALSO: The state of serverless computing: Current trends and future prospects

The study systematically examines a series of issues related to resource management including how quickly function instances can be launched, function instance placement strategies, and function instance reuse. What’s more, the authors examine the allocation of CPU, I/O and network bandwidth among functions and the ensuing performance implications, as well as a couple of exploitable resource accounting bugs.

Did I get your attention now?

In this article, we have an overview of the most interesting results presented in the original paper.

Let’s get started!


First things first.  Let’s have a quick introduction to the methodology of this study.

The authors conducted this research by integrating all the necessary functionalities and subroutines into a single function that they call a measurement function.

According to the definition found in the paper, this function performs two tasks:

  • Collect invocation timing and function instance runtime information
  • Run specified subroutines (e.g., measuring local disk I/O throughput, network throughput) based on received messages

In order to have a clear overview of the specifications for each provider, the following table provides a comparison of function configuration and billing in the three services.

The authors examined how instances and VMs are scheduled in the three serverless platforms in terms of instance coldstart latency, lifetime, scalability, and idle recycling and the results are extremely interesting.

Scalability and instance placement

One of the most intriguing findings, in my opinion, is on the scalability and instance placement of each provider. There is a significant discrepancy among the three big services with AWS being the best regarding support for concurrent execution:

AWS: “3,328MB was the maximum aggregate memory that can be allocated across all function instances on any VM in AWS Lambda. AWS Lambda appears to treat instance placement as a bin-packing problem, and tries to place a new function instance on an existing active VM to maximize VM memory utilization rates.”

Azure: Despite the fact that Azure documentation states that it will automatically scale up to at most 200 instances for a single Nodejs-based function and at most one new function instance can be launched every 10 seconds, the tests of Nodejs-based functions performed by the authors showed that “at most 10 function instances running concurrently for a single function”, no matter how the interval between invocations were changed.

Google: Contrary to what Google claims on how HTTP-triggered functions will scale to the desired invocation rate quickly, the service failed to provide the desired scalability for the study. “In general, only about half of the expected number of instances, even for a low concurrency level (e.g., 10), could be launched at the same time, while the remainder of the requests were queued.”

Interesting fact: More than 89% of VMs tested achieved 100% memory utilization.

Coldstart and VM provisioning

Concerning coldstart (the process of launching a new function instance) and VM provisioning, AWS Lambda appears to be on the top of its game:

AWS: Two types of coldstart events were examined: “a function instance is launched (1) on a new VM that we have never seen before and (2) on an existing VM. Intuitively, case (1) should have significantly longer coldstart latency than (2) because case (1) may involve starting a new VM.” However, the study shows that “case (1) was only slightly longer than (2) in general. The median coldstart latency in case (1) was only 39 ms longer than (2) (across all settings). Plus, the smallest VM kernel uptime (from /proc/uptime) that was found was 132 seconds, indicating that the VM has been launched before the invocation.” Therefore, these results show that AWS has a pool of ready VMs! What’s more, concerning the extra delays in case (1), the authors argue that they are “more likely introduced by scheduling rather than launching a VM.”

Azure: According to the findings, it took much longer to launch a function instance in Azure, despite the fact that their instances are always assigned 1.5GB memory. The median coldstart latency was 3,640 ms in Azure.

Google: “The median coldstart latency in Google ranged from 110 ms to 493 ms. Google also allocates CPU proportionally to memory, but in Google memory size has a greater impact on coldstart latency than in AWS.”

SEE ALSO: What do developer trends in the cloud look like?

Additional to the tests described above, the research team “collected the coldstart latencies of 128 MB, Python 2.7 (AWS) or Nodejs 6.* (Google and Azure) based functions every 10 seconds for over 168 hours (7 days), and calculated the median of the coldstart latencies collected in a given hour.” According to the results, “the coldstart latencies in AWS were relatively stable, as were those in Google (except for a few spikes). Azure had the highest network variation over time, ranging from about 1.5 seconds up to 16 seconds.” Take a look at the figure below:

Source: “Peeking Behind the Curtains of Serverless Platforms”, Figure 8, p. 139

Instance lifetime

The research team defines as instance lifetime “the longest time a function instance stays active.

Keeping in mind that users prefer the longer lifetimes, the results depict Azure winning this one since Azure functions provide significantly longer lifetimes than AWS and Google, as you can see in the figures below:

Source: “Peeking Behind the Curtains of Serverless Platforms”, Figure 9, p.140

Idle instance recycling

Instance maximum idle time is defined by the authors as “the longest time an instance can stay idle before getting shut down.” Specifically for each service provider, the results show:

AWS: An instance could usually stay inactive for at most 27 minutes. In fact, in 80% of the rounds instances were shut down after 26 minutes.

Azure: No consistent maximum instance idle time was found.

Google: The idle time of instances could be more than 120 minutes. After 120 minutes, instances remained active in 18% of the experiments.

Inconsistent function usage

According to this paper, users of these serverless providers expect the requests following a function update to be handled by the new function code, “especially if the update is security-critical.” Nonetheless, the results of this research show that “in AWS there was a small chance that requests could be handled by an old version of the function”, what the authors refer to in the paper as inconsistent function usage.

Further testing made it clear that these inconsistency issues are caused by race conditions in the instance scheduler.

CPU utilization

In this part, the research team defines the metric instance CPU utilization rate as “the fraction of the 1,000 ms for which a timestamp was recorded.”  The tests showed the following for each provider:

AWS: Instances with higher memory get more CPU cycles. What’s more interesting is that, according to the study, “when there is no contention from other coresident instances, the CPU utilization rate of an instance can vary significantly, resulting in inconsistent application performance.”

Azure: Relatively high variance in the CPU utilization rates (14.1%–90%), while the median was 66.9% and the SD was 16%.

Google: The median instance CPU utilization rates ranged from 11.1% to 100% as function memory increased.

I/O and network

Tests on the I/O throughput for the three services showed the following:

AWS: Despite the fact that the aggregate I/O and network throughput remains relatively stable, “each instance gets a smaller share of the I/O and network resources as colevel increases.” What’s more, “coresident instances get less share of the network with more contention.”

Azure: I/O and network throughput of an instance also drops as colevel increases. Also, it fluctuates due to contention from other coresident instances.

Google: Both the measured I/O and network throughput increase as function memory increases. However, the network throughput measured from different instances with the same memory size can vary significantly.

Resource accounting bugs

We finally reached what is, arguably, one of the most interesting findings of the paper. And I am referring to the resource accounting issues that were discovered in this research.

SEE ALSO: Cloud Foundry report: Severless computing and container technologies are in full swing

The research team came across a couple of, let’s say exploitable, resource accounting loopholes. Namely:

Billing issue: The research uncovered a billing issue in Google that could be exploited to run sophisticated tasks at a cost next to nothing! How is that possible, you ask? The authors found out that in Google one could execute an external script in the background that continued to run even *after* the function invocation concluded. The more you know…

CPU boost: Google does it again! As the study uncovered, “there was an 80% chance that a just-launched function instance (of any memory size other than 2,048 MB) could temporally gain more CPU time than expected.”


As I have expressed before, I do appreciate and enjoy a well-written scientific research with interesting and well-presented findings. The paper by Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, and Michael Swift is definitely one of them!

This article just scratched the surface of the rich results included in this study. However, it is safe to say that AWS demonstrated a higher level of consistency and reliability.

I do encourage you to take a closer look at the official paper and draw your own conclusions.

Until then, take the time to picture the three giant serverless platforms engaging into an iconic minion-style fight like this one:

Eirini-Eleni Papadopoulou
Eirini-Eleni Papadopoulou was the editor for Coming from an academic background in East Asian Studies, she decided that it was time to go back to her high-school hobby that was computer science and she dived into the development world. Other hobbies include esports and League of Legends, although she never managed to escape elo hell (yet), and she is a guest writer/analyst for competitive LoL at TGH.

1 Comment
Inline Feedbacks
View all comments
3 years ago

this is a nice post about Azure. Thanks for sharing this. Anyone interested to learn Microsoft Azure just click on given link