Mo’ developers, mo’ problems: How serverless has trouble with teams
Serverless is a tool like any other, but since it means a radical shift in your operations model, teams are often tempted to ‘dip a toe in’ with their first few services. This leads to many of the problems described by Toby Fee, Community Developer for Stackery in this article.
The team already uses AWS, but normally Annie relies on her team lead, Sasha, to deploy to AWS. Annie tries writing up a Lambda and asking Sasha to deploy it, but like any first draft of a complex service, it doesn’t work the first time, and sending a Slack message every time she wants to try out a new line of code gets pretty cumbersome.
The conference deadline looms, so Sasha gives Annie access to deploy to AWS. It still takes longer than expected to get the Lambda written, since every new deployment takes a few minutes, but eventually it’s up and working, and works in the first few test cases. Annie deploys the final version to production and heads off for a few week’s vacation.
Unfortunately, on the day of the conference, hundreds of people try to sign up, and a query to check if the user is already registered starts eating up connections and causing latency for users logging in. Since the problem doesn’t happen during signup, it isn’t obvious that the new feature is the cause of the problem, and it’s over a day before the issue is resolved. The top Google suggestion for the company name is ‘down for everyone?’ for the next month.
What failed here?
We can say ‘the deadline should have been moved’ or ‘such a critical service should have had more resources,’ but high-quality processes should allow us to get better outcomes with the same resources. Here are some of the problems we can identify, that apply to most teams early in the adoption of serverless:
- poor permissions management
Right now the only AWS permissions this team has is ‘can’t do anything’ or ‘can take down the site’
- bus factor of 1
Exactly one person knows precisely what changed when this feature went out. Worse: the work was on the signup process, so making changes could drastically affect user growth. In the middle of an incident, even if the team sees a Lambda was recently deployed, the Lambda dashboard might not be that helpful:
What are the rate of requests to the DB, who is generating them, and which ones take the longest?
CloudWatch and X-Ray do a pretty good job of showing you performance once you know the service you’re investigating, but when your problem is ‘some DB requests are failing’ or ‘latency is way up,’ they can be a lot less helpful!
Annie was failed a few times by tooling. Having to ask someone to deploy your code to see if it even works is not a process modern developers are used to. Even with AWS permissions, waiting a few minutes to see your code update changes the rhythm of developers who are used to JS environments where changes are instantaneous.
Enough complaining, how do we fix the problem?
- Permissions are complex for a reason
The theoretical situation above didn’t have enough detail to say what the real problem with permissions was, but suffice it to say there should be some way to get the service working on a staging environment and then see it working in production.
In a bit of shameless self-promotion, I’ll note that Stackery makes it extremely easy to manage multiple AWS accounts and permissions, letting you deploy to some and just propose on others.
- Teams must buy into serverless
The first and most straightforward solution is that the whole team should have been working on this feature or at least this area. If multiple people had understood the work going on, there’d be a better chance someone would be available to help fix problems. Serverless makes it easy for a single developer to roll out a whole service on their own. But just because it’s easy it doesn’t mean it’s the right way to add critical features. One-person hackathons are a great way to try new ideas, but they’re never the right way to change your signup process. Even though many aspects of serverless are unproven, it still takes buy-in from the entire team to prevent bottlenecks.
- Plan for observability
All Lambdas must export a handler function that will be called once per time the Lambda is triggered. This is a supremely good opportunity to wrap your code in something that will help you observe performance. Both Epsagon and IOpipe make tools to add observability in just one step.
- Work locally
I, at least, cannot work by uploading my code to a lambda to find out if it even works. To do the amount of testing and variation required to complete services, you must have a local version to play with. In AWS-land, this was recently made a lot simpler with the SAM CLI which runs many AWS resources (like lambdas and API endpoints) in a local container.
Serverless is a tool like any other, but since it means a radical shift in your operations model, teams are often tempted to ‘dip a toe in’ with their first few services. This leads to many of the problems described in this article. Along with the specific tips mentioned above, the overall takeaway should be that like any other tool, the best way for a team adopt it is to dive in, make sure the entire team builds competence and design practices that plan on sticking with this tool for the long term.