A DevOps Kong Diary
Kong is an API platform for multi-cloud and hybrid organizations, available as both FOSS and as an enterprise platform. Dennis Kelly, former senior DevOps engineer at Zillow Group, implemented Kong in the Trulia DevOps group. In this article, he recounts how they vetted Kong, what solutions it offered, and how it fundamentally changed their strategy.
While working at Zillow Group, I became part of implementing Kong in the Trulia DevOps group. I had never heard of Kong, but when my manager presented the project to me, I immediately said yes. Since it was a huge undertaking from which I wanted to retain my learnings, I decided to create a diary of my experience.
The Kong project legitimately just fell in my lap. It started with a request to share APIs between the Zillow and Trulia brands. Developers had requested a point-to-point VPN between data centers to expose a microservice – and in doing so, grabbed the attention of the security, network, and systems engineering teams. From an operational perspective, it was clear this solution was not secure nor would it scale.
The Trulia DevOps group was the first to be exposed to the request and had done the initial research, including a proof of concept (PoC) using Kong. They did not have the bandwidth to take on support for all brands and make the PoC enterprise-ready for production traffic.
We didn’t spend much time reviewing Trulia’s decision to use Kong. It was clear that it was the right choice for all of us.
We were accustomed to leveraging open source solutions. Having source code and the ability to extend the product are appealing benefits. That said, we were also cautious because bringing in new technology into our production environment, as open source projects come and go.
A handful of things vetted our concerns:
- Great documentation on both setup and configuration
- Release notes demonstrated active development and new features
- Github issues were being resolved
- Vibrant online community with not just responses but also working solutions to problems
- Built on the widely adopted and trusted web server, NGINX
- Highly performant
- Agnostic to data center infrastructure and development stacks
- Transparent proxy with no impact on the development experience
- Module, pluggable features
- Commercial support if needed
I was given the following tenants:
- Build a service that could be consumed by all Zillow Group brands, including but not limited to Zillow, Trulia, Mortech, Hotpads, StreetEasy, and Naked Apartments
- Consistent, secure access to microservices across data centers
- Standards base
- Quick, easy onboarding and transparent use
After developing an enterprise architecture and having it reviewed and tested by multiple DevOps teams, I presented Kong at our weekly engineering demos called “Zillow Brain Dumps” on Tuesday, August 15, 2017. The following week, I had 40+ Jira requests and meetings booked two weeks out for teams wanting to use Kong.
Over the next six months, we experienced rapid, almost-viral adoption:
- It was a DevOps win-win. Often times, solutions present themselves that either help the development experience or production operations and have negative repercussions for the other group. It is not often you find something that helps developers and operations succeed at the same time.
- New opportunities arose from people consulting the documentation or presenting a case similar to what we were already doing. Kong became the solution due to its power and flexibility:
- Public APIs
- Cross-origin resource sharing (CORS)
- Rate limiting
- East-west authentication
- Kong gained a reputation for “yes you can” instead of the typical no response from operations
It was around the six-month mark and the advent of wanting to use Kong for caching (at the time, a Kong Enterprise-only plugin) that we considered moving from Kong’s free, open source product to its Kong Enterprise version. We had a read-only service that was struggling to keep up with production workloads. It was six months out until it could be replaced, and we either needed to implement caching or disable it on the website to avoid site errors.
This particular issue would pay for our enterprise license, but being able to use advanced rate limiting and having support for our growing production workloads made it a no-brainer for the purchase. As a side note, caching got us a 70 percent hit rate and lowered the average latency from 25 milliseconds to four milliseconds, and of course, it would be much longer than six months before it could have been replaced. It is still running under Kong today, that I know of.
Kong was considered a success for many reasons:
- It became an integral part of the core architecture and API management strategies.
- The company had struggled to foster strong relationships between development and operations and leverage expertise across brands. The project became the model for service delivery and built new relationships among development and operational teams throughout all brands.
- Being restful and programmatic, the Kong Admin API lends itself to DevOps practices of automation and CI/CD.
- Data-rich logs sent to an enterprise logging stack such as ELK, Splunk, etc. empower teams to own service-level agreements and PKIs, as they had visibility into their microservices they did not have previously. I received a lot of positive feedback from teams when I finally integrated Kong with Splunk at Zillow Group. VP and executive levels were blown away when they saw performance numbers from Splunk and watched vitals dashboard live in Kong Manager.
Beyond the success and value Kong brought to Zillow Group, I think my only wish is that I had learned about Kong sooner. As an engineer, it is easy to get tunnel vision – hyper-focused on the technology you are already using and building on it; feeling like you can’t make substantial improvements without ripping everything out and starting over; marketing material falling on deaf ears because it is just another company trying to solve one specific problem, or we aren’t using the latest buzzword technology like Kubernetes, containers, serverless, etc.
Generally speaking, you do not look for solutions until you have a problem or add complexity without a business need. I think this is what makes Kong very different. While Kong is yet another moving part in your organization, it solves problems you will experience before you get there and reduces complexity as you scale…what if we had simply introduced Kong as a load balancer early on, before microservices took off and we started using containers and serverless?
Even with a monolith, you are likely using load balancers. You may not think you need something as sophisticated as an API gateway; however, at some point, product teams introduce a handful of microservices and shortly after, there is talk of “destroying the monolith.” Sooner than later, you have hundreds of microservices and a monolith. You introduce services to handle routing and business logic but without building any intelligence into the network stack.
Product teams only have their logs and basic telemetry to debug their services. Operations has no visibility into the dependencies among services or what is causing latency and timeouts. Site errors increase, sometimes to the extent of an outage. Teams add more logic at the service layer, operations chase problems from network statistics and packet traces, load balancer logs throughout the entire stack and configurations at all levels. This goes on for three months with morning and afternoon WAR rooms that include yelling, panic, and despair. While slight improvements were made, the problem goes away without understanding why and resurfaced multiple times over the next 18 months. This happened at Zillow, and it was a nightmare.
So…what if we had used Kong as our load balancer?
- Migrating to a hybrid cloud environment and a new datacenter, we wanted to break away from our use of F5’s BigIP, but we struggled with the complexity and reliability of other open source solutions while not knowing to consider Kong. We reluctantly purchased another pair of F5s for the new data center, which also meant divergence between our cloud and datacenter technologies (we used ELB/ALB in AWS). Our development and test environments were on premise with an F5, resulting in different configuration and behavior between test and production. There was also a significant expense to the F5 purchase.
- We introduced a gateway service, written in Java, to handle routing and cookie mangling in front of our monolith and microservices. It required significant compute resources to scale in production and did not have even close to the functionality of Kong – metrics, authentication, rate limiting, caching, request tracing, etc. that we would later need. We had 4x nodes in the gateway cluster than Kong and needed to add more. The cookie handling logic could have been built entirely as a Kong plugin, or if wanting to leverage development stacks we understood, at the very least a small Kong plugin that contacted this service for data. With Kong in place and understanding its capabilities, there was agreement across teams involved that using Kong would have been a better solution if we had the resources to revisit it.
- Kong supports request tracing, and given that, maybe we would have had the foresight to implement it early on. I cannot stress how important this is in a multi-tier stack even before you move to a full microservice architecture with or without orchestration like Kubernetes. It is one of the most powerful debugging tools you will have in your stack.
- Even without request tracing, using Kong as the load balancer and log shipping into a log analyzer provides rich data and visibility in a central, immediate tool for debugging and performance tuning.
- As your microservices grow – and they will quickly – you can onboard new features like canary and blue-green deployment, authentication, integrate serverless, circuit breaking, and even build your own intelligence into your service mesh.
- As you build APIs, you will have associated admin APIs to manage some of them. Instead of building this into the microservice, which will result in divergence or different implementations among them, use the OpenID plugin with a provider such as Okta, backed by Active Directory or your centralized authentication and authorization platform. Even in small corporations, SOX compliance will be required, and you will need controls, auditing, reporting and documentation around your central system. You can avoid duplicating this effort for each and every implementation on the admin API level for authentication, authorization, and accounting (AAA) by leveraging the aforementioned practice. Typically meeting the requirements around your central AAA will suffice and save you a lot of time and money.
Although we did not have the foresight to do this, retrofitting it in after the fact would still be a lot easier and simpler than the alternatives. Had I stayed at Zillow, my plan was to migrate us to this architecture, as it was a worthwhile investment into our future goals and opportunities.
One of the first diagrams I saw of Kong showed how it saved developer time by not building libraries, needing to refactor code, etc. when building microservices, but this only scratches the surface of what you’re doing. The bigger takeaway is moving intelligence into your network to give you greater depth and breadth into insights, telemetry and decision-making.