How GitOps Meant Fewer App Stalls and Failures for an Online Bank
This article examines GitOps in a real-world environment with issues and lessons on improving performance metrics for developers. GitOps provides an essential framework for DevSecOps, for security checks that extend throughout CI/CD, as well during the post-deployment stages of application management on Kubernetes clusters.
Mettle from NatWest falls squarely in the category of an online bank startup. Without any legacy equipment to manage, it started out many steps ahead of its competitors: established banks and financial service providers with data archives on often decades-old server equipment. The bank’s DevOps team was already in the position to reap the benefits a cloud native infrastructure offers for creating apps and delivering services more rapidly to online customers, in a way that many established financial institutions struggle to achieve today.
For a company that was fundamentally born cloud native, it could be assumed there wasn’t much room for improvements to its service-delivery productivity.
Not long after it began its operations and the firm began to grow, so did its developer and operations team. However, instead of delivering new apps and application updates at an ever-faster cadence, the DevOps team began to suffer from a demonstrable lag in CI/CD productivity. One of the key weak points involved how the developers continued to create and commit code for production at an adequate pace, yet there were unacceptable and widening lag times between the time the developer committed the code, and when the code was deployed to production.
Much of the CI-related delays were due to testing. A developer would create code and then run preliminary tests on their laptop or workstation. Once the code ran as it should on the developer’s laptop or workstation, the code was then committed. Load-performance and other tests were run in another environment, out of the developer’s hands.
Committing code, testing, and application deployment was largely a manual process. As manual tests were performed by Mettle’s operations team, the developer’s work would come to a stop, while waiting before the code was deployed or sent back to the developer for remediation, a back-and-forth process that added even more lag time to CI. When applications did not run as they should in production and they were sent back to the developer for remediation, the lag time only further increased for the development and deployment cycle. The back and forth does not matter too much when just a few developers are involved, but can lead to exponential decreases in productivity, as more developers are added to the team in order to keep up with demand for a higher cadence of application updates and releases.
It can be said that Mettle was largely suffering from a velocity problem. Coined by Google as DevOps Research and Assessment (DORA), DORA velocity metrics show that companies that are more effective developing and deploying software are twice as likely to meet their business and organizational goals. These metrics cover:
- Deployment frequency: How many deployments completed per month or year.
- Lead time for changes: How fast applications can be successfully from the time a commit is completed until deployment.
- Change Failure Rate: The percentage of deployments that fail and must be rolled back.
- Time to Restore Service: The meantime to restore (MTTR) an application failure in production.
GitOps Put to the Mettle
Mettle’s answer to its CI/CD lag problem in hopes of seeing quantifiable improvements in velocity that DORA metrics measures was to adopt GitOps. It began to rely on GitOps to standardize workflow and to deploy, configure, monitor, update and manage applications in production.
One of the more recent developments of GitOps is how it can be used not just for applications, but on Kubernetes cluster configuration and management, now applicable to multiple clusters. This capability extends from how Git is the single source of truth, as the desired configuration is declared here. There is a GitOps agent that runs inside Kubernetes, that continually compares the actual state inside Kubernetes with the desired state stored in Git. Any new changes merged into the monitored branch in Git are automatically applied to Kubernetes. Conversely, any manual changes directly applied to Kubernetes are automatically reverted back to the desired state declared in Git. Configuration drift is eliminated.
With GitOps, Mettle began to see quantifiable improvements as measured by DORA metrics. Testing became automated in such a way that a commit not only began the build process, but that same commit then could be used to update the deployment manifest on Git. While testing a single container or application can be accomplished by the developer on their own, a test of that developer’s application or container in concert with the other containers and microservices offers a more realistic assessment of how the code will run in production.
With the entire environment declared in GitOps, starting or maintaining an integration environment is simple. This environment can be “long running,” meaning the integration environment can be accessible 24/7 (which is also useful for distributed developer teams who are in different time zones). An integration test environment can also be provisioned on-demand when needed.
Developers can also create the integration test environment themselves, rather than have a dedicated Ops team do this. In this case, the developers can move quicker if they don’t have to wait for a test environment to be built for them.
With both the applications and containers declared with GitOps, and as well as the actual cluster, managing integration testing is simplified and accelerated. The requirement for full integration testing can be managed by the developers or the DevOps teams directly, reducing the time needed for the environment to be spun up for use.
With a major software push that might include 20 or so development teams, the different containers from the various teams are tested and committed in Git and are tagged. The tag represents the release of all of the container versions while all the source code versions.
The committed and tested code is then automatically pushed to production from Git. The updated code also remains independent and accessible and can be audited on Git (which reflects the same code commit in the cluster). In other words, deployments can be completed in a few minutes, which might have previously taken hours or even days. In an online cloud native world, that gain in productivity means features are completed and made available to the end user, faster. Quantifiable improvements in productivity across CI/CD are thus achieved, for both developer and operations teams.
Faster deployments and improved DORA metrics also do not pose security risks. This is because GitOps provides an essential framework for DevSecOps, for security checks that extend throughout CI/CD, as well during the post-deployment stages of application management on Kubernetes clusters. With the tag, a complete audit trail is available and accessible. The source code, the build, the deployment, and the test can are all produced with the tag. Consequently, Mettle was able to test, secure, deploy code across their entire environment much, much faster — this is where the biggest gain and productivity came from.
This entire process can be completed with Weave GitOps Enterprise, which has emerged as the first GitOps platform that automates continuous application delivery and automated operational control for Kubernetes at scale.
All told, Mettle reported substantial improvements based on all of the DORA metrics. Developer productivity alone improved 65% in time savings, thanks to the DevOps team’s ability to develop, build, test and deploy much faster — and ultimately, to deliver improved services faster to banking customers.
This developer productivity improvement is especially important for cost savings in consideration of the high salaries Kubernetes application developers typically command (the more productive a developer is the fewer developers the organization needs to employ), which is good news for the CTO. It’s win-win: measured in productivity, the developer is doing better. The CIO can easily show how the developer’s costs are reduced by half. Mettle is also not the only company seeing massive improvements in productivity frequency for developer productivity, reductions in MTTR, deployments speed, etc. — thanks to GitOps.