Atlassian: “DevOps is not a goal in and of itself”
“DevOps isn’t any single person’s job — it’s everyone’s job.” Practicing DevOps in the true sense of the word may be challenging, but once you’ve overcome the obstacles and preconceptions, everything becomes easier. We asked Atlassian’s Michael Knight and Nick Wright to tell us how this company benefits from (genuinely) putting DevOps into practice.
JAXenter talked to Michael Knight, Senior Support Engineer on Atlassian’s build engineering team, and Nick Wright, the head of Site Reliability Engineering at Atlassian, about this company’s DevOps journey and what they learned from the challenges.
JAXenter: What does DevOps look like at Atlassian and how did you get started?
Mike Knight: Our DevOps culture has grown organically over time. I think this was mainly due to our existing commitment to Agile methodologies. We gradually moved from just being an off-the-shelf software vendor to also being a services provider. Our existing Agile practices like cross-functional teams, small/iterative changes with automated testing and quick feedback loops saw teams make a natural progression to a DevOps-style culture as we expanded our ops capabilities to accommodate this change.
These days we have teams building and running many different external and internal services. Despite the varied nature of what the teams deliver, the culture is pretty consistent. Across all our engineering teams you will find scrum or Kanban-based work processes, daily stand-ups, mature continuous integration (CI) pipelines and a heavy focus on automation at all levels. We have a streamlined way for all teams to easily and reliably spin-up and run microservices (large or small) to solve their problems and there is a lot of room to try out new technologies & tooling.
Nick Wright: At Atlassian, our core value of Be The Change You Seek has been instrumental in teams’ ability to shape the way they work. It is fair to say that there is a diversity of approaches to solving problems and by that token DevOps approaches have taken hold in different teams at different times over the past ~5 years. This is significantly different from there being a top-down decision that “we should do DevOps”.
The focus on automation and tools as well as working directly with development teams to solve shared problems is at the heart of our movement toward Site Reliability Engineering.
On the one hand, we had some teams where the work being carried out would already have been classified as DevOps approaches when the term was popularized — I joined such a team in 2011 where a strong focus on automation and configuration management was the only way we could scale out thousands of hardware nodes in the datacenter. That team never particularly identified with the term DevOps but found themselves there as a matter of finding the path of least resistance to solving scaling challenges.
Other groups took a deliberate stance of adopting DevOps practices by looking externally and following examples we observed elsewhere. The focus on automation and tools as well as working directly with development teams to solve shared problems is at the heart of our movement toward Site Reliability Engineering and has helped to significantly shape the approach we’ve taken to scaling our operations team.
JAXenter: In the DevOps 101 report you mention that the key to faster, higher quality releases is a strong relationship between the dev and ops teams. How did you learn this lesson?
Mike Knight: During the early days of our cloud offering, our operations were outsourced to our managed hosting provider, since we did not have sufficient expertise in this area at the time (almost all the engineers were Java developers). This helped us get our offering off the ground, but we soon ran into a lot of friction. Miscommunication led to occasional outages (both minor and major) and we were constrained in our ability to experiment and scale. Eventually we decided that the only way to innovate and improve the quality and speed of our deployments at the pace we wanted was to bring our operations in-house. It was a big investment but necessary, and years down the track we have really reaped the rewards from it. In retrospect we probably should have done it much earlier.
Automation is fundamental.
Nick Wright: For me it has been about changing our mindset from that of two opposing forces with competing goals (stability in the case of ops, feature releases in the case of development teams) to one where all engineers are aligned on the goal of quickly and reliably releasing code. The biggest lesson has been one of seeking first to understand – taking the time to understand the perspective of those you work with and giving the benefit of the doubt that they are operating with the best of intentions.
It’s very easy to focus on an individual who has made a mistake in introducing a bug that has led to a production outage, particularly when it appears that a process did not get followed properly. People make mistakes – that’s expected. Building in safeguards and autonomous systems to catch these things so that we have become less reliant on people always doing the right thing has been an important part of improving and achieving these outcomes while at the same time removing the blame factor that erodes trust.
JAXenter: What tools and processes do you use to support the relationship between the dev and ops teams?
Mike Knight: The main stressor to the relationships between dev and ops teams is conflicting responsibilities and goals. To align teams better, we imbue the developers with as much responsibility as is reasonably possible, but this must also come with the necessary power and autonomy to fulfill those responsibilities. This means the devs control all or most of their delivery pipelines, deployment environments and QA processes, and also means they’re on-call for production issues. This helps developers appreciate (and develop) operational stability while the ops teams can focus more on platform improvements. For some of our services, the dev and ops teams are one and the same.
Some aspects of DevOps culture have been a good fit and adopted quickly by one team but not another.
We’re in a fortunate position as a maker of collaboration & development tools to be able to ‘eat our own dog food’ and use a lot of our own products to this end. We use JIRA to transparently track changes through their workflow, and also automatically links to associated code changes. Depending on where the changes are being made, pull requests for code changes will have reviewers from both dev and ops teams. This gives the ops team a heads up about changes and enables problems to potentially be caught by ops before they hit production. Bamboo is used for continuous integration and deployments, allowing both dev and ops teams to keep track of what’s been deployed into which environments, potentially allowing either team to trigger a roll-back if necessary. HipChat ties it all together, providing a solid communication layer both within and across teams and as a ‘command centre’ where monitoring alerts and status updates are sent.
Nick Wright: We collaborate through HipChat and this is central to the way we work. Teams collaborate, receive notifications from builds, deployment events, monitoring and alerting systems as part of their day-to-day activity. When things go wrong, incidents are managed from within HipChat and chat history provides a useful log of events leading up to failure allowing us to complete post incident review and agree between dev and ops teams on the right follow-up actions.
JAXenter: How easy/challenging has it been for Atlassian to actually practice DevOps in the true sense of the word?
Mike Knight: DevOps is not a goal in and of itself, but a means towards effective service delivery. Because different teams are in different situations, some aspects of DevOps culture have been a good fit and adopted quickly by one team but not another. For example, a team that runs an internal service that has few interactions with other services can usually own the entire thing from concept to delivery and be fully responsible for running it in production (including the infra). They’ll probably be using tooling popular in the DevOps space: containerization (e.g. Docker), configuration management (e.g. Puppet) and a cloud platform (e.g. AWS) with its associated tooling. Teams like these operate like what people typically imagine a ‘DevOps team’ to be.
On the other hand, a team that delivers a complex production service is necessarily more specialized and needs a lot of support from other teams. This may pragmatically preclude them from looking quite like the multi-disciplinary DevOps team mentioned above (for example, they may not manage their own infra or platform), but the same culture is there. You’ll still see automated CI pipelines facilitating fast releases, production monitoring for fast feedback, on-call rotations for outages and a close relationship with users (directly or via Support) to action feedback. They may not be using Docker, Puppet or AWS directly, but I think they still embody ‘DevOps’ in a true sense.
At times teams feel like they are re-inventing the wheel.
Nick Wright: I think that for us it has been relatively easy — I would suggest that it would only be difficult to make use of an approach in situations where it doesn’t fit. Our teams are quick to iterate and change when things are not working well so our flexibility has allowed teams to experiment with various approaches, keeping what works best and integrating these into our workflows.
One of the challenges I will mention is the sense that at times teams feel like they are re-inventing the wheel from time to time. I’ve been involved in projects where we’ve built a new service to help others solve a key operational problem in managing their services — our centralized logging solution was an example of this — only to hear from another team that another new service is being built to solve the same problem in a different way.
JAXenter: What do you think of Jeff Bezos’ 2-pizza team rule? Do you put this concept into practice at Atlassian?
Mike Knight: Yes – in general our team sizes are fairly small. For teams larger than about half a dozen people they are usually broken down into sub-teams, each with a team lead. It keeps communication overheads low and fosters strong connections between team members. However, people are also encouraged to go on secondment to other teams or even change teams entirely if they want to. This helps with inter-team relationships and knowledge sharing.
JAXenter: How do you feel about automation? How important is it in a DevOps context?
Mike Knight: It’s absolutely fundamental. Being able to programmatically build, test, deploy and monitor a service is a big investment but the pay-offs are even bigger. While time must be spent maintaining the automated systems, the ability to share expertise, track changes and push them through to production faster and more reliably is greatly enhanced. Automation can also help in some less obvious ways, for example, enhancing your customer service capabilities with chat bots, smart service status pages that automatically report in-progress incidents or tracking the number and type of support requests your team receives to help you avoid the need for them in future.
Nick Wright: Automation is how we scale without having to grow our team sizes linearly with the number of customers we have. There is no other way we could continue to run and operate at Atlassian, and indeed for most of the world’s companies whether they are in the tech industry or otherwise, this is becoming increasingly true.
When providing a platform as a service, automation is what powers individuals to get things done themselves, rapidly and without the delay of having to go through a service desk or other people driven process. We can experiment and try things out because the cost of doing so is significantly reduced: If you’d proposed spinning up 100 servers to perform load testing on a new component in a pre-DevOps world, that would have been a significant undertaking that might have taken days or even weeks. We now live in an environment where an engineer can do this in a matter of minutes across a far wider variety of use cases and with the cost being so low, the critical behaviors that improve overall reliability of our services is encouraged.
Thank you very much!
For more information about Atlassian’s DevOps journey take a look here.