Interview with Mike Tria, Head of Infrastructure at Atlassian

“There’s a class of problems that go away when you start using containers”

JAX Editorial Team
© Atlassian

Couldn’t make it to the Atlassian Summit earlier this month? We’ve got you covered. JAXenter editor Gabriela Motroc chatted with Mike Tria, the Head of Infrastructure at Atlassian, about going severless, containers, and the future of DevOps.

JAXenter:  Severless architecture is a bit of a trend right now, everyone is crazy about it. Is Atlassian into serverless architecture and if so, what exactly are you doing with it?

Mike Tria: I have particular opinions on serverless that are both good and bad. I am the head of infrastructure for Atlassian, so I essentially do all of the care and feeding for the cloud platform that underlines our cloud products. On that cloud platform, one of the big components is actually a fair amount of serverless architecture and serverless things.

I’d say that the things that’ve worked really well for serverless have been things around data processing. Streaming data comes in; it can be processed by something that runs serverless and it can scale with that through-put and load of that workload. We also use serverless for some sort of prototypical or experimental new services that we’re building because the deployment path to actually get it to our production environment is actually very quick for serverless.

I think the areas where you really need to be careful as a business is the lifecycle of code for something like serverless. I think it’s still very nascent. Doing something like “blue/green” deploys or “canaries” or compliants or really good instrumentation, I think those things are starting to come into being, but aren’t really, really there yet. So, I think some of the things that you get from running things in containers, we’re already there. We have those things. For serverless, I think they’re not quite there yet, so I think that certain types of workloads again, ones for data streaming, data specifically experimental workloads are a really good fit for serverless.

Serverless is still very nascent.

But I think others, you need to keep in mind the ecosystem around serverless is still developing and especially if you are an enterprise, you need to take care that you don’t need those things in the ecosystem just sort of around your code.

JAXenter: Do you think serverless means NoOps? Or is it the other way around?

Mike Tria: I think it can be. I think it can be for some kinds of things. I’ve heard a similar argument with visualization, when that came out, does that mean there’s no ops, or containers and things like Kubernetes, does that mean that you have no ops.

That’s just the nature of the changes, like we saw with ops when we converted like Ops to DevOps. With something like serverless, you’ll have something similar. I think the nature of what that operations team would be doing is different.

If you think about it this way, when you have a team that moves to microservices, you have one large monolithic application. The ops team really focused on all the things around that monolithic application; its lots of them because it’s so big.  When you move to microservices, well, each one of those services actually requires a lot less operations than the original monolith. But a new thing is introduced, which is coordinating the orchestration of all of these services together.

With serverless, you will have far more little bits of code that are basically running in servers than you’d even have in microservices.

With serverless, you will have far more little bits of code that are basically running in servers than you’d even have in microservices, so you can’t even know what all your services are. So I think what actually happens in operations is it changes again. You’re actually studying things en masse and building tools around managing fleets of different types of things that are running serverless and instrumentation. So, I don’t actually think Ops is going to go away. I think it’s going to do what it’s always done, which is, it’s going to evolve to fit that compute pattern.

JAXenter: Just like what happened with DevOps, right?

Mike Tria: Just like what happened with DevOps.

JAXenter: And how do Dev and Ops get along at Atlassian?

Mike Tria: Well, they both work for me. [Everyone laughs.] It sort of helps there. My team manages both our physical data centers and our centers running in the cloud. It’s been a transformation.

I think the things that bind operations and engineering together is automation.  So, operators – especially when you hit scale – the idea of SSHing in the server goes away, because you realize that things are happening at such high scale and such high rate. The idea of doing anything in isolation just doesn’t make any sense. So, good operation seems to invest heavily in automation and automation is something that is sort of core in principle to what developers do.

The bridge with that at Atlassian has been the SRE team. Our Site Reliability Engineers are that rare blend of having an operational background, plus an engineering focus on how we run our services. I’d say for any company that is concerned about not getting along between operations and engineering, I would put their focus on saying how can you develop an SRE skillset and an SRE mindset where you have people who are sort of very paranoid about services not working and will focus on the computer science of reliability. They’ll find friends in both operations and engineering that can help bridge that gap.

JAXenter: Do you build products with a DevOps approach in mind?

Mike Tria: Oh, we have to. Absolutely. I mean, if you think about something like JIRA  or Confluence, I think Scott [Farquhar, Co-founder and Co-CEO, Atlassian] mentioned this in the keynote, but we have hundreds of thousands of containers.  The idea of those containers failing means that our customers simply are not available, critical business functions are not available, which hits us directly,

And so for us, a DevOps focus means, “well, how can we ensure the reliability of all of this infrastructure, and all these customers and all these containers in a sustainable way without me having to hire like a hundred Ops people to sort of solve this problem?”

Well, the answer is DevOps. The answer is automation. And so, you know, it had to be a focus for us to scale as a business. So I’d say that it is something that we yearn to do, because our operations people have an automation mindset in the way that they basically structured the infrastructure.

The answer is automation.

Also just for us as a business as we scale as we do things for let’s say Europe, you know I physically won’t even have people, essentially running from an operations perspective in that time zone, in that particular area, and so we have to have even more automation and more monitoring and alerting. I’d say it’s been a very good and very necessary thing we’ve had to do as a business.

JAXenter: And how do you make sure it doesn’t make another silo? Some companies, when they adopt or do DevOps, basically, it creates another silo, even though that’s what they want to avoid.

Mike Tria: That’s a good question. So, the interesting thing at Atlassian, at least on the cloud business, is that all the engineers are DevOps engineers. We have an analogy we do at Atlassian called “You build it, you run it” – YBIYRI.

For me to hire enough operations or site reliability engineers to manage all the services that Atlassian builds, I would quickly become the bottom of the business. I don’t want to be the bottom of the business. I want us to have all of the features and I want to provide a platform that allows that to work.

What we’ve done is we’ve provided training to all of our engineers on essentially how to run administrative services. Then, we have that site reliability engineering that’s building a common set of tools that our engineers use. When an engineer in JIRA builds some service for what they’re building, they’re the one who is actually going to run and administer that service.

SEE MORE: 7 key takeaways from Atlassian Summit Europe 2017 — Time for a “Did you know” quiz

We like that, because it builds empathy for them in terms of what the classic operations people had to deal with. Instead of “throwing it over the wall and letting operations take care of it”, they’ll have to wake up in the middle of the night. The engineers at Atlassian all feel that pain. They actually understand the reliability of their service because they feel it very viscerally. I think that it’s a good thing because it helps engineers at Atlassian build empathy for the customer. When their service goes down, the customer feels it. Now they understand that.

Also, it helps build a DevOps mindset across Atlassian.  I don’t have to focus as much on building this bridge, because the engineers already want it anyway.  Because they woke up at 3am in the morning, they want better alerting and monitoring. I don’t have to say anything. They just want to do it themselves.

JAXenter: How do you handle incidents? For example, when something goes wrong at 3 am?

Mike Tria: I’ve heard a saying going “Being on call isn’t a bad thing. Having your service break at 3am is a bad thing.” I think that for engineers who are working on services that are less mature than other within Atlassian, they felt a lot of that pain and they’re essentially on call. They would basically jump in and help with incidents with the rest of Atlassian.

I think the thing that we’ve done to ease that pain is that I’ve made my team available to help them. So they’re not doing it alone. The nice thing is that I’ve now seen some of those developers educate my team. They’ve now felt that pain, they’ve built such instrumentation and monitoring and alerting about those services, they’re educating us, which I think is the best possible situation.

Our company values actually come into hand when we deal with something like incidents. Our incidents are always blameless.  We place a team on call – you might be a part of a service that wasn’t actually the one that failed – you’d still go online and you’d still jump in and help. That’s actually really, really helped. I’ve seen situations where those things are not the case, and it actually feels very terrible to go on to an incident. But within Atlassian as an engineer, it actually feels pretty good.  You feel like you have a team there that is capable of helping you and supporting you.

JAXenter: Cool. So I see everything is about culture. But is culture more important than tools for DevOps?

Mike Tria: It’s hard for me to say whether one is more important than the other. I think you need both to be able to succeed.

Let’s say you don’t have culture. What happens is you have an individual team that is trying to do this. You’re always pushing against the tide of engineers who don’t see value in your work. In a sense, they might be working against you by building systems that don’t allow for automation because they’re not aware of it. If you don’t have automation, then you run into the problem that I mentioned before, which is, your operations or your infrastructure becomes the bottleneck of your business. You need to hire them at the pace since you don’t have that kind of automation.

I think you need a little bit of both. Especially, maybe if you’re very, very early on in your business, you can get by with choosing just one of those roads to go down. By either focusing on culture or automation and you’d probably be okay for a little while. Once you reach the kind of scale that Atlassian has reached in the cloud, I really do believe you need both to succeed.

JAXenter: Let’s talk about containers for a bit. How involved is Atlassian in the field of containers? 

Mike Tria: Atlassian started with containers in 2010, with OpenVZ. So we have seven and a half years of experience with containers, which is more than I think of most companies out there. We’ve done them on bare iron, we’ve run them on the cloud, we’ve run them in a virtualized environment, we’ve done them every which way. We have a lot of people at Atlassian who have been on that journey for the last seven and a half years. So for us, all of our services internally are running on Docker and things like that.

We have a fair amount of focus on something right now like Kubernetes. Once you start having lots and lots of microservices, just using a container by itself isn’t good enough.  You’re managing a fleet of these things, and you have multiple clusters and multiple regions. We just announced with Europe, so actually getting good at managing those things is something that is very important to our business. Otherwise, we won’t know how our services are running which would be really bad.

Container 2.0 for me means building a system that allows you to have this complexity.

We’re investing very heavily in containers. We’ve got whole teams inside of my organization that are focused purely on containers. We actually have some open source that we have in this regard, because we’re advancing this state-of-the-art on these things. We’ve got some blogs and some presentations at this conference coming up so as well, so we’re very involved in the community and we try to give back as well.

JAXenter: What does Container 2.0 mean to a company?

Mike Tria: Container 2.0? Well, for some of the developers at Atlassian, it’s Container 1.0 because this is all a new thing to them. I think that how you think about running your instances, your services, your applications has changed a lot really in the last year or two years. Especially with a thing like containers.

It’s so easy just to build a small little thing and have it run and have it not consume that entire system because you’re using a container and you can pack a bunch together. That incentivizes a developer to write real small bits of code and have them out there. The downside of that is now you have this complex web of interdependencies.

SEE ALSO: All eyes on Container 2.0: “We must clarify what needs to be part of the first container standard”

There’s a picture sitting near my desk. It’s this graph of every single service we once did and what it communicates to and it looked like a subway map or metro map of the entire planet all hooked together. It was huge! And I was thinking, once upon a time, there were developers at Atlassian who could look at something like this and fit the whole thing inside of their head because there were only a few services. Now there are so many, we don’t even know where all the dots are.

Container 2.0 for me means building a system that allows you to have this complexity.  It keeps the system running, puts in best practices, does things where it isolates particular kinds of containers from others for compliance or security reasons. It aids the developer on how to get things to put out there. In a sense, it’s an environment where you can deploy containers safely into an environment that’s very complex. But to the developers it feels very simple.

JAXenter: I read the Atlassian report that said that containers are breaking out, especially among larger teams. 34% used containers to reduce IT overhead. 

Mike Tria: I think that 34% is a good number. It’s pretty high. I think part of the reason you’ve seen this is the ecosystem of things around containers has come into play. Containers have been around forever, we had things like Solaris Zones way, way long ago. They’re not necessarily anything. I think what’s new is running them in a virtualized environment in the cloud and then all the tooling that comes around them makes it really, really easy to do that. So, I think that’s why you’re seeing it.

I see people approaching it from two directions. You can see that lots of people are using containers for cost management. You can do high density containers on individual instance and you can save a lot of money. You can also do this thing where you essentially pack your operating system, all its dependencies and you applications into them and it spins up very quickly, faster than a virtualized instances. So I’ve seen people come at it from either one of those directions. Either they already have a large infrastructure and they want to save on cost so containers make sense, or they want to speed up the pace of their development while keeping it stable and reliable basically with a deterministic thing that you are deploying and they would come at it from that angle.

We’re coming at it from a much more “developer/speed” perspective; we basically want to be able to get features out to our customers as fast as possible and in a safe manner. Containers let us do that. I imagine that a lot of our peers out there and a lot of companies that are doing lots of IT and operations should really look at that as well. I think that’s where we’re going to go.

JAXenter: Actually, about that. One of the top DevOps influencers said something like “Containers are the future of DevOps.” Do you think that’s correct?

Mike Tria:  “Containers are the future of DevOps”? I think that there’s a class of problems that goes away when you start using containers. If you think about tools like Puppet and Chef and Ansible and Salt and all the rest of them, they’re really around configuring around a running system. When you use containers and you have an application that’s stateless, you essentially can just take the container, wipe it out, and then put the new one in. And so, the fact that you have this deterministic state means that you don’t really need to focus on changing anything that’s inside of this container, you would just remove it and deploy a new one very quick and save operation.

So, as someone in operations, I would say it wouldn’t remove what you do in DevOps because now your problem is, “well, how do I get my containers out there?” in a safe way. That becomes the new challenge. I think it just evolves. I’m very happy for this too. I’ve been in worlds where, you may have a container or an instance that’s out there for weeks, months, years and it’s just been patched and patched and patched and patched. Then, when you have time to do something like a reboot or a big change everyone is very frightened.

The fact that containers allow this very quick and rapid iteration means that you become less afraid of these changes because you’re kind of staging them always and you’re always throwing away your old infrastructure. For an operations person, I think that’s the blessing about containers because they don’t like these big changes. They would much rather have smaller, safer changes that are always happening.

Thank you very much!



comments powered by Disqus