DevOpsCon takeaways: Ops is more important than ever & automation can lead you astray
DevOpsCon 2017 is coming to an end but we’ve gathered enough intel to keep us out of DevOps trouble until our next DevOps-related event next year. We learned some rules about chaos engineering, we discovered that HTTP is not an ideal solution for microservices, we noted down some DevOps landmines and we found out that we shouldn’t rely on anything being stable, especially in the cloud. Let’s see what went down at DevOpsCon 2017.
Russell Miles: Rules about chaos engineering
What better way to start a conference than with a live concert? Russel Miles, Director of Chaos (Engineering) and Principal Consultant at Russ Miles & Associates and the first keynoter of DevOpsCon 2017 surprised attendees with a guitar solo. Luckily for us, he live-streamed the concert:
— Russ Miles (@russmiles) November 21, 2017
Music aside, Russell explained that chaos engineering has nothing to do with causing chaos but with the system itself being chaotic and fighting it to make sense and opined that “we should be banned from naming things after what they are not.” In short, production is out to get you — failure (in production) is everywhere (hardware, functions, state transmissions, latency and resource exhaustion).
One of the most important takeaways from his keynote is the following: Even though chaos engineering is specifically about availability, you should keep in mind that there are a lot of factors and levels (infrastructure, platform, applications, people, practices & process) that may affect it.
Takeaway number two: Think of how much you need to affect to learn something (conditions should be possible even though a bit extreme. Plus, you should ideally run experiments in production).
Rules about chaos engineering:
- You break things to learn something new (otherwise don’t do the experiment). Chaos should never be a surprise, if you already know the consequences, don’t do the experiment. What you should do instead is challenge your assumptions about the system.
- When adopting chaos, you should drop the term, communicate and limit the blast radius to put people’s minds at ease, grow the capability, etc.
- If you’re running microservices in production, you *need* to do chaos engineering
If you’d like to know more about why production hates you, don’t miss this interview.
Quentin Adam: “HTTP is not ideal for microservices”
Quentin Adam, CEO of Clever Cloud talked about the problems you’ll face in the microservices world such as
- Authentication: distributed authentication is hard and there are many ways to achieve it.
- Configuration is the second issue to be managed when dealing with distributed micro application strategy.
If you’re using microservices, you can have difference scaling agendas.
Quentin opined that HTTP is not for all and, more importantly, it is not ideal for microservices. Message brokers like Kafka and Redis are better in this case. Furthermore, the right size of microservices is very important — a lot of people are building a “noisy microworker army” but the network has two problems: it’s fragile and slow.
If you have a lot of microservices that depend on a bunch of libraries, that’s because you’re creating a monster, he concluded.
Paul Reed: “Automation can lead you astray”
Paul Reed, the second keynoter showed the audience that his approach to release engineering and DevOps is different so he used aviation terms to describe it. He explained that, in both DevOps and aviation, we sometimes underestimate the precision of communication due to the hesitance to use appropriate terms to communicate the situation at hand and the misuse of defined terminology.
Automation can lead you astray.
He also explained what it means to drift into failure — team miscommunication and/or the lack of coordination can make the team members believe everything is going fine until something (bad) happens all of a sudden.
That said, he pointed out the difference between the terms blameless and sanctionless and revealed why blameless postmortems can feel wrong (Spoiler: “humans are hardwired to use the technique of blaming as a way to give voice to painful and uncomfortable feelings”). Read more about why we’re wired for blame here.
In short, DevOps is about changing the company culture (more communication, a sanctionless environment etc).
- Organizational incompatibility,
- Only certain groups “get” to fail,
- Stopping the line is a “privilege”,
- Forgetting to dampen failure (where possible),
- Only reviewing failure,
- Forget about bias (we judge an event by the outcome, which should not happen) and
- De-prioritizing retrospectives & learning processes
Jean-Francois Landreau and Gregor Hohpe: Ops is more important than ever
Jean-Francois Landreau, Private Cloud Architect at Allianz Technology and Gregor Hohpe, Tech Director at Google Cloud talked about the difference between risk aversion and change aversion. They explained that people assume change is risk but the real fatal assumption is that no change means no risk.
Although developers might think they are the ones leading the DevOps show, this is not true. According to Jean-Francois and Gregor, operations is more important than ever, but in a different way because they embrace development practices.
Furthermore, they pointed out that skipping steps in the DevOps transformation cycle will catch up with you.
Justin Vaughan-Brown: “Silos still exist in monitoring”
Justin Vaughan-Brown, Director of Product Marketing at AppDynamics revealed that even though DevOps is trying to break down silos, they still exist in monitoring. Furthermore, silos in monitoring usually lead to a blame culture.
Jörg Schad: “You shouldn’t rely on anything being stable, especially in the cloud”
Mesosphere’s Jörg Schad shared some of his favorite and scariest support stories covering typical system-setup, configuration, and application pitfalls for new (and not-so-new) Mesos and DC/OS operators and gave some hints about how to debug those pitfalls.
Nightmares of a container orchestration system and how to make them more bearable:
- Container orchestration — don’t write the scripts yourself
- Backup state is very important. “You shouldn’t rely on anything being stable, especially in the cloud”
- Private container registries — try to pull your own private registry into the container to control what’s being pushed
- Health checks should be carefully specified
- Cluster updates: check the health before you update and follow the upgrade instructions (ex some endpoints or characteristics might have changed)