What happens when you blame a developer for a problem?
What is the right technical and cultural response to the failure of a highly available system? At the JAX London, software architecture expert Jeremy Deane lead two sessions on the importance of resiliency and the challenges of technical change – we spoke to him about diffusion techniques and blameless work culture.
JAXenter: During your JAX London session you kindly reminded us all about the inevitability of failure is inevitable, particularly in highly available systems. What happens when they fail?
Jeremy Deane: It always depends on what you have in place when a system fails. You might not know at all if you don’t have monitoring in place. If you don’t have log monitoring in place, if you’re not monitoring, say queues for messaging or middleware, there could be a case where the tree fell in the forest and nobody knows. And it’s not until you have irate customers coming in and berating your software, and asking why it’s not working. And then you have to scramble to find out what’s going.
Whereas if you implement resiliency principles and techniques, a lot of times, not only can you react very fast, in many cases you can add self-healing. And so you don’t even need to page that person in the middle of the night to come in and fix it. It just fixes itself.
You spoke of resiliency being a ‘neglected step-child’ – how is that?
I think we do a good job of creating highly available systems by doing things like creating multiple nodes in a cluster, load balancing, replication… but we don’t take the time to actually do threat modelling to see how the system might fail – both from a technology and a security standpoint. And then of course, the flipside of that is, when you do have a failure, how quickly can you react to it and make the system whole again – which is what resiliency is all about, making it whole.
Staying on the topic of failure, we’re hearing lots about companies like Etsy trying to create a blameless work culture where you accept and talk about failure. But isn’t it also good to always find the source of a problem and point the finger at some cause?
Absolutely. I mean if you can’t understand why something occurred, you’re not going to be able to prevent it in the future. And the only way, in many cases, you can find out is with honesty. The most principled software engineer can make a mistake. We move very fast in software, and it’s easy to make mistakes.
It’s not just the junior developer co-op who brought down production, but quite often it’s someone who wasn’t trying to bring down production…they just did. And so the worst thing you can do, and the most demoralising thing you can do, is to start blaming people, because they put up a wall and they’re not honest anymore. And then you can’t get to the root of the problem.
You’ve also extensively covered the topic of ‘change fatigue’ and the implications for software. What kind of changes are you seeing there?
So I see, especially since the global financial crisis in 2008, that organisations have had to make profound changes to their systems and architectures – both because of business drivers and also demand. And if you look at how our industry is changing right now, in terms of technology, you have all these new Platform-as-a-Service frameworks, from OpenShift to Kubernetes to CloudFoundry to Amazons. And it’s almost overwhelming, the amount of change that we’re going through.
Organisations that traditionally have one set of developers, let’s say a bunch of developers that know Groovy or Rails or Spring, are now being asked to learn to the Go language, are being asked to learn Ruby if they’re using Chef. And then they have to learn the framework that do the deployments for these. And so organisations are going through a whirlwind of change. And with every new CTO or new chief architect there’s going to be some disruption.
So if you’re a developer and you go to a conference and you see something that’s profoundly going to make a difference, like here at the JAX London we’ve seen a lot of things that would make a positive difference in organisations, and you want to introduce that change on Monday when you get back to the office…how are you going to do that? And that’s what the gist of my talk is about, to go and use diffusion techniques to be able to introduce that change.
So there’s a great book on patterns of change, modelled on architectural design patterns…same thing. It says for a given scenario you might want to hold a brown bag lunch. Or you might want to bring in a champion, someone who’s done something similar in the same domain at another organisation, and also understand where you are in the process of introducing that technology – have people even heard about it? One pattern is to ‘plant the seeds’: you start posting something on your intranet, you start hosting a ‘lunch and learn’. You’re not saying ‘let’s do it’, you’re just exposing it. And then moving on from there, if people already have knowledge of it, maybe creating a demo and showing them how it works.
Observability is usually impactful on the decision making process in order to adapt the technology. And then as you move down through the process you get to the point where you might need funding. So you need some kind of ‘angel’, if you will, to come in and help you with the organisation and fund it. Because at a certain point it needs to be institutionalised and you can’t do that with just a pilot project. So it’s really about having an understanding of the type of organisation you work in, the type of people in that organisation, their perspectives and how they would adopt technology over time.