Talking About Failure
When I was a younger I went on several rock climbing and mountaineering expeditions. I was exposed to some instructors and practitioners that took staying safe in the mountains very seriously. One of them carried a book from The American Alpine Club on climbing accidents. The accidents were never the cause of a single bad choice. They were caused by a series of decisions, taken over time, that combined to create the conditions for the accident.
What stuck in my memory, sitting beside rock faces that could very well be the scene for one of the stories, was that in certain cases we might make one of the same choices. That given a different time of day, or different weather, or different climbing partners, the same choice could have different levels of risk. Reading and discussing those stories made me a more aware adventurer and a safer climber.
The public analysis and discussion of the chain of decisions and events that led to accidents helps beginners build experience, experts grow wisdom and fosters a culture of safety. Other industries and sub-cultures, to a have their own ways to learn from accidents. For example: the US military has after action reviews, medicine has accident review boards and aviation has NTSB accident reviews.
I have read root cause analysis reports from outages at hyper scalar cloud hosting providers. Often they read like the plaster foot prints of a long dead animal, the legal and marketing departments sanitizing and scrubing any element of life or learning from them. They are the cargo culting of accident investigation and reporting. All of the motions and affectation with none of the substance.
The prevailing culture in software to errors seems to be "if you just". Where errors are blamed on lack of knowledge. If the operator had "just" known the impact of config change, if the developer had "just" understood the interaction of their code change. It is a culture of expert knowledge. It is based on the unexamined belief that bugs only exist from lack of knowledge or lack of care. Or taken to their offensive extremes: stupidity and laziness.
In the software industry bugs or outages most often do not result in injury or death. However, as society becomes more dependent on the software systems we are building the severity and impact of problems only grows. As our software systems continue to grow in complexity and the reach of problems we write software to solve, we as a society must grapple with changing this culture.
As a step on this path of changing this culture I am talking about failures I have seen or participated in. My goal is to make it easier for others to share where they have failed. This will create a feedback loop that will let us learn faster.
Join me in sharing stories of failure. Here are some talks and posts about failure from myself and Cyclic:
- How to fail at Serverless (DevOpsDays Boston 2022 Conference video
- How to Fail at Serverless: Serverless is Stateless (blog post)
- Its Always Sunny in us-east-1: The gang does business continuity (blog post)
- AWS S3: Why sometimes you should press the $100k button (blog post)
We would love to hear your stories.
Write something up.
Post it publicly.
Let us know.
We got your back :)