SRE Blog

Better living through reliability.

Postmortem Tip of the Day: Clarity

2021-05-14

It's important to have real clarity of incidents and have that clarity distilled into a postmortem.

A bad postmortem is almost worse than no postmortem as it lulls you into a false sense of how well the system is operating, how well you are operating the system, and how well you are able to adjust and adapt to make that system perform as you need it to for your business. Bad postmortems are bad for your customers.

Your ability to make meaningful change to your system depends on your ability to think clearly, simply, and critically about what really went wrong. Any extra cruft really gets in the way of that understanding. That clear understanding is critical to having a comprehensive and exhaustive list of action items. If my undertanding is muddy, then the action items will be too.

So you have to really whittle down what happened to its bare essentials ("just long enough but no longer") to have a clear narrative. This requires a lot of thinking and rethinking, which requires writing and rewriting. That's why we write docs --- they are a tool for focusing our thoughts. For complicated and unfamiliar incidents, it takes me several days to fully understand which were the important pieces of what actually happened and which were unimporant.

If only the surface level understanding is written in the doc, then it's a huge missed opportunity for change and likely a very costly process both in engineer time writing a poor doc, everyone's lost time in reading it, a huge amount of time lost going over it in the postmortem meeting, and of course all the things you didn't fix that will bite your customers later that could have been fixed this time around.