SRE Blog

Better living through reliability.

Postmortem Tip of the Day: Should Never Have Happened

2021-09-08

We had a recent incident where someone kept on using the word "should" when talking about the system after the outage. They literally said "this should have been a non-issue" to try to avoid futher discussion about it. This language is extremely unhelpful and actively harmful.

Language like this dismisses evidence out of hand. When someone points at something as absolutely not a mistake or a possible contributing factor, you should look at it twice as carefully before ruling it out.

NOTE: Tread carefully if someone thinks they will be blamed for something they did to cause or exacerbate the incident. If the thing they want you to rule out was something they did, it may be better to deprioritize its emphasis in the postmortem or otherwise assure them they aren't or won't be blamed. The factor's contribution to the incident should continue to be included in the postmortem, but action items and exact wording may need some thoughtful consideration.

The objective after an incident is to look at the collective understanding and decisions around a system and challenge what we thought should have happened versus what actually happened. Incidents are usually many things that "should not have happened" happening at the same time. These things that shouldn't have happened — they're the things you're trying to see and understand. Don't overlook them.