Better living through reliability.
2020-08-07
The summary is the first thing your stakeholders read. Most of them are
busy so may not read much beyond it, so make it count.
A summary should be at most a few sentences and high level. It will clearly
establish primary effect (or impact), primary cause, and the fix (or
resolution). Other sections of the postmortem cover each of these in
more depth, so keep the summary terse and dense.
A basic formula is something like:
- A sentence succinctly describing the primary/most important thing that broke
(e.g. logins, database, pipelines). Include the high level impact to that
thing (e.g. logins were failing, all microservices were crashlooping) and some
pretty specific numbers of impact (e.g. number of users impacted, for how long
it was broken, etc.).
- A sentence or two that makes it clear what the root cause was (e.g. ran out
of database connections, typo in config, etc.).
- And a very brief sentence outlining the fix/resolution (rolled back a
release, resized disk, etc.).