SRE Blog

Better living through reliability.

Incident Tip of the Day: Fast Fix Monitoring

2021-05-03

When you have an outage that's found by external users on very public messageboards and has gone on for several days, pro-tip that the oncall should at the very least recommend something to change about the system instead of twiddling their thumbs.

In fact what should happen is that same day one or ideally several monitoring changes are pushed out to production so that this never happens again (external customers finding the issue before your monitoring does).

I understand that everyone's busy. That's exactly why after the incident is mitigated and you're still oncall (so you have explicit time devoted to fix these sorts of issues), you should fix it. I know if I don't fix it now while it's top of mind and I have dedicated time, I won't fix it as other things come up. That's why I do it ASAP and especially while the shame is still searing.

After a medium size incident, if you haven't changed anything, you've failed your customers.