SRE Blog

Better living through reliability.

Server500sTooHigh Alert

2021-03-29

Our next alert (alphabetically) in our series on base alerts is Server500sTooHigh. This fires when the number of server 5xx responses is larger than expected.

A server returning a 5xx response to a client should be an extremely exceptional and rare occurrence. So rare that it is very likely to be indicative of customer impact and require immediate attention.

5xx responses MAY be measured both at an individual server level and in aggregate across all server instances. It's usually easier, less noisy, and a more accurate measure of customer impact to measure in aggregate across all instances (though measuring at the individual server instance level and at a higher ratio is acceptable).

It's best practice to alert on the ratio/percentage of 5xx responses divided by the total number of responses instead of trying to pick a particular number of responses as a threshold. Also, you probably want to alert on the sum of all (non-healthcheck) urls (as opposed to alerting on every url). For urls important to customer experience (such as login), additional individual alerts SHOULD be added to ensure critical urls' errors aren't lost in the noise of other traffic.

Recommended thresholds are paging alerts over 5%. The value should be above the threshold between 5 and 20 minutes before firing.