SRE Blog

Better living through reliability.

ProbeFailing Alert

2021-03-27

Our next alert (circling back a bit) in our series on base alerts is ProbeFailing. This fires when a single URL or instance of a server is unresponsive or returning errors.

Certain alerts such as MemoryUsageTooHigh rely on asking the server to describe itself (how much memory are you using, how many errors have you responded with, etc.). Other alerts mimic customer or API behavior to verify, validate, and continually check that certain behaviors continue to hold true.

These behaviors could include but are not limited to:

checking HTTP 200 response code on critical endpoints
checking HTTP 200 response code on a ~health endpoint
checking a URL responds within a certain time period
ensuring a valid response using a username/password to log in to a website
ensuring certain HTML is returned in a response
using a certificate to log in to an SSH server

Probes are not generally implemented on a per-server basis, but are usually implemented to use the same level of abstraction as a customer. For replicated services, this would mean these checks are applied at the load balancer or service endpoint. This helps to verify not only that individual servers are responding properly, but also that all upstream services (such as loadbalancers) are also configured and working as expected.

Similar to Server500sTooHigh, a failing or flaky probe should be an extremely exceptional and rare occurrence. Since probes test replicated infrastructure, that infrastructure should be providing a very reliable interface to the customer. As such it is very likely to be indicative of customer impact and require immediate attention.

Recommended thresholds are paging alerts for 100% failures. The value should be at the threshold for 5 minutes before firing. Probes should be executed once every minute.