SRE Blog

Better living through reliability.

ProcessCrashlooping Alert

2021-03-24

Our next alert (skipping ahead a bit) in our series on base alerts is ProcessCrashlooping. This fires when instances within a logical group of servers or processes have unexpectedly restarted several times within the last several minutes. This restarting may sometimes also be referred to as "flapping".

A production server instance should never unexpectedly restart or crashloop. In distributed systems, errors and exceptions must always be gracefully handled and not cause the server to restart. Most distributed systems have multiple duplicate exact copies of a server running that requests are spread across. This alert triggers based on the sum of the number of restarts across all of these duplicate copies.

Requests should be retried to recover from a server unexpectedly crashing, and servers should be configured to restart after a crash (worth noting that backoff and jitter is out of scope for this document). So while crashlooping is often not an emergency situation, crashlooping is usually indicative of a larger issue that needs to be addressed. Often crashlooping will be a leading indicator of an near-term emergency situation.

This alert is intended to not fire for rolling (or otherwise) restarts of processes. Picking thresholds of how many times a process would need to restart and within what time window is more of an art than a science. You definitely want to not fire for releases, and you don't want to increase either such that it only fires under egregious circumstances (e.g. 50 times or 4 hours). ProcessCrashloopingSlowly is a similar alert to this one, but with slightly different thresholds to catch rougnly the same number of restarts, but over a longer period of time.

Recommended thresholds are paging alerts at 3 or more restarts. The value should be above the threshold between 15 and 30 minutes before firing.