SRE Blog

Better living through reliability.

CPUUsageTooHigh Alert

2020-07-30

The first alert (alphabetically) in our series on base alerts is CPUUsageTooHigh. This fires when a single instance of a server is using more cpu than is expected.

Maintaining a robust distributed system requires enforcing limits of CPU usage (implicitly by having an EC2 instance with limited resources, explicitly by manually setting limits on a k8s pod, etc.). Intentionally setting limits helps with application capacity planning (e.g. how much CPU is required to serve N requests per second?), and possibly with cluster scheduling planning (such as k8s node capacity planning) to determine how many nodes you might need to run all of production.

CPU capacity is fixed so may be used up. In general, production systems should be load tested such that individual task utilization is no more than 80%. This leaves headroom for unintended additional traffic or expensive requests while not impacting customers. Load testing and capacity planning may be an advanced topic, so it's simpler to recommend more conservative alert thresholds.

Recommended thresholds are non-paging alerts at 70% (to allow ample time (weeks or months) to add additional capacity before it's an emergency or impacting customers) and paging alerts at 90% (since under normal operating conditions, usage above this amount is very likely to impact customers due to running out of CPU). The value should be above the threshold between 10 and 20 minutes before firing.