SRE Blog

Better living through reliability.

MemoryUsageTooHigh Alert

2020-08-15

Our next alert (alphabetically) in our series on base alerts is MemoryUsageTooHigh. This fires when a single instance of a server or container is using more memory/RAM than is expected.

Maintaining a robust distributed system requires enforcing limits of memory usage (either implicitly by having an EC2 instance with limited resources, explicitly by manually setting limits on a k8s pod, etc.). Intentionally setting limits helps with both application capacity planning (how much memory is required to serve N requests per second?), as well as ensuring that multiple applications on a single host will not use more memory than the host/ec2 instance/vm has.

Memory capacity is fixed so may be used up. In general, production systems should be load tested such that individual task utilization is no more than 80%. This leaves headroom for unintended additional traffic or expensive requests while not impacting customers. Load testing and capacity planning may be an advanced topic, so it's simpler to recommend more conservative alert thresholds.

Recommended thresholds are non-paging alerts at 70% (to allow ample time (weeks or months) to add additional capacity before it's an emergency or customer-impacting) and paging alerts at 90% (since under normal operating conditions, usage above this amount is very likely to impact customers due to running out of memory which then usually causes the server to kill processes via OOMKiller or start using swap (if configured)). The value should be above the threshold between 10 and 20 minutes before firing.