SRE Blog

Better living through reliability.

DiskUsageTooHigh Alert

2020-08-01

Our next alert (alphabetically) in our series on base alerts is DiskUsageTooHigh. This fires when a single instance of a pod or server is using more disk than is expected.

Maintaining a robust distributed system requires enforcing limits of disk usage (either implicitly by having an EC2 instance with limited resources or explicitly by manually setting limits on a k8s pod). Intentionally setting limits helps with application capacity planning, and possibly with cluster scheduling planning (such as k8s node capacity planning) to determine how many nodes you might need to run all of production.

Disk space capacity is fixed so may be used up. In general, individual production systems utilization should not be more than 80%. This leaves headroom for unintended additional traffic that causes increased disk usage or additional logs written to disk while not impacting customers. Load testing and capacity planning may be an advanced topic, so it's simpler to recommend more conservative alert thresholds.

Recommended thresholds are non-paging alerts at 70% (to allow ample time (weeks or months) to add additional capacity before it's an emergency or impacting customers) and paging alerts at 90% (since under normal operating conditions, usage above this amount is very likely to impact customers since running out of disk often causes operations to stop or fail). The value should be above the threshold between 10 and 20 minutes before firing.

Services on servers/VMs and especially services like databases or file servers should have this alert.

Batch jobs, cron jobs, etc. (even though they may be production) should not have this alert configured. It is useful to know if the job has its disk usage configured properly, but most often the primary metric for ~batch jobs is duration to completion. So if it's completing in the time you expect, you probably don't care how much disk it's using.