SRE Blog

Better living through reliability.

JVMHeapUsageTooHigh Alert

2020-08-13

Our next alert (alphabetically) in our series on base alerts is JVHeapUsageTooHigh. This fires when a single instance of a server is using more JVM heap than expected.

Maintaining a robust distributed system requires enforcing limits of memory usage (either implicitly by having an EC2 instance with limited resources, explicitly by manually setting limits on a k8s pod, etc.). Intentionally setting limits helps with both application capacity planning (how much memory is required to serve N requests per second?), as well as ensuring that multiple applications on a single host will not use more memory than the host/ec2 instance/vm has.

Enforcing limits of memory usage applies at many levels in the stack (e.g. at vm or container levels), so has implications at the application runtime level as well (for those that use garbage collection). So to ensure reliability when using the jvm, you MUST manually set certain garbage collection parameters, just as you must set resource-level limits.

Memory capacity is fixed so may be used up. In general, production systems should be load tested such that individual task utilization is no more than 80%. This leaves headroom for unintended additional traffic or expensive requests while not impacting customers. Load testing and capacity planning may be an advanced topic, so it's simpler to recommend more conservative alert thresholds.

Recommended thresholds are non-paging alerts at 70% (to allow ample time (weeks or months) to add additional capacity before it's an emergency or customer-impacting) and paging alerts at 90% (since under normal operating conditions, usage above this amount is very likely to impact customers due to running out of memory which then usually causes the server to run out of CPU due to garbage collection thrashing). The value should be above the threshold between 10 and 20 minutes before firing.