OOMKilled (Exit Code 137): Understanding the Kubernetes Memory Kill
When a Kubernetes pod is terminated with exit code 137, it means the container was killed by the system due to memory exhaustion. Understanding OOMKilled errors is crucial for managing Kubernetes workloads effectively.
What is OOMKilled?
OOMKilled stands for "Out Of Memory Killed." It occurs when a container exceeds its memory limit and the Linux kernel's OOM killer terminates the process.
Common Causes
1. Memory Limits Too Low
Setting memory limits that are too low for your application's actual requirements will cause OOMKilled errors.
2. Memory Leaks
Applications with memory leaks will gradually consume more memory until they hit the limit.
3. Sudden Memory Spikes
Unexpected traffic or workloads can cause sudden memory spikes that exceed limits.
4. Incorrect Resource Requests
Setting memory requests that don't match your application's baseline memory usage can lead to problems.
How to Diagnose OOMKilled
Check Pod Status
kubectl get pods
kubectl describe pod <pod-name>
Check Container Logs
kubectl logs <pod-name> -c <container-name>
Check Events
kubectl get events --field-selector involvedObject.name=<pod-name>
Prevention Strategies
1. Set Appropriate Memory Limits
Analyze your application's memory usage and set limits accordingly. Consider:
- Baseline memory usage
- Peak memory usage
- Memory growth patterns
- Buffer for unexpected spikes
2. Monitor Memory Usage
Use monitoring tools to track memory usage over time and identify trends.
3. Implement Resource Requests
Set memory requests that match your application's baseline needs to help the scheduler make better decisions.
4. Use Horizontal Pod Autoscaling
HPA can help scale your pods based on memory usage, preventing individual pods from hitting limits.
Best Practices
- Start Conservative: Set limits higher initially and tune down based on observed usage
- Monitor Continuously: Track memory usage patterns over time
- Test Under Load: Test your applications under expected load conditions
- Review Regularly: Regularly review and adjust memory limits based on usage patterns
Conclusion
OOMKilled errors can be disruptive, but with proper monitoring, resource management, and prevention strategies, you can minimize their occurrence and impact.
Ready to simplify your on-call?
Start free today.
Get 20 monitors on us.
No credit card required.