AWS, Microsoft Azure, Google Cloud Platform are an incredible cloud hosting solutions, offering scalability and flexibility for a wide range of applications. However, if not configured correctly it can also become quite expensive. Today, we’ll go over some basic tuning strategies to help you optimize CPU consumption on Kubernetes clusters using Horizontal Pod Autoscaler (HPA). Implementing these optimizations can lead to significant cost savings — potentially thousands of dollars per month for a mid-size cloud based project: the technique was recently implemented by our department and helped us save a lot on AWS bills.
General Guidelines for Resource Allocation
Before diving into specific optimizations, let’s discuss some debatable principles to keep in mind.
First, when optimizing AWS EC2 utilization there are two main resource types to focus on: CPU and memory allocations. We have the following differences between them:
- For a typical web service CPU is more expensive than memory whatever it means.
- Memory shortages cause crashes, CPU shortages cause slowdowns – If an application runs out of memory, it will throw an “Out of Memory” (OOM) exception. If it lacks CPU, it will slow down but keep running.
Note that Kubernetes terminates pods exceeding resource requests – If a pod’s CPU or memory usage surpasses the requested limit, Kubernetes may evict or restart it. So the limits should be configured to deal with this situation.
Of course a typical web application doesn’t experience a constant load but instead sees the traffic spikes based on user activity. This implies that while there are times when more CPU power is required, most of the time, the demand is much lower. So ideally we would like to have dynamic vertical resource allocation, but it is still in beta: Kubernetes is actively developing it, you can learn more here.
So the solution is to use combinations of these two techniques to mimic behaviour of the dynamic vertical resource allocation:
- Fractional CPU cores.
- Horizontal Pod Autoscaler
Explanations on fractional cores
One not so obvious feature of k8s is that it allows the use of fractional CPU cores: instead of allocating entire core unit per container, k8s measures allocations in millicores (m
), where 1000m equals 1 full core, i.e. if a container requests 250m, it receives 25% of a CPU core. This is achieved by simply sharing one CPU between a few container instances.
In practice, the resource specifications might look something like this:
spec:
containers:
- name: example-container
image: nginx
resources:
requests:
cpu: "100m"
limits:
cpu: "400m"
For a small web APIs that don’t handle heavy traffic, this particular adjustment alone can be highly effective: most likely your usage does not exceed 100m on average and you reduce a significant part of the cloud costs by a factor of 10. To tune it even better and handle other cases one can also configure the HPA.
HPA explanations and examples
The Horizontal Pod Autoscaler (HPA) in Kubernetes automatically adjusts the number of pods in a deployment based on observed CPU utilization (or other custom metrics). When the CPU usage exceeds a specified threshold, HPA increases the number of pods to distribute the load. Conversely, if the utilization drops below the target, HPA scales down the number of pods to free up resources. This dynamic scaling ensures that your application has the necessary resources during high demand while minimizing costs during quieter periods.
A typical example of such a config will look like that:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: example-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: example-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
This example configures the application to scale between 2 and 10 pods depending on the CPU demand, with the target average CPU utilization set to 50%. Note that setting upper limits is quite crucial: sometimes metrics are not reported correctly leading to “infinite” resource allocations which can cause troubles for entire cluster.
To make sure autoscaler works as expected you can monitor k8s events:
kubectl get events --sort-by='.lastTimestamp'
and search for events of adding and destroying pods based on metrics.
Final thoughts
Of course to setup base-line numbers accordingly one needs to know the consumption of the resources and also determining the right CPU and memory allocations requires proper monitoring tools. How to do that?
Well, the most straightforward approach is to simply simulate user load on your backend using Postman, k6, Python scripts, etc and manually check resource usage by running this command:
kubectl top pod
a few times.
This can already give you some quick insights. A slight improvement would be to automate it and collect the data a few times during the run. However, a more effective method is to work with your DevOps team to set up a dashboard that collects and shows the desired metrics. Then you can monitor it and iteratively tune your setup when needed.
Finally don’t forget to be environment specific: most likely your non-prod environment might use a way smaller setup then prod.