30 Oct, 2020
Problem Statement: The Kubernetes Cluster Autoscaler fails to stabilize the environment when AWS Auto Scaling Group (ASG) has no spot pools to launch instances. If you are using spot instances with kubernetes on AWS, you know the pain of stabilizing the environment. The main culprit being, that the spot instances become unavailable in some AZs for some time during the day. This not only impacts the launching of new instances, but also knocks out existing instances, making that environment unstable.
Some workarounds as per your use case could be:
- Do not restrict the ASG to any specific AZ
- Configure multiple instance types, which have the same CPU and memory resources in one ASG
- Fallback to on-demand instances, if spot instances fail to launch
- Reduce node provision time of cluster autoscaler (recommended time to be at least 5 minutes, as it may affect the working scenarios)
An exceptional scenario where none of the workarounds are useful/effective (One ASG per AZ):
As you see in the image above, I was using 3 AZs (us-east-1a, us-east-1c and us-east-1d) out of the 6AZs available in us-east-1. The instance type that I was using was of c5 series. So, the easiest way would have been to add c5a series as the second priority in the ASG, which could have worked for ASG-1. Right when you thought you have tackled the problem, you discover that c5a series is available only in us-east-1a and us-east-1b availability zones. Here is when your plan of action fails.
When the kubernetes cluster autoscaler tries to launch the instances from AZs other than 1a and 1b, and the spot capacity in both AZs is exhausted, the cluster autoscaler fails to scale up your environment.
Services used in the initial setup
- AWS Spot instances
- Self-hosted Kubernetes worker nodes on AWS Auto Scaling Groups
- Kubernetes Cluster Autoscaler
Services integrated with initial setup to run the custom logic (Final solution):
- SSM (Systems Manager)
- SNS (Simple Notification Service)
- Multi AZ kubernetes cluster worker nodes setup
High level overview of the final solution (Spot Instance launch failure fallback strategy)
No spot pool left in ASG-1 > fallback to ASG-2
No spot pool left in ASG-2 > fallback to ASG-3
No spot pool left in ASG-3 > fallback to ASG-4
No spot pool left in ASG-4 > fallback to ASG-1
Requirements for ASGs to implement this logic
- Configure ASG with 100% spot instances with at least one instance type in every ASG
- Add unique tag to all ASGs linked with cluster autoscaler
- Attach the same SNS topic to every ASG
Let’s consider there are no spot pools left in AZs with ASG-1, ASG-2 and ASG-3.
- Cluster autoscaler requests ASG-1 to scale up ‘n’ instances.
- It fails as there is no capacity left in that AZ to scale up, it throws an error message in the events, triggering a SNS topic.
- SNS will trigger a custom logic through a lambda function which will take the following steps:
- Fetch other ASGs with unique tag
- Check which ASGs have the capacity to scale up ‘n’ instances (let’s say it is ASG-2)
- Check if any key is created in SSM parameter store for this ASG (explained in #d); if yes: do not take any action; if no: continue with #d to #g
- Create a new parameter store key in SSM that has the name of ASG-2 (This is done so that this logic is not triggered for redundant error messages) as AWS throws multiple error messages for a single launch failure
- Increase the desired count of ASG-2 by ‘n’ instances
- Decrease the desired count of ASG-1 by ’n’ instances
- Delete the SSM parameter created
Similarly, ASG-2 (us-east-1c) will fall back to ASG-3 (us-east-1d), ASG-3 (us-east-1d) will fall back to ASG-4 (us-east-1b), which has the capacity and scales up the environment. To have the best results of this algorithm (without having last option of making it fallback on on-demand instances), use all AZs of the region that you are working in as it will increase the number of spot pools and moreover, your compute cost will also not increase.
To conclude: One iteration of this run through lambda, will consume 80 MB of max memory of lambda function, with an execution time of 2 seconds.
About the Author:
Neha Agarwal is a cloud technology enthusiast, who likes constructing resilient infrastructures and architectures for cloud-native applications. She specializes in Amazon Web Services and designs hyper-flexible infrastructure, that are based on ‘Design for Failure’ and ‘Defense in Depth’ principles. She loves automating manual processes, strives for simplicity and firmly believes in making applications more consistent and reliable over time.