Azure Kubernetes Service (AKS) - Planning

When creating an AKS cluster it can be easy to get started with quick start templates for ARM and Terraform available.  This is fine for a lab environment but when it comes to building a production ready AKS cluster there are some decisions you need to make at the beginning to save yourself unnecessary pain further down the line.  In this article we will explore some of the key decisions areas which I hope will help you in deploying your AKS cluster.


Cluster Node SKUs

You have to have a default node pool, you can add additional node pools at a later point which can have different skus but if you want to change the sku on the default node pool it would cause the cluster to be rebuilt. Plan what size SKU size is sensible for your workload.  You can scale out easily but you can not scale-up easily so it is worth getting this right.  We cover adding additional node pools below, this can help get around changing the default node pool.

Load Balancer SKU and Multiple Node Pools

When you come to deploy your AKS cluster you can choose a basic or standard load balancer SKU.  the default when deploying via Terraform is standard SKU.  Again, switching this is a breaking change so choose wisely.  The biggest limitation of the basic SKU is that you can not add multiple node pools.  As per this guide which states;

"The AKS cluster must use the Standard SKU load balancer to use multiple node pools, the feature is not supported with Basic SKU load balancers."

Why is this important?

As we mentioned you can have multiple node pools, this allows you to run Linux or Windows node pools and the pools can be assigned different SKUs.  If you are the using the basic load balancer SKU you remove the ability to have multiple node pools.  This is because AKS needs an Azure managed device to manage the routing to the node pools.

In the scenario where you provisioned the default node pool and a small VM SKU, you start using your cluster and then hit resource issues. As we mentioned changing the SKU is a breaking change.  The alternative is to add a new pool with larger VM SKU and move some  or all services over to the new pool.  You can not remove the default pool but you can scale this down to 1.  

In the future if you hit the issue again you can repeat the process adding node pool three for example  and would be able to decommission node pool two once you have re-deployed services.

Basic or Advanced Networking 

You have a choice when you come to create your cluster, you can create the cluster with Basic networking  - Kubenet or Advanced Networking - Azure Container Networking Interface (CNI)

With Advanced networking as the name suggest there is more configuration required.  Each pod gets an IP address from the subnet and can be accessed directly.  When you define the subnet address these IPs are from your network they must be unique and so you need to plan for this which we will cover fully in the next section.  

When configuring advanced networking you also define a Kubernetes service address range this is separate and used by AKS and should not overlap with your other IP ranges.

As you can imagine you can not change between these two modes so review the options and make your decision on how to configure your network.

Subnet Size

When we are using CNI we need to plan out our subnets.  Review this document on how to plan your network.  The important thing to note is that with CNI it pre-allocates the IP addresses for the pods.  This is where the setting max pods per node comes into effect.

Max Pods per Node

For full details on planning your IP addressing and how max pods per nodes impacts this review this document .  In short the default setting for max pods per nodes is 30.  This means that every time you deploy or scale out your node pool Azure will allocate 31 IPs, 1 for the node and 30 for the pods.  

This takes some planning, firstly based on your SKU and the pods you will be deploying what kind of resources will they need realistically? how many pods can fit on a node before you would need to scale out?  Will you hit the node resource limit or the mac pods limit first?  Then take into consideration your scaling needs, what are the likely min and max number of nodes you will want to scale out to?  Use these factors to work out the correct subnet size.

Ingress Controller

There are a couple of choices now on what device you use in conjunction with your ingress controller.  The default scenario is you deploy a Azure Load Balancer and then you install your ingress controller of choice.  The ingress controller then works in unison with the Load balancer. 

Another option now available it the Application Gateway Ingress Controller (AGIC) this ingress controller allows you to use the Azure Application Gateway Layer 7 device to expose your services to the internet.  As you add services to your cluster it automatically updates the Azure Application gateway.  You get to use some of the features of AppGW like re-write rules.

You can deploy this using helm or the add-on which is a fully managed service. The add-on provides added benefits such as automatic updates and increased support.  NOTE! AGIC deployed through Helm is not supported by AKS, AGIC deployed as an AKS add-on is supported by AKS.

This link provides the steps for both approaches.  Currently (at the time of writing) the only way to deploy AKS using the add-on is via Azure CLI so you may want to continue using the Helm approach for Terraform and ARM pipeline deployments until support is added (I know there are open issue for Terraform to add support for the add-on). 

AD Integration 

To make it easier to manage and control access to manage your AKS cluster you will likely want to enable AKS-managed Azure Active Directory integration .  This integration only works with RBAC enabled clusters.  This is a simple setting to configure when you deploy but not something you can turn on/off so get this one right at the beginning.  

If you are using Terrform you will need a block like the one below to enable rbac and the AD integration.  The ID is the list of AD user or Group IDs that should be allowed to manage the cluster.

role_based_access_control {
    enabled = true
    azure_active_directory {
        managed = true
        admin_group_object_ids = ["123456abc-a123-a123-a123-123456abc"]
    }
}

Conclusion

These are some of the key configurations and gotchas I have found along my path deploying AKS clusters.  As you will  have seen a lot of the changes are breaking changes that require the cluster to be rebuilt so it really is worth planning your cluster to reduce the impact to your solution in the future.  I hope this article has been useful and provided you some good tips for when you deploy your own AKS environments.

If you are currently working with AKS you may find some of my other articles useful;

Azure Kubernetes Service (AKS) and Managed Identities

Comments

Popular posts from this blog

Working with WSL and AKS

Next-gen Cloud Operations

Azure DevOps Microsoft Terraform Task and Terraform 0.15 issue