2023-10-07 21:08:39 -07:00

176 lines
7.9 KiB
Markdown

---
title: "Day 2 Kubernetes: Migrate EKS Node Groups with Zero Downtime"
h1: "Day 2 Kubernetes: Migrating EKS Node Groups with Zero Downtime"
authors: ["mike-metral"]
tags: ["Kubernetes", "eks"]
meta_desc: "Use Pulumi's for Day 2 Kubernetes. Spin up a new EKS cluster, add one more node groups, and migrate the workloads with zero downtime using code and kubectl."
date: "2019-07-23"
meta_image: "eks-migrate-nodegroups.png"
---
Managed Kubernetes offerings greatly reduce the overhead required in
administering Kubernetes. However, the cluster is only one of the
components under management, as app lifecycles are self-driven tasks
that vary by workloads.
In Kubernetes, node groups are a useful mechanism for creating pools of resources that can enforce
scheduling requirements. They also provide a utility for shifting
workloads around during cluster management and updates.
In this post, we'll see how to use Pulumi for Day 2 Kubernetes administration.
We'll spin up a new EKS cluster with two node groups and a given workload.
Then we'll add one more node group with an updated configuration, and migrate the workload
over to it with zero downtime using code and `kubectl`.
<!--more-->
[View the full tutorial and code.][eks-nodegroup-tutorial]
![EKS Migrate Node Groups](eks-migrate-nodegroups.png)
## Create an EKS cluster and Deploy the Workload
For our initial update, we will configure and launch an EKS cluster using
`v1.13` of Kubernetes, with the cluster's infrastructure dependencies
(such as VPC and IAM) defined using [Crosswalk for AWS][crosswalk-aws].
Crosswalk allows us to leverage the Pulumi libraries of common infrastructure
for AWS to simplify cloud resource instantiation and management while gaining
best-practices as defaults.
We'll also create and attach the following node groups to the cluster:
* A standard `t2.medium` worker node group using the recent `v1.13.7` worker [AMI][eks-amis], for general purpose workloads such as the [`EchoServer`][echoserver] &mdash; a simple app that echo's client request headers.
* A 2xlarge `t3.2xlarge` worker node group using an older `v1.12.7` worker [AMI][eks-amis], for use by larger, intensive workloads such as the [NGINX Ingress Controller][ingress-nginx].
With the cluster successfully created and the node groups available, Pulumi
will deploy the workload: the [`EchoServer`][echoserver], and the [NGINX Ingress Controller][ingress-nginx] that will manage its ingress.
The `echoserver` will land on the Standard node group, and NGINX is set
to specifically target the `2xlarge` node group.
Once the workload is deployed, we can validate it is up and running by accessing
the `echoserver` behind the NGINX endpoint using `curl`:
![Kubernetes Cluster Workload](https://www.pulumi.com/uploads/content/blog/day-2-kubernetes-migrating-eks-nodegroups-with-zero-downtime/k8s-cluster-workload.gif)
> **Note**: You can open the <a href="https://www.pulumi.com/uploads/content/blog/day-2-kubernetes-migrating-eks-nodegroups-with-zero-downtime/k8s-cluster-workload.gif" target="_blank">GIF</a> for a maximized view.
## The Great Migration
After our initial deployment, let's say we decide to update the node group used by NGINX.
To do so, we'll move NGINX from the `2xlarge` node group over to a new, `4xlarge` worker
node group with an updated AMI, a different instance type, and a different desired instance count.
As we migrate NGINX over to the `4xlarge` and decommission the `2xlarge` node
group in the next steps, we'll also actively load test the endpoint of the
`echoserver` to ensure that we are not losing requests throughout the migration.
### Step 1: Create the new `4xlarge` node group.
Create the new, `4xlarge` node group by defining it and running an update.
![4xlarge Node Group](new-4xlarge.png)
### Step 2: Migrate NGINX to the `4xlarge` node group.
With the `4xlarge` node group created, we'll migrate the NGINX service away
from the `2xlarge` node group over to the `4xlarge` node group by changing
its node selector scheduling terms.
This change updates the NGINX Deployment spec to require the use of
`c5.4xlarge` nodes during scheduling, and forces a rolling update over to the
`4xlarge` node group.
NGINX is able to successfully migrate across node groups because it is
configured with HA settings, [spread-type scheduling predicates][tutorial-ha-refs],
and can gracefully terminate within the Kubernetes [Pod lifecycle][pod-lifecycle].
![Migrate Ngnix 4xlarge](migrate-nginx-4xlarge.png)
### Step 3: Decommission the `2xlarge` node group.
Once NGINX has been completely migrated to the `4xlarge` node group, we can
begin decommissioning the original `2xlarge` node group which is no longer in use.
Decommissioning the node group involves:
* Draining the Kubernetes nodes.
* Deleting the Kubernetes nodes from the APIServer.
* Scaling down the Auto Scaling Group to `0`.
* Deleting the node group.
Set up `kubectl` by using the `kubeconfig` from the stack output.
```bash
$ pulumi stack output kubeconfig > kubeconfig.json
$ export KUBECONFIG=$PWD/kubeconfig.json
```
To drain the `2xlarge` node group, we'll use `kubectl drain`:
```bash
for node in $(kubectl get nodes -l beta.kubernetes.io/instance-type=t3.2xlarge -o=name); do
kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 "$node";
done
```
After draining has completed, use `kubectl delete node` to delete the nodes from the APIServer.
```bash
for node in $(kubectl get nodes -l beta.kubernetes.io/instance-type=t3.2xlarge -o=name); do
kubectl delete "$node";
done
```
Scale down the node group Auto Scaling Group completely by setting
the `desiredCapacity: 0` and running an update:
![Scale Down ng 2xlarge](scale-down-ng-2xlarge.png)
Once the Auto Scaling Group has scaled down, we can delete the node group from
AWS and the Pulumi program:
![Remove ng 2xlarge](remove-ng-2xlarge.png)
## Summary
In this post, we stood up an EKS cluster with a couple of node groups, and an
`echoserver` and NGINX workload. We then created a new, updated node group and
migrated NGINX over to it.
We achieved this node group migration with zero downtime to our apps during
load testing and decommissioning of the original node group.
![EKS Migration](https://www.pulumi.com/uploads/content/blog/day-2-kubernetes-migrating-eks-nodegroups-with-zero-downtime/eks-migration.gif)
> **Note**: You can open the <a href="https://www.pulumi.com/uploads/content/blog/day-2-kubernetes-migrating-eks-nodegroups-with-zero-downtime/eks-migration.gif" target="_blank">GIF</a> for a maximized view.
## Learn More
If you'd like to learn about Pulumi and how to manage your
infrastructure and Kubernetes through code, [get started today](/docs/get-started/). Pulumi is open source and free to
use.
For further examples on how to use Pulumi to create Kubernetes
clusters, or deploy workloads to a cluster, check out the rest of the
[Kubernetes tutorials](/registry/packages/kubernetes/how-to-guides/).
As always, you can check out our code on
[GitHub](https://github.com/pulumi), follow us on
[Twitter](https://twitter.com/pulumicorp), subscribe to our [YouTube
channel](https://www.youtube.com/channel/UC2Dhyn4Ev52YSbcpfnfP0Mw), or
join our [Community Slack](https://slack.pulumi.com/) channel if you have
any questions, need support, or just want to say hello.
If you'd like to chat with our team, or get hands-on assistance with
migrating your existing configuration code to Pulumi, please don't hesitate to [drop us a line](/contact/).
[eks-amis]: https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html
[ingress-nginx]: https://github.com/kubernetes/ingress-nginx
[echoserver]: https://github.com/kubernetes-retired/contrib/blob/master/ingress/echoheaders/echo-app.yaml
[pod-lifecycle]: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods
[eks-nodegroup-tutorial]: /registry/packages/kubernetes/how-to-guides/eks-migrate-nodegroups/
[crosswalk-aws]: /docs/clouds/aws/guides/
[tutorial-ha-refs]: /registry/packages/kubernetes/how-to-guides/eks-migrate-nodegroups/