fix: Default to server side apply and update MPI operator for NVIDIA … (#2042)

bryantbiggs · web-flow · commit 40ed02fd2ff2 · 2024-10-28T08:29:37.000-07:00
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -1,6 +1,6 @@
 repos:
   - repo: https://github.com/streetsidesoftware/cspell-cli
-    rev: v8.15.1
+    rev: v8.15.2
     hooks:
       - id: cspell
         args: [--exclude, 'ADOPTERS.md', --exclude, '.pre-commit-config.yaml', --exclude, '.gitignore', --exclude, '*.drawio', --exclude, 'mkdocs.yml', --exclude, '.helmignore', --exclude, '.github/workflows/*', --exclude, 'patterns/istio-multi-cluster/*', --exclude, 'patterns/blue-green-upgrade/*', --exclude, '/patterns/vpc-lattice/cross-cluster-pod-communication/*', --exclude, 'patterns/bottlerocket/*', --exclude, 'patterns/nvidia-gpu-efa/generate-efa-nccl-test.sh']
diff --git a/patterns/gitops/getting-started-argocd/README.md b/patterns/gitops/getting-started-argocd/README.md
@@ -117,7 +117,7 @@ The output looks like the following:
 Bootstrap the addons using ArgoCD:
 
 ```shell
-kubectl apply -f bootstrap/addons.yaml
+kubectl apply --server-side -f bootstrap/addons.yaml
 ```
 
 ### Monitor GitOps Progress for Addons
@@ -188,7 +188,7 @@ echo "ArgoCD URL: https://$(kubectl get svc -n argocd argo-cd-argocd-server -o j
 Deploy a sample application located in [k8s/game-2048.yaml](k8s/game-2048.yaml) using ArgoCD:
 
 ```shell
-kubectl apply -f bootstrap/workloads.yaml
+kubectl apply --server-side -f bootstrap/workloads.yaml
 ```
 
 ### Monitor GitOps Progress for Workloads
diff --git a/patterns/istio/README.md b/patterns/istio/README.md
@@ -36,7 +36,7 @@ cluster with deployed Istio.
 for ADDON in kiali jaeger prometheus grafana
 do
     ADDON_URL="https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/$ADDON.yaml"
-    kubectl apply -f $ADDON_URL
+    kubectl apply --server-side -f $ADDON_URL
 done
 ```
 
@@ -177,7 +177,7 @@ kubectl port-forward svc/jaeger 16686:16686 -n istio-system
             - containerPort: 5000
     EOF
 
-    kubectl apply -f helloworld.yaml -n sample
+    kubectl apply --server-side -f helloworld.yaml -n sample
     ```
 
     ```text
@@ -239,7 +239,7 @@ kubectl port-forward svc/jaeger 16686:16686 -n istio-system
               optional: true
     EOF
 
-    kubectl apply -f sleep.yaml -n sample
+    kubectl apply --server-side -f sleep.yaml -n sample
     ```
 
     ```text
diff --git a/patterns/karpenter-mng/README.md b/patterns/karpenter-mng/README.md
@@ -54,13 +54,13 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
 2. Provision the Karpenter `EC2NodeClass` and `NodePool` resources which provide Karpenter the necessary configurations to provision EC2 resources:
 
     ```sh
-    kubectl apply -f karpenter.yaml
+    kubectl apply --server-side -f karpenter.yaml
     ```
 
 3. Once the Karpenter resources are in place, Karpenter will provision the necessary EC2 resources to satisfy any pending pods in the scheduler's queue. You can demonstrate this with the example deployment provided. First deploy the example deployment which has the initial number replicas set to 0:
 
     ```sh
-    kubectl apply -f example.yaml
+    kubectl apply --server-side -f example.yaml
     ```
 
 4. When you scale the example deployment, you should see Karpenter respond by quickly provisioning EC2 resources to satisfy those pending pod requests:
diff --git a/patterns/karpenter/README.md b/patterns/karpenter/README.md
@@ -47,13 +47,13 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
 2. Provision the Karpenter `EC2NodeClass` and `NodePool` resources which provide Karpenter the necessary configurations to provision EC2 resources:
 
     ```sh
-    kubectl apply -f karpenter.yaml
+    kubectl apply --server-side -f karpenter.yaml
     ```
 
 3. Once the Karpenter resources are in place, Karpenter will provision the necessary EC2 resources to satisfy any pending pods in the scheduler's queue. You can demonstrate this with the example deployment provided. First deploy the example deployment which has the initial number replicas set to 0:
 
     ```sh
-    kubectl apply -f example.yaml
+    kubectl apply --server-side -f example.yaml
     ```
 
 4. When you scale the example deployment, you should see Karpenter respond by quickly provisioning EC2 resources to satisfy those pending pod requests:
diff --git a/patterns/ml-container-cache/README.md b/patterns/ml-container-cache/README.md
@@ -81,13 +81,13 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
 4. Once the EKS cluster and node group have been provisioned, you can deploy the provided example pod that will use a cached image to verify the time it takes for the pod to reach a ready state.
 
     ```sh
-    kubectl apply -f pod-cached.yaml
+    kubectl apply --server-side -f pod-cached.yaml
     ```
 
     You can contrast this with the time it takes for a pod that is not cached on a node by using the provided `pod-uncached.yaml` file. This works by simply using a pod that doesn't have a toleration for nodes that contain NVIDIA GPUs, which is where the cached images are provided in this example.
 
     ```sh
-    kubectl apply -f pod-uncached.yaml
+    kubectl apply --server-side -f pod-uncached.yaml
     ```
 
     You can also do the same steps above but using the small, utility CLI [ktime](https://github.com/clowdhaus/ktime) which can either collect the pod events to measure the time duration to reach a ready state, or it can deploy a pod manifest and return the same:
diff --git a/patterns/nvidia-gpu-efa/README.md b/patterns/nvidia-gpu-efa/README.md
@@ -36,8 +36,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
 ## Validate
 
 !!! note
-
-    Desired instance type can be specified in [eks.tf](eks.tf#L36).
+    Desired instance type can be specified in [eks.tf](https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/d5ddd10afef9b4fd3e0cbba865645f0f522992ac/patterns/nvidia-gpu-efa/eks.tf#L38).
     Values shown below will change based on the instance type selected (i.e. - `p5.48xlarge` has 8 GPUs and 32 EFA interfaces).
     A list of EFA-enabled instance types is available [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types).
     If you are using an on-demand capacity reservation (ODCR) for your instance type, please uncomment the `capacity_reservation_specification` block in `eks.tf`
@@ -66,36 +65,25 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
     To deploy the MPI operator execute the following:
 
     ```sh
-    kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml
-    ```
-
-    ```text
-    namespace/mpi-operator created
-    customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org created
-    serviceaccount/mpi-operator created
-    clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-admin created
-    clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-edit created
-    clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-view created
-    clusterrole.rbac.authorization.k8s.io/mpi-operator created
-    clusterrolebinding.rbac.authorization.k8s.io/mpi-operator created
-    deployment.apps/mpi-operator created
-    ```
-
-    In addition to deploying the operator, please apply a patch to the mpi-operator clusterrole
-    to allow the mpi-operator service account access to `leases` resources in the `coordination.k8s.io` apiGroup.
-
-    ```sh
-    kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-do-eks/main/Container-Root/eks/deployment/kubeflow/mpi-operator/clusterrole-mpi-operator.yaml
+    kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml
     ```
 
     ```text
-    clusterrole.rbac.authorization.k8s.io/mpi-operator configured
+    namespace/mpi-operator serverside-applied
+    customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org serverside-applied
+    serviceaccount/mpi-operator serverside-applied
+    clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-admin serverside-applied
+    clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-edit serverside-applied
+    clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-view serverside-applied
+    clusterrole.rbac.authorization.k8s.io/mpi-operator serverside-applied
+    clusterrolebinding.rbac.authorization.k8s.io/mpi-operator serverside-applied
+    deployment.apps/mpi-operator serverside-applied
     ```
 
 3. EFA info test
 
     This test prints a list of available EFA interfaces by using the `/opt/amazon/efa/bin/fi_info` utility.
-    The script [generate-efa-info-test.sh](generate-efa-info-test.sh) creates an MPIJob manifest file named `efa-info-test.yaml`. It assumes that there are two cluster nodes with 8 GPU's per node and 32 EFA adapters. If you are not using `p5.48xlarge` instances in your cluster, you may adjust the settings in the script prior to running it.
+    The script [generate-efa-info-test.sh](https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/main/patterns/nvidia-gpu-efa/generate-efa-info-test.sh) creates an MPIJob manifest file named `efa-info-test.yaml`. It assumes that there are two cluster nodes with 8 GPU's per node and 32 EFA adapters. If you are not using `p5.48xlarge` instances in your cluster, you may adjust the settings in the script prior to running it.
 
     `NUM_WORKERS` - number of nodes you want to run the test on
     `GPU_PER_WORKER` - number of GPUs available on each node
@@ -108,7 +96,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
     To start the test apply the generated manifest to the cluster:
 
     ```sh
-    kubectl apply -f ./efa-info-test.yaml
+    kubectl apply --server-side -f ./efa-info-test.yaml
     ```
 
     ```text
@@ -186,7 +174,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
     This script creates a file named `efa-nccl-test.yaml`. Apply the manifest to start the EFA nccl test.
 
     ```sh
-    kubectl apply -f ./efa-nccl-test.yaml
+    kubectl apply --server-side -f ./efa-nccl-test.yaml
 
     ```text
     mpijob.kubeflow.org/efa-nccl-test created
diff --git a/patterns/wireguard-with-cilium/README.md b/patterns/wireguard-with-cilium/README.md
@@ -20,7 +20,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
 1. Deploy the example pods:
 
     ```sh
-    kubectl apply -f example.yaml
+    kubectl apply --server-side -f example.yaml
     ```
 
     ```text
@@ -100,7 +100,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
 
     ```sh
     kubectl create ns cilium-test
-    kubectl apply -n cilium-test -f https://raw.githubusercontent.com/cilium/cilium/v1.14.1/examples/kubernetes/connectivity-check/connectivity-check.yaml
+    kubectl apply --server-side -n cilium-test -f https://raw.githubusercontent.com/cilium/cilium/v1.14.1/examples/kubernetes/connectivity-check/connectivity-check.yaml
     ```
 
     ```text