Skip to content

Commit 40ed02f

Browse files
authored
fix: Default to server side apply and update MPI operator for NVIDIA … (#2042)
1 parent d5ddd10 commit 40ed02f

File tree

8 files changed

+28
-40
lines changed

8 files changed

+28
-40
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
repos:
22
- repo: https://github.com/streetsidesoftware/cspell-cli
3-
rev: v8.15.1
3+
rev: v8.15.2
44
hooks:
55
- id: cspell
66
args: [--exclude, 'ADOPTERS.md', --exclude, '.pre-commit-config.yaml', --exclude, '.gitignore', --exclude, '*.drawio', --exclude, 'mkdocs.yml', --exclude, '.helmignore', --exclude, '.github/workflows/*', --exclude, 'patterns/istio-multi-cluster/*', --exclude, 'patterns/blue-green-upgrade/*', --exclude, '/patterns/vpc-lattice/cross-cluster-pod-communication/*', --exclude, 'patterns/bottlerocket/*', --exclude, 'patterns/nvidia-gpu-efa/generate-efa-nccl-test.sh']

patterns/gitops/getting-started-argocd/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ The output looks like the following:
117117
Bootstrap the addons using ArgoCD:
118118

119119
```shell
120-
kubectl apply -f bootstrap/addons.yaml
120+
kubectl apply --server-side -f bootstrap/addons.yaml
121121
```
122122

123123
### Monitor GitOps Progress for Addons
@@ -188,7 +188,7 @@ echo "ArgoCD URL: https://$(kubectl get svc -n argocd argo-cd-argocd-server -o j
188188
Deploy a sample application located in [k8s/game-2048.yaml](k8s/game-2048.yaml) using ArgoCD:
189189

190190
```shell
191-
kubectl apply -f bootstrap/workloads.yaml
191+
kubectl apply --server-side -f bootstrap/workloads.yaml
192192
```
193193

194194
### Monitor GitOps Progress for Workloads

patterns/istio/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ cluster with deployed Istio.
3636
for ADDON in kiali jaeger prometheus grafana
3737
do
3838
ADDON_URL="https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/$ADDON.yaml"
39-
kubectl apply -f $ADDON_URL
39+
kubectl apply --server-side -f $ADDON_URL
4040
done
4141
```
4242

@@ -177,7 +177,7 @@ kubectl port-forward svc/jaeger 16686:16686 -n istio-system
177177
- containerPort: 5000
178178
EOF
179179
180-
kubectl apply -f helloworld.yaml -n sample
180+
kubectl apply --server-side -f helloworld.yaml -n sample
181181
```
182182
183183
```text
@@ -239,7 +239,7 @@ kubectl port-forward svc/jaeger 16686:16686 -n istio-system
239239
optional: true
240240
EOF
241241
242-
kubectl apply -f sleep.yaml -n sample
242+
kubectl apply --server-side -f sleep.yaml -n sample
243243
```
244244
245245
```text

patterns/karpenter-mng/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,13 +54,13 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
5454
2. Provision the Karpenter `EC2NodeClass` and `NodePool` resources which provide Karpenter the necessary configurations to provision EC2 resources:
5555

5656
```sh
57-
kubectl apply -f karpenter.yaml
57+
kubectl apply --server-side -f karpenter.yaml
5858
```
5959

6060
3. Once the Karpenter resources are in place, Karpenter will provision the necessary EC2 resources to satisfy any pending pods in the scheduler's queue. You can demonstrate this with the example deployment provided. First deploy the example deployment which has the initial number replicas set to 0:
6161
6262
```sh
63-
kubectl apply -f example.yaml
63+
kubectl apply --server-side -f example.yaml
6464
```
6565
6666
4. When you scale the example deployment, you should see Karpenter respond by quickly provisioning EC2 resources to satisfy those pending pod requests:

patterns/karpenter/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,13 +47,13 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
4747
2. Provision the Karpenter `EC2NodeClass` and `NodePool` resources which provide Karpenter the necessary configurations to provision EC2 resources:
4848

4949
```sh
50-
kubectl apply -f karpenter.yaml
50+
kubectl apply --server-side -f karpenter.yaml
5151
```
5252

5353
3. Once the Karpenter resources are in place, Karpenter will provision the necessary EC2 resources to satisfy any pending pods in the scheduler's queue. You can demonstrate this with the example deployment provided. First deploy the example deployment which has the initial number replicas set to 0:
5454
5555
```sh
56-
kubectl apply -f example.yaml
56+
kubectl apply --server-side -f example.yaml
5757
```
5858
5959
4. When you scale the example deployment, you should see Karpenter respond by quickly provisioning EC2 resources to satisfy those pending pod requests:

patterns/ml-container-cache/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,13 +81,13 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
8181
4. Once the EKS cluster and node group have been provisioned, you can deploy the provided example pod that will use a cached image to verify the time it takes for the pod to reach a ready state.
8282
8383
```sh
84-
kubectl apply -f pod-cached.yaml
84+
kubectl apply --server-side -f pod-cached.yaml
8585
```
8686
8787
You can contrast this with the time it takes for a pod that is not cached on a node by using the provided `pod-uncached.yaml` file. This works by simply using a pod that doesn't have a toleration for nodes that contain NVIDIA GPUs, which is where the cached images are provided in this example.
8888
8989
```sh
90-
kubectl apply -f pod-uncached.yaml
90+
kubectl apply --server-side -f pod-uncached.yaml
9191
```
9292
9393
You can also do the same steps above but using the small, utility CLI [ktime](https://github.com/clowdhaus/ktime) which can either collect the pod events to measure the time duration to reach a ready state, or it can deploy a pod manifest and return the same:

patterns/nvidia-gpu-efa/README.md

Lines changed: 14 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
3636
## Validate
3737

3838
!!! note
39-
40-
Desired instance type can be specified in [eks.tf](eks.tf#L36).
39+
Desired instance type can be specified in [eks.tf](https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/d5ddd10afef9b4fd3e0cbba865645f0f522992ac/patterns/nvidia-gpu-efa/eks.tf#L38).
4140
Values shown below will change based on the instance type selected (i.e. - `p5.48xlarge` has 8 GPUs and 32 EFA interfaces).
4241
A list of EFA-enabled instance types is available [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types).
4342
If you are using an on-demand capacity reservation (ODCR) for your instance type, please uncomment the `capacity_reservation_specification` block in `eks.tf`
@@ -66,36 +65,25 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
6665
To deploy the MPI operator execute the following:
6766

6867
```sh
69-
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml
70-
```
71-
72-
```text
73-
namespace/mpi-operator created
74-
customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org created
75-
serviceaccount/mpi-operator created
76-
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-admin created
77-
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-edit created
78-
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-view created
79-
clusterrole.rbac.authorization.k8s.io/mpi-operator created
80-
clusterrolebinding.rbac.authorization.k8s.io/mpi-operator created
81-
deployment.apps/mpi-operator created
82-
```
83-
84-
In addition to deploying the operator, please apply a patch to the mpi-operator clusterrole
85-
to allow the mpi-operator service account access to `leases` resources in the `coordination.k8s.io` apiGroup.
86-
87-
```sh
88-
kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-do-eks/main/Container-Root/eks/deployment/kubeflow/mpi-operator/clusterrole-mpi-operator.yaml
68+
kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml
8969
```
9070

9171
```text
92-
clusterrole.rbac.authorization.k8s.io/mpi-operator configured
72+
namespace/mpi-operator serverside-applied
73+
customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org serverside-applied
74+
serviceaccount/mpi-operator serverside-applied
75+
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-admin serverside-applied
76+
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-edit serverside-applied
77+
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-view serverside-applied
78+
clusterrole.rbac.authorization.k8s.io/mpi-operator serverside-applied
79+
clusterrolebinding.rbac.authorization.k8s.io/mpi-operator serverside-applied
80+
deployment.apps/mpi-operator serverside-applied
9381
```
9482

9583
3. EFA info test
9684

9785
This test prints a list of available EFA interfaces by using the `/opt/amazon/efa/bin/fi_info` utility.
98-
The script [generate-efa-info-test.sh](generate-efa-info-test.sh) creates an MPIJob manifest file named `efa-info-test.yaml`. It assumes that there are two cluster nodes with 8 GPU's per node and 32 EFA adapters. If you are not using `p5.48xlarge` instances in your cluster, you may adjust the settings in the script prior to running it.
86+
The script [generate-efa-info-test.sh](https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/main/patterns/nvidia-gpu-efa/generate-efa-info-test.sh) creates an MPIJob manifest file named `efa-info-test.yaml`. It assumes that there are two cluster nodes with 8 GPU's per node and 32 EFA adapters. If you are not using `p5.48xlarge` instances in your cluster, you may adjust the settings in the script prior to running it.
9987
10088
`NUM_WORKERS` - number of nodes you want to run the test on
10189
`GPU_PER_WORKER` - number of GPUs available on each node
@@ -108,7 +96,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
10896
To start the test apply the generated manifest to the cluster:
10997
11098
```sh
111-
kubectl apply -f ./efa-info-test.yaml
99+
kubectl apply --server-side -f ./efa-info-test.yaml
112100
```
113101
114102
```text
@@ -186,7 +174,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
186174
This script creates a file named `efa-nccl-test.yaml`. Apply the manifest to start the EFA nccl test.
187175
188176
```sh
189-
kubectl apply -f ./efa-nccl-test.yaml
177+
kubectl apply --server-side -f ./efa-nccl-test.yaml
190178
191179
```text
192180
mpijob.kubeflow.org/efa-nccl-test created

patterns/wireguard-with-cilium/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
2020
1. Deploy the example pods:
2121

2222
```sh
23-
kubectl apply -f example.yaml
23+
kubectl apply --server-side -f example.yaml
2424
```
2525

2626
```text
@@ -100,7 +100,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
100100

101101
```sh
102102
kubectl create ns cilium-test
103-
kubectl apply -n cilium-test -f https://raw.githubusercontent.com/cilium/cilium/v1.14.1/examples/kubernetes/connectivity-check/connectivity-check.yaml
103+
kubectl apply --server-side -n cilium-test -f https://raw.githubusercontent.com/cilium/cilium/v1.14.1/examples/kubernetes/connectivity-check/connectivity-check.yaml
104104
```
105105

106106
```text

0 commit comments

Comments
 (0)