Skip to content

Commit 0045484

Browse files
committed
Update all NeMo services to v25.06 and update Customizer config API
Signed-off-by: Shiva Krishna, Merla <[email protected]>
1 parent dd95ee9 commit 0045484

25 files changed

+1209
-218
lines changed

api/apps/v1alpha1/common_types.go

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,17 @@ type NGCSecret struct {
218218
Key string `json:"key"`
219219
}
220220

221+
// HFSecret represents the secret and key details for HuggingFace.
222+
type HFSecret struct {
223+
// Name of the Kubernetes secret containing HF_TOKEN key
224+
// +kubebuilder:validation:MinLength=1
225+
Name string `json:"name"`
226+
227+
// Key in the key containing the actual token value
228+
// +kubebuilder:default:="HF_TOKEN"
229+
Key string `json:"key"`
230+
}
231+
221232
// PersistentVolumeClaim defines the attributes of PVC.
222233
// +kubebuilder:validation:XValidation:rule="!has(self.create) || !self.create || (has(self.size) && self.size != \"\")", message="size is required for pvc creation"
223234
// +kubebuilder:validation:XValidation:rule="!has(self.create) || !self.create || (has(self.volumeAccessMode) && self.volumeAccessMode != \"\")", message="volumeAccessMode is required for pvc creation"

api/apps/v1alpha1/nemo_customizer_types.go

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,10 @@ type ModelDownloadJobsConfig struct {
192192
ImagePullPolicy string `json:"imagePullPolicy,omitempty"`
193193

194194
// NGCSecret is the secret containing the NGC API key
195-
NGCSecret NGCSecret `json:"ngcAPISecret"`
195+
NGCSecret NGCSecret `json:"ngcAPISecret,omitempty"`
196+
197+
// HFSecret is the secret containing the HF_TOKEN key
198+
HFSecret HFSecret `json:"hfSecret,omitempty"`
196199

197200
// Optional security context for the job pods
198201
SecurityContext *corev1.PodSecurityContext `json:"securityContext,omitempty"`

api/apps/v1alpha1/nemo_evaluator_types.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@ type NemoEvaluatorSpec struct {
119119
EvaluationImages EvaluationImages `json:"evaluationImages"`
120120
}
121121

122-
// EvaluationImages for different evaluation targets
122+
// EvaluationImages for different evaluation targets.
123123
type EvaluationImages struct {
124124
BigcodeEvalHarness string `json:"bigcodeEvalHarness,omitempty"`
125125
LmEvalHarness string `json:"lmEvalHarness,omitempty"`

api/apps/v1alpha1/zz_generated.deepcopy.go

Lines changed: 16 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

bundle/manifests/apps.nvidia.com_nemocustomizers.yaml

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -419,6 +419,22 @@ spec:
419419
modelDownloadJobs:
420420
description: Model download job configuration
421421
properties:
422+
hfSecret:
423+
description: HFSecret is the secret containing the HF_TOKEN key
424+
properties:
425+
key:
426+
default: HF_TOKEN
427+
description: Key in the key containing the actual token value
428+
type: string
429+
name:
430+
description: Name of the Kubernetes secret containing HF_TOKEN
431+
key
432+
minLength: 1
433+
type: string
434+
required:
435+
- key
436+
- name
437+
type: object
422438
image:
423439
description: Docker image used for model download jobs
424440
minLength: 1
@@ -664,7 +680,6 @@ spec:
664680
type: integer
665681
required:
666682
- image
667-
- ngcAPISecret
668683
- pollIntervalSeconds
669684
- ttlSecondsAfterFinished
670685
type: object

config/crd/bases/apps.nvidia.com_nemocustomizers.yaml

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -419,6 +419,22 @@ spec:
419419
modelDownloadJobs:
420420
description: Model download job configuration
421421
properties:
422+
hfSecret:
423+
description: HFSecret is the secret containing the HF_TOKEN key
424+
properties:
425+
key:
426+
default: HF_TOKEN
427+
description: Key in the key containing the actual token value
428+
type: string
429+
name:
430+
description: Name of the Kubernetes secret containing HF_TOKEN
431+
key
432+
minLength: 1
433+
type: string
434+
required:
435+
- key
436+
- name
437+
type: object
422438
image:
423439
description: Docker image used for model download jobs
424440
minLength: 1
@@ -664,7 +680,6 @@ spec:
664680
type: integer
665681
required:
666682
- image
667-
- ngcAPISecret
668683
- pollIntervalSeconds
669684
- ttlSecondsAfterFinished
670685
type: object

config/samples/nemo/25.04/README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# NeMo Custom Resources
2+
3+
These CRs are designed to deploy NeMo microservices using the NIM Operator.
4+
5+
## Compatible NIM Operator Version
6+
7+
- **NIM Operator v2.0.0**
8+
9+
> Using these CRs with any other version may lead to validation or runtime errors.
10+
11+
## Notes
12+
13+
- The CR schema and fields in this version match the capabilities of NIM Operator v2.0.0.
14+
15+
## Upgrade Notes
16+
17+
If upgrading from a previous NeMo service version (e.g., `25.04`) using the existing operator version:
18+
- Check for renamed or deprecated fields.
19+
- Review updated model config parameters.
20+
- Revalidate against the new CR using:
21+
22+
```bash
23+
kubectl apply --dry-run=server -f apps_v1alpha1_nemodatastore.yaml \
24+
-f apps_v1alpha1_nemocustomizer.yaml \
25+
-f apps_v1alpha1_nemoentitystore.yaml \
26+
-f apps_v1alpha1_nemoguardrails.yaml \
27+
-f apps_v1alpha1_nemoevaluator.yaml
28+
```
29+
30+
```text
31+
nemodatastore.apps.nvidia.com/nemodatastore-sample created (server dry run)
32+
nemocustomizer.apps.nvidia.com/nemocustomizer-sample created (server dry run)
33+
nemoentitystore.apps.nvidia.com/nemoentitystore-sample created (server dry run)
34+
nemoguardrail.apps.nvidia.com/nemoguardrails-sample configured (server dry run)
35+
nemoevaluator.apps.nvidia.com/nemoevaluator-sample created (server dry run)
36+
```
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
apiVersion: apps.nvidia.com/v1alpha1
2+
kind: NemoCustomizer
3+
metadata:
4+
name: nemocustomizer-sample
5+
namespace: nemo
6+
spec:
7+
# Scheduler configuration for training jobs (volcano (default))
8+
scheduler:
9+
type: "volcano"
10+
# Weights & Biases configuration for experiment tracking
11+
wandb:
12+
secretName: wandb-secret # Kubernetes secret that stores WANDB_API_KEY and optionally encryption key
13+
apiKeyKey: apiKey # Key in the secret that holds the W&B API key
14+
encryptionKey: encryptionKey # Key in the secret that holds optional encryption key
15+
# OpenTelemetry tracing configuration
16+
otel:
17+
enabled: true
18+
exporterOtlpEndpoint: http://customizer-otel-opentelemetry-collector.nemo.svc.cluster.local:4317
19+
# PostgreSQL database connection configuration
20+
databaseConfig:
21+
credentials:
22+
user: ncsuser # Database username
23+
secretName: customizer-pg-existing-secret # Secret containing password
24+
passwordKey: password # Key inside secret that contains the password
25+
host: customizer-pg-postgresql.nemo.svc.cluster.local
26+
port: 5432
27+
databaseName: ncsdb
28+
# Customizer API service exposure settings
29+
expose:
30+
service:
31+
type: ClusterIP
32+
port: 8000
33+
# Global image pull settings used in various subcomponents
34+
image:
35+
repository: nvcr.io/nvidia/nemo-microservices/customizer-api
36+
tag: "25.04"
37+
pullPolicy: IfNotPresent
38+
pullSecrets:
39+
- ngc-secret
40+
# URL to the NeMo Entity Store microservice
41+
entitystore:
42+
endpoint: http://nemoentitystore-sample.nemo.svc.cluster.local:8000
43+
# URL to the NeMo Data Store microservice
44+
datastore:
45+
endpoint: http://nemodatastore-sample.nemo.svc.cluster.local:8000
46+
# URL for MLflow tracking server
47+
mlflow:
48+
endpoint: http://mlflow-tracking.nemo.svc.cluster.local:80
49+
# Configuration for the data store CLI tools
50+
nemoDatastoreTools:
51+
image: nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.04
52+
# Configuration for model download jobs
53+
modelDownloadJobs:
54+
image: "nvcr.io/nvidia/nemo-microservices/customizer-api:25.04"
55+
ngcAPISecret:
56+
# Secret that stores NGC API key
57+
name: ngc-api-secret
58+
# Key inside secret
59+
key: "NGC_API_KEY"
60+
securityContext:
61+
fsGroup: 1000
62+
runAsNonRoot: true
63+
runAsUser: 1000
64+
runAsGroup: 1000
65+
# Time (in seconds) to retain job after completion
66+
ttlSecondsAfterFinished: 600
67+
# Polling frequency to check job status
68+
pollIntervalSeconds: 15
69+
# Name to the ConfigMap containing model definitions
70+
modelConfig:
71+
name: nemo-model-config
72+
# Training configuration
73+
trainingConfig:
74+
configMap:
75+
# Optional: Additional configuration to merge into training config
76+
name: nemo-training-config
77+
# PVC where model artifacts are cached or used during training
78+
modelPVC:
79+
create: true
80+
name: finetuning-ms-models-pvc
81+
# StorageClass for the PVC (can be empty to use default)
82+
storageClass: ""
83+
volumeAccessMode: ReadWriteOnce
84+
size: 50Gi
85+
# Workspace PVC automatically created per job
86+
workspacePVC:
87+
storageClass: "local-path"
88+
volumeAccessMode: ReadWriteOnce
89+
size: 10Gi
90+
# Mount path for workspace inside container
91+
mountPath: /pvc/workspace
92+
image:
93+
repository: nvcr.io/nvidia/nemo-microservices/customizer
94+
tag: "25.04"
95+
env:
96+
- name: LOG_LEVEL
97+
value: INFO
98+
# Multi-node networking environment variables for training (CSPs)
99+
networkConfig:
100+
- name: NCCL_IB_SL
101+
value: "0"
102+
- name: NCCL_IB_TC
103+
value: "41"
104+
- name: NCCL_IB_QPS_PER_CONNECTION
105+
value: "4"
106+
- name: UCX_TLS
107+
value: TCP
108+
- name: UCX_NET_DEVICES
109+
value: eth0
110+
- name: HCOLL_ENABLE_MCAST_ALL
111+
value: "0"
112+
- name: NCCL_IB_GID_INDEX
113+
value: "3"
114+
# TTL for training job after it completes
115+
ttlSecondsAfterFinished: 3600
116+
# Timeout duration (in seconds) for training job
117+
timeout: 3600
118+
# Node tolerations
119+
tolerations:
120+
- key: "nvidia.com/gpu"
121+
operator: "Exists"
122+
effect: "NoSchedule"
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
apiVersion: apps.nvidia.com/v1alpha1
2+
kind: NemoDatastore
3+
metadata:
4+
name: nemodatastore-sample
5+
namespace: nemo
6+
spec:
7+
secrets:
8+
datastoreConfigSecret: "nemo-ms-nemo-datastore"
9+
datastoreInitSecret: "nemo-ms-nemo-datastore-init"
10+
datastoreInlineConfigSecret: "nemo-ms-nemo-datastore-inline-config"
11+
giteaAdminSecret: "gitea-admin-credentials"
12+
lfsJwtSecret: "nemo-ms-nemo-datastore--lfs-jwt"
13+
databaseConfig:
14+
credentials:
15+
user: ndsuser
16+
secretName: datastore-pg-existing-secret
17+
passwordKey: password
18+
host: datastore-pg-postgresql.nemo.svc.cluster.local
19+
port: 5432
20+
databaseName: ndsdb
21+
pvc:
22+
name: "pvc-shared-data"
23+
create: true
24+
storageClass: ""
25+
volumeAccessMode: ReadWriteOnce
26+
size: "10Gi"
27+
expose:
28+
service:
29+
type: ClusterIP
30+
port: 8000
31+
image:
32+
repository: nvcr.io/nvidia/nemo-microservices/datastore
33+
tag: "25.04"
34+
pullPolicy: IfNotPresent
35+
pullSecrets:
36+
- ngc-secret
37+
replicas: 1
38+
resources:
39+
requests:
40+
memory: "256Mi"
41+
cpu: "500m"
42+
limits:
43+
memory: "512Mi"
44+
cpu: "1"
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
---
2+
apiVersion: apps.nvidia.com/v1alpha1
3+
kind: NemoEntitystore
4+
metadata:
5+
name: nemoentitystore-sample
6+
namespace: nemo
7+
spec:
8+
image:
9+
repository: nvcr.io/nvidia/nemo-microservices/entity-store
10+
tag: "25.04"
11+
pullPolicy: IfNotPresent
12+
pullSecrets:
13+
- ngc-secret
14+
expose:
15+
service:
16+
type: ClusterIP
17+
port: 8000
18+
databaseConfig:
19+
databaseName: nesdb
20+
host: entity-store-pg-postgresql.nemo.svc.cluster.local
21+
port: 5432
22+
credentials:
23+
user: nesuser
24+
secretName: entity-store-pg-existing-secret
25+
passwordKey: password
26+
datastore:
27+
endpoint: http://nemodatastore-sample.nemo.svc.cluster.local:8000

0 commit comments

Comments
 (0)