Skip to content

Commit 943195f

Browse files
committed
Add benchmark automation tool
1 parent 12bcc9a commit 943195f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+4855
-28
lines changed
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Patterns to ignore when building packages.
2+
# This supports shell glob matching, relative path matching, and
3+
# negation (prefixed with !). Only one pattern per line.
4+
.DS_Store
5+
# Common VCS dirs
6+
.git/
7+
.gitignore
8+
.bzr/
9+
.bzrignore
10+
.hg/
11+
.hgignore
12+
.svn/
13+
# Common backup files
14+
*.swp
15+
*.bak
16+
*.tmp
17+
*.orig
18+
*~
19+
# Various IDEs
20+
.project
21+
.idea/
22+
*.tmproj
23+
.vscode/
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
apiVersion: v2
2+
name: inferencemodel
3+
description: A Helm chart for InferenceModel
4+
5+
# A chart can be either an 'application' or a 'library' chart.
6+
#
7+
# Application charts are a collection of templates that can be packaged into versioned archives
8+
# to be deployed.
9+
#
10+
# Library charts provide useful utilities or functions for the chart developer. They're included as
11+
# a dependency of application charts to inject those utilities and functions into the rendering
12+
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
13+
type: application
14+
15+
# This is the chart version. This version number should be incremented each time you make changes
16+
# to the chart and its templates, including the app version.
17+
# Versions are expected to follow Semantic Versioning (https://semver.org/)
18+
version: 0.1.0
19+
20+
# This is the version number of the application being deployed. This version number should be
21+
# incremented each time you make changes to the application. Versions are not expected to
22+
# follow Semantic Versioning. They should reflect the version the application is using.
23+
# It is recommended to use it with quotes.
24+
appVersion: "1.16.0"
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
{{/*
2+
Common labels
3+
*/}}
4+
{{- define "gateway-api-inference-extension.labels" -}}
5+
app.kubernetes.io/name: {{ include "gateway-api-inference-extension.name" . }}
6+
{{- if .Chart.AppVersion }}
7+
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
8+
{{- end }}
9+
{{- end }}
10+
11+
{{/*
12+
Inference extension name
13+
*/}}
14+
{{- define "gateway-api-inference-extension.name" -}}
15+
{{- $base := .Values.inferencePool.name | default "default-pool" | lower | trim | trunc 40 -}}
16+
{{ $base }}-epp
17+
{{- end -}}
18+
19+
{{/*
20+
Selector labels
21+
*/}}
22+
{{- define "gateway-api-inference-extension.selectorLabels" -}}
23+
app: {{ include "gateway-api-inference-extension.name" . }}
24+
{{- end -}}
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
apiVersion: inference.networking.x-k8s.io/v1alpha2
2+
kind: InferenceModel
3+
metadata:
4+
name: inferencemodel-sample
5+
spec:
6+
modelName: tweet-summary
7+
criticality: Critical
8+
poolRef:
9+
name: vllm-llama2-7b
10+
targetModels:
11+
- name: tweet-summary-1
12+
weight: 100
13+
14+
---
15+
apiVersion: inference.networking.x-k8s.io/v1alpha2
16+
kind: InferenceModel
17+
metadata:
18+
name: inferencemodel-base-model
19+
spec:
20+
modelName: meta-llama/Llama-2-7b-hf
21+
criticality: Critical
22+
poolRef:
23+
name: vllm-llama2-7b
24+
25+
---
26+
apiVersion: inference.networking.x-k8s.io/v1alpha2
27+
kind: InferenceModel
28+
metadata:
29+
name: inferencemodel-base-model-cpu
30+
spec:
31+
modelName: Qwen/Qwen2.5-1.5B-Instruct
32+
criticality: Critical
33+
poolRef:
34+
name: vllm-llama2-7b
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# Default values for inferencemodel.
2+
# This is a YAML-formatted file.
3+
# Declare variables to be passed into your templates.
4+
5+
replicaCount: 1
6+
7+
image:
8+
repository: nginx
9+
pullPolicy: IfNotPresent
10+
# Overrides the image tag whose default is the chart appVersion.
11+
tag: ""
12+
13+
imagePullSecrets: []
14+
nameOverride: ""
15+
fullnameOverride: ""
16+
17+
serviceAccount:
18+
# Specifies whether a service account should be created
19+
create: true
20+
# Automatically mount a ServiceAccount's API credentials?
21+
automount: true
22+
# Annotations to add to the service account
23+
annotations: {}
24+
# The name of the service account to use.
25+
# If not set and create is true, a name is generated using the fullname template
26+
name: ""
27+
28+
podAnnotations: {}
29+
podLabels: {}
30+
31+
podSecurityContext: {}
32+
# fsGroup: 2000
33+
34+
securityContext: {}
35+
# capabilities:
36+
# drop:
37+
# - ALL
38+
# readOnlyRootFilesystem: true
39+
# runAsNonRoot: true
40+
# runAsUser: 1000
41+
42+
service:
43+
type: ClusterIP
44+
port: 80
45+
46+
ingress:
47+
enabled: false
48+
className: ""
49+
annotations: {}
50+
# kubernetes.io/ingress.class: nginx
51+
# kubernetes.io/tls-acme: "true"
52+
hosts:
53+
- host: chart-example.local
54+
paths:
55+
- path: /
56+
pathType: ImplementationSpecific
57+
tls: []
58+
# - secretName: chart-example-tls
59+
# hosts:
60+
# - chart-example.local
61+
62+
resources: {}
63+
# We usually recommend not to specify default resources and to leave this as a conscious
64+
# choice for the user. This also increases chances charts run on environments with little
65+
# resources, such as Minikube. If you do want to specify resources, uncomment the following
66+
# lines, adjust them as necessary, and remove the curly braces after 'resources:'.
67+
# limits:
68+
# cpu: 100m
69+
# memory: 128Mi
70+
# requests:
71+
# cpu: 100m
72+
# memory: 128Mi
73+
74+
livenessProbe:
75+
httpGet:
76+
path: /
77+
port: http
78+
readinessProbe:
79+
httpGet:
80+
path: /
81+
port: http
82+
83+
autoscaling:
84+
enabled: false
85+
minReplicas: 1
86+
maxReplicas: 100
87+
targetCPUUtilizationPercentage: 80
88+
# targetMemoryUtilizationPercentage: 80
89+
90+
# Additional volumes on the output Deployment definition.
91+
volumes: []
92+
# - name: foo
93+
# secret:
94+
# secretName: mysecret
95+
# optional: false
96+
97+
# Additional volumeMounts on the output Deployment definition.
98+
volumeMounts: []
99+
# - name: foo
100+
# mountPath: "/etc/foo"
101+
# readOnly: true
102+
103+
nodeSelector: {}
104+
105+
tolerations: []
106+
107+
affinity: {}

config/manifests/benchmark/model-server-service.yaml

Lines changed: 0 additions & 12 deletions
This file was deleted.

site-src/performance/benchmark/index.md

Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,30 +5,26 @@ inference extension, and a Kubernetes service as the load balancing strategy. Th
55
benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG)
66
tool to generate load and collect results.
77

8-
## Prerequisites
8+
## Run benchmarks manually
99

10-
### Deploy the inference extension and sample model server
10+
### Prerequisite: have an endpoint ready to server inference traffic
1111

12-
Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the
13-
sample vLLM application, and the inference extension.
12+
To serve via a Gateway using the inference extension, follow this [user guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/)
13+
to deploy the sample vLLM application, and the inference extension.
1414

15-
### [Optional] Scale the sample vLLM deployment
16-
17-
You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
15+
You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision. So consider scaling the sample application with more replicas:
1816

1917
```bash
2018
kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml
2119
```
2220

23-
### Expose the model server via a k8s service
24-
25-
As the baseline, let's also expose the vLLM deployment as a k8s service:
21+
To serve via a Kubernetes LoadBalancer service as a baseline comparison, you can expose the sample application:
2622

2723
```bash
2824
kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer
2925
```
3026

31-
## Run benchmark
27+
### Run benchmark
3228

3329
The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets.
3430

@@ -60,18 +56,24 @@ to specify what this benchmark is for. For instance, `inference-extension` or `k
6056
the script below will watch for that log line and then start downloading results.
6157

6258
```bash
63-
benchmark_id='my-benchmark' ./tools/benchmark/download-benchmark-results.bash
59+
benchmark_id='my-benchmark' ./tools/benchmark/scripts/download-benchmark-results.bash
6460
```
6561

6662
1. After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/my-benchmark/results/json` folder.
6763

68-
### Tips
64+
#### Tips
6965

70-
* You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script.
66+
* You can specify `run_id="runX"` environment variable when running the `download-benchmark-results.bash` script.
7167
This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly.
7268
* Update the `request_rates` that best suit your benchmark environment.
7369

74-
### Advanced Benchmark Configurations
70+
## Run benchmarks automatically
71+
72+
The [benchmark automation tool](https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/tools/benchmark) enables defining benchmarks via a config file and running the benchmarks
73+
automatically. It's currently experimental. To try it, refer to its [user guide](https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/tools/benchmark).
74+
75+
76+
## Advanced Benchmark Configurations
7577
7678
Pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a detailed list of configuration knobs.
7779

tools/benchmark/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
output/

0 commit comments

Comments
 (0)