Skip to content

Commit d26e7f7

Browse files
phymbertngxson
authored andcommitted
server: continuous performance monitoring and PR comment (ggml-org#6283)
* server: bench: init * server: bench: reduce list of GPU nodes * server: bench: fix graph, fix output artifact * ci: bench: add mermaid in case of image cannot be uploaded * ci: bench: more resilient, more metrics * ci: bench: trigger build * ci: bench: fix duration * ci: bench: fix typo * ci: bench: fix mermaid values, markdown generated * typo on the step name Co-authored-by: Xuan Son Nguyen <[email protected]> * ci: bench: trailing spaces * ci: bench: move images in a details section * ci: bench: reduce bullet point size --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
1 parent c91d021 commit d26e7f7

File tree

5 files changed

+603
-9
lines changed

5 files changed

+603
-9
lines changed

.github/workflows/bench.yml

+279
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,279 @@
1+
# Benchmark
2+
name: Benchmark
3+
4+
on:
5+
workflow_dispatch:
6+
inputs:
7+
gpu-series:
8+
description: 'Azure GPU series to run with'
9+
required: true
10+
type: choice
11+
options:
12+
- Standard_NC4as_T4_v3
13+
- Standard_NC24ads_A100_v4
14+
- Standard_NC80adis_H100_v5
15+
sha:
16+
description: 'Commit SHA1 to build'
17+
required: false
18+
type: string
19+
duration:
20+
description: 'Duration of the bench'
21+
type: string
22+
default: 10m
23+
24+
push:
25+
branches:
26+
- master
27+
paths: ['.github/workflows/bench.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/bench/**.*']
28+
pull_request:
29+
types: [opened, synchronize, reopened]
30+
paths: ['.github/workflows/bench.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/bench/**.*']
31+
schedule:
32+
- cron: '04 2 * * *'
33+
34+
concurrency:
35+
group: ${{ github.workflow }}-${{ github.ref }}
36+
cancel-in-progress: true
37+
38+
jobs:
39+
bench-server-baseline:
40+
runs-on: Standard_NC4as_T4_v3
41+
env:
42+
RUNNER_LABEL: Standard_NC4as_T4_v3 # FIXME Do not find a way to not duplicate it
43+
N_USERS: 8
44+
DURATION: 10m
45+
if: ${{ github.event.inputs.gpu-series == 'Standard_NC4as_T4_v3' || github.event.schedule || github.event.pull_request || github.event.push.ref == 'refs/heads/master' }}
46+
steps:
47+
- name: Clone
48+
id: checkout
49+
uses: actions/checkout@v3
50+
with:
51+
fetch-depth: 0
52+
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
53+
54+
- name: Install python env
55+
id: pipenv
56+
run: |
57+
cd examples/server/bench
58+
python3 -m venv venv
59+
source venv/bin/activate
60+
pip install -r requirements.txt
61+
62+
- name: Prometheus
63+
id: install_prometheus
64+
run: |
65+
wget --quiet https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz
66+
tar xzf prometheus*.tar.gz --strip-components=1
67+
./prometheus --config.file=examples/server/bench/prometheus.yml &
68+
while ! nc -z localhost 9090; do
69+
sleep 0.1
70+
done
71+
72+
- name: Install k6
73+
id: k6_installation
74+
run: |
75+
cd examples/server/bench
76+
wget --quiet https://github.com/grafana/k6/releases/download/v0.49.0/k6-v0.49.0-linux-amd64.tar.gz
77+
tar xzf k6*.tar.gz --strip-components=1
78+
79+
- name: Build
80+
id: cmake_build
81+
run: |
82+
set -eux
83+
mkdir build
84+
cd build
85+
cmake .. \
86+
-DLLAMA_NATIVE=OFF \
87+
-DLLAMA_BUILD_SERVER=ON \
88+
-DLLAMA_CURL=ON \
89+
-DLLAMA_CUBLAS=ON \
90+
-DCUDAToolkit_ROOT=/usr/local/cuda \
91+
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
92+
-DCMAKE_CUDA_ARCHITECTURES=75 \
93+
-DLLAMA_FATAL_WARNINGS=OFF \
94+
-DLLAMA_ALL_WARNINGS=OFF \
95+
-DCMAKE_BUILD_TYPE=Release;
96+
cmake --build . --config Release -j $(nproc) --target server
97+
98+
- name: Download the dataset
99+
id: download_dataset
100+
run: |
101+
cd examples/server/bench
102+
wget --quiet https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
103+
104+
- name: Server bench
105+
id: server_bench
106+
run: |
107+
set -eux
108+
109+
cd examples/server/bench
110+
source venv/bin/activate
111+
BENCH_K6_BIN_PATH=./k6 python bench.py \
112+
--runner-label ${{ env.RUNNER_LABEL }} \
113+
--name ${{ github.job }} \
114+
--branch ${{ github.head_ref || github.ref_name }} \
115+
--commit ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha }} \
116+
--scenario script.js \
117+
--duration ${{ github.event.inputs.duration || env.DURATION }} \
118+
--hf-repo ggml-org/models \
119+
--hf-file phi-2/ggml-model-q4_0.gguf \
120+
--model-path-prefix /models \
121+
--parallel ${{ env.N_USERS }} \
122+
-ngl 33 \
123+
--batch-size 2048 \
124+
--ubatch-size 256 \
125+
--ctx-size 16384 \
126+
--n-prompts 1000 \
127+
--max-prompt-tokens 1024 \
128+
--max-tokens 2048
129+
130+
cat results.github.env >> $GITHUB_ENV
131+
132+
# Remove dataset as we do not want it in the artefact
133+
rm ShareGPT_V3_unfiltered_cleaned_split.json
134+
135+
- uses: actions/upload-artifact@v4
136+
with:
137+
name: benchmark-results
138+
compression-level: 9
139+
path: |
140+
examples/server/bench/*.jpg
141+
examples/server/bench/*.json
142+
examples/server/bench/*.log
143+
144+
- name: Commit status
145+
uses: Sibz/github-status-action@v1
146+
with:
147+
authToken: ${{secrets.GITHUB_TOKEN}}
148+
sha: ${{ inputs.sha || github.event.pull_request.head.sha || github.sha }}
149+
context: bench-server-baseline
150+
description: |
151+
${{ env.BENCH_RESULTS }}
152+
state: 'success'
153+
154+
- name: Upload benchmark images
155+
uses: devicons/[email protected]
156+
continue-on-error: true # Important as it looks unstable: 503
157+
id: imgur_step
158+
with:
159+
client_id: ${{secrets.IMGUR_CLIENT_ID}}
160+
path: |
161+
examples/server/bench/prompt_tokens_seconds.jpg
162+
examples/server/bench/predicted_tokens_seconds.jpg
163+
examples/server/bench/kv_cache_usage_ratio.jpg
164+
examples/server/bench/requests_processing.jpg
165+
166+
- name: Extract mermaid
167+
id: set_mermaid
168+
run: |
169+
set -eux
170+
171+
cd examples/server/bench
172+
PROMPT_TOKENS_SECONDS=$(cat prompt_tokens_seconds.mermaid)
173+
echo "PROMPT_TOKENS_SECONDS<<EOF" >> $GITHUB_ENV
174+
echo "$PROMPT_TOKENS_SECONDS" >> $GITHUB_ENV
175+
echo "EOF" >> $GITHUB_ENV
176+
177+
PREDICTED_TOKENS_SECONDS=$(cat predicted_tokens_seconds.mermaid)
178+
echo "PREDICTED_TOKENS_SECONDS<<EOF" >> $GITHUB_ENV
179+
echo "$PREDICTED_TOKENS_SECONDS" >> $GITHUB_ENV
180+
echo "EOF" >> $GITHUB_ENV
181+
182+
KV_CACHE_USAGE_RATIO=$(cat kv_cache_usage_ratio.mermaid)
183+
echo "KV_CACHE_USAGE_RATIO<<EOF" >> $GITHUB_ENV
184+
echo "$KV_CACHE_USAGE_RATIO" >> $GITHUB_ENV
185+
echo "EOF" >> $GITHUB_ENV
186+
187+
REQUESTS_PROCESSING=$(cat requests_processing.mermaid)
188+
echo "REQUESTS_PROCESSING<<EOF" >> $GITHUB_ENV
189+
echo "$REQUESTS_PROCESSING" >> $GITHUB_ENV
190+
echo "EOF" >> $GITHUB_ENV
191+
192+
- name: Extract image url
193+
id: extract_image_url
194+
continue-on-error: true
195+
run: |
196+
set -eux
197+
198+
echo "IMAGE_O=${{ fromJSON(steps.imgur_step.outputs.imgur_urls)[0] }}" >> $GITHUB_ENV
199+
echo "IMAGE_1=${{ fromJSON(steps.imgur_step.outputs.imgur_urls)[1] }}" >> $GITHUB_ENV
200+
echo "IMAGE_2=${{ fromJSON(steps.imgur_step.outputs.imgur_urls)[2] }}" >> $GITHUB_ENV
201+
echo "IMAGE_3=${{ fromJSON(steps.imgur_step.outputs.imgur_urls)[3] }}" >> $GITHUB_ENV
202+
203+
- name: Comment PR
204+
uses: mshick/add-pr-comment@v2
205+
id: comment_pr
206+
if: ${{ github.event.pull_request != '' }}
207+
with:
208+
message-id: bench-${{ github.job }}-${{ env.RUNNER_LABEL }}
209+
message: |
210+
📈 **llama.cpp server** for _${{ github.job }}_ on _${{ env.RUNNER_LABEL }}_: **${{ env.BENCH_ITERATIONS}} iterations** 🚀
211+
212+
- Concurrent users: ${{ env.N_USERS }}, duration: ${{ github.event.inputs.duration || env.DURATION }}
213+
- HTTP request : avg=${{ env.HTTP_REQ_DURATION_AVG }}ms p(90)=${{ env.HTTP_REQ_DURATION_P_90_ }}ms fails=${{ env.HTTP_REQ_FAILED_PASSES }}, finish reason: stop=${{ env.LLAMACPP_COMPLETIONS_STOP_RATE_PASSES }} truncated=${{ env.LLAMACPP_COMPLETIONS_TRUNCATED_RATE_PASSES }}
214+
- Prompt processing (pp): avg=${{ env.LLAMACPP_PROMPT_TOKENS_AVG }}tk/s p(90)=${{ env.LLAMACPP_PROMPT_TOKENS_P_90_ }}tk/s **total=${{ env.LLAMACPP_PROMPT_TOKENS_TOTAL_COUNTER_RATE }}tk/s**
215+
- Token generation (tg): avg=${{ env.LLAMACPP_TOKENS_SECOND_AVG }}tk/s p(90)=${{ env.LLAMACPP_TOKENS_SECOND_P_90_ }}tk/s **total=${{ env.LLAMACPP_COMPLETION_TOKENS_TOTAL_COUNTER_RATE }}tk/s**
216+
- ${{ env.BENCH_GRAPH_XLABEL }}
217+
218+
<details>
219+
220+
<summary>Time series</summary>
221+
222+
<p align="center">
223+
224+
<img width="100%" height="100%" src="${{ env.IMAGE_O }}" alt="prompt_tokens_seconds" />
225+
226+
<details>
227+
228+
<summary>More</summary>
229+
230+
```mermaid
231+
${{ env.PROMPT_TOKENS_SECONDS }}
232+
```
233+
234+
</details>
235+
236+
<img width="100%" height="100%" src="${{ env.IMAGE_1 }}" alt="predicted_tokens_seconds"/>
237+
238+
<details>
239+
<summary>More</summary>
240+
241+
```mermaid
242+
${{ env.PREDICTED_TOKENS_SECONDS }}
243+
```
244+
245+
</details>
246+
247+
</p>
248+
249+
<details>
250+
251+
<summary>Details</summary>
252+
253+
<p align="center">
254+
255+
<img width="100%" height="100%" src="${{ env.IMAGE_2 }}" alt="kv_cache_usage_ratio" />
256+
257+
<details>
258+
<summary>More</summary>
259+
260+
```mermaid
261+
${{ env.KV_CACHE_USAGE_RATIO }}
262+
```
263+
264+
</details>
265+
266+
<img width="100%" height="100%" src="${{ env.IMAGE_3 }}" alt="requests_processing"/>
267+
268+
<details>
269+
<summary>More</summary>
270+
271+
```mermaid
272+
${{ env.REQUESTS_PROCESSING }}
273+
```
274+
275+
</details>
276+
277+
</p>
278+
</details>
279+
</details>

0 commit comments

Comments
 (0)