Commit f63c134
authored
[Feature] GLM4.6 support mtp with fullgraph (#5460)
### What this PR does / why we need it?
GLM4.6 support mtp with fullgraph to improve performance
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
`
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE=AIV
vllm serve /weight/glm4.6_w8a8_with_float_mtp \
--data-parallel-size 1 \
--tensor-parallel-size 16 \
--seed 1024 \
--served-model-name glm \
--max-model-len 35000 \
--max-num-batched-tokens 16384 \
--max-num-seqs 16 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--speculative-config '{"num_speculative_tokens": 1,
"model":"/weight/glm4.6_w8a8_with_float_mtp", "method":"mtp"}' \
--compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16,32],
"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--async-scheduling \
`
test case:
`
vllm bench serve \
--backend vllm \
--dataset-name prefix_repetition \
--prefix-repetition-prefix-len 22400 \
--prefix-repetition-suffix-len 9600 \
--prefix-repetition-output-len 1024 \
--num-prompts 1 \
--prefix-repetition-num-prefixes 1 \
--ignore-eos \
--model glm \
--tokenizer /weight/glm4.6_w8a8_with_float_mtp \
--seed 1000 \
--host 0.0.0.0 \
--port 8000 \
--endpoint /v1/completions \
--max-concurrency 1 \
--request-rate 1
`
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89
Signed-off-by: 1092626063 <[email protected]>1 parent 09682e0 commit f63c134
File tree
2 files changed
+18
-6
lines changed- tests/e2e/nightly/single_node/models
- vllm_ascend/quantization
2 files changed
+18
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
| 32 | + | |
32 | 33 | | |
33 | 34 | | |
34 | 35 | | |
| |||
65 | 66 | | |
66 | 67 | | |
67 | 68 | | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
73 | 72 | | |
74 | 73 | | |
75 | 74 | | |
| |||
91 | 90 | | |
92 | 91 | | |
93 | 92 | | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
94 | 98 | | |
95 | 99 | | |
96 | 100 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
173 | 173 | | |
174 | 174 | | |
175 | 175 | | |
176 | | - | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
177 | 185 | | |
178 | 186 | | |
179 | 187 | | |
| |||
0 commit comments