You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: benchmarks/cpp/README.md
+43
Original file line number
Diff line number
Diff line change
@@ -352,3 +352,46 @@ If you want to obtain context and generation logits, you could build an enigne w
352
352
If you want to get the logits, you could run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.
353
353
354
354
*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*
355
+
356
+
357
+
### 4.launch C++ disaggServerBenchmark
358
+
Currently ,TensorRT-LLM has limited support for disaggregated inference, where context and generation phases of a request can run on different executors. `disaggServerBenchmark` is a tool to benchmark disaggregated inference.
359
+
360
+
#### Usage
361
+
For detailed usage, you can do the following
362
+
```
363
+
cd cpp/build
364
+
365
+
# You can directly execute the binary for help information
366
+
./benchmarks/disaggServerBenchmark --help
367
+
```
368
+
`disaggServerBenchmark` only supports `decoder-only` models.
This command will launch m context engines and n generation engines. You need to ensure `proc` is equal to the sum of the number of processes required for each engine plus 1. Since we use orchestrator mode for `disaggServerBenchmark` we need an additional process as the orchestrator. For example, if there are two context engines (one is TP2_PP1,another is TP1_PP1) and two generation engines(one is TP2_PP1,another is TP1_PP1), then the `proc` value should be set to 7.
# need 6 gpus and 7 processes to launch the benchmark.
381
+
```
382
+
383
+
#### Known Issues
384
+
385
+
##### 1. error `All available sequence slots are used`
386
+
387
+
If generation_engine's pp_size >1, the error "All available sequence slots are used" may occur, setting and adjusting the parameter `--request_rate` may help alleviate the problem.
388
+
389
+
##### 2.KVCache transfers are by default via PCIE on single node.
390
+
Currently, because of the dependency libraries,KVCache transfers are by default via PCIE on single node.
391
+
392
+
If you want to use NVLink, please check the UCX version in the container by running:
393
+
```
394
+
ucx_info -v
395
+
```
396
+
If the UCX version is less than or equal to 1.17, set `UCX_RNDV_FRAG_MEM_TYPE=cuda` to enable KvCache transfers using NVLink.
397
+
If the UCX version is 1.18, please set `UCX_CUDA_COPY_ASYNC_MEM_TYPE=cuda` to enable KvCache transfers using NVLink.
0 commit comments