Skip to content

Commit 3856265

Browse files
kaiyuxZars19
andauthored
Update TensorRT-LLM (#2502)
* Update TensorRT-LLM --------- Co-authored-by: 岑灿 <[email protected]>
1 parent 535c9cc commit 3856265

File tree

487 files changed

+678595
-2461601
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

487 files changed

+678595
-2461601
lines changed

README.md

+5-3
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,14 @@ TensorRT-LLM
1717
<div align="left">
1818

1919
## Latest News
20+
* [2024/11/19] Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs
21+
[➡️ link](https://developer.nvidia.com/blog/llama-3-2-full-stack-optimizations-unlock-high-performance-on-nvidia-gpus/?ncid=so-link-721194)
22+
<div align="center">
23+
<img src="https://developer-blogs.nvidia.com/wp-content/uploads/2024/11/three-llamas-holding-number-10-signs-1.jpg" width="50%">
24+
<div align="left">
2025

2126
* [2024/11/09] 🚀🚀🚀 3x Faster AllReduce with NVSwitch and TensorRT-LLM MultiShot
2227
[➡️ link](https://developer.nvidia.com/blog/3x-faster-allreduce-with-nvswitch-and-tensorrt-llm-multishot/)
23-
<div align="center">
24-
<img src="https://developer-blogs.nvidia.com/wp-content/uploads/2024/08/HGX-H200-tech-blog-1920x1080-1.jpg" width="50%">
25-
<div align="left">
2628

2729
* [2024/11/09] ✨ NVIDIA advances the AI ecosystem with the AI model of LG AI Research 🙌
2830
[➡️ link](https://blogs.nvidia.co.kr/blog/nvidia-lg-ai-research/)

benchmarks/cpp/CMakeLists.txt

+2-1
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ if(NOT TARGET cxxopts::cxxopts)
2525
endif()
2626

2727
function(add_benchmark test_name test_src)
28-
add_executable(${test_name} ${test_src})
28+
add_executable(${test_name} ${test_src} utils/utils.cpp)
2929

3030
target_link_libraries(
3131
${test_name} PUBLIC ${SHARED_TARGET} nvinfer_plugin_tensorrt_llm
@@ -40,3 +40,4 @@ endfunction()
4040
add_benchmark(gptSessionBenchmark gptSessionBenchmark.cpp)
4141
add_benchmark(bertBenchmark bertBenchmark.cpp)
4242
add_benchmark(gptManagerBenchmark gptManagerBenchmark.cpp)
43+
add_benchmark(disaggServerBenchmark disaggServerBenchmark.cpp)

benchmarks/cpp/README.md

+43
Original file line numberDiff line numberDiff line change
@@ -352,3 +352,46 @@ If you want to obtain context and generation logits, you could build an enigne w
352352
If you want to get the logits, you could run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.
353353
354354
*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*
355+
356+
357+
### 4.launch C++ disaggServerBenchmark
358+
Currently ,TensorRT-LLM has limited support for disaggregated inference, where context and generation phases of a request can run on different executors. `disaggServerBenchmark` is a tool to benchmark disaggregated inference.
359+
360+
#### Usage
361+
For detailed usage, you can do the following
362+
```
363+
cd cpp/build
364+
365+
# You can directly execute the binary for help information
366+
./benchmarks/disaggServerBenchmark --help
367+
```
368+
`disaggServerBenchmark` only supports `decoder-only` models.
369+
Here is the basic usage:
370+
```
371+
mpirun -n ${proc} benchmarks/disaggServerBenchmark --context_engine_dirs ${context_engine_0},${context_engine_1}...,${context_engine_{m-1}} \
372+
--generation_engine_dirs ${generation_engine_0},${generation_engine_1}...,${generation_engine_{n-1}} --dataset ${dataset_path}
373+
```
374+
This command will launch m context engines and n generation engines. You need to ensure `proc` is equal to the sum of the number of processes required for each engine plus 1. Since we use orchestrator mode for `disaggServerBenchmark` we need an additional process as the orchestrator. For example, if there are two context engines (one is TP2_PP1,another is TP1_PP1) and two generation engines(one is TP2_PP1,another is TP1_PP1), then the `proc` value should be set to 7.
375+
376+
for example:
377+
```
378+
mpirun -n 7 benchmarks/disaggServerBenchmark --context_engine_dirs ${llama_7b_tp2_pp1_dir},${llama_7b_tp1_pp1_dir} --generation_engine_dirs ${llama_7b_tp1_pp1_dir},${llama_7b_tp2_pp1_dir} --dataset ${dataset_path}
379+
380+
# need 6 gpus and 7 processes to launch the benchmark.
381+
```
382+
383+
#### Known Issues
384+
385+
##### 1. error `All available sequence slots are used`
386+
387+
If generation_engine's pp_size >1, the error "All available sequence slots are used" may occur, setting and adjusting the parameter `--request_rate` may help alleviate the problem.
388+
389+
##### 2.KVCache transfers are by default via PCIE on single node.
390+
Currently, because of the dependency libraries,KVCache transfers are by default via PCIE on single node.
391+
392+
If you want to use NVLink, please check the UCX version in the container by running:
393+
```
394+
ucx_info -v
395+
```
396+
If the UCX version is less than or equal to 1.17, set `UCX_RNDV_FRAG_MEM_TYPE=cuda` to enable KvCache transfers using NVLink.
397+
If the UCX version is 1.18, please set `UCX_CUDA_COPY_ASYNC_MEM_TYPE=cuda` to enable KvCache transfers using NVLink.

0 commit comments

Comments
 (0)