Torch-TensorRT parallelism for distributed inference

Examples in this folder demonstrates doing distributed inference on multiple devices with Torch-TensorRT backend.

Data parallel distributed inference based on Accelerate

Using Accelerate users can achieve data parallel distributed inference with Torch-TensorRt backend. In this case, the entire model will be loaded onto each GPU and different chunks of batch input is processed on each device.

See the examples started with data_parallel for more details.

Tensor parallel distributed inference

Here we use torch.distributed as an example, but compilation with tensor parallelism is agnostic to the implementation framework as long as the module is properly sharded.

torchrun --nproc_per_node=2 tensor_parallel_llama2.py

Tensor parallel distributed inference using nccl ops plugin

apt install libmpich-dev

apt install libopenmpi-dev

#For python3.10

pip install tensorrt-llm

For other python versions, you need to load the libnvinfer_plugin_tensorrt_llm.so. Please set that in the environment variable export TRTLLM_PLUGINS_PATH={lib_path}. For example, we have already set the variable in initialize_distributed_env(). You can replace this with your TRTLLM_PLUGINS_PATH and unset it there

#then pip install the tensorrt and torch version compatible with installed torchTRT

mpirun -n 2 --allow-run-as-root python tensor_parallel_simple_example.py

#For other python

Tensor parallel distributed llama3 inference using nccl ops plugin

apt install libmpich-dev

apt install libopenmpi-dev

#For python3.10

pip install tensorrt-llm

For other python versions, you need to load the libnvinfer_plugin_tensorrt_llm.so

#then pip install the tensorrt and torch version compatible with installed torchTRT

mpirun -n 2 --allow-run-as-root python tensor_parallel_llama3.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Torch-TensorRT parallelism for distributed inference

Files

README.md

Latest commit

History

README.md

File metadata and controls

Torch-TensorRT parallelism for distributed inference