Examples in this folder demonstrate distributed inference on multiple devices with the Torch-TensorRT backend.
Data Parallel Distributed Inference based on Accelerate
Using Accelerate, users can achieve data parallel distributed inference with the Torch-TensorRT backend. In this case, the entire model will be loaded onto each GPU, and different chunks of batch input are processed on each device.
See the examples:
for more details.
Here, we use torch.distributed as an example, but compilation with tensor parallelism is agnostic to the implementation framework as long as the module is properly sharded.
torchrun --nproc_per_node=2 tensor_parallel_llama2.py
We use torch.distributed to shard the model with Tensor parallelism. The distributed operations (all_gather and all_reduce) are then expressed as TensorRT-LLM plugins to avoid graph breaks during Torch-TensorRT compilation. The converters for these operators are already available in Torch-TensorRT. The functional implementation of ops is imported from the tensorrt_llm package (specifically, libnvinfer_plugin_tensorrt_llm.so is required).
We have two options:
Follow the instructions to install TensorRT-LLM. Please note that before installing TensorRT-LLM, you need to
- apt install libmpich-dev
- apt install libopenmpi-dev
If the default installation fails due to issues like library version mismatches or Python compatibility, consider using Option 2. After a successful installation, test by running:
import torch_tensorrt
to ensure it works without errors. The import might fail if tensorrt_llm overrides torch_tensorrt dependencies. Option 2 is preferable if you do not wish to install tensorrt_llm and its dependencies.
Alternatively, you can load libnvinfer_plugin_tensorrt_llm.so manually:
- Download the tensorrt_llm-0.16.0 wheel file from NVIDIA's Python index.
- Extract the wheel file to a directory and locate libnvinfer_plugin_tensorrt_llm.so under the tensorrt_llm/libs directory.
- Set the environment variable TRTLLM_PLUGINS_PATH to the extracted path at the initialize_distributed_env() call.
After configuring TensorRT-LLM or the TensorRT-LLM plugin library path, run the following command to illustrate tensor parallelism of a simple model and compilation with Torch-TensorRT:
mpirun -n 2 --allow-run-as-root python tensor_parallel_simple_example.py
We also provide a tensor parallelism compilation example on a more advanced model like Llama-3. Run the following command:
mpirun -n 2 --allow-run-as-root python tensor_parallel_llama3.py
- :ref:`tensor_parallel_llama3`: Illustration of distributed inference on multiple devices with the Torch-TensorRT backend.