-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trt.IStreamReader
(as implemented e.g. in polygraphy
) requires higher peak CPU memory and more time than naive python implementation.
#4327
Comments
trt.IStreamReader
usage from polygraphy
of TensorRT 10.7 requires higher memory and time than naive implementation.
trt.IStreamReader
usage from polygraphy
of TensorRT 10.7 requires higher memory and time than naive implementation.trt.IStreamReader
usage from polygraphy
requires higher peak memory and more time than naive python implementation.
trt.IStreamReader
usage from polygraphy
requires higher peak memory and more time than naive python implementation.trt.IStreamReader
usage from polygraphy
requires higher peak CPU memory and more time than naive python implementation.
polygraphy just a inference prototyping and debugging toolkit, not for the purpose of pursuing performance. Here it warp the trt.IStreamReader https://github.com/NVIDIA/TensorRT/blob/release/10.7/tools/Polygraphy/polygraphy/backend/trt/file_reader.py. |
@lix19937 The above Implementation is an exact copy of https://github.com/NVIDIA/TensorRT/blob/release/10.7/tools/Polygraphy/polygraphy/backend/trt/file_reader.py (which is currently the only OSS implementation of trt.IStreamReader). The issue happens with both, the linked and the code in this issue. |
trt.IStreamReader
usage from polygraphy
requires higher peak CPU memory and more time than naive python implementation.trt.IStreamReader
(as implemented e.g. in polygraphy
) requires higher peak CPU memory and more time than naive python implementation.
There's an |
@pranavm-nvidia As you see above, the StreamReaderV2 has the same issues, citing the above section on V2: Llama-8B-Fp16 engine.
|
Huh yeah, that is strange. Looking at the implementation, it does indeed request the entire engine in one go. I assume this API was only intended to be used with GPU Direct Storage so that you bypass host memory entirely. @jhalakpatel do you know? |
The same engine is fine if I load it via the bytes API. e.g. leading to:
|
Description
I am trying to optimize the loading of a ~14.2GB tensorrt-llm engine on a 16GB CPU RAM node into a 16GB VRAM. As the rest of my program takes around ~1GB CPU RAM, there is little room for not streaming the CudaEngine from disk to cuda.
Upon trying out the
trt.IStreamReader
the class does not hold its promises.Environment
TensorRT Version:
NVIDIA GPU: H100
/baseten/engine-builder/tei_trt# nvidia-smi
Wed Jan 15 23:59:04 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
Operating System: Ubuntu 22.04
Python Version (if applicable): 3.10.2
PyTorch Version (if applicable): 2.5.1
Baremetal or Container (if so, version):
Relevant Files
Llama-7B engine created with TensorRT-LLM 0.16.0
Steps To Reproduce
Vanilla results
8.4s + peak memory 15524688kB
IStreamReaderV1 loading:
Analysis
The duplication of the memory is likely because of a parsing from python to cpp, which uses a copy. If the API was to read it in smaller chunks, this would not be as bad.
The
.read(size)
API is called twice with StreamV1 class, requesting the initial 32Bytes and then the rest.pdb breakpoint delivers no additional info
Analysis IStreamReaderV2
Streamreaderv2 also reads out most in one file. This actually does fail.
Desired behavior:
Either:
bytes
interface)Commands or scripts:
Have you tried the latest release?: YES
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
): polygraphy / tensorrt_llmThe text was updated successfully, but these errors were encountered: