`trt.IStreamReader` (as implemented e.g. in `polygraphy`) requires higher peak CPU memory and more time than naive python implementation.

## Description

I am trying to optimize the loading of a ~14.2GB tensorrt-llm engine on a 16GB CPU RAM node into a 16GB VRAM. As the rest of my program takes around ~1GB CPU  RAM, there is little room for not streaming the CudaEngine from disk to cuda.

Upon trying out the `trt.IStreamReader` the class does not hold its promises. 
- its slower then reading the file in python.
- it requires ~15GB CPU RAM overhead instead of 1GB CPU RAM with a naive implementation 

## Environment



**TensorRT Version**:

**NVIDIA GPU**: H100

/baseten/engine-builder/tei_trt# nvidia-smi
Wed Jan 15 23:59:04 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |




Operating System: Ubuntu 22.04

Python Version (if applicable): 3.10.2

PyTorch Version (if applicable): 2.5.1

Baremetal or Container (if so, version):


## Relevant Files

Llama-7B engine created with TensorRT-LLM 0.16.0


## Steps To Reproduce
```python
iimport time
import tensorrt as trt

from pathlib import Path

def FileReaderVanilla(filepath):
    if not Path(filepath).exists():
        raise ValueError(f"File at {filepath} does not exist!")
    with open(filepath, "rb") as f:
        return f.read()
class FileReaderV1(trt.IStreamReader):
    """
    Class that supplies data to TensorRT from a stream. This may help reduce memory usage during deserialization.
    Moves engine file directly to CUDA memory, without loading it into CPU memory first.
    https://github.com/NVIDIA/TensorRT/blob/97ff24489d0ea979c418c7a0847dfc14c8483846/tools/Polygraphy/polygraphy/backend/trt/file_reader.py#L28
    Args:
        filepath (str):
                The path to the serialized file.

    ```python
    # roughly equivalent to:
    if not self.serialize_path.exists():
        raise ValueError(
            f"missing engine at serialize_path={self.serialize_path}"
        )
    with open(self.serialize_path, "rb") as f:
        yield f.read() # stream equivalent
    ```
    """
    def __init__(self, filepath):
        # Must explicitly initialize parent for any trampoline class! Will mysteriously segfault without this.
        trt.IStreamReader.__init__(self)  # type: ignore

        self.filepath = filepath

        if not Path(self.filepath).exists():
            raise ValueError(f"File at {self.filepath} does not exist!")
        self.file = open(self.filepath, "rb")
        
    def read(self, size: int) -> bytes:
        print(f"Reading {size} bytes")
        return self.file.read(size)

    def free(self):
        if self.file:
            self.file.close()

    def __enter__(self):
        # Open the file and create a memory map
        
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self.free()

class FileReaderV2(trt.IStreamReaderV2):
    """
    Class that supplies data to TensorRT from a stream, without loading the whole file into memory.
    Moves engine file directly to CUDA memory, without first allocating it all in CPU memory.

    Args:
        file (Path):
            The path to the serialized engine file.
    """
    def __init__(self, file_path):
        trt.IStreamReaderV2.__init__(self)
        self.bytes = Path(file_path).read_bytes()
        self.len = len(self.bytes)
        self.index = 0

    def read(self, size, cudaStreamPtr):
        
        assert self.index + size <= self.len
        data = self.bytes[self.index:self.index + size]
        self.index += size
        print(f"Reading {size} bytes, actual size: {len(data)}")
        return data

    def seek(self, offset, where):
        print(f" seek position: {offset} {where}")
        if where == trt.SeekPosition.SET:
            self.index = offset
        elif where == trt.SeekPosition.CUR:
            self.index += offset
        elif where == trt.SeekPosition.END:
            self.index = self.len - offset
        else:
            raise ValueError(f"Invalid seek position: {where}")

def init_runtime(reader):
    runtime = trt.Runtime(trt.Logger(trt.Logger.INFO))
    engine = runtime.deserialize_cuda_engine(reader)
    assert engine is not None
    return runtime, engine

def debug_max_memory_usage_filereaderv2():
    _ = init_runtime(FileReaderV2("/app/engines/rank0.engine"))
    time.sleep(1)

def debug_max_memory_usage_filereaderv1():
    _ = init_runtime(FileReaderV1("/app/engines/rank0.engine"))
    time.sleep(1)

def debug_max_memory_usage_filereader_vanilla():
    _ = init_runtime(FileReaderVanilla("/app/engines/rank0.engine"))
    time.sleep(1)

if __name__ == "__main__":
    # /usr/bin/time -v poetry run python ./tests/test_runtime_filereader.py
    debug_max_memory_usage_filereaderv2()
```

# Vanilla results
8.4s + peak memory 15524688kB
```
/usr/bin/time -v poetry run python --vanilla
debug_max_memory_usage_filereader_vanilla()

  warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
        Command being timed: "poetry run python ./tests/test_runtime_filereader.py"
        User time (seconds): 8.40
        System time (seconds): 17.13
        Percent of CPU this job got: 109%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.25
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 15524688
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 6318
        Minor (reclaiming a frame) page faults: 3824756
        Voluntary context switches: 53551
        Involuntary context switches: 537
        Swaps: 0
        File system inputs: 0
        File system outputs: 24
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
(trt-tei-runtime-py3.10) root@michaelfeil-dev-pod-h100-0:~/baseten/engine-builde
```

IStreamReaderV1 loading:
- User time (seconds): 10.27 (worse)
- Maximum resident set size (kbytes): 29217388 (almost double)
```
/usr/bin/time -v poetry run python --stream
debug_max_memory_usage_filereader()
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
        Command being timed: "poetry run python ./tests/test_runtime_filereader.py"
        User time (seconds): 10.27
        System time (seconds): 22.72
        Percent of CPU this job got: 111%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:29.65
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 29217388
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 6284
        Minor (reclaiming a frame) page faults: 7312826
        Voluntary context switches: 54294
        Involuntary context switches: 538
        Swaps: 0
        File system inputs: 0
        File system outputs: 24
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
```


# Analysis
The duplication of the memory is likely because of a parsing from python to cpp, which uses a copy. If the API was to read it in smaller chunks, this would not be as bad.

The `.read(size)` API is called twice with StreamV1 class, requesting the initial 32Bytes and then the rest.

```
# successful read that needs 29217388kB
reading 32 bytes from /app/engines/rank0.engine
reading 14244750076 bytes from /app/engines/rank0.engine 
```

pdb breakpoint delivers no additional info
```
builder/tei_trt/tests/test_runtime_filereader.py(7)init_runtime()
      6     runtime = trt.Runtime(trt.Logger([trt.Logger.INFO](http://trt.logger.info/)))
----> 7     engine = runtime.deserialize_cuda_engine(reader)
      8     assert engine is not None
> /workspace/model-performance/michaelfeil/baseten/engine-builder/tei_trt/trt_tei_runtime/trt_model.py(137)read()
    136         ipdb.set_trace()
--> 137         print(f"reading {size} bytes from {self.filepath}")
    138         return self.file.read(size)
```

## Analysis IStreamReaderV2
Streamreaderv2 also reads out most in one file. This actually does fail.
```
 seek position: 0 SeekPosition.SET
 seek position: 0 SeekPosition.SET
Reading 32 bytes, acutal size: 32
Reading 48 bytes, acutal size: 48
 seek position: 80 SeekPosition.SET
Reading 6586564 bytes, acutal size: 6586564
 seek position: 6586648 SeekPosition.SET
Reading 13975421440 bytes, acutal size: 13975421440
Segmentation fault (core dumped)
```


## Desired behavior:
Either:
- accept if fewer bytes are returned, moving parts of the engine plan to GPU. I could limit a max return size in python to e.g. 1GB and C++ side would "need to make it work"
- C++ side exposes a API for setting a max bytes size. Python can set this optional value to control the demand from C++ side.
- the pybind interface / garbage collection on python side seems to be unclean, such that we have duplication of memory. (in-memory copy) instead of passing the value (as in vanilla `bytes` interface)

**Commands or scripts**:

**Have you tried [the latest release](https://developer.nvidia.com/tensorrt)?**: YES

**Can this model run on other frameworks?** For example run ONNX model with ONNXRuntime (`polygraphy run <model.onnx> --onnxrt`): polygraphy / tensorrt_llm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`trt.IStreamReader` (as implemented e.g. in `polygraphy`) requires higher peak CPU memory and more time than naive python implementation. #4327

Description

Environment

Relevant Files

Steps To Reproduce

Vanilla results

Analysis

Analysis IStreamReaderV2

Desired behavior:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

trt.IStreamReader (as implemented e.g. in polygraphy) requires higher peak CPU memory and more time than naive python implementation. #4327

Description

Description

Environment

Relevant Files

Steps To Reproduce

Vanilla results

Analysis

Analysis IStreamReaderV2

Desired behavior:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`trt.IStreamReader` (as implemented e.g. in `polygraphy`) requires higher peak CPU memory and more time than naive python implementation. #4327