-Our journey in deploying the report generation models reflects the above discussion. We started out serving our models by deploying the model code and its dependencies along with the parameter checkpoints in a custom Docker image exposing a gRPC service interface. However, we soon noticed that it became error-prone to replicate the exact code and environment used by the modeling team while estimating the parameters. Moreover, this approach prevented us from leveraging high-performance model serving frameworks like NVIDIA's Triton, which is written in C++ and requires self-contained models that can be used without a Python interpreter. At this stage, we were facing a choice between attempting to export our PyTorch models to ONNX or TorchScript format. ONNX is an open specification for representing machine learning models that increasingly finds adoption. It is powered by a high-performance runtime developed by Microsoft (ONNX Runtime). Working closely with the ONNX team at Microsoft, we discovered that some operations that our models require were not yet supported in ONNX. Consequently, we turned our attention to TorchScript, the mechanism more native to PyTorch. Through a combination of tracing and scripting, annotating our code where needed, we succeeded and obtained self-contained TorchScript models that Triton could serve. This improved our deployment path considerably. We no longer had to worry about the code dependencies and now had the option of using Triton for high-performance model serving on NVIDIA GPUs.
0 commit comments