[TRT EP] Fix logic to reach cache encryption code. #17111

simonjub · 2023-08-11T05:12:22Z

Description

This is a followup to PR #15519 that is closed in favor of this one.

Motivation and Context

The current implementation of TRT cache has no code execution path possible so that an encrypted TRT engine cache could be created when flags engine_cache_enable and engine_decryption_enable are true. This was originally raised in issue #12551.

…s no code execution path possible so that an encrypted TRT engine cache could be created when flags engine_cache_enable and engine_decryption_enable are true. Checking for the encrypted engine presence beforehand through the decrypt function should fix this.

chilo-ms · 2023-08-11T16:22:25Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

chilo-ms · 2023-08-11T16:22:35Z

/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

chilo-ms · 2023-08-11T16:22:42Z

/azp run Linux QNN CI Pipeline, Windows ARM64 QNN CI Pipeline

azure-pipelines · 2023-08-11T16:22:54Z

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines · 2023-08-11T16:22:57Z

Azure Pipelines successfully started running 5 pipeline(s).

azure-pipelines · 2023-08-11T16:23:06Z

Azure Pipelines successfully started running 9 pipeline(s).

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

chilo-ms · 2023-08-14T16:41:05Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

chilo-ms · 2023-08-14T16:41:17Z

/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

chilo-ms · 2023-08-14T16:41:28Z

/azp run Linux QNN CI Pipeline, Windows ARM64 QNN CI Pipeline

azure-pipelines · 2023-08-14T16:41:39Z

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines · 2023-08-14T16:41:40Z

Azure Pipelines successfully started running 5 pipeline(s).

azure-pipelines · 2023-08-14T16:41:41Z

Azure Pipelines successfully started running 9 pipeline(s).

simonjub · 2023-08-15T17:55:43Z

@chilo-ms I tried to find out why some checks failed but can't figure it out. Please let me know if there's something for me to do. I'd really like this to make it into ORT 1.16 if possible.

chilo-ms · 2023-08-15T18:06:46Z

@chilo-ms I tried to find out why some checks failed but can't figure it out. Please let me know if there's something for me to do. I'd really like this to make it into ORT 1.16 if possible.

Sometimes the CIs might have intermittent failures. I restart the CI again to see.
We will be reviewing and discussing it, will reply back soon, thanks.

jywu-msft · 2023-08-17T02:55:44Z

@simonjub thanks a lot for your contribution.
overall, it looks good. one thought we had was we can explicitly name the encrypted engine file with an additional suffix. e.g. *.engine.encrypted
i think there are several benefits. 1) it's easy to confirm the encryption code path is successful/aids in debugging 2) if you want to build/encrypt engine offline, you know you're deploying the correct encrypted engine files.
this could enable the optional use case where engines aren't created at runtime (e.g. input shapes are static), they are only created and encrypted offline. (one would never call encrypt at runtime and the deployed encryption library only needs to include the decrypt function)

it would also be nice to add some unit tests that can exercise these paths, so that we can ensure we don't break this new functionality.

simonjub · 2023-08-17T04:54:26Z

@jywu-msft I remember from our discussion in PR #15519 that your internal user for engine encryption already uses a different filename for the encrypted engine cache file and that only their encryption library knows it. If we implement the change that you propose (encrypted engine file explicitely named by the EP), that may break functionality for them, or at the very least force them to rename their existing encrypted engine files to take the new name into account.

You do bring up a good point that it's not desirable to have both the encrypt and decrypt functions deployed. It can't really be avoided with models using dynamic shape inputs, though. It feels like the preferred use case for encryption IS the one when input shapes are static and the engine can be created offline.

I am not familiar with onnxruntime's test suite, but I can look into adding unit tests and a basic encryption library.

jywu-msft · 2023-08-17T14:53:04Z

@jywu-msft I remember from our discussion in PR #15519 that your internal user for engine encryption already uses a different filename for the encrypted engine cache file and that only their encryption library knows it. If we implement the change that you propose (encrypted engine file explicitely named by the EP), that may break functionality for them, or at the very least force them to rename their existing encrypted engine files to take the new name into account.

You do bring up a good point that it's not desirable to have both the encrypt and decrypt functions deployed. It can't really be avoided with models using dynamic shape inputs, though. It feels like the preferred use case for encryption IS the one when input shapes are static and the engine can be created offline.

I am not familiar with onnxruntime's test suite, but I can look into adding unit tests and a basic encryption library.

I think we should go ahead and implement explicit different filename for encrypted engine.

the encrypt path has been broken since the original commit in the public onnxruntime repo, as you discovered.
compatibility shouldn't be an issue. the old engine files created by internal users are based on internal code base and are not compatible with current version of ORT + TensorRT.

simonjub · 2023-08-17T17:37:27Z

I think we should go ahead and implement explicit different filename for encrypted engine.

1. the encrypt path has been broken since the original commit in the public onnxruntime repo, as you discovered.

2. compatibility shouldn't be an issue. the old engine files created by internal users are based on internal code base and are not compatible with current version of ORT + TensorRT.

Okay. I started to work on it. In the case where an expected encrypted cache is missing and the deployed library does not have the encrypt function implemented, I'll just completely skip any cache creation. This means every time the session is created, the engine will be rebuilt without saving it to disk until the encrypted engine file is put in by the user. Does that sound good?

chilo-ms · 2023-08-17T18:07:17Z

I think we should go ahead and implement explicit different filename for encrypted engine.
1. the encrypt path has been broken since the original commit in the public onnxruntime repo, as you discovered.

2. compatibility shouldn't be an issue. the old engine files created by internal users are based on internal code base and are not compatible with current version of ORT + TensorRT.
Okay. I started to work on it. In the case where an expected encrypted cache is missing and the deployed library does not have the encrypt function implemented, I'll just completely skip any cache creation. This means every time the session is created, the engine will be rebuilt without saving it to disk until the encrypted engine file is put in by the user. Does that sound good?

I think it's okay for me. Just add a warning message to let user know why we skip cache creation.

The decryption library can be distributed without the encrypt function for security reasons. Check if it is present before calling it.

simonjub · 2023-08-22T05:22:17Z

I implemented the separate filename for encrypted engine to have ".encrypted" appended at the end.
I also added a check for encryption function presence in the library before calling it. This allows for the use case where it is not deployed along with decrypt function.
I am second guessing what behavior would be best if the encrypted engine cache file is not present and the encrypt function is not deployed either. Maybe the user would prefer that an exception is thrown to remind them that they need to deploy the encrypted engine. The alternative is a slow rebuild of the engine with each session created which may not be caught by user automated testing.

jywu-msft · 2023-08-22T16:19:14Z

I implemented the separate filename for encrypted engine to have ".encrypted" appended at the end. I also added a check for encryption function presence in the library before calling it. This allows for the use case where it is not deployed along with decrypt function. I am second guessing what behavior would be best if the encrypted engine cache file is not present and the encrypt function is not deployed either. Maybe the user would prefer that an exception is thrown to remind them that they need to deploy the encrypted engine. The alternative is a slow rebuild of the engine with each session created which may not be caught by user automated testing.

great. thanks! we'll take a look at the new code.
re: behavior if encrypt function isn't deployed and encrypted engine cache isn't present, it does seem reasonable to throw exception to remind user. just to confirm, the following condition is what you're referring to:
decrypt_option enabled + encrypted_engine_path not found + encryption_decrypt_lib_path found + engine_encryption function not found.

azure-pipelines · 2023-08-23T16:24:31Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2023-08-23T16:24:36Z

Azure Pipelines successfully started running 5 pipeline(s).

chilo-ms · 2023-08-23T16:40:59Z

@simonjub thanks!

We discussed internally and came up with following table which lists different cases that need to be handled differently.
Your PR handles most of the cases and only the "options" in the table we need to decide which is preferred and we are still waiting for our internal user to share their deployment setting. Once we get reply we can proceed to finish this PR.

Following table lists the different cases under the situation where engine cache enable option and engine decryption enable option are both on:

Note:
“true” means the object/function is existed.
“false” means the object/function does not exist.

simonjub · 2023-08-23T17:10:36Z

Great analysis of all possible cases.

I wonder if the decryption_function (false) possibility is likely to happen? Why would someone not always provide this? I can see why for encryption_function.

chilo-ms · 2023-08-23T17:36:05Z

I wonder if the decryption_function (false) possibility is likely to happen? Why would someone not always provide this? I can see why for encryption_function.

It's highly unlikely to happen.
But still it's possible that user provides the wrong "engine_decryption_lib_path" or there is no function called "decrypt" in the decryption library. In either case, TRT EP needs to handle them properly.

simonjub · 2023-08-23T18:55:02Z

It's highly unlikely to happen. But still it's possible that user provides the wrong "engine_decryption_lib_path" or there is no function called "decrypt" in the decryption library. In either case, TRT EP needs to handle them properly.

If we agree that it does not make sense to provide a library to the EP without the decrypt function, I can add a check for this at that location you just pointed out and throw an exception if engine_decryption_ is a nullptr. That would change the second and fourth row in your table as an exception would be thrown early whether there is an encrypted cache on disk or not.

When the TRT EP has engine encryption flag enabled, it must be provided a path to a library that exports the decrypt function. The EP will now throw an exception if decrypt is not found.

simonjub · 2023-08-23T22:49:23Z

If we agree that it does not make sense to provide a library to the EP without the decrypt function, I can add a check for this at that location you just pointed out and throw an exception if engine_decryption_ is a nullptr. That would change the second and fourth row in your table as an exception would be thrown early whether there is an encrypted cache on disk or not.

@jywu-msft I hope I didn't misinterpret your thumbs-up on my last message. I went ahead and added the exception if decrypt function is not implemented in the provided library. I verified that the exception is correctly thrown if the wrong library is provided.

chilo-ms · 2023-08-23T23:18:01Z

If we agree that it does not make sense to provide a library to the EP without the decrypt function, I can add a check for this at that location you just pointed out and throw an exception if engine_decryption_ is a nullptr. That would change the second and fourth row in your table as an exception would be thrown early whether there is an encrypted cache on disk or not.

@jywu-msft I hope I didn't misinterpret your thumbs-up on my last message. I went ahead and added the exception if decrypt function is not implemented in the provided library. I verified that the exception is correctly thrown if the wrong library is provided.

agree that it does not make sense to provide a library to the EP without the decrypt function and thanks for the change.

chilo-ms · 2023-08-23T23:19:59Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

chilo-ms · 2023-08-23T23:20:26Z

/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

azure-pipelines · 2023-08-23T23:20:40Z

Azure Pipelines successfully started running 9 pipeline(s).

chilo-ms · 2023-08-23T23:20:42Z

/azp run Linux QNN CI Pipeline, Windows ARM64 QNN CI Pipeline

azure-pipelines · 2023-08-23T23:20:51Z

Azure Pipelines successfully started running 5 pipeline(s).

azure-pipelines · 2023-08-23T23:20:53Z

Azure Pipelines successfully started running 2 pipeline(s).

chilo-ms · 2023-08-25T17:14:21Z

@simonjub
Please merge main to resolve the Web CI Pipeline failure.
This PR looks good to us and let's merge it first.
We still need to test thoroughly internally in order to decide whether to cherry pick it for ORT 1.16 release. Thanks again for the contribution.

jywu-msft · 2023-08-25T17:33:19Z

@simonjub Please merge main to resolve the Web CI Pipeline failure. This PR looks good to us and let's merge it first. We still need to test thoroughly internally in order to decide whether to cherry pick it for ORT 1.16 release. Thanks again for the contribution.

just to set expectations, this probably won't make it into the ort 1.16 release.
we did the right thing in taking our time to review this code thoroughly and consult with internal users. we're confident your final implementation is the right approach.
echo'ing @chilo-ms , thanks very much for taking the time to contribute this! we really appreciate it.

jywu-msft · 2023-08-25T23:04:24Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,ONNX Runtime Web CI Pipeline

jywu-msft · 2023-08-25T23:04:39Z

/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

azure-pipelines · 2023-08-25T23:04:59Z

Azure Pipelines successfully started running 8 pipeline(s).

azure-pipelines · 2023-08-25T23:05:08Z

Azure Pipelines successfully started running 8 pipeline(s).

simonjub · 2023-08-25T23:05:11Z

@chilo-ms @jywu-msft Successfully merged my branch with main just now. Thanks to you as well for taking the time to look into this and accepting the changes. I agree it was worth taking our time to make sure it was done right and the final solution is better than the initial proposal in the previous PR.

It's fine if it doesn't make it into 1.16. Thanks again!

### Description This is a followup to PR microsoft#15519 that is closed in favor of this one. ### Motivation and Context The current implementation of TRT cache has no code execution path possible so that an encrypted TRT engine cache could be created when flags engine_cache_enable and engine_decryption_enable are true. This was originally raised in issue microsoft#12551.

BengtGustafsson · 2024-10-18T07:30:27Z

As I understand this thread there is now support for engine encryption but as far as I can tell there is no documentation for how to write an encryption dll. What its function's names should be, what parameters they have and its semantics.

vadimkantorov · 2024-10-25T12:53:24Z

Also curious, if this support for a custom decryption library exists only for TRT EP or for vanilla ORT EP as well?

simonjub added 2 commits August 11, 2023 00:53

Merge branch 'microsoft:main' into trt_cache_encryption

d39c075

simonjub mentioned this pull request Aug 11, 2023

Fix logic error with TensorRT engine cache decryption #15519

Closed

jywu-msft requested a review from chilo-ms August 11, 2023 16:27

chilo-ms reviewed Aug 11, 2023

View reviewed changes

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc Show resolved Hide resolved

simonjub added 2 commits August 12, 2023 00:08

Fix compile error.

431834f

Change variable name to better reflect what it is.

6a23f14

simonjub added 2 commits August 22, 2023 01:09

Use distinct engine cache file name when encryption is enabled.

49c69a2

Change cache encryption behavior

34d7057

The decryption library can be distributed without the encrypt function for security reasons. Check if it is present before calling it.

Add exception if decrypt is not present

879ca4f

When the TRT EP has engine encryption flag enabled, it must be provided a path to a library that exports the decrypt function. The EP will now throw an exception if decrypt is not found.

chilo-ms approved these changes Aug 25, 2023

View reviewed changes

Merge branch 'microsoft:main' into trt_cache_encryption

cd735c8

jywu-msft merged commit 4eedd3b into microsoft:main Aug 27, 2023

vadimkantorov mentioned this pull request Oct 28, 2024

Built-in support for (custom?) decryption of model weights triton-inference-server/onnxruntime_backend#279

Open

vadimkantorov mentioned this pull request Nov 12, 2024

[Feature Request] ONNX model file decryption/custom I/O hooks #22813

Open

[TRT EP] Fix logic to reach cache encryption code. #17111

[TRT EP] Fix logic to reach cache encryption code. #17111

Conversation

simonjub commented Aug 11, 2023

Description

Motivation and Context

chilo-ms commented Aug 11, 2023

chilo-ms commented Aug 11, 2023

chilo-ms commented Aug 11, 2023

azure-pipelines bot commented Aug 11, 2023

azure-pipelines bot commented Aug 11, 2023

azure-pipelines bot commented Aug 11, 2023

chilo-ms commented Aug 14, 2023

chilo-ms commented Aug 14, 2023

chilo-ms commented Aug 14, 2023

azure-pipelines bot commented Aug 14, 2023

azure-pipelines bot commented Aug 14, 2023

azure-pipelines bot commented Aug 14, 2023

simonjub commented Aug 15, 2023

chilo-ms commented Aug 15, 2023 • edited Loading

jywu-msft commented Aug 17, 2023

simonjub commented Aug 17, 2023

jywu-msft commented Aug 17, 2023 • edited Loading

simonjub commented Aug 17, 2023

chilo-ms commented Aug 17, 2023

simonjub commented Aug 22, 2023

jywu-msft commented Aug 22, 2023 • edited Loading

azure-pipelines bot commented Aug 23, 2023

azure-pipelines bot commented Aug 23, 2023

chilo-ms commented Aug 23, 2023

simonjub commented Aug 23, 2023

chilo-ms commented Aug 23, 2023

simonjub commented Aug 23, 2023

simonjub commented Aug 23, 2023

chilo-ms commented Aug 23, 2023

chilo-ms commented Aug 23, 2023

chilo-ms commented Aug 23, 2023

azure-pipelines bot commented Aug 23, 2023

chilo-ms commented Aug 23, 2023

azure-pipelines bot commented Aug 23, 2023

azure-pipelines bot commented Aug 23, 2023

chilo-ms commented Aug 25, 2023 • edited Loading

jywu-msft commented Aug 25, 2023

jywu-msft commented Aug 25, 2023

jywu-msft commented Aug 25, 2023

azure-pipelines bot commented Aug 25, 2023

azure-pipelines bot commented Aug 25, 2023

simonjub commented Aug 25, 2023

BengtGustafsson commented Oct 18, 2024

vadimkantorov commented Oct 25, 2024

chilo-ms commented Aug 15, 2023 •

edited

Loading

jywu-msft commented Aug 17, 2023 •

edited

Loading

jywu-msft commented Aug 22, 2023 •

edited

Loading

chilo-ms commented Aug 25, 2023 •

edited

Loading