Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TRT EP] Fix logic to reach cache encryption code. #17111

Merged
merged 9 commits into from
Aug 27, 2023
Merged

[TRT EP] Fix logic to reach cache encryption code. #17111

merged 9 commits into from
Aug 27, 2023

Conversation

simonjub
Copy link
Contributor

Description

This is a followup to PR #15519 that is closed in favor of this one.

Motivation and Context

The current implementation of TRT cache has no code execution path possible so that an encrypted TRT engine cache could be created when flags engine_cache_enable and engine_decryption_enable are true. This was originally raised in issue #12551.

…s no code execution path possible so that an encrypted TRT engine cache could be created when flags engine_cache_enable and engine_decryption_enable are true. Checking for the encrypted engine presence beforehand through the decrypt function should fix this.
@chilo-ms
Copy link
Contributor

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

@chilo-ms
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@chilo-ms
Copy link
Contributor

/azp run Linux QNN CI Pipeline, Windows ARM64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@jywu-msft jywu-msft requested a review from chilo-ms August 11, 2023 16:27
@chilo-ms
Copy link
Contributor

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

@chilo-ms
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@chilo-ms
Copy link
Contributor

/azp run Linux QNN CI Pipeline, Windows ARM64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@simonjub
Copy link
Contributor Author

@chilo-ms I tried to find out why some checks failed but can't figure it out. Please let me know if there's something for me to do. I'd really like this to make it into ORT 1.16 if possible.

@chilo-ms
Copy link
Contributor

chilo-ms commented Aug 15, 2023

@chilo-ms I tried to find out why some checks failed but can't figure it out. Please let me know if there's something for me to do. I'd really like this to make it into ORT 1.16 if possible.

Sometimes the CIs might have intermittent failures. I restart the CI again to see.
We will be reviewing and discussing it, will reply back soon, thanks.

@jywu-msft
Copy link
Member

@simonjub thanks a lot for your contribution.
overall, it looks good. one thought we had was we can explicitly name the encrypted engine file with an additional suffix. e.g. *.engine.encrypted
i think there are several benefits. 1) it's easy to confirm the encryption code path is successful/aids in debugging 2) if you want to build/encrypt engine offline, you know you're deploying the correct encrypted engine files.
this could enable the optional use case where engines aren't created at runtime (e.g. input shapes are static), they are only created and encrypted offline. (one would never call encrypt at runtime and the deployed encryption library only needs to include the decrypt function)

it would also be nice to add some unit tests that can exercise these paths, so that we can ensure we don't break this new functionality.

@simonjub
Copy link
Contributor Author

@jywu-msft I remember from our discussion in PR #15519 that your internal user for engine encryption already uses a different filename for the encrypted engine cache file and that only their encryption library knows it. If we implement the change that you propose (encrypted engine file explicitely named by the EP), that may break functionality for them, or at the very least force them to rename their existing encrypted engine files to take the new name into account.

You do bring up a good point that it's not desirable to have both the encrypt and decrypt functions deployed. It can't really be avoided with models using dynamic shape inputs, though. It feels like the preferred use case for encryption IS the one when input shapes are static and the engine can be created offline.

I am not familiar with onnxruntime's test suite, but I can look into adding unit tests and a basic encryption library.

@jywu-msft
Copy link
Member

jywu-msft commented Aug 17, 2023

@jywu-msft I remember from our discussion in PR #15519 that your internal user for engine encryption already uses a different filename for the encrypted engine cache file and that only their encryption library knows it. If we implement the change that you propose (encrypted engine file explicitely named by the EP), that may break functionality for them, or at the very least force them to rename their existing encrypted engine files to take the new name into account.

You do bring up a good point that it's not desirable to have both the encrypt and decrypt functions deployed. It can't really be avoided with models using dynamic shape inputs, though. It feels like the preferred use case for encryption IS the one when input shapes are static and the engine can be created offline.

I am not familiar with onnxruntime's test suite, but I can look into adding unit tests and a basic encryption library.

I think we should go ahead and implement explicit different filename for encrypted engine.

  1. the encrypt path has been broken since the original commit in the public onnxruntime repo, as you discovered.
  2. compatibility shouldn't be an issue. the old engine files created by internal users are based on internal code base and are not compatible with current version of ORT + TensorRT.

@simonjub
Copy link
Contributor Author

I think we should go ahead and implement explicit different filename for encrypted engine.

1. the encrypt path has been broken since the original commit in the public onnxruntime repo, as you discovered.

2. compatibility shouldn't be an issue. the old engine files created by internal users are based on internal code base and are not compatible with current version of ORT + TensorRT.

Okay. I started to work on it. In the case where an expected encrypted cache is missing and the deployed library does not have the encrypt function implemented, I'll just completely skip any cache creation. This means every time the session is created, the engine will be rebuilt without saving it to disk until the encrypted engine file is put in by the user. Does that sound good?

@chilo-ms
Copy link
Contributor

I think we should go ahead and implement explicit different filename for encrypted engine.

1. the encrypt path has been broken since the original commit in the public onnxruntime repo, as you discovered.

2. compatibility shouldn't be an issue. the old engine files created by internal users are based on internal code base and are not compatible with current version of ORT + TensorRT.

Okay. I started to work on it. In the case where an expected encrypted cache is missing and the deployed library does not have the encrypt function implemented, I'll just completely skip any cache creation. This means every time the session is created, the engine will be rebuilt without saving it to disk until the encrypted engine file is put in by the user. Does that sound good?

I think it's okay for me. Just add a warning message to let user know why we skip cache creation.

The decryption library can be distributed without the encrypt function
for security reasons. Check if it is present before calling it.
@simonjub
Copy link
Contributor Author

I implemented the separate filename for encrypted engine to have ".encrypted" appended at the end.
I also added a check for encryption function presence in the library before calling it. This allows for the use case where it is not deployed along with decrypt function.
I am second guessing what behavior would be best if the encrypted engine cache file is not present and the encrypt function is not deployed either. Maybe the user would prefer that an exception is thrown to remind them that they need to deploy the encrypted engine. The alternative is a slow rebuild of the engine with each session created which may not be caught by user automated testing.

@jywu-msft
Copy link
Member

jywu-msft commented Aug 22, 2023

I implemented the separate filename for encrypted engine to have ".encrypted" appended at the end. I also added a check for encryption function presence in the library before calling it. This allows for the use case where it is not deployed along with decrypt function. I am second guessing what behavior would be best if the encrypted engine cache file is not present and the encrypt function is not deployed either. Maybe the user would prefer that an exception is thrown to remind them that they need to deploy the encrypted engine. The alternative is a slow rebuild of the engine with each session created which may not be caught by user automated testing.

great. thanks! we'll take a look at the new code.
re: behavior if encrypt function isn't deployed and encrypted engine cache isn't present, it does seem reasonable to throw exception to remind user. just to confirm, the following condition is what you're referring to:
decrypt_option enabled + encrypted_engine_path not found + encryption_decrypt_lib_path found + engine_encryption function not found.

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@chilo-ms
Copy link
Contributor

@simonjub thanks!

We discussed internally and came up with following table which lists different cases that need to be handled differently.
Your PR handles most of the cases and only the "options" in the table we need to decide which is preferred and we are still waiting for our internal user to share their deployment setting. Once we get reply we can proceed to finish this PR.

Following table lists the different cases under the situation where engine cache enable option and engine decryption enable option are both on:
image
Note:
“true” means the object/function is existed.
“false” means the object/function does not exist.

@simonjub
Copy link
Contributor Author

Great analysis of all possible cases.

I wonder if the decryption_function (false) possibility is likely to happen? Why would someone not always provide this? I can see why for encryption_function.

@chilo-ms
Copy link
Contributor

I wonder if the decryption_function (false) possibility is likely to happen? Why would someone not always provide this? I can see why for encryption_function.

It's highly unlikely to happen.
But still it's possible that user provides the wrong "engine_decryption_lib_path" or there is no function called "decrypt" in the decryption library. In either case, TRT EP needs to handle them properly.

@simonjub
Copy link
Contributor Author

It's highly unlikely to happen. But still it's possible that user provides the wrong "engine_decryption_lib_path" or there is no function called "decrypt" in the decryption library. In either case, TRT EP needs to handle them properly.

If we agree that it does not make sense to provide a library to the EP without the decrypt function, I can add a check for this at that location you just pointed out and throw an exception if engine_decryption_ is a nullptr. That would change the second and fourth row in your table as an exception would be thrown early whether there is an encrypted cache on disk or not.

When the TRT EP has engine encryption flag enabled, it must
be provided a path to a library that exports the decrypt function.
The EP will now throw an exception if decrypt is not found.
@simonjub
Copy link
Contributor Author

If we agree that it does not make sense to provide a library to the EP without the decrypt function, I can add a check for this at that location you just pointed out and throw an exception if engine_decryption_ is a nullptr. That would change the second and fourth row in your table as an exception would be thrown early whether there is an encrypted cache on disk or not.

@jywu-msft I hope I didn't misinterpret your thumbs-up on my last message. I went ahead and added the exception if decrypt function is not implemented in the provided library. I verified that the exception is correctly thrown if the wrong library is provided.

@chilo-ms
Copy link
Contributor

If we agree that it does not make sense to provide a library to the EP without the decrypt function, I can add a check for this at that location you just pointed out and throw an exception if engine_decryption_ is a nullptr. That would change the second and fourth row in your table as an exception would be thrown early whether there is an encrypted cache on disk or not.

@jywu-msft I hope I didn't misinterpret your thumbs-up on my last message. I went ahead and added the exception if decrypt function is not implemented in the provided library. I verified that the exception is correctly thrown if the wrong library is provided.

agree that it does not make sense to provide a library to the EP without the decrypt function and thanks for the change.

@chilo-ms
Copy link
Contributor

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

@chilo-ms
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@chilo-ms
Copy link
Contributor

/azp run Linux QNN CI Pipeline, Windows ARM64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@chilo-ms
Copy link
Contributor

chilo-ms commented Aug 25, 2023

@simonjub
Please merge main to resolve the Web CI Pipeline failure.
This PR looks good to us and let's merge it first.
We still need to test thoroughly internally in order to decide whether to cherry pick it for ORT 1.16 release. Thanks again for the contribution.

@jywu-msft
Copy link
Member

@simonjub Please merge main to resolve the Web CI Pipeline failure. This PR looks good to us and let's merge it first. We still need to test thoroughly internally in order to decide whether to cherry pick it for ORT 1.16 release. Thanks again for the contribution.

just to set expectations, this probably won't make it into the ort 1.16 release.
we did the right thing in taking our time to review this code thoroughly and consult with internal users. we're confident your final implementation is the right approach.
echo'ing @chilo-ms , thanks very much for taking the time to contribute this! we really appreciate it.

@jywu-msft
Copy link
Member

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,ONNX Runtime Web CI Pipeline

@jywu-msft
Copy link
Member

/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

@simonjub
Copy link
Contributor Author

@chilo-ms @jywu-msft Successfully merged my branch with main just now. Thanks to you as well for taking the time to look into this and accepting the changes. I agree it was worth taking our time to make sure it was done right and the final solution is better than the initial proposal in the previous PR.

It's fine if it doesn't make it into 1.16. Thanks again!

@jywu-msft jywu-msft merged commit 4eedd3b into microsoft:main Aug 27, 2023
kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request Mar 22, 2024
### Description
This is a followup to PR microsoft#15519 that is closed in favor of this one.


### Motivation and Context
The current implementation of TRT cache has no code execution path
possible so that an encrypted TRT engine cache could be created when
flags engine_cache_enable and engine_decryption_enable are true. This
was originally raised in issue microsoft#12551.
@BengtGustafsson
Copy link
Contributor

As I understand this thread there is now support for engine encryption but as far as I can tell there is no documentation for how to write an encryption dll. What its function's names should be, what parameters they have and its semantics.

@vadimkantorov
Copy link

Also curious, if this support for a custom decryption library exists only for TRT EP or for vanilla ORT EP as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants