Add some piece for offline prepare support. (#310)

HectorSVC · web-flow · commit 05a813774ad0 · 2023-10-12T09:02:10.000-07:00
* Add some piece for offline prepare support.
Add some piece for non-embed mode
diff --git a/c_cxx/QNN_EP/mobilenetv2_classification/README.md b/c_cxx/QNN_EP/mobilenetv2_classification/README.md
@@ -2,7 +2,7 @@
 - Builds the sample compiled against the ONNX Runtime built with support for Qualcomm AI Engine Direct SDK (Qualcomm Neural Network (QNN) SDK).
 - The sample uses the QNN EP to:
   - a. run the float32 model on Qnn CPU banckend.
-  - b. run the QDQ model on HTP backend with qnn_context_cache_enable=1, and generates the Onnx model which has QNN context binary embeded.
+  - b. run the QDQ model on HTP backend with qnn_context_cache_enable=1, and generates the Onnx model which has QNN context binary embedded.
   - c. run the QNN context binary model generated from ONNX Runtime (previous step) on HTP backend, to improve the model initialization time and reduce memory overhead.
   - d. run the QNN context binary model generated from QNN tool chain on HTP backend, to support models generated from native QNN tool chain.
 - The sample downloads the mobilenetv2 model from Onnx model zoo, and use mobilenetv2_helper.py to quantize the float32 model to QDQ model which is required for HTP backend
@@ -12,6 +12,15 @@
     - a. Set qnn_context_cache_enable to 1 and run with QDQ model.
     - b. The first run will generate the context binary model (Default file name is model_file_name.onnx_qnn_ctx.onnx if qnn_context_cache_path is not set.).
     - c. Use the generated context binary model (mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx) for inference going forward. (No need the QDQ model, no need to set the qnn_context_cache_enable).
+    - Notes: The QNN context binary is embedded within the ONNX model by default. Alternatively, set the QNN EP session option qnn_context_embed_mode to 0 in order to generate the QNN context binary as a separate file and embed the file's relative path in the ONNX model. This is necessary if the QNN context binary size exceeds protobuf's 2GB limit.
+	- Offline prepare is also supported. Generate the QNN context binary on x64 machine and run it on QC ARM64 device.
+    ```
+	# Build the qnn_ep_sample with x64.
+	MSBuild.exe .\qnn_ep_sample.sln /property:Configuration=Release /p:Platform="x64"
+	# Run qnn_ep_sample.exe on x64. It only create the Onnx Runtime session with QDQ model to generate the QNN context binary
+	# No need to run the model
+	qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw --gen_ctx
+    ```
   
   - Option 2: Use context binary generated by native QNN tool chain:
     - The sample also demonstrates the feature to create an Onnx model file from QNN generated context binary file libmobilenetv2-12.serialized.bin to better support customer application migration from native QNN to OnnxRuntime QNN EP. A script [gen_qnn_ctx_onnx_model.py](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/qnn/gen_qnn_ctx_onnx_model.py) is provided to generate an Onnx model from QNN generated context binary file. It requires the QNN generated context binary file libmobilenetv2-12.serialized.bin and pre-converted QNN mobilenetv2-12_net.json.
@@ -24,6 +33,7 @@
 	  - python gen_qnn_ctx_onnx_model.py -b libmobilenetv2-12.serialized.bin -q mobilenetv2-12_net.json
     - c. Create ONNX Runtime session with the model generated from step b.
 	- d. Run the model with quantized input data. The output also need to be dequantized. This is because QNN quantized model use quantized data type for model inputs & outputs. More details refer to QuantizedData & DequantizedData in [main.cpp](https://github.com/microsoft/onnxruntime-inference-examples/blob/main/c_cxx/QNN_EP/mobilenetv2_classification/main.cpp). Also the input image is NHWC layout for QNN converted model.
+    - Notes: Call gen_qnn_ctx_onnx_model.py with --disable_embed_mode to generate the ONNX model with the relative path to the QNN context binary file. This is necessary if the QNN context binary size exceeds protobuf's 2GB limit.
 
 - More info on QNN EP - https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html
 
@@ -56,12 +66,18 @@ qnn_ep_sample.exe --cpu mobilenetv2-12_shape.onnx kitten_input.raw
 Result:
 position=281, classification=n02123045 tabby, tabby cat, probability=13.663178
 
-REM run mobilenetv2-12_quant_shape.onnx with QNN HTP backend, generate mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx
-qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw --gen_ctx
+REM run mobilenetv2-12_quant_shape.onnx with QNN HTP backend
+qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw
 
 Result:
 position=281, classification=n02123045 tabby, tabby cat, probability=13.637316
 
+REM load mobilenetv2-12_quant_shape.onnx with QNN HTP backend, generate mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx which hs QNN context binary embedded
+REM This does not has to be run on real device with HTP, it can be done on x64 platform also, since it supports offline generation
+qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw --gen_ctx
+
+Onnx model with QNN context binary is generated.
+
 REM run mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx with QNN HTP backend
 qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx kitten_input.raw
 
diff --git a/c_cxx/QNN_EP/mobilenetv2_classification/main.cpp b/c_cxx/QNN_EP/mobilenetv2_classification/main.cpp
@@ -96,6 +96,10 @@ void run_ort_qnn_ep(const std::string& backend, const std::string& model_path, c
                                                                   options_values.data(), options_keys.size()));
   OrtSession* session;
   CheckStatus(g_ort, g_ort->CreateSession(env, model_path_wstr.c_str(), session_options, &session));
+  if (generate_ctx) {
+    printf("\nOnnx model with QNN context binary is generated.\n");
+    return;
+  }
 
   OrtAllocator* allocator;
   CheckStatus(g_ort, g_ort->GetAllocatorWithDefaultOptions(&allocator));
diff --git a/c_cxx/QNN_EP/mobilenetv2_classification/run_qnn_ep_sample.bat b/c_cxx/QNN_EP/mobilenetv2_classification/run_qnn_ep_sample.bat
@@ -57,7 +57,7 @@ IF NOT EXIST %QNN_CTX_ONNX_GEN_SCRIPT% (
     powershell -Command "Invoke-WebRequest %QNN_CTX_ONNX_GEN_SCRIPT_URL% -Outfile %QNN_CTX_ONNX_GEN_SCRIPT%" )
 
 REM based on the input & output information got from QNN converted mobilenetv2-12_net.json file
-REM Generate mobilenetv2-12_net_qnn_ctx.onnx with content of libmobilenetv2-12.serialized.bin embeded
+REM Generate mobilenetv2-12_net_qnn_ctx.onnx with content of libmobilenetv2-12.serialized.bin embedded
 python gen_qnn_ctx_onnx_model.py -b libmobilenetv2-12.serialized.bin -q mobilenetv2-12_net.json
 
 where /q cmake.exe
@@ -100,7 +100,11 @@ copy /y ..\..\synset.txt .
 REM run mobilenetv2-12_shape.onnx with QNN CPU backend 
 qnn_ep_sample.exe --cpu mobilenetv2-12_shape.onnx kitten_input.raw
 
-REM run mobilenetv2-12_quant_shape.onnx with QNN HTP backend, generate mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx
+REM run mobilenetv2-12_quant_shape.onnx with QNN HTP backend
+qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw
+
+REM load mobilenetv2-12_quant_shape.onnx with QNN HTP backend, generate mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx which hs QNN context binary embedded
+REM This does not has to be run on real device with HTP, it can be done on x64 platform also, since it supports offline generation
 qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw --gen_ctx
 
 REM run mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx with QNN HTP backend