Update Qnn EP example code to reflect the QNN cache improvement in Ort (#309)

HectorSVC · web-flow · commit 251188161dd8 · 2023-10-11T13:01:53.000-07:00
* Update Qnn EP example code to reflect the QNN cache improvement in Ort microsoft/onnxruntime#17757
diff --git a/c_cxx/QNN_EP/mobilenetv2_classification/README.md b/c_cxx/QNN_EP/mobilenetv2_classification/README.md
@@ -1,17 +1,29 @@
 ## About
-- Builds the sample compiled against the ONNX Runtime built with support for Qualcomm AI Engine Direct SDK (Qualcomm Neural Network (QNN) SDK)
-- The sample uses the QNN EP, run with Qnn CPU banckend and HTP backend
+- Builds the sample compiled against the ONNX Runtime built with support for Qualcomm AI Engine Direct SDK (Qualcomm Neural Network (QNN) SDK).
+- The sample uses the QNN EP to:
+  - a. run the float32 model on Qnn CPU banckend.
+  - b. run the QDQ model on HTP backend with qnn_context_cache_enable=1, and generates the Onnx model which has QNN context binary embeded.
+  - c. run the QNN context binary model generated from ONNX Runtime (previous step) on HTP backend, to improve the model initialization time and reduce memory overhead.
+  - d. run the QNN context binary model generated from QNN tool chain on HTP backend, to support models generated from native QNN tool chain.
 - The sample downloads the mobilenetv2 model from Onnx model zoo, and use mobilenetv2_helper.py to quantize the float32 model to QDQ model which is required for HTP backend
 - The sample is targeted to run on QC ARM64 device.
 - There are 2 ways to improve the session creation time by using of QNN context binary:
-  - Option 1: Use contexty binary generated by OnnxRuntime QNN EP. OnnxRuntime QNN EP use QNN API to generate the QNN context binary, and also dumps some metadata (model name, version, graph meta id, etc) to identify the model. You can just simply set qnn_context_cache_enable to 1 and run with QDQ model. The first run will generate the context binary (Default file name is model_file_name.onnx.bin if qnn_context_cache_path is not set.). The model run afterwards will load from the generated context binary file.
+  - Option 1: Use context binary generated by OnnxRuntime QNN EP. OnnxRuntime QNN EP use QNN API to generate the QNN context binary, and also dumps some metadata (model name, version, graph meta id, etc) to identify the model.
+    - a. Set qnn_context_cache_enable to 1 and run with QDQ model.
+    - b. The first run will generate the context binary model (Default file name is model_file_name.onnx_qnn_ctx.onnx if qnn_context_cache_path is not set.).
+    - c. Use the generated context binary model (mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx) for inference going forward. (No need the QDQ model, no need to set the qnn_context_cache_enable).
+  
   - Option 2: Use context binary generated by native QNN tool chain:
-    - The sample also demonstrates the feature to load from QNN generated context binary file libmobilenetv2-12.serialized.bin to better support customer application migration from native QNN to OnnxRuntime QNN EP. Because QNN converted model use channel last layout, and the quantized model use INT8/UINT8 as model input & output. A script add_trans_cast.py is provided to update the orignial Onnx model by insert Cast and Transpose node to make the model input & output align with QNN converted model. It requires the Onnx float32 model and pre-converted QNN model_net.json.
-    - The sample provides mobilenetv2-12_net.json and context binary file libmobilenetv2-12.serialized.bin as exmple (generated by QNN version 2.10). Please follow QNN document to gnereated QNN model_net.json and the context binary file.
+    - The sample also demonstrates the feature to create an Onnx model file from QNN generated context binary file libmobilenetv2-12.serialized.bin to better support customer application migration from native QNN to OnnxRuntime QNN EP. A script [gen_qnn_ctx_onnx_model.py](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/qnn/gen_qnn_ctx_onnx_model.py) is provided to generate an Onnx model from QNN generated context binary file. It requires the QNN generated context binary file libmobilenetv2-12.serialized.bin and pre-converted QNN mobilenetv2-12_net.json.
+    - a. Convert model to QNN model and generate the QNN context binary file. The sample provides mobilenetv2-12_net.json and context binary file libmobilenetv2-12.serialized.bin as exmple (generated by QNN version 2.10). Please follow QNN document to generated QNN model_net.json and the context binary file.
       - Example command used:
         - qnn-onnx-converter --input_list ./input.txt --input_network ./mobilenetv2-12.onnx --output_path ./mobilenetv2-12.cpp -b 1 --bias_bw 32
         - qnn-model-lib-generator -c ./mobilenetv2-12.cpp -b ./mobilenetv2-12.bin -o ./mobilenetv2_classification/qnn_lib
         - qnn-context-binary-generator --backend ${QNN_SDK_ROOT}/target/x86_64-linux-clang/lib/libQnnHtp.so --model ./mobilenetv2_classification/qnn_lib/x86_64-linux-clang/libmobilenetv2-12.so --binary_file libmobilenetv2-12.serialized
+    - b. Create an Onnx model file from QNN generated context binary file libmobilenetv2-12.serialized.bin. More details refer to run_qnn_ep_sample.bat & [gen_qnn_ctx_onnx_model.py](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/qnn/gen_qnn_ctx_onnx_model.py)
+	  - python gen_qnn_ctx_onnx_model.py -b libmobilenetv2-12.serialized.bin -q mobilenetv2-12_net.json
+    - c. Create ONNX Runtime session with the model generated from step b.
+	- d. Run the model with quantized input data. The output also need to be dequantized. This is because QNN quantized model use quantized data type for model inputs & outputs. More details refer to QuantizedData & DequantizedData in [main.cpp](https://github.com/microsoft/onnxruntime-inference-examples/blob/main/c_cxx/QNN_EP/mobilenetv2_classification/main.cpp). Also the input image is NHWC layout for QNN converted model.
 
 - More info on QNN EP - https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html
 
@@ -38,22 +50,26 @@ Example (Src): run_qnn_ep_sample.bat C:\src\onnxruntime C:\src\onnxruntime\build
 ## Example run result
 ```
 ...
-REM run with QNN CPU backend
-qnn_ep_sample.exe --cpu kitten_input.raw
+REM run mobilenetv2-12_shape.onnx with QNN CPU backend
+qnn_ep_sample.exe --cpu mobilenetv2-12_shape.onnx kitten_input.raw
 
 Result:
-position=281, classification=n02123045 tabby, tabby cat, probability=13.663173
+position=281, classification=n02123045 tabby, tabby cat, probability=13.663178
 
-REM run with QNN HTP backend
-qnn_ep_sample.exe --htp kitten_input.raw
+REM run mobilenetv2-12_quant_shape.onnx with QNN HTP backend, generate mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx
+qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw --gen_ctx
 
 Result:
 position=281, classification=n02123045 tabby, tabby cat, probability=13.637316
 
-REM run with QNN HTP backend using the QNN generated context binary file
-qnn_ep_sample.exe --qnn kitten_input_nhwc.raw
+REM run mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx with QNN HTP backend
+qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx kitten_input.raw
 
 Result:
+position=281, classification=n02123045 tabby, tabby cat, probability=13.637316
+
+REM run mobilenetv2-12_net_qnn_ctx.onnx (generated from native QNN) with QNN HTP backend
+qnn_ep_sample.exe --qnn mobilenetv2-12_net_qnn_ctx.onnx kitten_input_nhwc.raw
 position=281, classification=n02123045 tabby, tabby cat, probability=13.637315
 ...
 ```
diff --git a/c_cxx/QNN_EP/mobilenetv2_classification/main.cpp b/c_cxx/QNN_EP/mobilenetv2_classification/main.cpp
@@ -63,59 +63,39 @@ void DequantizedData(float* out, const T_QuantType* in, int32_t offset, float sc
   }
 }
 
-void run_ort_qnn_ep(std::string backend, std::string input_path, std::string qnn_native_context_binary = "") {
-#ifdef _WIN32
-  const wchar_t* model_path = nullptr;
-  if (backend == "QnnCpu.dll") {
-    model_path = L"mobilenetv2-12_shape.onnx";
-  } else {
-    model_path = L"mobilenetv2-12_quant_shape.onnx";
-  }
-#else
-  char* model_path = nullptr;
-  if (backend == "QnnCpu.so") {
-    model_path = "mobilenetv2-12_shape.onnx";
-  } else {
-    model_path = "mobilenetv2-12_quant_shape.onnx";
-  }
-#endif
+void run_ort_qnn_ep(const std::string& backend, const std::string& model_path, const std::string& input_path,
+                    bool generated_from_native_qnn, bool generate_ctx) {
+  std::wstring model_path_wstr = std::wstring(model_path.begin(), model_path.end());
 
   const OrtApi* g_ort = OrtGetApiBase()->GetApi(ORT_API_VERSION);
   OrtEnv* env;
-  CheckStatus(g_ort, g_ort->CreateEnv(ORT_LOGGING_LEVEL_WARNING, "test", &env));  // Can set to ORT_LOGGING_LEVEL_INFO or ORT_LOGGING_LEVEL_VERBOSE for more info
+  // Can set to ORT_LOGGING_LEVEL_INFO or ORT_LOGGING_LEVEL_VERBOSE for more info
+  CheckStatus(g_ort, g_ort->CreateEnv(ORT_LOGGING_LEVEL_WARNING, "test", &env));
 
   OrtSessionOptions* session_options;
   CheckStatus(g_ort, g_ort->CreateSessionOptions(&session_options));
   CheckStatus(g_ort, g_ort->SetIntraOpNumThreads(session_options, 1));
   CheckStatus(g_ort, g_ort->SetSessionGraphOptimizationLevel(session_options, ORT_ENABLE_BASIC));
 
-  // You can also set qnn_context_cache_enable to 1 to improve session creation cost
-  // More option details refers tohttps://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html
+  // More option details refers to https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html
   std::vector<const char*> options_keys = {"backend_path"};
   std::vector<const char*> options_values = {backend.c_str()};
 
-  bool run_with_qnn_native_ctx_binary = false;
-  // Load from offline prepared QNN context binary file. This one is generated by native QNN tool chain
-  if (!qnn_native_context_binary.empty()) {
+  // If it runs from a QDQ model on HTP backend
+  // It will generate an Onnx model with Qnn context binary.
+  // The context binary can be embedded inside the model in EPContext->ep_cache_context (by default),
+  // or the context binary can be a separate .bin file, with relative path set in EPContext->ep_cache_context (qnn_context_embed_mode = 0)
+  if (generate_ctx) {
     options_keys.push_back("qnn_context_cache_enable");
     options_values.push_back("1");
-    // qnn_context_cache_path -- Specify the context binary. This one is generated by QNN toolchain
-    // If not specifid, OnnxRuntime QNN EP will generate the context binary file from QDQ model at the frst run and load from it next run
-    options_keys.push_back("qnn_context_cache_path");
-    options_values.push_back("libmobilenetv2-12.serialized.bin");
-    run_with_qnn_native_ctx_binary = true;
-
-#ifdef _WIN32
-    model_path = L"mobilenetv2-12_shape_add_trans.onnx";
-#else
-    model_path = "mobilenetv2-12_shape_add_trans.onnx";
-#endif
   }
+  // qnn_context_cache_path -- you can specify the path and file name as you want
+  // If not specified, OnnxRuntime QNN EP will generate it at [model_path]_qnn_ctx.onnx
 
   CheckStatus(g_ort, g_ort->SessionOptionsAppendExecutionProvider(session_options, "QNN", options_keys.data(),
                                                                   options_values.data(), options_keys.size()));
   OrtSession* session;
-  CheckStatus(g_ort, g_ort->CreateSession(env, model_path, session_options, &session));
+  CheckStatus(g_ort, g_ort->CreateSession(env, model_path_wstr.c_str(), session_options, &session));
 
   OrtAllocator* allocator;
   CheckStatus(g_ort, g_ort->GetAllocatorWithDefaultOptions(&allocator));
@@ -203,13 +183,14 @@ void run_ort_qnn_ep(std::string backend, std::string input_path, std::string qnn
   input_raw_file.read(reinterpret_cast<char*>(&input_data[0]), num_elements * sizeof(float));
 
   CheckStatus(g_ort, g_ort->CreateCpuMemoryInfo(OrtArenaAllocator, OrtMemTypeDefault, &memory_info));
-  if (run_with_qnn_native_ctx_binary) {
+  // QNN native tool chain generated quantized model use quantized data as inputs & outputs
+  if (generated_from_native_qnn) {
     size_t input_data_length = input_data_size * sizeof(uint8_t);
     QuantizedData(quantized_input_data.data(), input_data.data(), -116, 0.015875209f, input_data_size);
     CheckStatus(g_ort, g_ort->CreateTensorWithDataAsOrtValue(
                            memory_info, reinterpret_cast<void*>(quantized_input_data.data()), input_data_length,
                            input_node_dims[0].data(), input_node_dims[0].size(), input_types[0], &input_tensors[0]));
-  } else {
+  } else { // Ort generate QDQ model still use float32 data as inputs & outputs
     size_t input_data_length = input_data_size * sizeof(float);
     CheckStatus(g_ort, g_ort->CreateTensorWithDataAsOrtValue(
                            memory_info, reinterpret_cast<void*>(input_data.data()), input_data_length,
@@ -226,7 +207,7 @@ void run_ort_qnn_ep(std::string backend, std::string input_path, std::string qnn
   void* output_buffer;
   CheckStatus(g_ort, g_ort->GetTensorMutableData(output_tensors[0], &output_buffer));
   float* float_buffer = nullptr;
-  if (run_with_qnn_native_ctx_binary) {
+  if (generated_from_native_qnn) {
     uint8_t* buffer = reinterpret_cast<uint8_t*>(output_buffer);
     DequantizedData(output_data.data(), buffer, -86, 0.08069417f, output_data_size);
     float_buffer = output_data.data();
@@ -252,39 +233,62 @@ void run_ort_qnn_ep(std::string backend, std::string input_path, std::string qnn
 
 void PrintHelp() {
   std::cout << "To run the sample, use the following command:" << std::endl;
-  std::cout << "Example: ./qnn_ep_sample --cpu <path_to_raw_input>" << std::endl;
-  std::cout << "To Run with QNN CPU backend. Example: ./qnn_ep_sample --cpu kitten_input.raw" << std::endl;
-  std::cout << "To Run with QNN HTP backend. Example: ./qnn_ep_sample --htp kitten_input.raw" << std::endl;
-  std::cout << "To Run with QNN native context binary on QNN HTP backend . Example: ./qnn_ep_sample --qnn kitten_input_nhwc.raw" << std::endl;
+  std::cout << "Example: ./qnn_ep_sample --cpu <model_path> <path_to_raw_input>" << std::endl;
+  std::cout << "To Run with QNN CPU backend. Example: ./qnn_ep_sample --cpu mobilenetv2-12_shape.onnx kitten_input.raw" << std::endl;
+  std::cout << "To Run with QNN HTP backend. Example: ./qnn_ep_sample --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw" << std::endl;
+  std::cout << "To Run with QNN HTP backend and generate Qnn context binary model. Example: ./qnn_ep_sample --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw --gen_ctx" << std::endl;
+  std::cout << "To Run with QNN native context binary on QNN HTP backend . Example: ./qnn_ep_sample --qnn qnn_native_ctx_binary.onnx kitten_input_nhwc.raw" << std::endl;
 }
 
 constexpr const char* CPUBACKEDN = "--cpu";
 constexpr const char* HTPBACKEDN = "--htp";
 constexpr const char* QNNCTXBINARY = "--qnn";
+constexpr const char* GENERATE_CTX = "--gen_ctx";
 
 int main(int argc, char* argv[]) {
 
-  if (argc != 3) {
+  if (argc != 4 && argc != 5) {
     PrintHelp();
     return 1;
   }
 
+  bool generate_ctx = false;
+  if (argc == 5) {
+    if (strcmp(argv[4], GENERATE_CTX) == 0) {
+      generate_ctx = true;
+    } else {
+      std::cout << "The expected last parameter is --gen_ctx." << std::endl;
+      PrintHelp();
+      return 1;
+    }
+  }
+
   std::string backend = "";
-  std::string qnn_native_context_binary = "";
+  bool generated_from_native_qnn = false;
   if (strcmp(argv[1], CPUBACKEDN) == 0) {
     backend = "QnnCpu.dll";
+    if (generate_ctx) {
+      std::cout << "--gen_ctx won't work with CPU backend." << std::endl;
+      return 1;
+    }
   } else if (strcmp(argv[1], HTPBACKEDN) == 0) {
     backend = "QnnHtp.dll";
   } else if (strcmp(argv[1], QNNCTXBINARY) == 0) {
     backend = "QnnHtp.dll";
-    qnn_native_context_binary = "libmobilenetv2-12.serialized.bin";
+    generated_from_native_qnn = true;
+    if (generate_ctx) {
+      std::cout << "--gen_ctx won't work with --qnn." << std::endl;
+      return 1;
+    }
   } else {
     std::cout << "This sample only support option cpu, htp, qnn." << std::endl;
     PrintHelp();
     return 1;
   }
-  std::string input_path(argv[2]);
 
-  run_ort_qnn_ep(backend, input_path, qnn_native_context_binary);
+  std::string model_path(argv[2]);
+  std::string input_path(argv[3]);
+
+  run_ort_qnn_ep(backend, model_path, input_path, generated_from_native_qnn, generate_ctx);
   return 0;
 }
diff --git a/c_cxx/QNN_EP/mobilenetv2_classification/run_qnn_ep_sample.bat b/c_cxx/QNN_EP/mobilenetv2_classification/run_qnn_ep_sample.bat
@@ -51,16 +51,14 @@ IF NOT EXIST mobilenetv2-12_shape.onnx (
 )
 
 REM Download add_trans_cast.py file
-set ADD_TRANS_CAST_SCRIPT_URL="https://raw.githubusercontent.com/microsoft/onnxruntime/main/onnxruntime/python/tools/qnn/add_trans_cast.py"
-set ADD_TRANS_CAST_SCRIPT="add_trans_cast.py"
-IF NOT EXIST %ADD_TRANS_CAST_SCRIPT% (
-    powershell -Command "Invoke-WebRequest %ADD_TRANS_CAST_SCRIPT_URL% -Outfile %ADD_TRANS_CAST_SCRIPT%" )
+set QNN_CTX_ONNX_GEN_SCRIPT_URL="https://raw.githubusercontent.com/microsoft/onnxruntime/main/onnxruntime/python/tools/qnn/gen_qnn_ctx_onnx_model.py"
+set QNN_CTX_ONNX_GEN_SCRIPT="gen_qnn_ctx_onnx_model.py"
+IF NOT EXIST %QNN_CTX_ONNX_GEN_SCRIPT% (
+    powershell -Command "Invoke-WebRequest %QNN_CTX_ONNX_GEN_SCRIPT_URL% -Outfile %QNN_CTX_ONNX_GEN_SCRIPT%" )
 
-REM QNN converted quantized model use channel last layout, and use uint8/int8 for input & output
-REM To support context binary file generated by QNN tool chain, need to do some modification
-REM Generate a new model mobilenetv2-12_shape_add_trans.onnx from mobilenetv2-12_shape.onnx
-REM based on the input & output information got from QNN converted model.cpp file
-python add_trans_cast.py -m mobilenetv2-12_shape.onnx -q mobilenetv2-12_net.json
+REM based on the input & output information got from QNN converted mobilenetv2-12_net.json file
+REM Generate mobilenetv2-12_net_qnn_ctx.onnx with content of libmobilenetv2-12.serialized.bin embeded
+python gen_qnn_ctx_onnx_model.py -b libmobilenetv2-12.serialized.bin -q mobilenetv2-12_net.json
 
 where /q cmake.exe
 IF ERRORLEVEL 1 (
@@ -92,22 +90,24 @@ copy /y %ORT_BIN%\QnnSystem.dll .
 copy /y %ORT_BIN%\libQnnHtpV68Skel.so .
 copy /y %ORT_BIN%\libQnnHtpV73Skel.so
 copy /y ..\..\mobilenetv2-12_shape.onnx .
-copy /y ..\..\mobilenetv2-12_shape_add_trans.onnx .
 copy /y ..\..\mobilenetv2-12_quant_shape.onnx .
-copy /y ..\..\libmobilenetv2-12.serialized.bin .
+copy /y ..\..\mobilenetv2-12_net_qnn_ctx.onnx .
 copy /y ..\..\kitten_input.raw .
 copy /y ..\..\kitten_input_nhwc.raw .
 copy /y ..\..\synset.txt .
 
 @ECHO ON
-REM run with QNN CPU backend
-qnn_ep_sample.exe --cpu kitten_input.raw
+REM run mobilenetv2-12_shape.onnx with QNN CPU backend 
+qnn_ep_sample.exe --cpu mobilenetv2-12_shape.onnx kitten_input.raw
 
-REM run with QNN HTP backend
-qnn_ep_sample.exe --htp kitten_input.raw
+REM run mobilenetv2-12_quant_shape.onnx with QNN HTP backend, generate mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx
+qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw --gen_ctx
 
-REM run with QNN HTP backend using the QNN generated context binary file
-qnn_ep_sample.exe --qnn kitten_input_nhwc.raw
+REM run mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx with QNN HTP backend
+qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx kitten_input.raw
+
+REM run mobilenetv2-12_net_qnn_ctx.onnx (generated from native QNN) with QNN HTP backend
+qnn_ep_sample.exe --qnn mobilenetv2-12_net_qnn_ctx.onnx kitten_input_nhwc.raw
 
 :EXIT
 popd