Skip to content

Commit 2511881

Browse files
authored
Update Qnn EP example code to reflect the QNN cache improvement in Ort (#309)
* Update Qnn EP example code to reflect the QNN cache improvement in Ort microsoft/onnxruntime#17757
1 parent 5c47d50 commit 2511881

File tree

3 files changed

+95
-75
lines changed

3 files changed

+95
-75
lines changed

c_cxx/QNN_EP/mobilenetv2_classification/README.md

Lines changed: 28 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,29 @@
11
## About
2-
- Builds the sample compiled against the ONNX Runtime built with support for Qualcomm AI Engine Direct SDK (Qualcomm Neural Network (QNN) SDK)
3-
- The sample uses the QNN EP, run with Qnn CPU banckend and HTP backend
2+
- Builds the sample compiled against the ONNX Runtime built with support for Qualcomm AI Engine Direct SDK (Qualcomm Neural Network (QNN) SDK).
3+
- The sample uses the QNN EP to:
4+
- a. run the float32 model on Qnn CPU banckend.
5+
- b. run the QDQ model on HTP backend with qnn_context_cache_enable=1, and generates the Onnx model which has QNN context binary embeded.
6+
- c. run the QNN context binary model generated from ONNX Runtime (previous step) on HTP backend, to improve the model initialization time and reduce memory overhead.
7+
- d. run the QNN context binary model generated from QNN tool chain on HTP backend, to support models generated from native QNN tool chain.
48
- The sample downloads the mobilenetv2 model from Onnx model zoo, and use mobilenetv2_helper.py to quantize the float32 model to QDQ model which is required for HTP backend
59
- The sample is targeted to run on QC ARM64 device.
610
- There are 2 ways to improve the session creation time by using of QNN context binary:
7-
- Option 1: Use contexty binary generated by OnnxRuntime QNN EP. OnnxRuntime QNN EP use QNN API to generate the QNN context binary, and also dumps some metadata (model name, version, graph meta id, etc) to identify the model. You can just simply set qnn_context_cache_enable to 1 and run with QDQ model. The first run will generate the context binary (Default file name is model_file_name.onnx.bin if qnn_context_cache_path is not set.). The model run afterwards will load from the generated context binary file.
11+
- Option 1: Use context binary generated by OnnxRuntime QNN EP. OnnxRuntime QNN EP use QNN API to generate the QNN context binary, and also dumps some metadata (model name, version, graph meta id, etc) to identify the model.
12+
- a. Set qnn_context_cache_enable to 1 and run with QDQ model.
13+
- b. The first run will generate the context binary model (Default file name is model_file_name.onnx_qnn_ctx.onnx if qnn_context_cache_path is not set.).
14+
- c. Use the generated context binary model (mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx) for inference going forward. (No need the QDQ model, no need to set the qnn_context_cache_enable).
15+
816
- Option 2: Use context binary generated by native QNN tool chain:
9-
- The sample also demonstrates the feature to load from QNN generated context binary file libmobilenetv2-12.serialized.bin to better support customer application migration from native QNN to OnnxRuntime QNN EP. Because QNN converted model use channel last layout, and the quantized model use INT8/UINT8 as model input & output. A script add_trans_cast.py is provided to update the orignial Onnx model by insert Cast and Transpose node to make the model input & output align with QNN converted model. It requires the Onnx float32 model and pre-converted QNN model_net.json.
10-
- The sample provides mobilenetv2-12_net.json and context binary file libmobilenetv2-12.serialized.bin as exmple (generated by QNN version 2.10). Please follow QNN document to gnereated QNN model_net.json and the context binary file.
17+
- The sample also demonstrates the feature to create an Onnx model file from QNN generated context binary file libmobilenetv2-12.serialized.bin to better support customer application migration from native QNN to OnnxRuntime QNN EP. A script [gen_qnn_ctx_onnx_model.py](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/qnn/gen_qnn_ctx_onnx_model.py) is provided to generate an Onnx model from QNN generated context binary file. It requires the QNN generated context binary file libmobilenetv2-12.serialized.bin and pre-converted QNN mobilenetv2-12_net.json.
18+
- a. Convert model to QNN model and generate the QNN context binary file. The sample provides mobilenetv2-12_net.json and context binary file libmobilenetv2-12.serialized.bin as exmple (generated by QNN version 2.10). Please follow QNN document to generated QNN model_net.json and the context binary file.
1119
- Example command used:
1220
- qnn-onnx-converter --input_list ./input.txt --input_network ./mobilenetv2-12.onnx --output_path ./mobilenetv2-12.cpp -b 1 --bias_bw 32
1321
- qnn-model-lib-generator -c ./mobilenetv2-12.cpp -b ./mobilenetv2-12.bin -o ./mobilenetv2_classification/qnn_lib
1422
- qnn-context-binary-generator --backend ${QNN_SDK_ROOT}/target/x86_64-linux-clang/lib/libQnnHtp.so --model ./mobilenetv2_classification/qnn_lib/x86_64-linux-clang/libmobilenetv2-12.so --binary_file libmobilenetv2-12.serialized
23+
- b. Create an Onnx model file from QNN generated context binary file libmobilenetv2-12.serialized.bin. More details refer to run_qnn_ep_sample.bat & [gen_qnn_ctx_onnx_model.py](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/qnn/gen_qnn_ctx_onnx_model.py)
24+
- python gen_qnn_ctx_onnx_model.py -b libmobilenetv2-12.serialized.bin -q mobilenetv2-12_net.json
25+
- c. Create ONNX Runtime session with the model generated from step b.
26+
- d. Run the model with quantized input data. The output also need to be dequantized. This is because QNN quantized model use quantized data type for model inputs & outputs. More details refer to QuantizedData & DequantizedData in [main.cpp](https://github.com/microsoft/onnxruntime-inference-examples/blob/main/c_cxx/QNN_EP/mobilenetv2_classification/main.cpp). Also the input image is NHWC layout for QNN converted model.
1527

1628
- More info on QNN EP - https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html
1729

@@ -38,22 +50,26 @@ Example (Src): run_qnn_ep_sample.bat C:\src\onnxruntime C:\src\onnxruntime\build
3850
## Example run result
3951
```
4052
...
41-
REM run with QNN CPU backend
42-
qnn_ep_sample.exe --cpu kitten_input.raw
53+
REM run mobilenetv2-12_shape.onnx with QNN CPU backend
54+
qnn_ep_sample.exe --cpu mobilenetv2-12_shape.onnx kitten_input.raw
4355
4456
Result:
45-
position=281, classification=n02123045 tabby, tabby cat, probability=13.663173
57+
position=281, classification=n02123045 tabby, tabby cat, probability=13.663178
4658
47-
REM run with QNN HTP backend
48-
qnn_ep_sample.exe --htp kitten_input.raw
59+
REM run mobilenetv2-12_quant_shape.onnx with QNN HTP backend, generate mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx
60+
qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw --gen_ctx
4961
5062
Result:
5163
position=281, classification=n02123045 tabby, tabby cat, probability=13.637316
5264
53-
REM run with QNN HTP backend using the QNN generated context binary file
54-
qnn_ep_sample.exe --qnn kitten_input_nhwc.raw
65+
REM run mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx with QNN HTP backend
66+
qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx kitten_input.raw
5567
5668
Result:
69+
position=281, classification=n02123045 tabby, tabby cat, probability=13.637316
70+
71+
REM run mobilenetv2-12_net_qnn_ctx.onnx (generated from native QNN) with QNN HTP backend
72+
qnn_ep_sample.exe --qnn mobilenetv2-12_net_qnn_ctx.onnx kitten_input_nhwc.raw
5773
position=281, classification=n02123045 tabby, tabby cat, probability=13.637315
5874
...
5975
```

c_cxx/QNN_EP/mobilenetv2_classification/main.cpp

Lines changed: 50 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -63,59 +63,39 @@ void DequantizedData(float* out, const T_QuantType* in, int32_t offset, float sc
6363
}
6464
}
6565

66-
void run_ort_qnn_ep(std::string backend, std::string input_path, std::string qnn_native_context_binary = "") {
67-
#ifdef _WIN32
68-
const wchar_t* model_path = nullptr;
69-
if (backend == "QnnCpu.dll") {
70-
model_path = L"mobilenetv2-12_shape.onnx";
71-
} else {
72-
model_path = L"mobilenetv2-12_quant_shape.onnx";
73-
}
74-
#else
75-
char* model_path = nullptr;
76-
if (backend == "QnnCpu.so") {
77-
model_path = "mobilenetv2-12_shape.onnx";
78-
} else {
79-
model_path = "mobilenetv2-12_quant_shape.onnx";
80-
}
81-
#endif
66+
void run_ort_qnn_ep(const std::string& backend, const std::string& model_path, const std::string& input_path,
67+
bool generated_from_native_qnn, bool generate_ctx) {
68+
std::wstring model_path_wstr = std::wstring(model_path.begin(), model_path.end());
8269

8370
const OrtApi* g_ort = OrtGetApiBase()->GetApi(ORT_API_VERSION);
8471
OrtEnv* env;
85-
CheckStatus(g_ort, g_ort->CreateEnv(ORT_LOGGING_LEVEL_WARNING, "test", &env)); // Can set to ORT_LOGGING_LEVEL_INFO or ORT_LOGGING_LEVEL_VERBOSE for more info
72+
// Can set to ORT_LOGGING_LEVEL_INFO or ORT_LOGGING_LEVEL_VERBOSE for more info
73+
CheckStatus(g_ort, g_ort->CreateEnv(ORT_LOGGING_LEVEL_WARNING, "test", &env));
8674

8775
OrtSessionOptions* session_options;
8876
CheckStatus(g_ort, g_ort->CreateSessionOptions(&session_options));
8977
CheckStatus(g_ort, g_ort->SetIntraOpNumThreads(session_options, 1));
9078
CheckStatus(g_ort, g_ort->SetSessionGraphOptimizationLevel(session_options, ORT_ENABLE_BASIC));
9179

92-
// You can also set qnn_context_cache_enable to 1 to improve session creation cost
93-
// More option details refers tohttps://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html
80+
// More option details refers to https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html
9481
std::vector<const char*> options_keys = {"backend_path"};
9582
std::vector<const char*> options_values = {backend.c_str()};
9683

97-
bool run_with_qnn_native_ctx_binary = false;
98-
// Load from offline prepared QNN context binary file. This one is generated by native QNN tool chain
99-
if (!qnn_native_context_binary.empty()) {
84+
// If it runs from a QDQ model on HTP backend
85+
// It will generate an Onnx model with Qnn context binary.
86+
// The context binary can be embedded inside the model in EPContext->ep_cache_context (by default),
87+
// or the context binary can be a separate .bin file, with relative path set in EPContext->ep_cache_context (qnn_context_embed_mode = 0)
88+
if (generate_ctx) {
10089
options_keys.push_back("qnn_context_cache_enable");
10190
options_values.push_back("1");
102-
// qnn_context_cache_path -- Specify the context binary. This one is generated by QNN toolchain
103-
// If not specifid, OnnxRuntime QNN EP will generate the context binary file from QDQ model at the frst run and load from it next run
104-
options_keys.push_back("qnn_context_cache_path");
105-
options_values.push_back("libmobilenetv2-12.serialized.bin");
106-
run_with_qnn_native_ctx_binary = true;
107-
108-
#ifdef _WIN32
109-
model_path = L"mobilenetv2-12_shape_add_trans.onnx";
110-
#else
111-
model_path = "mobilenetv2-12_shape_add_trans.onnx";
112-
#endif
11391
}
92+
// qnn_context_cache_path -- you can specify the path and file name as you want
93+
// If not specified, OnnxRuntime QNN EP will generate it at [model_path]_qnn_ctx.onnx
11494

11595
CheckStatus(g_ort, g_ort->SessionOptionsAppendExecutionProvider(session_options, "QNN", options_keys.data(),
11696
options_values.data(), options_keys.size()));
11797
OrtSession* session;
118-
CheckStatus(g_ort, g_ort->CreateSession(env, model_path, session_options, &session));
98+
CheckStatus(g_ort, g_ort->CreateSession(env, model_path_wstr.c_str(), session_options, &session));
11999

120100
OrtAllocator* allocator;
121101
CheckStatus(g_ort, g_ort->GetAllocatorWithDefaultOptions(&allocator));
@@ -203,13 +183,14 @@ void run_ort_qnn_ep(std::string backend, std::string input_path, std::string qnn
203183
input_raw_file.read(reinterpret_cast<char*>(&input_data[0]), num_elements * sizeof(float));
204184

205185
CheckStatus(g_ort, g_ort->CreateCpuMemoryInfo(OrtArenaAllocator, OrtMemTypeDefault, &memory_info));
206-
if (run_with_qnn_native_ctx_binary) {
186+
// QNN native tool chain generated quantized model use quantized data as inputs & outputs
187+
if (generated_from_native_qnn) {
207188
size_t input_data_length = input_data_size * sizeof(uint8_t);
208189
QuantizedData(quantized_input_data.data(), input_data.data(), -116, 0.015875209f, input_data_size);
209190
CheckStatus(g_ort, g_ort->CreateTensorWithDataAsOrtValue(
210191
memory_info, reinterpret_cast<void*>(quantized_input_data.data()), input_data_length,
211192
input_node_dims[0].data(), input_node_dims[0].size(), input_types[0], &input_tensors[0]));
212-
} else {
193+
} else { // Ort generate QDQ model still use float32 data as inputs & outputs
213194
size_t input_data_length = input_data_size * sizeof(float);
214195
CheckStatus(g_ort, g_ort->CreateTensorWithDataAsOrtValue(
215196
memory_info, reinterpret_cast<void*>(input_data.data()), input_data_length,
@@ -226,7 +207,7 @@ void run_ort_qnn_ep(std::string backend, std::string input_path, std::string qnn
226207
void* output_buffer;
227208
CheckStatus(g_ort, g_ort->GetTensorMutableData(output_tensors[0], &output_buffer));
228209
float* float_buffer = nullptr;
229-
if (run_with_qnn_native_ctx_binary) {
210+
if (generated_from_native_qnn) {
230211
uint8_t* buffer = reinterpret_cast<uint8_t*>(output_buffer);
231212
DequantizedData(output_data.data(), buffer, -86, 0.08069417f, output_data_size);
232213
float_buffer = output_data.data();
@@ -252,39 +233,62 @@ void run_ort_qnn_ep(std::string backend, std::string input_path, std::string qnn
252233

253234
void PrintHelp() {
254235
std::cout << "To run the sample, use the following command:" << std::endl;
255-
std::cout << "Example: ./qnn_ep_sample --cpu <path_to_raw_input>" << std::endl;
256-
std::cout << "To Run with QNN CPU backend. Example: ./qnn_ep_sample --cpu kitten_input.raw" << std::endl;
257-
std::cout << "To Run with QNN HTP backend. Example: ./qnn_ep_sample --htp kitten_input.raw" << std::endl;
258-
std::cout << "To Run with QNN native context binary on QNN HTP backend . Example: ./qnn_ep_sample --qnn kitten_input_nhwc.raw" << std::endl;
236+
std::cout << "Example: ./qnn_ep_sample --cpu <model_path> <path_to_raw_input>" << std::endl;
237+
std::cout << "To Run with QNN CPU backend. Example: ./qnn_ep_sample --cpu mobilenetv2-12_shape.onnx kitten_input.raw" << std::endl;
238+
std::cout << "To Run with QNN HTP backend. Example: ./qnn_ep_sample --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw" << std::endl;
239+
std::cout << "To Run with QNN HTP backend and generate Qnn context binary model. Example: ./qnn_ep_sample --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw --gen_ctx" << std::endl;
240+
std::cout << "To Run with QNN native context binary on QNN HTP backend . Example: ./qnn_ep_sample --qnn qnn_native_ctx_binary.onnx kitten_input_nhwc.raw" << std::endl;
259241
}
260242

261243
constexpr const char* CPUBACKEDN = "--cpu";
262244
constexpr const char* HTPBACKEDN = "--htp";
263245
constexpr const char* QNNCTXBINARY = "--qnn";
246+
constexpr const char* GENERATE_CTX = "--gen_ctx";
264247

265248
int main(int argc, char* argv[]) {
266249

267-
if (argc != 3) {
250+
if (argc != 4 && argc != 5) {
268251
PrintHelp();
269252
return 1;
270253
}
271254

255+
bool generate_ctx = false;
256+
if (argc == 5) {
257+
if (strcmp(argv[4], GENERATE_CTX) == 0) {
258+
generate_ctx = true;
259+
} else {
260+
std::cout << "The expected last parameter is --gen_ctx." << std::endl;
261+
PrintHelp();
262+
return 1;
263+
}
264+
}
265+
272266
std::string backend = "";
273-
std::string qnn_native_context_binary = "";
267+
bool generated_from_native_qnn = false;
274268
if (strcmp(argv[1], CPUBACKEDN) == 0) {
275269
backend = "QnnCpu.dll";
270+
if (generate_ctx) {
271+
std::cout << "--gen_ctx won't work with CPU backend." << std::endl;
272+
return 1;
273+
}
276274
} else if (strcmp(argv[1], HTPBACKEDN) == 0) {
277275
backend = "QnnHtp.dll";
278276
} else if (strcmp(argv[1], QNNCTXBINARY) == 0) {
279277
backend = "QnnHtp.dll";
280-
qnn_native_context_binary = "libmobilenetv2-12.serialized.bin";
278+
generated_from_native_qnn = true;
279+
if (generate_ctx) {
280+
std::cout << "--gen_ctx won't work with --qnn." << std::endl;
281+
return 1;
282+
}
281283
} else {
282284
std::cout << "This sample only support option cpu, htp, qnn." << std::endl;
283285
PrintHelp();
284286
return 1;
285287
}
286-
std::string input_path(argv[2]);
287288

288-
run_ort_qnn_ep(backend, input_path, qnn_native_context_binary);
289+
std::string model_path(argv[2]);
290+
std::string input_path(argv[3]);
291+
292+
run_ort_qnn_ep(backend, model_path, input_path, generated_from_native_qnn, generate_ctx);
289293
return 0;
290294
}

c_cxx/QNN_EP/mobilenetv2_classification/run_qnn_ep_sample.bat

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -51,16 +51,14 @@ IF NOT EXIST mobilenetv2-12_shape.onnx (
5151
)
5252

5353
REM Download add_trans_cast.py file
54-
set ADD_TRANS_CAST_SCRIPT_URL="https://raw.githubusercontent.com/microsoft/onnxruntime/main/onnxruntime/python/tools/qnn/add_trans_cast.py"
55-
set ADD_TRANS_CAST_SCRIPT="add_trans_cast.py"
56-
IF NOT EXIST %ADD_TRANS_CAST_SCRIPT% (
57-
powershell -Command "Invoke-WebRequest %ADD_TRANS_CAST_SCRIPT_URL% -Outfile %ADD_TRANS_CAST_SCRIPT%" )
54+
set QNN_CTX_ONNX_GEN_SCRIPT_URL="https://raw.githubusercontent.com/microsoft/onnxruntime/main/onnxruntime/python/tools/qnn/gen_qnn_ctx_onnx_model.py"
55+
set QNN_CTX_ONNX_GEN_SCRIPT="gen_qnn_ctx_onnx_model.py"
56+
IF NOT EXIST %QNN_CTX_ONNX_GEN_SCRIPT% (
57+
powershell -Command "Invoke-WebRequest %QNN_CTX_ONNX_GEN_SCRIPT_URL% -Outfile %QNN_CTX_ONNX_GEN_SCRIPT%" )
5858

59-
REM QNN converted quantized model use channel last layout, and use uint8/int8 for input & output
60-
REM To support context binary file generated by QNN tool chain, need to do some modification
61-
REM Generate a new model mobilenetv2-12_shape_add_trans.onnx from mobilenetv2-12_shape.onnx
62-
REM based on the input & output information got from QNN converted model.cpp file
63-
python add_trans_cast.py -m mobilenetv2-12_shape.onnx -q mobilenetv2-12_net.json
59+
REM based on the input & output information got from QNN converted mobilenetv2-12_net.json file
60+
REM Generate mobilenetv2-12_net_qnn_ctx.onnx with content of libmobilenetv2-12.serialized.bin embeded
61+
python gen_qnn_ctx_onnx_model.py -b libmobilenetv2-12.serialized.bin -q mobilenetv2-12_net.json
6462

6563
where /q cmake.exe
6664
IF ERRORLEVEL 1 (
@@ -92,22 +90,24 @@ copy /y %ORT_BIN%\QnnSystem.dll .
9290
copy /y %ORT_BIN%\libQnnHtpV68Skel.so .
9391
copy /y %ORT_BIN%\libQnnHtpV73Skel.so
9492
copy /y ..\..\mobilenetv2-12_shape.onnx .
95-
copy /y ..\..\mobilenetv2-12_shape_add_trans.onnx .
9693
copy /y ..\..\mobilenetv2-12_quant_shape.onnx .
97-
copy /y ..\..\libmobilenetv2-12.serialized.bin .
94+
copy /y ..\..\mobilenetv2-12_net_qnn_ctx.onnx .
9895
copy /y ..\..\kitten_input.raw .
9996
copy /y ..\..\kitten_input_nhwc.raw .
10097
copy /y ..\..\synset.txt .
10198

10299
@ECHO ON
103-
REM run with QNN CPU backend
104-
qnn_ep_sample.exe --cpu kitten_input.raw
100+
REM run mobilenetv2-12_shape.onnx with QNN CPU backend
101+
qnn_ep_sample.exe --cpu mobilenetv2-12_shape.onnx kitten_input.raw
105102

106-
REM run with QNN HTP backend
107-
qnn_ep_sample.exe --htp kitten_input.raw
103+
REM run mobilenetv2-12_quant_shape.onnx with QNN HTP backend, generate mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx
104+
qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw --gen_ctx
108105

109-
REM run with QNN HTP backend using the QNN generated context binary file
110-
qnn_ep_sample.exe --qnn kitten_input_nhwc.raw
106+
REM run mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx with QNN HTP backend
107+
qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx kitten_input.raw
108+
109+
REM run mobilenetv2-12_net_qnn_ctx.onnx (generated from native QNN) with QNN HTP backend
110+
qnn_ep_sample.exe --qnn mobilenetv2-12_net_qnn_ctx.onnx kitten_input_nhwc.raw
111111

112112
:EXIT
113113
popd

0 commit comments

Comments
 (0)