|
2 | 2 | - Builds the sample compiled against the ONNX Runtime built with support for Qualcomm AI Engine Direct SDK (Qualcomm Neural Network (QNN) SDK).
|
3 | 3 | - The sample uses the QNN EP to:
|
4 | 4 | - a. run the float32 model on Qnn CPU banckend.
|
5 |
| - - b. run the QDQ model on HTP backend with qnn_context_cache_enable=1, and generates the Onnx model which has QNN context binary embeded. |
| 5 | + - b. run the QDQ model on HTP backend with qnn_context_cache_enable=1, and generates the Onnx model which has QNN context binary embedded. |
6 | 6 | - c. run the QNN context binary model generated from ONNX Runtime (previous step) on HTP backend, to improve the model initialization time and reduce memory overhead.
|
7 | 7 | - d. run the QNN context binary model generated from QNN tool chain on HTP backend, to support models generated from native QNN tool chain.
|
8 | 8 | - The sample downloads the mobilenetv2 model from Onnx model zoo, and use mobilenetv2_helper.py to quantize the float32 model to QDQ model which is required for HTP backend
|
|
12 | 12 | - a. Set qnn_context_cache_enable to 1 and run with QDQ model.
|
13 | 13 | - b. The first run will generate the context binary model (Default file name is model_file_name.onnx_qnn_ctx.onnx if qnn_context_cache_path is not set.).
|
14 | 14 | - c. Use the generated context binary model (mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx) for inference going forward. (No need the QDQ model, no need to set the qnn_context_cache_enable).
|
| 15 | + - Notes: The QNN context binary is embedded within the ONNX model by default. Alternatively, set the QNN EP session option qnn_context_embed_mode to 0 in order to generate the QNN context binary as a separate file and embed the file's relative path in the ONNX model. This is necessary if the QNN context binary size exceeds protobuf's 2GB limit. |
| 16 | + - Offline prepare is also supported. Generate the QNN context binary on x64 machine and run it on QC ARM64 device. |
| 17 | + ``` |
| 18 | + # Build the qnn_ep_sample with x64. |
| 19 | + MSBuild.exe .\qnn_ep_sample.sln /property:Configuration=Release /p:Platform="x64" |
| 20 | + # Run qnn_ep_sample.exe on x64. It only create the Onnx Runtime session with QDQ model to generate the QNN context binary |
| 21 | + # No need to run the model |
| 22 | + qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw --gen_ctx |
| 23 | + ``` |
15 | 24 |
|
16 | 25 | - Option 2: Use context binary generated by native QNN tool chain:
|
17 | 26 | - The sample also demonstrates the feature to create an Onnx model file from QNN generated context binary file libmobilenetv2-12.serialized.bin to better support customer application migration from native QNN to OnnxRuntime QNN EP. A script [gen_qnn_ctx_onnx_model.py](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/qnn/gen_qnn_ctx_onnx_model.py) is provided to generate an Onnx model from QNN generated context binary file. It requires the QNN generated context binary file libmobilenetv2-12.serialized.bin and pre-converted QNN mobilenetv2-12_net.json.
|
|
24 | 33 | - python gen_qnn_ctx_onnx_model.py -b libmobilenetv2-12.serialized.bin -q mobilenetv2-12_net.json
|
25 | 34 | - c. Create ONNX Runtime session with the model generated from step b.
|
26 | 35 | - d. Run the model with quantized input data. The output also need to be dequantized. This is because QNN quantized model use quantized data type for model inputs & outputs. More details refer to QuantizedData & DequantizedData in [main.cpp](https://github.com/microsoft/onnxruntime-inference-examples/blob/main/c_cxx/QNN_EP/mobilenetv2_classification/main.cpp). Also the input image is NHWC layout for QNN converted model.
|
| 36 | + - Notes: Call gen_qnn_ctx_onnx_model.py with --disable_embed_mode to generate the ONNX model with the relative path to the QNN context binary file. This is necessary if the QNN context binary size exceeds protobuf's 2GB limit. |
27 | 37 |
|
28 | 38 | - More info on QNN EP - https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html
|
29 | 39 |
|
@@ -56,12 +66,18 @@ qnn_ep_sample.exe --cpu mobilenetv2-12_shape.onnx kitten_input.raw
|
56 | 66 | Result:
|
57 | 67 | position=281, classification=n02123045 tabby, tabby cat, probability=13.663178
|
58 | 68 |
|
59 |
| -REM run mobilenetv2-12_quant_shape.onnx with QNN HTP backend, generate mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx |
60 |
| -qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw --gen_ctx |
| 69 | +REM run mobilenetv2-12_quant_shape.onnx with QNN HTP backend |
| 70 | +qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw |
61 | 71 |
|
62 | 72 | Result:
|
63 | 73 | position=281, classification=n02123045 tabby, tabby cat, probability=13.637316
|
64 | 74 |
|
| 75 | +REM load mobilenetv2-12_quant_shape.onnx with QNN HTP backend, generate mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx which hs QNN context binary embedded |
| 76 | +REM This does not has to be run on real device with HTP, it can be done on x64 platform also, since it supports offline generation |
| 77 | +qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx kitten_input.raw --gen_ctx |
| 78 | +
|
| 79 | +Onnx model with QNN context binary is generated. |
| 80 | +
|
65 | 81 | REM run mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx with QNN HTP backend
|
66 | 82 | qnn_ep_sample.exe --htp mobilenetv2-12_quant_shape.onnx_qnn_ctx.onnx kitten_input.raw
|
67 | 83 |
|
|
0 commit comments