Why does the weight data type of the Linear layer become FP32 during the runtime when load fp16.pte (fp16 Llama 3.2-1B model)? #9639
Labels
module: llm
Issues related to LLM examples and apps, and to the extensions/llm/ code
module: xnnpack
Issues related to xnnpack delegation and the code under backends/xnnpack/
When the LLaMA 3.2-1B model is converted to fp16.pte using the -d fp16 parameter, why does the weight data type of the Linear layer become FP32 during the runtime?
convert command:
python -m examples.models.llama.export_llama --model "llama3_2" --checkpoint "/model_convert/Llama-3.2-1B/original/consolidated_00.pth" --params "/Llama-3.2-1B/original/params.json" --use_sdpa_with_kv_cache -X --xnnpack-extended-ops --output_name "llama3_2_fp16_direct_convert_runtime.pte" -kv -d fp16 --max_seq_length 256
runtime Linear weight dtype log:
@@@@@ kernel_value->datatype FP32, input_value->datatype FP16, output_value->datatype FP16
We print linear weight dtype in executorch/backends/xnnpack/third-party/XNNPACK/src/subgraph/fully-connected.c:1039
cc @digantdesai @mcr229 @cbilgin @larryliu0820 @mergennachin @cccclai @helunwencser @jackzhxng
The text was updated successfully, but these errors were encountered: