Why does the weight data type of the Linear layer become FP32 during the runtime when load fp16.pte (fp16 Llama 3.2-1B model)? #9639

WeiMa01 · 2025-03-26T08:57:51Z

When the LLaMA 3.2-1B model is converted to fp16.pte using the -d fp16 parameter, why does the weight data type of the Linear layer become FP32 during the runtime?

convert command:
python -m examples.models.llama.export_llama --model "llama3_2" --checkpoint "/model_convert/Llama-3.2-1B/original/consolidated_00.pth" --params "/Llama-3.2-1B/original/params.json" --use_sdpa_with_kv_cache -X --xnnpack-extended-ops --output_name "llama3_2_fp16_direct_convert_runtime.pte" -kv -d fp16 --max_seq_length 256

runtime Linear weight dtype log:
@@@@@ kernel_value->datatype FP32, input_value->datatype FP16, output_value->datatype FP16

We print linear weight dtype in executorch/backends/xnnpack/third-party/XNNPACK/src/subgraph/fully-connected.c:1039

enum xnn_status xnn_define_fully_connected(xnn_subgraph_t subgraph,
                                           float output_min, float output_max,
                                           uint32_t input_id,
                                           uint32_t filter_id, uint32_t bias_id,
                                           uint32_t output_id, uint32_t flags)
    .......
    printf("@@@@@ kernel_value->datatype %s, input_value->datatype %s, output_value->datatype %s\n", 
          xnn_datatype_to_string(kernel_value->datatype),xnn_datatype_to_string(input_value->datatype),xnn_datatype_to_string(input_value->datatype));

cc @digantdesai @mcr229 @cbilgin @larryliu0820 @mergennachin @cccclai @helunwencser @jackzhxng

The text was updated successfully, but these errors were encountered:

JacobSzwejbka · 2025-03-26T23:18:03Z

@mcr229 tagging you since it looks like an xnnpack export flow

mcr229 · 2025-03-27T00:45:49Z

so internally XNNPACK calls this:

https://github.com/google/XNNPACK/blob/8eeaaa61823318102f892e92b9069a5cefe3a523/src/subgraph/fully-connected.c#L339-L345

in which it packs the fp32 static weights into fp16, so internally we still use fp16 for computation. The downside is that we serialize and load the weights as the original FP32. From what I remember, the reason we do this is because in the past (when we first added fp16 support) there was no XNNPACK support for raw fp16 weights. Meaning the only way to force fp16 computation was to pass in fp32 weights and allow them to pack it into fp16. It seems like they do have this support now, so we should update it to avoid confusion and the improve our load and file size.

cc. @digantdesai for the original history of fp16 support.

mcr229 · 2025-03-28T18:17:39Z

working on updating this, this week.

mcr229 · 2025-04-01T20:39:55Z

closed with this: #9753

JacobSzwejbka added module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code labels Mar 26, 2025

github-project-automation bot added this to ExecuTorch Core Mar 26, 2025

github-project-automation bot moved this to To triage in ExecuTorch Core Mar 26, 2025

JacobSzwejbka assigned mcr229 Mar 26, 2025

mcr229 added this to ExecuTorch - CPU Mar 28, 2025

github-project-automation bot moved this to To triage in ExecuTorch - CPU Mar 28, 2025

mcr229 moved this from To triage to In progress in ExecuTorch - CPU Mar 28, 2025

mcr229 closed this as completed Apr 1, 2025

github-project-automation bot moved this from To triage to Done in ExecuTorch Core Apr 1, 2025

github-project-automation bot moved this from In progress to Done in ExecuTorch - CPU Apr 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does the weight data type of the Linear layer become FP32 during the runtime when load fp16.pte (fp16 Llama 3.2-1B model)? #9639

Why does the weight data type of the Linear layer become FP32 during the runtime when load fp16.pte (fp16 Llama 3.2-1B model)? #9639

WeiMa01 commented Mar 26, 2025 •

edited by pytorch-bot bot

Loading

JacobSzwejbka commented Mar 26, 2025 •

edited

Loading

mcr229 commented Mar 27, 2025

mcr229 commented Mar 28, 2025

mcr229 commented Apr 1, 2025

Why does the weight data type of the Linear layer become FP32 during the runtime when load fp16.pte (fp16 Llama 3.2-1B model)? #9639

Why does the weight data type of the Linear layer become FP32 during the runtime when load fp16.pte (fp16 Llama 3.2-1B model)? #9639

Comments

WeiMa01 commented Mar 26, 2025 • edited by pytorch-bot bot Loading

JacobSzwejbka commented Mar 26, 2025 • edited Loading

mcr229 commented Mar 27, 2025

mcr229 commented Mar 28, 2025

mcr229 commented Apr 1, 2025

WeiMa01 commented Mar 26, 2025 •

edited by pytorch-bot bot

Loading

JacobSzwejbka commented Mar 26, 2025 •

edited

Loading