Skip to content

ggml-hexagon: HAP_power_set_HMX uses &ctx instead of ctx in htp_iface_open(), causing large HMX GEMM slowdown #1452

@happyyzy

Description

@happyyzy

What happened

In ggml-hexagon, htp_iface_open() powers up HMX with the wrong client/context pointer.

File:

  • src/ggml-hexagon/htp/main.c

Current code on commit 35ae589fa189a3682a1fe25b7803122680c401b4:

request.type         = HAP_power_set_HMX;
request.hmx.power_up = TRUE;
err = HAP_power_set((void *) &ctx, &request);

That passes &ctx (address of the local pointer variable) instead of ctx (the actual struct htp_context *).

The rest of the power votes in the same function use ctx correctly:

HAP_power_set((void *) ctx, &request)

The one-line fix is:

err = HAP_power_set((void *) ctx, &request);

Why this matters

This is not just a cosmetic bug. On device, this causes a large HMX performance regression in ggml-hexagon HMX GEMM.

After fixing only this pointer bug, the HMX core segment drops immediately from ~65 ms to ~22 ms on the same workload.

Reproduction

Repo / commit:

  • ggml-org/ggml
  • 35ae589fa189a3682a1fe25b7803122680c401b4

Command used:

GGML_HEXAGON_ARCH=79 GGML_HEXAGON_PROFILE=1 \
./test-backend-ops perf -o MUL_MAT -b HTP0 \
  -p "type_a=q8_0,type_b=f32,m=4096,n=12288,k=4096"

and:

GGML_HEXAGON_ARCH=79 GGML_HEXAGON_PROFILE=1 \
./test-backend-ops perf -o MUL_MAT -b HTP0 \
  -p "type_a=q4_0,type_b=f32,m=4096,n=12288,k=4096"

Measured before / after

q8_0, shape 4096 x 4096 x 12288

Before fix:

  • dequant ~= 777xx us
  • core ~= 648xx us

After changing only HAP_power_set((void *)&ctx, ...) -> HAP_power_set((void *)ctx, ...):

  • dequant ~= 776xx us
  • core ~= 2205x us

q4_0, shape 4096 x 4096 x 12288

Before fix:

  • dequant ~= 643xx us
  • core ~= 653xx us

After fix:

  • dequant ~= 595xx us
  • core ~= 2218x us

Suggested fix

Change this line in src/ggml-hexagon/htp/main.c:

err = HAP_power_set((void *) &ctx, &request);

to:

err = HAP_power_set((void *) ctx, &request);

Notes

I ruled out a few unrelated explanations before isolating this:

  • HMX lock mode (lock vs lock2(shared)) was not the cause.
  • Chunk/layout changes were not the cause.
  • The same workload still ran numerically; the bug manifested as a large performance drop.

This issue is specifically about the wrong context pointer passed to HAP_power_set_HMX.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions