Skip to content

ggml_rms_norm_back: backend-sched compute produces different (wrong) result than legacy compute_with_ctx #1491

@OriPekelman

Description

@OriPekelman

Summary

ggml_rms_norm_back produces different values depending on whether
the graph is computed via the legacy ggml_graph_compute_with_ctx
(which gives the correct, formula-matching value) or via the backend
scheduler ggml_backend_sched_graph_compute on the CPU backend (which
returns a wildly different value with the wrong sign and magnitude).

Both paths use the same op (ggml_rms_norm_back), the same CPU
backend, and the same input tensors. Forward ggml_rms_norm works
correctly on both paths; the discrepancy is specific to the backward op.

Repro on ggml master 5725fee (the commit in my vendored checkout),
Linux aarch64 (NVIDIA GB10), gcc 13.3, CPU backend only.

Repro 1 — legacy ggml_graph_compute_with_ctx (correct)

#include "ggml.h"
#include "ggml-cpu.h"
#include <stdio.h>
#include <string.h>

int main(void) {
    struct ggml_init_params params = {
        .mem_size   = 256 * 1024,
        .mem_buffer = NULL,
        .no_alloc   = false,
    };
    struct ggml_context *ctx = ggml_init(params);

    struct ggml_tensor *t_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4, 1);
    struct ggml_tensor *t_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4, 1);

    float a[4] = {1.0f, 0.0f, 0.0f, 0.0f};  // gradients (src0 per impl source)
    float b[4] = {1.0f, 0.0f, 0.0f, 0.0f};  // x (src1 per impl source)
    memcpy(t_a->data, a, sizeof a);
    memcpy(t_b->data, b, sizeof b);

    struct ggml_tensor *t_out = ggml_rms_norm_back(ctx, t_a, t_b, 1e-4f);
    struct ggml_cgraph *gf = ggml_new_graph(ctx);
    ggml_build_forward_expand(gf, t_out);
    ggml_graph_compute_with_ctx(ctx, gf, 1);

    float *out = (float *) t_out->data;
    printf("out = [%g %g %g %g]\n", out[0], out[1], out[2], out[3]);
    return 0;
}

Output: out = [0.000799 0 0 0]

Matches the formula in ggml-cpu/ops.cpp::ggml_compute_forward_rms_norm_back_f32:

sum_xx  = Σ x[i]²       = 1
sum_eps = sum_xx + eps*n = 1.0004
sum_xdz = Σ x[i]*dz[i]   = 1
mean_eps = sum_xx/n + eps = 0.2501
rrms   = 1/sqrt(mean_eps) ≈ 1.9996
dx[0]  = (dz[0] - x[0]*sum_xdz/sum_eps) * rrms
       = (1 - 1.0 * 1/1.0004) * 1.9996
       ≈ 0.0008
dx[1..3] = 0

Repro 2 — backend scheduler ggml_backend_sched_graph_compute (wrong)

#include "ggml.h"
#include "ggml-backend.h"
#include "ggml-cpu.h"
#include <stdio.h>

int main(void) {
    ggml_backend_load_all();
    ggml_backend_t backend = ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_CPU, NULL);
    ggml_backend_sched_t sched = ggml_backend_sched_new(&backend, NULL, 1,
                                                         GGML_DEFAULT_GRAPH_SIZE, false, true);

    size_t ctx_buf_size = ggml_tensor_overhead() * GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead();
    uint8_t *ctx_buf = (uint8_t *) calloc(1, ctx_buf_size);
    struct ggml_init_params params = {
        .mem_size   = ctx_buf_size,
        .mem_buffer = ctx_buf,
        .no_alloc   = true,
    };
    struct ggml_context *ctx = ggml_init(params);

    struct ggml_tensor *t_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4, 1);
    struct ggml_tensor *t_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4, 1);
    struct ggml_tensor *t_out = ggml_rms_norm_back(ctx, t_a, t_b, 1e-4f);

    struct ggml_cgraph *gf = ggml_new_graph(ctx);
    ggml_build_forward_expand(gf, t_out);

    ggml_backend_sched_reset(sched);
    if (!ggml_backend_sched_alloc_graph(sched, gf)) {
        fprintf(stderr, "alloc_graph failed\n"); return 1;
    }

    float a[4] = {1.0f, 0.0f, 0.0f, 0.0f};
    float b[4] = {1.0f, 0.0f, 0.0f, 0.0f};
    ggml_backend_tensor_set(t_a, a, 0, sizeof a);
    ggml_backend_tensor_set(t_b, b, 0, sizeof b);

    ggml_backend_sched_graph_compute(sched, gf);

    float out[4];
    ggml_backend_tensor_get(t_out, out, 0, sizeof out);
    printf("out = [%g %g %g %g]\n", out[0], out[1], out[2], out[3]);
    return 0;
}

Output: out = [-3.9976 0 0 0]

-3.9976 is suspicious: it equals roughly -2 * (1/sqrt(0.2501)) = -2 * rrms. Doesn't match any documented formula I can recover from the impl source.

What's different

Tried during debugging:

  • Calling ggml_set_input on the two input tensors before
    ggml_backend_sched_alloc_graph — no change.
  • Downloading the input tensors immediately after
    ggml_backend_tensor_set and before ggml_backend_sched_graph_compute
    — values are correct ([1 0 0 0] for both), so the uploads are landing.
  • Forward ggml_rms_norm via the same backend-sched setup — works
    correctly (parity verified at multiple shapes against a hand-rolled
    reference in our TinyNN.rms_norm smoke test).
  • softmax_back via ggml_soft_max_ext_back through the same
    backend-sched setup — works correctly (parity verified against
    hand-rolled reference). So it's not a "backward ops in general are
    broken via sched" issue, specifically rms_norm_back.

Where I'm using this

Building a Ruby-FFI bridge from a Spinel-AOT-compiled toy transformer
to ggml — repo at
OriPekelman/toy_ruby_neural_network,
the bridge code in tinynn/. The shim around tnn_session_new uses
backend-sched (so we get the same code path for CPU and CUDA), which
is why this matters for us — but the repro reproduces in pure C using
just ggml, so the issue is upstream.

TinyNN.rms_norm_back is currently bound but its parity smoke is
disabled until this is resolved. Happy to test any candidate fix
against our toy transformer's backward pass.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions