ggml_rms_norm_back: backend-sched compute produces different (wrong) result than legacy compute_with_ctx

## Summary

`ggml_rms_norm_back` produces different values depending on whether
the graph is computed via the legacy `ggml_graph_compute_with_ctx`
(which gives the correct, formula-matching value) or via the backend
scheduler `ggml_backend_sched_graph_compute` on the CPU backend (which
returns a wildly different value with the wrong sign and magnitude).

Both paths use the same op (`ggml_rms_norm_back`), the same CPU
backend, and the same input tensors. Forward `ggml_rms_norm` works
correctly on both paths; the discrepancy is specific to the backward op.

Repro on `ggml` master `5725fee` (the commit in my vendored checkout),
Linux aarch64 (NVIDIA GB10), gcc 13.3, CPU backend only.

## Repro 1 — legacy `ggml_graph_compute_with_ctx` (correct)

```c
#include "ggml.h"
#include "ggml-cpu.h"
#include <stdio.h>
#include <string.h>

int main(void) {
    struct ggml_init_params params = {
        .mem_size   = 256 * 1024,
        .mem_buffer = NULL,
        .no_alloc   = false,
    };
    struct ggml_context *ctx = ggml_init(params);

    struct ggml_tensor *t_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4, 1);
    struct ggml_tensor *t_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4, 1);

    float a[4] = {1.0f, 0.0f, 0.0f, 0.0f};  // gradients (src0 per impl source)
    float b[4] = {1.0f, 0.0f, 0.0f, 0.0f};  // x (src1 per impl source)
    memcpy(t_a->data, a, sizeof a);
    memcpy(t_b->data, b, sizeof b);

    struct ggml_tensor *t_out = ggml_rms_norm_back(ctx, t_a, t_b, 1e-4f);
    struct ggml_cgraph *gf = ggml_new_graph(ctx);
    ggml_build_forward_expand(gf, t_out);
    ggml_graph_compute_with_ctx(ctx, gf, 1);

    float *out = (float *) t_out->data;
    printf("out = [%g %g %g %g]\n", out[0], out[1], out[2], out[3]);
    return 0;
}
```

**Output**: `out = [0.000799 0 0 0]`

Matches the formula in `ggml-cpu/ops.cpp::ggml_compute_forward_rms_norm_back_f32`:

```
sum_xx  = Σ x[i]²       = 1
sum_eps = sum_xx + eps*n = 1.0004
sum_xdz = Σ x[i]*dz[i]   = 1
mean_eps = sum_xx/n + eps = 0.2501
rrms   = 1/sqrt(mean_eps) ≈ 1.9996
dx[0]  = (dz[0] - x[0]*sum_xdz/sum_eps) * rrms
       = (1 - 1.0 * 1/1.0004) * 1.9996
       ≈ 0.0008
dx[1..3] = 0
```

## Repro 2 — backend scheduler `ggml_backend_sched_graph_compute` (wrong)

```c
#include "ggml.h"
#include "ggml-backend.h"
#include "ggml-cpu.h"
#include <stdio.h>

int main(void) {
    ggml_backend_load_all();
    ggml_backend_t backend = ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_CPU, NULL);
    ggml_backend_sched_t sched = ggml_backend_sched_new(&backend, NULL, 1,
                                                         GGML_DEFAULT_GRAPH_SIZE, false, true);

    size_t ctx_buf_size = ggml_tensor_overhead() * GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead();
    uint8_t *ctx_buf = (uint8_t *) calloc(1, ctx_buf_size);
    struct ggml_init_params params = {
        .mem_size   = ctx_buf_size,
        .mem_buffer = ctx_buf,
        .no_alloc   = true,
    };
    struct ggml_context *ctx = ggml_init(params);

    struct ggml_tensor *t_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4, 1);
    struct ggml_tensor *t_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4, 1);
    struct ggml_tensor *t_out = ggml_rms_norm_back(ctx, t_a, t_b, 1e-4f);

    struct ggml_cgraph *gf = ggml_new_graph(ctx);
    ggml_build_forward_expand(gf, t_out);

    ggml_backend_sched_reset(sched);
    if (!ggml_backend_sched_alloc_graph(sched, gf)) {
        fprintf(stderr, "alloc_graph failed\n"); return 1;
    }

    float a[4] = {1.0f, 0.0f, 0.0f, 0.0f};
    float b[4] = {1.0f, 0.0f, 0.0f, 0.0f};
    ggml_backend_tensor_set(t_a, a, 0, sizeof a);
    ggml_backend_tensor_set(t_b, b, 0, sizeof b);

    ggml_backend_sched_graph_compute(sched, gf);

    float out[4];
    ggml_backend_tensor_get(t_out, out, 0, sizeof out);
    printf("out = [%g %g %g %g]\n", out[0], out[1], out[2], out[3]);
    return 0;
}
```

**Output**: `out = [-3.9976 0 0 0]`

`-3.9976` is suspicious: it equals roughly `-2 * (1/sqrt(0.2501))` = `-2 * rrms`. Doesn't match any documented formula I can recover from the impl source.

## What's different

Tried during debugging:
- Calling `ggml_set_input` on the two input tensors before
  `ggml_backend_sched_alloc_graph` — no change.
- Downloading the input tensors immediately after
  `ggml_backend_tensor_set` and before `ggml_backend_sched_graph_compute`
  — values are correct (`[1 0 0 0]` for both), so the uploads are landing.
- Forward `ggml_rms_norm` via the same backend-sched setup — works
  correctly (parity verified at multiple shapes against a hand-rolled
  reference in our `TinyNN.rms_norm` smoke test).
- `softmax_back` via `ggml_soft_max_ext_back` through the same
  backend-sched setup — works correctly (parity verified against
  hand-rolled reference). So it's not a "backward ops in general are
  broken via sched" issue, specifically `rms_norm_back`.

## Where I'm using this

Building a Ruby-FFI bridge from a Spinel-AOT-compiled toy transformer
to ggml — repo at
[`OriPekelman/toy_ruby_neural_network`](https://github.com/OriPekelman/toy_ruby_neural_network),
the bridge code in `tinynn/`. The shim around `tnn_session_new` uses
backend-sched (so we get the same code path for CPU and CUDA), which
is why this matters for us — but the repro reproduces in pure C using
just `ggml`, so the issue is upstream.

`TinyNN.rms_norm_back` is currently bound but its parity smoke is
disabled until this is resolved. Happy to test any candidate fix
against our toy transformer's backward pass.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml_rms_norm_back: backend-sched compute produces different (wrong) result than legacy compute_with_ctx #1491

Summary

Repro 1 — legacy `ggml_graph_compute_with_ctx` (correct)

Repro 2 — backend scheduler `ggml_backend_sched_graph_compute` (wrong)

What's different

Where I'm using this

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ggml_rms_norm_back: backend-sched compute produces different (wrong) result than legacy compute_with_ctx #1491

Description

Summary

Repro 1 — legacy ggml_graph_compute_with_ctx (correct)

Repro 2 — backend scheduler ggml_backend_sched_graph_compute (wrong)

What's different

Where I'm using this

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Repro 1 — legacy `ggml_graph_compute_with_ctx` (correct)

Repro 2 — backend scheduler `ggml_backend_sched_graph_compute` (wrong)