Summary
ggml_rms_norm_back produces different values depending on whether
the graph is computed via the legacy ggml_graph_compute_with_ctx
(which gives the correct, formula-matching value) or via the backend
scheduler ggml_backend_sched_graph_compute on the CPU backend (which
returns a wildly different value with the wrong sign and magnitude).
Both paths use the same op (ggml_rms_norm_back), the same CPU
backend, and the same input tensors. Forward ggml_rms_norm works
correctly on both paths; the discrepancy is specific to the backward op.
Repro on ggml master 5725fee (the commit in my vendored checkout),
Linux aarch64 (NVIDIA GB10), gcc 13.3, CPU backend only.
Repro 1 — legacy ggml_graph_compute_with_ctx (correct)
#include "ggml.h"
#include "ggml-cpu.h"
#include <stdio.h>
#include <string.h>
int main(void) {
struct ggml_init_params params = {
.mem_size = 256 * 1024,
.mem_buffer = NULL,
.no_alloc = false,
};
struct ggml_context *ctx = ggml_init(params);
struct ggml_tensor *t_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4, 1);
struct ggml_tensor *t_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4, 1);
float a[4] = {1.0f, 0.0f, 0.0f, 0.0f}; // gradients (src0 per impl source)
float b[4] = {1.0f, 0.0f, 0.0f, 0.0f}; // x (src1 per impl source)
memcpy(t_a->data, a, sizeof a);
memcpy(t_b->data, b, sizeof b);
struct ggml_tensor *t_out = ggml_rms_norm_back(ctx, t_a, t_b, 1e-4f);
struct ggml_cgraph *gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, t_out);
ggml_graph_compute_with_ctx(ctx, gf, 1);
float *out = (float *) t_out->data;
printf("out = [%g %g %g %g]\n", out[0], out[1], out[2], out[3]);
return 0;
}
Output: out = [0.000799 0 0 0]
Matches the formula in ggml-cpu/ops.cpp::ggml_compute_forward_rms_norm_back_f32:
sum_xx = Σ x[i]² = 1
sum_eps = sum_xx + eps*n = 1.0004
sum_xdz = Σ x[i]*dz[i] = 1
mean_eps = sum_xx/n + eps = 0.2501
rrms = 1/sqrt(mean_eps) ≈ 1.9996
dx[0] = (dz[0] - x[0]*sum_xdz/sum_eps) * rrms
= (1 - 1.0 * 1/1.0004) * 1.9996
≈ 0.0008
dx[1..3] = 0
Repro 2 — backend scheduler ggml_backend_sched_graph_compute (wrong)
#include "ggml.h"
#include "ggml-backend.h"
#include "ggml-cpu.h"
#include <stdio.h>
int main(void) {
ggml_backend_load_all();
ggml_backend_t backend = ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_CPU, NULL);
ggml_backend_sched_t sched = ggml_backend_sched_new(&backend, NULL, 1,
GGML_DEFAULT_GRAPH_SIZE, false, true);
size_t ctx_buf_size = ggml_tensor_overhead() * GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead();
uint8_t *ctx_buf = (uint8_t *) calloc(1, ctx_buf_size);
struct ggml_init_params params = {
.mem_size = ctx_buf_size,
.mem_buffer = ctx_buf,
.no_alloc = true,
};
struct ggml_context *ctx = ggml_init(params);
struct ggml_tensor *t_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4, 1);
struct ggml_tensor *t_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4, 1);
struct ggml_tensor *t_out = ggml_rms_norm_back(ctx, t_a, t_b, 1e-4f);
struct ggml_cgraph *gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, t_out);
ggml_backend_sched_reset(sched);
if (!ggml_backend_sched_alloc_graph(sched, gf)) {
fprintf(stderr, "alloc_graph failed\n"); return 1;
}
float a[4] = {1.0f, 0.0f, 0.0f, 0.0f};
float b[4] = {1.0f, 0.0f, 0.0f, 0.0f};
ggml_backend_tensor_set(t_a, a, 0, sizeof a);
ggml_backend_tensor_set(t_b, b, 0, sizeof b);
ggml_backend_sched_graph_compute(sched, gf);
float out[4];
ggml_backend_tensor_get(t_out, out, 0, sizeof out);
printf("out = [%g %g %g %g]\n", out[0], out[1], out[2], out[3]);
return 0;
}
Output: out = [-3.9976 0 0 0]
-3.9976 is suspicious: it equals roughly -2 * (1/sqrt(0.2501)) = -2 * rrms. Doesn't match any documented formula I can recover from the impl source.
What's different
Tried during debugging:
- Calling
ggml_set_input on the two input tensors before
ggml_backend_sched_alloc_graph — no change.
- Downloading the input tensors immediately after
ggml_backend_tensor_set and before ggml_backend_sched_graph_compute
— values are correct ([1 0 0 0] for both), so the uploads are landing.
- Forward
ggml_rms_norm via the same backend-sched setup — works
correctly (parity verified at multiple shapes against a hand-rolled
reference in our TinyNN.rms_norm smoke test).
softmax_back via ggml_soft_max_ext_back through the same
backend-sched setup — works correctly (parity verified against
hand-rolled reference). So it's not a "backward ops in general are
broken via sched" issue, specifically rms_norm_back.
Where I'm using this
Building a Ruby-FFI bridge from a Spinel-AOT-compiled toy transformer
to ggml — repo at
OriPekelman/toy_ruby_neural_network,
the bridge code in tinynn/. The shim around tnn_session_new uses
backend-sched (so we get the same code path for CPU and CUDA), which
is why this matters for us — but the repro reproduces in pure C using
just ggml, so the issue is upstream.
TinyNN.rms_norm_back is currently bound but its parity smoke is
disabled until this is resolved. Happy to test any candidate fix
against our toy transformer's backward pass.
Summary
ggml_rms_norm_backproduces different values depending on whetherthe graph is computed via the legacy
ggml_graph_compute_with_ctx(which gives the correct, formula-matching value) or via the backend
scheduler
ggml_backend_sched_graph_computeon the CPU backend (whichreturns a wildly different value with the wrong sign and magnitude).
Both paths use the same op (
ggml_rms_norm_back), the same CPUbackend, and the same input tensors. Forward
ggml_rms_normworkscorrectly on both paths; the discrepancy is specific to the backward op.
Repro on
ggmlmaster5725fee(the commit in my vendored checkout),Linux aarch64 (NVIDIA GB10), gcc 13.3, CPU backend only.
Repro 1 — legacy
ggml_graph_compute_with_ctx(correct)Output:
out = [0.000799 0 0 0]Matches the formula in
ggml-cpu/ops.cpp::ggml_compute_forward_rms_norm_back_f32:Repro 2 — backend scheduler
ggml_backend_sched_graph_compute(wrong)Output:
out = [-3.9976 0 0 0]-3.9976is suspicious: it equals roughly-2 * (1/sqrt(0.2501))=-2 * rrms. Doesn't match any documented formula I can recover from the impl source.What's different
Tried during debugging:
ggml_set_inputon the two input tensors beforeggml_backend_sched_alloc_graph— no change.ggml_backend_tensor_setand beforeggml_backend_sched_graph_compute— values are correct (
[1 0 0 0]for both), so the uploads are landing.ggml_rms_normvia the same backend-sched setup — workscorrectly (parity verified at multiple shapes against a hand-rolled
reference in our
TinyNN.rms_normsmoke test).softmax_backviaggml_soft_max_ext_backthrough the samebackend-sched setup — works correctly (parity verified against
hand-rolled reference). So it's not a "backward ops in general are
broken via sched" issue, specifically
rms_norm_back.Where I'm using this
Building a Ruby-FFI bridge from a Spinel-AOT-compiled toy transformer
to ggml — repo at
OriPekelman/toy_ruby_neural_network,the bridge code in
tinynn/. The shim aroundtnn_session_newusesbackend-sched (so we get the same code path for CPU and CUDA), which
is why this matters for us — but the repro reproduces in pure C using
just
ggml, so the issue is upstream.TinyNN.rms_norm_backis currently bound but its parity smoke isdisabled until this is resolved. Happy to test any candidate fix
against our toy transformer's backward pass.