Skip to content

[ExecuTorch][WebGPU] Dynamic-shape integration test (allocate-at-max + per-op resize)#20582

Open
JulianCloudNTH wants to merge 2 commits into
gh/JulianCloudNTH/74/basefrom
gh/JulianCloudNTH/74/head
Open

[ExecuTorch][WebGPU] Dynamic-shape integration test (allocate-at-max + per-op resize)#20582
JulianCloudNTH wants to merge 2 commits into
gh/JulianCloudNTH/74/basefrom
gh/JulianCloudNTH/74/head

Conversation

@JulianCloudNTH

@JulianCloudNTH JulianCloudNTH commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

End-to-end validation that one graph built at the upper-bound seq-len serves every smaller live shape, matching the torch golden.

Problem: the dynamic-resize engine (allocate-at-max buffers + per-op resize hooks + output resize) had unit-level reasoning but no single oracle proving a graph built at S=MAX runs correctly at S<MAX without reallocating buffers (which would invalidate bind groups).

Solution: a native test that builds each toy model at S=MAX and runs it at several live S, asserting the output matches a torch-computed golden and that the output EValue is resized to the live shape.

  • Cases A-D: dynamic + static rms_norm (resize shrinks the dispatch; one reused graph across S proves buffers never move; static path unchanged).
  • Cases F-H: rms(rms(x)) cascade, rms(x)+x (rms->add cascade), rms(x)*x (mul).
  • Cases I-L: dynamic linear_q4gsw (GEMM at several M), sdpa_with_kv_cache (GQA prefill at several S), embedding_q4gsw (int64 ids), apply_rotary_emb (two outputs).

Implementation:

  • test/ops/dynamic_shape/test_dynamic_shape_export.py exports each toy model through VulkanPartitioner with a dynamic dim and writes per-S torch goldens; reuses the existing op-test helpers for quant/sdpa/embedding/rope.
  • test/native/test_dynamic_shape.cpp loads each .pte, runs each live S, and compares at the per-op tolerance (rms 1e-3, quant 5e-3, sdpa 2e-3).
  • Multi-output ops select their output by full shape, never numel.

Constraints: numerics computed with torch (no hand-rolled reference); toy models stay within the 65535 1D-dispatch cap; SDPA case is skipped gracefully if sym_size.int/copy_ op coverage is incomplete (does not fail the suite).

Co-authored-with: Claude Code.

Differential Revision: D109906090

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20582

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2026
@JulianCloudNTH

Copy link
Copy Markdown
Contributor Author

@claude review and check for any areas or opportunities for modularization

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Claude finished @JulianCloudNTH's task in 1m 28s —— View job


Review: dynamic-shape integration test

  • Read changed files
  • Review for correctness & best practices
  • Identify modularization opportunities

Solid end-to-end test — building one graph at S=MAXS and replaying smaller live shapes against a torch golden is exactly the right oracle for the allocate-at-max + per-op-resize engine, and selecting multi-output tensors by full shape (not numel) is the correct call. The graceful SKIP paths (no WebGPU device, SDPA op-coverage PENDING) keep the suite green where appropriate. Below are the modularization opportunities you asked about, plus a few minor notes.


Modularization — Python (test_dynamic_shape_export.py)

1. The lower → assert-delegated → write-buffer block is copy-pasted 5×. _export (lines 97-108) already does this, but _export_dynamic_linear, _export_dynamic_sdpa, _export_dynamic_embedding, and _export_dynamic_rope each re-implement the identical to_edge_transform_and_lower(...).to_executorch() + any(d.id == "VulkanBackend" ...) + open().write(et.buffer) sequence (e.g. lines 210-218, 253-261, 292-299, 335-343). Since the only thing that differs between callers is how the ExportedProgram is built, split _export so everyone shares the tail:

def _lower_and_write(ep, path: str) -> None:
    et = to_edge_transform_and_lower(ep, partitioner=[VulkanPartitioner()]).to_executorch()
    assert any(
        d.id == "VulkanBackend"
        for plan in et.executorch_program.execution_plan
        for d in plan.delegates
    ), f"Expected VulkanBackend delegate in {path}"
    with open(path, "wb") as f:
        f.write(et.buffer)
    print(f"Exported {path}")

_export then becomes torch.export.export(...) + _lower_and_write(ep, path), and the four specialized exporters drop ~6 duplicated lines each. Fix this →

2. The tensor → little-endian f32 → file incantation appears ~10×. t.detach().cpu().numpy().astype("<f4").tofile(os.path.join(out_dir, ...)) is repeated in _write_goldens, _export_dynamic_linear, _export_dynamic_sdpa, _export_dynamic_embedding, and _export_dynamic_rope. A one-liner _write_f32(t, path) (and a sibling _write_i64 for the embedding idx at line 307) would centralize the dtype contract that the native side depends on, so a future dtype change is one edit instead of ten.

3. SDPA and RoPE golden loops are structurally identical (lines 266-270, 347-351): iterate [(name, tensor), ...] and dump each to {prefix}.S{s}.{name}.bin. Once _write_f32 exists, both collapse to a shared _write_named(prefix, s, pairs, out_dir).


Modularization — C++ (test_dynamic_shape.cpp)

4. The four op-specific checkers share a fixed skeleton. check_linear, check_sdpa, check_embedding, and check_rope each repeat: construct Module, load_forward() with FAIL/ok=false handling, read input bins, make_tensor_ptr, forward, result .ok()/isTensor guards, slice to numel, max_err, and the printf(... PASS/FAIL); ok = ok && pass tail. The per-op delta is really just (a) input tensor construction, (b) output selection, and (c) tolerance. Worth extracting the invariant pieces:

  • forward_and_check(got, golden, label, s, tol, ok) — the max_err + PASS/FAIL print + ok &= pass tail (currently duplicated in check_s, check_linear, check_embedding, check_rope, check_sdpa).
  • find_output_by_shape(result, predicate) — the shape-matching output-selection loop is written twice (SDPA lines 207-217, RoPE lines 309-321) with the same structure.

5. The main()-level "load fresh module per S and check_s" loop is repeated 4× for cases A/B, F, G, H (lines 360-369, 398-407, 410-419, 422-431) — identical except the .pte name and the S list. A helper:

void run_dyn_cases(const std::string& dir, const std::string& prefix,
                   std::initializer_list<int> ss, bool& ok) {
  for (int s : ss) {
    Module m(dir + "/" + prefix + ".pte");
    if (m.load_forward() != Error::Ok) { printf("  FAIL load %s.pte\n", prefix.c_str()); ok = false; return; }
    check_s(m, dir, prefix, s, ok);
  }
}

turns each case into a single call. Fix this →

6. Templatize the binary reader. read_bin only handles float, so check_embedding (lines 246-255) hand-rolls the int64 read. A read_bin<T>(path) would remove that special case and keep one file-reading code path.


Minor notes (non-blocking)

  • Magic tolerances. 1e-3f, 5e-3f, 2e-3f are inlined at each call site (lines 106, 155, 230, 270, 331). Named constants (kTolRms, kTolQuant, kTolSdpa) would document intent and pair naturally with the tol parameter in suggestion Add unlifting pass under private config #4.
  • Header comment lists A–F but cases extend to L (lines 9-21). The block comment stops at case F while the suite runs through L — worth completing or trimming so it doesn't drift.
  • max_err returns 1e30f as a size-mismatch sentinel (line 57). It works because every comparison is < tol, but an explicit got.empty() || size mismatch → FAIL at the call site reads more clearly than relying on a magic large float comparing false.

None of these are correctness blockers — the test logic itself looks right. The Python _lower_and_write extraction (#1) is the highest-value change; the C++ helpers (#4/#5) are next.

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants