[ExecuTorch][WebGPU] Dynamic-shape integration test (allocate-at-max + per-op resize)#20582
[ExecuTorch][WebGPU] Dynamic-shape integration test (allocate-at-max + per-op resize)#20582JulianCloudNTH wants to merge 2 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20582
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
@claude review and check for any areas or opportunities for modularization |
|
Claude finished @JulianCloudNTH's task in 1m 28s —— View job Review: dynamic-shape integration test
Solid end-to-end test — building one graph at Modularization — Python (
|
Stack from ghstack (oldest at bottom):
End-to-end validation that one graph built at the upper-bound seq-len serves every smaller live shape, matching the torch golden.
Problem: the dynamic-resize engine (allocate-at-max buffers + per-op resize hooks + output resize) had unit-level reasoning but no single oracle proving a graph built at S=MAX runs correctly at S<MAX without reallocating buffers (which would invalidate bind groups).
Solution: a native test that builds each toy model at S=MAX and runs it at several live S, asserting the output matches a torch-computed golden and that the output EValue is resized to the live shape.
rms_norm(resize shrinks the dispatch; one reused graph across S proves buffers never move; static path unchanged).rms(rms(x))cascade,rms(x)+x(rms->add cascade),rms(x)*x(mul).linear_q4gsw(GEMM at several M),sdpa_with_kv_cache(GQA prefill at several S),embedding_q4gsw(int64 ids),apply_rotary_emb(two outputs).Implementation:
test/ops/dynamic_shape/test_dynamic_shape_export.pyexports each toy model throughVulkanPartitionerwith a dynamic dim and writes per-S torch goldens; reuses the existing op-test helpers for quant/sdpa/embedding/rope.test/native/test_dynamic_shape.cpploads each.pte, runs each live S, and compares at the per-op tolerance (rms 1e-3, quant 5e-3, sdpa 2e-3).Constraints: numerics computed with torch (no hand-rolled reference); toy models stay within the 65535 1D-dispatch cap; SDPA case is skipped gracefully if
sym_size.int/copy_op coverage is incomplete (does not fail the suite).Co-authored-with: Claude Code.
Differential Revision: D109906090