Skip to content

[ExecuTorch][WebGPU] Dynamic tensor-shape resize engine core#20574

Open
JulianCloudNTH wants to merge 1 commit into
gh/JulianCloudNTH/66/basefrom
gh/JulianCloudNTH/66/head
Open

[ExecuTorch][WebGPU] Dynamic tensor-shape resize engine core#20574
JulianCloudNTH wants to merge 1 commit into
gh/JulianCloudNTH/66/basefrom
gh/JulianCloudNTH/66/head

Conversation

@JulianCloudNTH

@JulianCloudNTH JulianCloudNTH commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

The WebGPU backend baked static tensor shapes at build time, so a dynamic .pte needed a separate graph for each shape (prefill vs. decode). This adds a tensor-shape resize engine mirroring Vulkan: tensors carry live cur_dims ≤ max, inputs resize per call, and a bounded-fixpoint propagates tensor-level resize hooks.

Key changes:

  • WebGPUTensor: add cur_dims/cur_nbytes (live sizes ≤ max allocation), initialized to max at build
  • WebGPUGraph: resize_input/set_cur_dims validate live dims fit max, propagate_resize runs tensor hooks for dirty shapes
  • update_symints_from_inputs reads live cur_dims; adds sym_size.int dim source path
  • copy_inputs uploads only live bytes; WebGPUBackend::execute shrinks inputs and resizes outputs to live shapes

Static graphs stay byte-identical: cur == max forever, no hooks fire, no reallocations.

Differential Revision: D109906091

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20574

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit aa08145 with merge base 55a71e6 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2026
@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@JulianCloudNTH

Copy link
Copy Markdown
Contributor Author

@claude review and check for any areas or opportunities for modularization

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Claude finished @JulianCloudNTH's task in 1m 58s —— View job


Code review: WebGPU dynamic resize engine core

  • Read PR diff and changed files
  • Analyze correctness & modularization opportunities
  • Post review

Overall this is a clean, well-commented port of Vulkan's resize model. The "static graph stays byte-identical" invariant holds: cur_dims == dims at build, set_cur_dims only dirties on an actual change, and copy_inputs/propagate_resize short-circuit when nothing is dirty. Below are correctness notes and the modularization opportunities you asked about.


Modularization opportunities

1. utils::numel_of is added but never called — and the duplicate copies it should replace are left in place. This is the biggest one.

WebGPUUtils.h:28 adds:

inline uint64_t numel_of(const std::vector<int64_t>& dims) { ... }

But there are zero callers of utils::numel_of, and three places still hand-roll the same loop:

  • ops/embedding_q4gsw/EmbeddingQ4gsw.cpp:39 — its own local numel_of
  • ops/rope/RotaryEmbedding.cpp:37 — its own local numel_of
  • WebGPUGraph.cpp:107-110 (select_as_symint) and :165-171 (set_cur_dims) — inline product loops

Either wire the new util into those sites or drop it — as-is it's dead code plus two surviving duplicates. Replacing the per-op locals with utils::numel_of is the intended consolidation. Fix this →

2. Extract maybe_resize_inputs / maybe_resize_outputs from WebGPUBackend::execute. The two new loops at WebGPUBackend.cpp:118-122 and :128-138 are exactly the Vulkan maybe_resize_input/maybe_resize_output pair the comments reference, inlined into execute. Pulling them into two private methods (or free helpers) would match the Vulkan structure the PR is mirroring, keep execute readable, and put the std::vector conversions behind a named boundary. Given the CLAUDE.md "no trivial single-use helper" rule these are borderline, but each is ~5–10 lines with a clear name and a 1:1 Vulkan analog, so I'd lean toward extracting.

3. A normalize_dim(dim, rank) + bounds-check helper. This pattern now appears three times verbatim:

int dim = s.dim < 0 ? s.dim + static_cast<int>(d.size()) : s.dim;
if (dim < 0 || dim >= static_cast<int>(d.size())) { throw ...; }

at WebGPUGraph.cpp:96-98, :137-140, and the negative-index logic in set_cur_dims neighbors. A single small helper would remove the repetition and the easy-to-miss sign/cast bugs.


Correctness / robustness

4. select_as_symint empty-input guard now bypassable via cur_dims. At WebGPUGraph.cpp:107-112, numel is computed from live cur_dims and rejected if <= 0. That's correct, but note a dynamic input legitimately resized so one source dim is 0 would now throw where the static path never could. Worth confirming dynamic symint sources can never have a zero live dim, or the guard may need to tolerate it.

5. Per-call heap allocations on the hot path. execute allocates a std::vector<int64_t> per input (:120) and a std::vector<SizesType> per output (:130) every invocation. For decode loops this runs once per token. Minor, but a reusable scratch buffer on the graph (or a SmallVector/stack array for typical low ranks) would avoid it. Not blocking.

6. propagate_resize fixpoint bound looks right. pass <= tensor_resize_hooks_.size() with set_cur_dims only re-dirtying on change gives convergence in ≤ depth passes for a forward DAG, and processing.swap(dirty_tensors_) cleanly separates this pass's set from cascades. No issue — just flagging that the loop will silently stop after size()+1 passes if a cycle ever existed; an assert/log on !dirty_tensors_.empty() after the loop would catch a malformed graph instead of executing on stale shapes.

7. No op registers a tensor resize hook yet. add_tensor_resize_hook and the add_dispatch index return are added but unused in this PR — expected for an "engine core" change, just confirming the dispatch-rewrite consumers land in a later stack entry.


CI note

The reported failure is test-arm-backend-no-driver — an Arm backend job unrelated to these WebGPU-only files; almost certainly not caused by this PR.
· branch gh/JulianCloudNTH/66/head

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants