Skip to content

use Metal's Indirect Command Buffers for true GPU-side multi draw indirect#9640

Open
matthargett wants to merge 19 commits into
gfx-rs:trunkfrom
rebeckerspecialties:metal-icb-multi-draw-indirect
Open

use Metal's Indirect Command Buffers for true GPU-side multi draw indirect#9640
matthargett wants to merge 19 commits into
gfx-rs:trunkfrom
rebeckerspecialties:metal-icb-multi-draw-indirect

Conversation

@matthargett

@matthargett matthargett commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Connections

Reference workload: this was motivated by BabylonNative WebGPU rendering work where CPU-side multi-draw looping became visible in large batched scenes and AR rendering. The relevant application cases were AR Portal camera + rendering and (Back to the Future) Hill Valley-style GLTF's baked/static geometry batches that included PBR materials and a WGSL SSAO workload.

AR portal demo reference:
https://doc.babylonjs.com/features/featuresDeepDive/webXR/webXRDemos#ar-demo

Description

This ports Metal fixed-count multi-draw indirect to a real indirect-command-buffer path when the render pass can be safely suspended and resumed, instead of always looping over draws on the CPU.

The ICB fast path is intentionally conservative. It currently applies to larger static/material-baked multi-draw batches whose render state is pipeline plus vertex/index buffers only. Passes that bind textures, uniforms, or storage through render bind groups, or use immediates, stay on the CPU-loop fallback. This keeps common material-heavy passes safe while still reducing bridge/encoder overhead for large baked-geometry batches.

Changes:

  • Detect Metal render/compute indirect command buffer capabilities through feature sets plus runtime family checks.
  • Keep the macOS ICB path to Apple Silicon Macs (unified-memory Metal devices) with Mac2/Metal3-family evidence; Intel/AMD Macs stay on the CPU fallback until direct validation.
  • Keep the mobile/watch/vision ICB path to Apple5/A12+ or Metal3-class hardware on the validated modern OS generation (iOS/iPadOS 18, tvOS 18, visionOS 2, watchOS 11), plus Apple3/A10X on tvOS 18+ after first-generation Apple TV 4K validation. Apple3 on iOS/iPadOS remains on the CPU fallback.
  • Mark render pipelines as supporting ICBs and generate render ICB commands from WebGPU indirect argument buffers with a compute pass.
  • Support non-indexed and indexed fixed-count MDI, including GPU-generated indirect argument buffers.
  • Use an explicit primitive-type mapping for the ICB generation shader instead of relying on MTLPrimitiveType integer values matching MSL primitive_type values.
  • Generate ICB records with uniform dispatchThreadgroups, using the compute pipeline's threadExecutionWidth() and an in-shader cmdIndex >= drawCount guard instead of relying on non-uniform threadgroups from dispatchThreads.
  • Cache Metal argument encoders per command encoder and only use the ICB path for batches above the current setup-cost threshold.
  • Preserve the CPU fallback for unsupported render-pass states such as active bind groups, immediates, timestamp/occlusion queries, resolves, memoryless attachments, non-store attachments, virtual adapters, Intel/AMD macOS adapters, Apple3 on iOS/iPadOS, and mobile/watch/vision runtimes older than the validated modern OS generation.
  • Add wgpu-gpu conformance coverage for GPU-generated non-indexed and indexed multi-draw indirect arguments.
  • Add extra hardening coverage for 16-bit indexed ICB records, positive and negative base_vertex, non-zero first_vertex/first_instance, draw counts crossing the ICB generation workgroup width, mixed single/multi indirect sequencing in one render pass, and bind-group forced fallback.
  • Disable the ICB path on Apple's paravirtual Metal adapter after CI showed it advertises the relevant feature sets/families but aborts when executing render ICBs.

Hardware and driver quirks found:

  • M4-class Apple Silicon Macs execute GPU-generated non-indexed and indexed MDI through Metal ICBs. Intel/AMD macOS GPUs are intentionally out of scope for this PR and remain on the CPU fallback because the macOS gate now requires unified-memory Apple Silicon evidence.
  • A12/iPhone XS on iOS 18.7 successfully executes GPU-generated non-indexed and indexed MDI through Metal ICBs. This matches our empirical Apple5/A12 finding even though older public Apple capability tables are conservative about ICB support.
  • A10X/iPad Pro (iPad7,3) on iPadOS 17.7 advertises the legacy iOS GPUFamily3 ICB feature set, but throws a foreign NSException from the draw_indirect path before ICB generation while ending/suspending the active render encoder for the ICB splice. This PR therefore keeps Apple3/A10X on the iOS/iPadOS CPU-loop fallback even if a later runtime reports related feature sets.
  • A10X Apple TV 4K (AppleTV6,2) on tvOS 18.3 passed the native tvOS MdiIcbProbe workload on this PR branch without the previous experimental Apple3 override. The probe forced the ICB path, hit multi_draw_indirect and multi_draw_indexed_indirect Metal ICB diagnostics, and validated GPU-generated indirect arguments plus the render target by CPU readback.
  • Apple Watch GPUs are minimal enough that the ICB generation compute pass should not depend on non-uniform threadgroup dispatch. This PR now follows the Dawn-style shape: uniform dispatchThreadgroups sized by threadExecutionWidth(), with rounded-up extra threads neutralized by an in-shader draw-count guard.
  • Apple Watch SE2 (Watch6,10) passed the post-fix rerun in the external MetalSpeed probe. The probe used a Metal shared-buffer readback for CPU-side pixel verification, while the ICB path itself remains GPU-side command generation plus indirect draw execution.
  • On A12, MTLIndirectCommandBufferDescriptor inheritance defaults are usable, but calling optional generated objc2 setters such as setInheritDepthStencilState: can crash because the selector is not present on that runtime. The implementation relies on default inheritance instead of calling those optional setters.
  • On A12, generated ICB records rendered black until maxVertexBufferBindCount was set from the currently bound vertex-buffer count. The fragment bind count is currently set to 0 because this fast path rejects active render bind groups and immediates and does not restore fragment buffer bindings into ICB state.
  • On the GitHub-hosted Apple Paravirtual device, Metal reports ICB-related feature sets/families and the debug dump showed indirect_command_buffers_rendering: true and indirect_command_buffers_compute: true, but all MDI cases SIGABRTed when the ICB path executed. This PR treats virtual adapters like the existing mesh-shader virtual-device quirk and keeps them on the CPU fallback until we see that simulator/virtualized driver path gets fixed by Apple.
  • The render-pass suspend/resume path is intentionally conservative and could be broadened. It only uses ICBs when the pass can be safely ended and restarted with store/load semantics and when current state is simple enough to restore.

Gaps:

  • No A11 device can be updated beyond iOS 16, and based on the iPadOS 17 testing I expect those devices to hit similar driver/runtime limits, but direct validation would still be useful.
  • It is possible that Intel / AMD GPU Mac devices can be made to work (or already work), but I just don't have those devices available to me to validate/debug. I did note that Dawn disabled MDI on those GPUs, affirming the conservative gating I had already done. It would be nice to give the same uplift across more devices, especially hand-me-down devices.

Testing

Local validation:

  • M4 Max / Metal: cargo check -p wgpu-hal --features metal --no-default-features.
  • M4 Max / Metal: cargo check -p wgpu-test --test wgpu-gpu.
  • M4 Max / Metal: WGPU_BACKEND=metal cargo test -p wgpu-test --test wgpu-gpu multi_draw -- --test-threads=1 --nocapture, 13 passed.
  • M4 Max / Metal: WGPU_BACKEND=metal cargo test -p wgpu-test --test wgpu-gpu draw_indirect -- --test-threads=1 --nocapture, 33 passed.
  • Changelog audit: cargo xtask changelog rebecker/trunk.

Device validation:

  • iPhone XS Max / Apple A12 GPU / iOS 18.7.9: temporary aarch64-apple-ios probe app passed GPU-generated non-indexed and indexed MDI after the OS-version-aware mobile ICB gate.
  • A14 / iPhone 12 and M2 / Apple Vision Pro: validated successfully in the target app stack.
  • Apple Watch SE2 / Watch6,10 / watchOS 11.x validated successfully with an MDI + readback accuracy check.
  • Apple TV 4K / Apple A10X GPU / tvOS 18.3: native tvOS MdiIcbProbe app passed GPU-generated non-indexed and indexed MDI through the Metal ICB path on this PR branch; WGPU_METAL_REQUIRE_ICB_MDI=1 was set and the old experimental Apple3 env override was not set.
  • iPad Pro 10.5-inch / Apple A10X GPU / iPadOS 17.7.11: temporary native iPad-only probe app (UIDeviceFamily = 2, MinimumOSVersion = 17.0) passed GPU-generated non-indexed and indexed MDI through the CPU fallback after the OS-version-aware mobile ICB gate.
  • ICB-path proof: temporary fallback trap for draw_count > 1 was enabled during one M4 run and one iPhone XS run; both passed, proving these generated-argument cases did not use CPU-side per-draw fallback. The trap was removed before this PR.

CI notes:

  • The first CI run exposed Apple Paravirtual device SIGABRTs in multi_draw_indirect, multi_draw_indexed_indirect, and both GPU-generated-argument variants. I gated ICB capability exposure on virtual Metal adapters so CI exercises (and simulators) the existing CPU fallback there while unfettered Apple GPUs continue to use ICBs.

Squash or Rebase?

Squash before merge. The branch currently keeps the development and hardening iterations visible for review, but the final upstream landing would be cleaner as one feature commit or a maintainer-curated split.

Checklist

  • I self-reviewed and fully understand this PR.
  • WebGPU implementations built with wgpu may be affected behaviorally (if they enable MDI feature).
  • Validation and feature gates are in place to confine behavioral changes.
  • Tests demonstrate the validation and altered logic works.
  • CHANGELOG.md entries for the user-facing effects of this change are present.
  • The PR is minimal, and doesn't make sense to land as multiple PRs.
  • Commits are logically scoped and individually reviewable.
  • The PR description has enough context to understand the motivation and solution implemented.

@inner-daemons

Copy link
Copy Markdown
Collaborator

Holy shit this is large. I will take a look but it might be a while.

In the meantime,

  • Does this address multi_draw_indirect_count and friends?
  • Is this implemented and tested for mesh shaders?
  • Does it handle the draw_index builtin?
  • Is there extensive testing otherwise?

Also, I must ask, to what extent is the code written by LLMs? What about the PR description?

@inner-daemons inner-daemons self-assigned this Jun 5, 2026
@inner-daemons inner-daemons self-requested a review June 5, 2026 06:13
@matthargett

Copy link
Copy Markdown
Contributor Author

Holy shit this is large. I will take a look but it might be a while.

Sorry, I did try to keep it to the thinnest meaningful slice that demonstrated e2e uplift in my AR portal on real devices. if you see a natural seam I can split things out on, or I should disabuse myself of the constraint I imposed on myelf that a single PR bringing measurable uplift, just let me know! :D

* Does this address multi_draw_indirect_count and friends?

I didn't deal with multi_draw_indirect_count / multi_draw_indexed_indirect_count yet, as one of the ways to keep the PR's size down. It is intentionally scoped to "just" fixed-count multi_draw_indirect and multi_draw_indexed_indirect. The count-buffer variants have additional semantics and could/should be a separate follow-up.

* Is this implemented and tested for mesh shaders?

Mesh shaders are supported on my local M4 MacBook device, and I verified that existing fixed-count mesh MDI tests pass, but this PR doesn't add a Metal ICB mesh-command generation path. I did try to control scope, so count-buffer MDI and draw_index would need to be follow-ons unless it needs to be one atomic PR/commit from your perspective. (side note: I found mesh shaders to be pretty finicky on 26.0 operating systems when using MSL directly, so I'd anticipate even more physical device testing and iteration)

* Does it handle the draw_index builtin?

nope. In my notebook when I reviewed wgpu, I wrote that Naga’s MSL backend still rejects that builtin. (if that's changed or I was mistaken, do tell me!) would it be useful for me to add a test/fallback assertion so the ICB path cannot give the wrong impression on what it supports?

* Is there extensive testing otherwise?

the hardest part of this stuff is all the on-device testing (especially with ICBs which can halt/panic a device), and I try to capture all the quirks I ran into so others can avoid the tediousness of some of it on legacy devices that don't receive active update.
for testing, I:

  1. made a small MdiIcbTest program that I ran on all the Apple devices I have that verified correct values upon CPU readback
  2. I also integration tested it in the AR portal playground app that runs on my fork of BabylonNative that uses wgpu-native (instead of the GLES-based bgfx, which they don't want to switch from)
  3. I've been building a WASM interpreter-based fantasy console that uses wgpu-native, and I used both indirect draws and indirect compute generated on the GPU to drive some very neat 3D graphics demos but also pushing MOD/S3M and SNES SPC sample/song decoding onto the GPU. on the iPhone XS in particular, I had to push as much of the S3M music player's tracker and sample/effects math onto the GPU and use MDI to have decent quality and low battery draw by avoiding the A12 CPU cores.
    5.. I tested that on iPhone XS/12, iPad Pro (gen 2), M4 MacBook (not AR), AppleTV 4K 3rd gen (not AR), and Apple Vision Pro, and iterated until I got the performance I was hoping for from MDI. Let me know if you want links to supporting repos, or if reviewers would like to do a video call (or in-person meetup in San Francisco).

Also, I must ask, to what extent is the code written by LLMs? What about the PR description?

I came up with the high-level design for the WASM fantasy console in my notebooks, and MDI was added into that requirements list when I first learned about it at the Khronos meetup at GDC a few years ago. Then I used codex and iterated with it over the course of a few months to push the boundaries of older/weaker Apple GPUs (namely, iPhone XS and Apple Watch SE2/6).

I built up quite a local patch stack (which I reviewed with my own eyes at each step along the way) while creating all these demonstrations (WebAudio WASI using WebGPU in WASM worklets), and then directed my attention back to BabylonJS/BabylonNative which my startup's main product was written in. I had tried several times last year to fix bgfx's swapchains so we could ship a native BabylonJS app to the AVP app store, but then realized I just needed to rip off the bandaid and do the work to integrate wgpu-native. I used Codex to drive that work, but going through all their screenshot tests manually and spot checking accuracy was not left to AI.

Similarly, with the AR portal integration test, I had to physically pick up iPhones and iPads to scan the room after each build (so ARkit could detect the floor), walk into an out of the portal, note problems, and iterate with Codex. Then, I directed Codex to measure e2e framerate/CPU stats, alongside XCode CPU profiler data (L1/prefetch/branch-prediction misses), and guided it to the optimization result of 60 fps camera feed+3D AR projection on iPhone 12. (Some of those optimizations were on the Babylon.js side, and several of those submitted PRs have been merged/released by the nice folks at Microsoft.)

wrt the PR text: I directed Codex to summarize the important hard-won knowledge not represented in the diff, and gave it the bullet outline that it elaborated, but I also edited the PR title and description in the staging PR in my wgpu fork. I then had Claude Code review the fork PR description, commits, etc to look for problems but also ways it could better conform to gfx-rs project PR/issue norms/templates. If you look at my open source contributions across the last ~30 years, I can be very verbose especially when it comes to performance, exploits, and optimization.

So yes, I used multiple AI coding tools, but I personally reviewed and guided each step of the way, and did not post a PR in this project until it looked good (and passed CI) in my fork. Sorry for the wall of text, but I wanted to give a nuanced response so you (and whomever else) can understand this wasn't a one-shot AI prompt on a whim: I have things built with wgpu that I'd like to ship, and getting better FPS per milliwatt of power draw means MDI is key.

@inner-daemons

Copy link
Copy Markdown
Collaborator

Ok thank you for the explanations!

@inner-daemons

Copy link
Copy Markdown
Collaborator

Looking at this on my phone so pardon any misunderstandings:

  • Mesh shader draw calls are very similar to normal and indexed draw calls, I'm of the opinion that this PR should probably add the same features for all 3 draw calls "families"
  • Do you know how difficult it would be to add support for the draw index built in on top of this? If you don't know that's fine, we can worry about it later
  • Similar for multi draw indirect count & friends
  • AI usage is fine, I just like to know what I'm working with, especially for long descriptions, since AI has a tendency to write summaries of its actions to retroactively justify mistakes and try to portray them as sensibly as possible

@inner-daemons

Copy link
Copy Markdown
Collaborator

Also, you should add testing to wgpu's own test suite for every new piece of api surface, if you haven't already.

I would also like to see benchmarks that these indirect command buffers do actually speed things up, though I don't doubt it. This is not a requirement though

Add Metal ICB generation for mesh multi-draw indirect and opt-in count-buffer variants.

Wire the Chromium experimental multi-draw indirect API surface for CTS, add wgpu-owned readback tests for normal/indexed count-buffer draws and mesh MDI, and validate draw-count-buffer offset alignment.
@matthargett

Copy link
Copy Markdown
Contributor Author

Looking at this on my phone so pardon any misunderstandings:

* Mesh shader draw calls are very similar to normal and indexed draw calls, I'm of the opinion that this PR should probably add the same features for all 3 draw calls "families"

okay, based on our discord discussion and running more of the Dawn CTS, I expanded the test coverage and added some fallback code for when the feature isn't supported. Now ppl won't be surprised on Metal, and the accurate exception did traverse to my crash reporter (Sentry) in my integration native app

* Do you know how difficult it would be to add support for the draw index built in on top of this? If you don't know that's fine, we can worry about it later

It should be sort of straightforward, modulo discovering silicon/driver quirks across the test devices.

* Similar for multi draw indirect count & friends

same as above, should be straightforward. I can submit these PRs in parallel (based on eachother's PR branch), if you want to see more of the e2e all at once without the first PR being giant.

* AI usage is fine, I just like to know what I'm working with, especially for long descriptions, since AI has a tendency to write summaries of its actions to retroactively justify mistakes and try to portray them as sensibly as possible

yea, I see that in other repos as well. anyone I've worked with will tell you my biggest weakness is being verbose in my comms. I review everything in my fork before I ever submit work to upstream repos.

@inner-daemons

Copy link
Copy Markdown
Collaborator

For anyone else curious, I talked privately with Matt Hargett and he seems like a real and very experienced person.

same as above, should be straightforward. I can submit these PRs in parallel (based on eachother's PR branch), if you want to see more of the e2e all at once without the first PR being giant.

Not your obligation. I was just not sure if this approach would have to be redone for either of those to be implemented in the future, but you don't need to implement those right now.

@inner-daemons inner-daemons left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final comment here is my main concern. I am happy to have this but that is a serious problem that needs to be addressed.

Also, I think that you were a little too willing to use environment variables. We usually don't accept environment variables as the only way to control behavior, especially in hal. And these cases are more hacky from what I understand.

Overall, I look forward to having this PR in eventually, and I'm grateful to you for putting in the effort to move towards that.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I always worry about my precious mesh shaders being stepped on :p

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So so glad we have more in depth testing of this

Comment thread deno_webgpu/adapter.rs
Comment on lines +99 to +100
wgpu_types::Features::MULTI_DRAW_INDIRECT_COUNT,
features.contains(wgpu_types::Features::MULTI_DRAW_INDIRECT_COUNT),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe note that this is exposed as a chromium feature

Comment on lines +927 to +928
#[error("Indirect draw count buffer offset {0:?} is not a multiple of 4")]
UnalignedIndirectCountBufferOffset(BufferAddress),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO there should be a limit for the indirect args alignment, rather than a hardcoded value.

Comment on lines +640 to +650
// ICB support has been empirically validated on A12/iOS 18+ and
// S8/watchOS 11+ hardware even where older public tables lag behind.
// Keep this narrower than `family_check` so watchOS does not inherit
// unrelated family-based feature exposure.
let icb_family_check = available!(
macos = 10.15,
ios = 13.0,
tvos = 13.0,
visionos = 1.0,
watchos = 11.0
);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of feature check makes me uncomfortable, since in the past it has led to tons of ridiculous issues with features being incorrectly exposed.

Comment on lines +700 to +703
let force_icb_mdi_on_macos = std::env::var("WGPU_METAL_FORCE_ICB_MDI")
.is_ok_and(|value| value == "1")
&& os_type == super::OsType::Macos
&& !is_virtual;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Also makes me feel uncomfortable

Comment on lines +1338 to +1343
features.set(
F::MULTI_DRAW_INDIRECT_COUNT,
std::env::var("WGPU_METAL_ENABLE_ICB_MDI_COUNT").is_ok_and(|value| value == "1")
&& self.indirect_command_buffers_rendering
&& self.indirect_command_buffers_compute,
);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this belongs here.

Comment on lines 34 to +43
const WORD_SIZE: usize = 4;
const ICB_MIN_DRAW_COUNT: u32 = 8;
const ICB_DIAGNOSTIC_REQUIRE_ENV: &str = "WGPU_METAL_REQUIRE_ICB_MDI";
const ICB_PRIMITIVE_POINT: u32 = 0;
const ICB_PRIMITIVE_LINE: u32 = 1;
const ICB_PRIMITIVE_LINE_STRIP: u32 = 2;
const ICB_PRIMITIVE_TRIANGLE: u32 = 3;
const ICB_PRIMITIVE_TRIANGLE_STRIP: u32 = 4;

const ICB_GENERATION_SHADER: &str = r#"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of these probably warrants a comment

Comment on lines +44 to +46
#include <metal_stdlib>
#include <metal_command_buffer>
using namespace metal;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These shaders should be in their own files and include!ed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge refactor so I didn't read all of the code because of a more pressing issue:

From what I understand, your current method is to stop the render pass whenever an indirect command buffer is needed and then switch back. This results in de-paralellizing a lot of render passes, and a ton of wasted on time on re-binding crap, not to mention switching between pipelines. In its current state I bet this PR would actually slow down most workloads for this reason.

I think that a better method would be to record all indirect command buffers in the same compute pass before any render pass is started. That would probably make this a lot faster and also cut down on the size of this PR. What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but that felt like a bigger change in wgpu architecture. If you all can discuss amongst yourselves and determine if your idea is preferred, I can handle it once I'm back from vacation :)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ive added it to the meeting discussion list. I won't be there for the next 2 weeks to bring it up myself but others may decide to (or may skip it as sometimes happens).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @cwfitzgerald @teoxoy
They seem likely to have opinions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have infrastructure for doing similar things in wgpu-core as we need to do indirect draw call validation, so this can likely use the same infrastructure. I agree that this should ideally happen once (or once per-indirect-buffer) per renderpass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants