use Metal's Indirect Command Buffers for true GPU-side multi draw indirect#9640
use Metal's Indirect Command Buffers for true GPU-side multi draw indirect#9640matthargett wants to merge 19 commits into
Conversation
|
Holy shit this is large. I will take a look but it might be a while. In the meantime,
Also, I must ask, to what extent is the code written by LLMs? What about the PR description? |
Sorry, I did try to keep it to the thinnest meaningful slice that demonstrated e2e uplift in my AR portal on real devices. if you see a natural seam I can split things out on, or I should disabuse myself of the constraint I imposed on myelf that a single PR bringing measurable uplift, just let me know! :D
I didn't deal with multi_draw_indirect_count / multi_draw_indexed_indirect_count yet, as one of the ways to keep the PR's size down. It is intentionally scoped to "just" fixed-count multi_draw_indirect and multi_draw_indexed_indirect. The count-buffer variants have additional semantics and could/should be a separate follow-up.
Mesh shaders are supported on my local M4 MacBook device, and I verified that existing fixed-count mesh MDI tests pass, but this PR doesn't add a Metal ICB mesh-command generation path. I did try to control scope, so count-buffer MDI and draw_index would need to be follow-ons unless it needs to be one atomic PR/commit from your perspective. (side note: I found mesh shaders to be pretty finicky on 26.0 operating systems when using MSL directly, so I'd anticipate even more physical device testing and iteration)
nope. In my notebook when I reviewed wgpu, I wrote that Naga’s MSL backend still rejects that builtin. (if that's changed or I was mistaken, do tell me!) would it be useful for me to add a test/fallback assertion so the ICB path cannot give the wrong impression on what it supports?
the hardest part of this stuff is all the on-device testing (especially with ICBs which can halt/panic a device), and I try to capture all the quirks I ran into so others can avoid the tediousness of some of it on legacy devices that don't receive active update.
I came up with the high-level design for the WASM fantasy console in my notebooks, and MDI was added into that requirements list when I first learned about it at the Khronos meetup at GDC a few years ago. Then I used codex and iterated with it over the course of a few months to push the boundaries of older/weaker Apple GPUs (namely, iPhone XS and Apple Watch SE2/6). I built up quite a local patch stack (which I reviewed with my own eyes at each step along the way) while creating all these demonstrations (WebAudio WASI using WebGPU in WASM worklets), and then directed my attention back to BabylonJS/BabylonNative which my startup's main product was written in. I had tried several times last year to fix bgfx's swapchains so we could ship a native BabylonJS app to the AVP app store, but then realized I just needed to rip off the bandaid and do the work to integrate wgpu-native. I used Codex to drive that work, but going through all their screenshot tests manually and spot checking accuracy was not left to AI. Similarly, with the AR portal integration test, I had to physically pick up iPhones and iPads to scan the room after each build (so ARkit could detect the floor), walk into an out of the portal, note problems, and iterate with Codex. Then, I directed Codex to measure e2e framerate/CPU stats, alongside XCode CPU profiler data (L1/prefetch/branch-prediction misses), and guided it to the optimization result of 60 fps camera feed+3D AR projection on iPhone 12. (Some of those optimizations were on the Babylon.js side, and several of those submitted PRs have been merged/released by the nice folks at Microsoft.) wrt the PR text: I directed Codex to summarize the important hard-won knowledge not represented in the diff, and gave it the bullet outline that it elaborated, but I also edited the PR title and description in the staging PR in my wgpu fork. I then had Claude Code review the fork PR description, commits, etc to look for problems but also ways it could better conform to gfx-rs project PR/issue norms/templates. If you look at my open source contributions across the last ~30 years, I can be very verbose especially when it comes to performance, exploits, and optimization. So yes, I used multiple AI coding tools, but I personally reviewed and guided each step of the way, and did not post a PR in this project until it looked good (and passed CI) in my fork. Sorry for the wall of text, but I wanted to give a nuanced response so you (and whomever else) can understand this wasn't a one-shot AI prompt on a whim: I have things built with wgpu that I'd like to ship, and getting better FPS per milliwatt of power draw means MDI is key. |
|
Ok thank you for the explanations! |
|
Looking at this on my phone so pardon any misunderstandings:
|
|
Also, you should add testing to wgpu's own test suite for every new piece of api surface, if you haven't already. I would also like to see benchmarks that these indirect command buffers do actually speed things up, though I don't doubt it. This is not a requirement though |
Add Metal ICB generation for mesh multi-draw indirect and opt-in count-buffer variants. Wire the Chromium experimental multi-draw indirect API surface for CTS, add wgpu-owned readback tests for normal/indexed count-buffer draws and mesh MDI, and validate draw-count-buffer offset alignment.
okay, based on our discord discussion and running more of the Dawn CTS, I expanded the test coverage and added some fallback code for when the feature isn't supported. Now ppl won't be surprised on Metal, and the accurate exception did traverse to my crash reporter (Sentry) in my integration native app
It should be sort of straightforward, modulo discovering silicon/driver quirks across the test devices.
same as above, should be straightforward. I can submit these PRs in parallel (based on eachother's PR branch), if you want to see more of the e2e all at once without the first PR being giant.
yea, I see that in other repos as well. anyone I've worked with will tell you my biggest weakness is being verbose in my comms. I review everything in my fork before I ever submit work to upstream repos. |
|
For anyone else curious, I talked privately with Matt Hargett and he seems like a real and very experienced person.
Not your obligation. I was just not sure if this approach would have to be redone for either of those to be implemented in the future, but you don't need to implement those right now. |
inner-daemons
left a comment
There was a problem hiding this comment.
The final comment here is my main concern. I am happy to have this but that is a serious problem that needs to be addressed.
Also, I think that you were a little too willing to use environment variables. We usually don't accept environment variables as the only way to control behavior, especially in hal. And these cases are more hacky from what I understand.
Overall, I look forward to having this PR in eventually, and I'm grateful to you for putting in the effort to move towards that.
There was a problem hiding this comment.
Thank you! I always worry about my precious mesh shaders being stepped on :p
There was a problem hiding this comment.
So so glad we have more in depth testing of this
| wgpu_types::Features::MULTI_DRAW_INDIRECT_COUNT, | ||
| features.contains(wgpu_types::Features::MULTI_DRAW_INDIRECT_COUNT), |
There was a problem hiding this comment.
Maybe note that this is exposed as a chromium feature
| #[error("Indirect draw count buffer offset {0:?} is not a multiple of 4")] | ||
| UnalignedIndirectCountBufferOffset(BufferAddress), |
There was a problem hiding this comment.
IMO there should be a limit for the indirect args alignment, rather than a hardcoded value.
| // ICB support has been empirically validated on A12/iOS 18+ and | ||
| // S8/watchOS 11+ hardware even where older public tables lag behind. | ||
| // Keep this narrower than `family_check` so watchOS does not inherit | ||
| // unrelated family-based feature exposure. | ||
| let icb_family_check = available!( | ||
| macos = 10.15, | ||
| ios = 13.0, | ||
| tvos = 13.0, | ||
| visionos = 1.0, | ||
| watchos = 11.0 | ||
| ); |
There was a problem hiding this comment.
This kind of feature check makes me uncomfortable, since in the past it has led to tons of ridiculous issues with features being incorrectly exposed.
| let force_icb_mdi_on_macos = std::env::var("WGPU_METAL_FORCE_ICB_MDI") | ||
| .is_ok_and(|value| value == "1") | ||
| && os_type == super::OsType::Macos | ||
| && !is_virtual; |
There was a problem hiding this comment.
Is this needed? Also makes me feel uncomfortable
| features.set( | ||
| F::MULTI_DRAW_INDIRECT_COUNT, | ||
| std::env::var("WGPU_METAL_ENABLE_ICB_MDI_COUNT").is_ok_and(|value| value == "1") | ||
| && self.indirect_command_buffers_rendering | ||
| && self.indirect_command_buffers_compute, | ||
| ); |
There was a problem hiding this comment.
I don't think this belongs here.
| const WORD_SIZE: usize = 4; | ||
| const ICB_MIN_DRAW_COUNT: u32 = 8; | ||
| const ICB_DIAGNOSTIC_REQUIRE_ENV: &str = "WGPU_METAL_REQUIRE_ICB_MDI"; | ||
| const ICB_PRIMITIVE_POINT: u32 = 0; | ||
| const ICB_PRIMITIVE_LINE: u32 = 1; | ||
| const ICB_PRIMITIVE_LINE_STRIP: u32 = 2; | ||
| const ICB_PRIMITIVE_TRIANGLE: u32 = 3; | ||
| const ICB_PRIMITIVE_TRIANGLE_STRIP: u32 = 4; | ||
|
|
||
| const ICB_GENERATION_SHADER: &str = r#" |
There was a problem hiding this comment.
Each of these probably warrants a comment
| #include <metal_stdlib> | ||
| #include <metal_command_buffer> | ||
| using namespace metal; |
There was a problem hiding this comment.
These shaders should be in their own files and include!ed
There was a problem hiding this comment.
This is a huge refactor so I didn't read all of the code because of a more pressing issue:
From what I understand, your current method is to stop the render pass whenever an indirect command buffer is needed and then switch back. This results in de-paralellizing a lot of render passes, and a ton of wasted on time on re-binding crap, not to mention switching between pipelines. In its current state I bet this PR would actually slow down most workloads for this reason.
I think that a better method would be to record all indirect command buffers in the same compute pass before any render pass is started. That would probably make this a lot faster and also cut down on the size of this PR. What do you think?
There was a problem hiding this comment.
I agree, but that felt like a bigger change in wgpu architecture. If you all can discuss amongst yourselves and determine if your idea is preferred, I can handle it once I'm back from vacation :)
There was a problem hiding this comment.
Ive added it to the meeting discussion list. I won't be there for the next 2 weeks to bring it up myself but others may decide to (or may skip it as sometimes happens).
There was a problem hiding this comment.
cc @cwfitzgerald @teoxoy
They seem likely to have opinions.
There was a problem hiding this comment.
We already have infrastructure for doing similar things in wgpu-core as we need to do indirect draw call validation, so this can likely use the same infrastructure. I agree that this should ideally happen once (or once per-indirect-buffer) per renderpass.
Connections
Reference workload: this was motivated by BabylonNative WebGPU rendering work where CPU-side multi-draw looping became visible in large batched scenes and AR rendering. The relevant application cases were AR Portal camera + rendering and (Back to the Future) Hill Valley-style GLTF's baked/static geometry batches that included PBR materials and a WGSL SSAO workload.
AR portal demo reference:
https://doc.babylonjs.com/features/featuresDeepDive/webXR/webXRDemos#ar-demo
Description
This ports Metal fixed-count multi-draw indirect to a real indirect-command-buffer path when the render pass can be safely suspended and resumed, instead of always looping over draws on the CPU.
The ICB fast path is intentionally conservative. It currently applies to larger static/material-baked multi-draw batches whose render state is pipeline plus vertex/index buffers only. Passes that bind textures, uniforms, or storage through render bind groups, or use immediates, stay on the CPU-loop fallback. This keeps common material-heavy passes safe while still reducing bridge/encoder overhead for large baked-geometry batches.
Changes:
iOS/iPadOS 18,tvOS 18,visionOS 2,watchOS 11), plus Apple3/A10X on tvOS 18+ after first-generation Apple TV 4K validation. Apple3 on iOS/iPadOS remains on the CPU fallback.MTLPrimitiveTypeinteger values matching MSLprimitive_typevalues.dispatchThreadgroups, using the compute pipeline'sthreadExecutionWidth()and an in-shadercmdIndex >= drawCountguard instead of relying on non-uniform threadgroups fromdispatchThreads.wgpu-gpuconformance coverage for GPU-generated non-indexed and indexed multi-draw indirect arguments.base_vertex, non-zerofirst_vertex/first_instance, draw counts crossing the ICB generation workgroup width, mixed single/multi indirect sequencing in one render pass, and bind-group forced fallback.Hardware and driver quirks found:
iPad7,3) on iPadOS 17.7 advertises the legacy iOS GPUFamily3 ICB feature set, but throws a foreignNSExceptionfrom thedraw_indirectpath before ICB generation while ending/suspending the active render encoder for the ICB splice. This PR therefore keeps Apple3/A10X on the iOS/iPadOS CPU-loop fallback even if a later runtime reports related feature sets.AppleTV6,2) on tvOS 18.3 passed the native tvOSMdiIcbProbeworkload on this PR branch without the previous experimental Apple3 override. The probe forced the ICB path, hitmulti_draw_indirectandmulti_draw_indexed_indirectMetal ICB diagnostics, and validated GPU-generated indirect arguments plus the render target by CPU readback.dispatchThreadgroupssized bythreadExecutionWidth(), with rounded-up extra threads neutralized by an in-shader draw-count guard.MetalSpeedprobe. The probe used a Metal shared-buffer readback for CPU-side pixel verification, while the ICB path itself remains GPU-side command generation plus indirect draw execution.MTLIndirectCommandBufferDescriptorinheritance defaults are usable, but calling optional generated objc2 setters such assetInheritDepthStencilState:can crash because the selector is not present on that runtime. The implementation relies on default inheritance instead of calling those optional setters.maxVertexBufferBindCountwas set from the currently bound vertex-buffer count. The fragment bind count is currently set to0because this fast path rejects active render bind groups and immediates and does not restore fragment buffer bindings into ICB state.Apple Paravirtual device, Metal reports ICB-related feature sets/families and the debug dump showedindirect_command_buffers_rendering: trueandindirect_command_buffers_compute: true, but all MDI cases SIGABRTed when the ICB path executed. This PR treats virtual adapters like the existing mesh-shader virtual-device quirk and keeps them on the CPU fallback until we see that simulator/virtualized driver path gets fixed by Apple.Gaps:
Testing
Local validation:
cargo check -p wgpu-hal --features metal --no-default-features.cargo check -p wgpu-test --test wgpu-gpu.WGPU_BACKEND=metal cargo test -p wgpu-test --test wgpu-gpu multi_draw -- --test-threads=1 --nocapture, 13 passed.WGPU_BACKEND=metal cargo test -p wgpu-test --test wgpu-gpu draw_indirect -- --test-threads=1 --nocapture, 33 passed.cargo xtask changelog rebecker/trunk.Device validation:
aarch64-apple-iosprobe app passed GPU-generated non-indexed and indexed MDI after the OS-version-aware mobile ICB gate.MdiIcbProbeapp passed GPU-generated non-indexed and indexed MDI through the Metal ICB path on this PR branch;WGPU_METAL_REQUIRE_ICB_MDI=1was set and the old experimental Apple3 env override was not set.UIDeviceFamily = 2,MinimumOSVersion = 17.0) passed GPU-generated non-indexed and indexed MDI through the CPU fallback after the OS-version-aware mobile ICB gate.draw_count > 1was enabled during one M4 run and one iPhone XS run; both passed, proving these generated-argument cases did not use CPU-side per-draw fallback. The trap was removed before this PR.CI notes:
Apple Paravirtual deviceSIGABRTs inmulti_draw_indirect,multi_draw_indexed_indirect, and both GPU-generated-argument variants. I gated ICB capability exposure on virtual Metal adapters so CI exercises (and simulators) the existing CPU fallback there while unfettered Apple GPUs continue to use ICBs.Squash or Rebase?
Squash before merge. The branch currently keeps the development and hardening iterations visible for review, but the final upstream landing would be cleaner as one feature commit or a maintainer-curated split.
Checklist
wgpumay be affected behaviorally (if they enable MDI feature).CHANGELOG.mdentries for the user-facing effects of this change are present.