Skip to content

Conversation

@swoods-nv
Copy link
Contributor

No description provided.

@swoods-nv
Copy link
Contributor Author

I need to add a few sentences about performance delta -- I'm seeing ~10 iterations per second improvement with wave balloting vs. the original diffsplat on my Windows 11/RTX 2070 machine.

Copy link
Contributor

@ArielG-NV ArielG-NV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the blog post and I find it easy to digest (despite my limited background in Gaussian Splatting).

Given the goal of the blog is not capabilities I think the introduction of the feature is well done.

return pixelState.value;
}
```
The first thing you’ll likely notice is that this function carries additional annotations compared to the functions in the original diff-splatting example. The `[require (subgroup_ballot)]` and `[require (subgroup_vote)]` annotations use slang’s **capability system** to indicate that this function requires this optional capability to be supported. The Slang compiler is able to identify whether the target it is currently compiling for supports these capabilities, and if not, it will provide a warning. For example, a shader targeting HLSL Shader Model 5 with these capability requirements would result in:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most of the code written by our users, we don't expect them to use explicit [require(capability)] decorations, so we may want to omit this part in the blog.

For ordinary users, the only place they need to put [require] attribute is on the entrypoints, and that is only when they want the compiler to enforce the entrypoint isn't using more capabilities than it is intended for.

If you define an entrypoint without [require], the compiler will automatically upgrade its requirement to whatever it uses.

Copy link
Contributor Author

@swoods-nv swoods-nv May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the blog post to just cover the fact that the Slang compiler will detect and issue a warning if you try to compile this for a profile that doesn't include wave intrinsic support.

The first thing you’ll likely notice is that this function carries additional annotations compared to the functions in the original diff-splatting example. The `[require (subgroup_ballot)]` and `[require (subgroup_vote)]` annotations use slang’s **capability system** to indicate that this function requires this optional capability to be supported. The Slang compiler is able to identify whether the target it is currently compiling for supports these capabilities, and if not, it will provide a warning. For example, a shader targeting HLSL Shader Model 5 with these capability requirements would result in:

```
myshader.slang(9): warning 41012: entry point 'computeMain' uses additional capabilities that are not part of the specified profile 'sm_5_0'. The profile setting is automatically updated to include these capabilities: 'sm_6_0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning will appear without marking the above functions with [require], becuase [require] was already marked on WaveActive** functions and it just propagates all the way through to the entrypoint.

So how does this shader use wave intrinsics?

Instead of a multi-pass approach– first identifying intersecting blobs for the current tile, sorting them, and then calculating colors from the shorter list of blobs, we’re now using a single pass through the set of Gaussians to process them all, in workgroup-sized chunks. Within each chunk, each lane (a thread within the wave) is assigned a single Gaussian, and tests whether it intersects the current tile bounds. The crucial improvement here is the `WaveActiveBallot(intersects).x` call. This takes the boolean intersection result from each active lane in the wave, and creates a bitmask. All of the lanes in the wave can access the bitmask, and can therefore understand which Gaussians in the chunk being processed are relevant. The code then iterates through the set bits of this mask, which we’ve called `intersectionMask`. For each intersection Gaussian, its contribution is evaluated, and immediately alpha-blended. We still store the indices for the intersecting blobs, because we will still need them during the custom backward pass.
One benefit of this approach is that we no longer need to do an explicit workgroup-wide sort. Because we keep the blobs in order during processing, we maintain the needed order for alpha blending. Additionally, we no longer need to use an atomic counter– and thereby introduce the possibility of contention– when we increment the number of intersecting blobs and write the index to the blob list. This might look problematic at first glance, because all of the lanes are writing to the same `intersectingBlobList` in shared memory. But we don’t need to worry about data collisions here because of how we’re coming up with this data. Each lane has its own copy of numIntersectingBlobs, so that variable does not need to be atomically incremented. And each lane also will be operating on the same value in `intersectionMask`, calculated using `WaveActiveBallot`. For this reason, all lanes are storing the same indices in the same order into `intersectingBlobList`, so while technically this is a data race, it’s a benign one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why we need to introduce any data races at all.

We should instead do

   int idx = WavePrefixSum(intersects?1:0);

and then each lane just write to intersectingBlobList[idx]. There will be no data races if done this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to increment the overall number of intersecting blobs (intersectingBlobCount), which is used in the backwards pass. With all of the lanes calculating the same value for numIntersectingBlobs, it can be done non-atomically -- but still a technical data race. If I'm not calculating numIntersectingBlobs at all, then I would need to make intersectingBlobCount an atomic, which then complicates its use elsewhere.

@swoods-nv
Copy link
Contributor Author

After discussion with Yong: the wave intrinsic example only works in cases where there's only one workgroup per dispatch. We should wait to post this until the dispatch shape is controllable from SlangPy, so this is blocked on shader-slang/slangpy#72

@swoods-nv
Copy link
Contributor Author

Blog post updated with call group shape info -- @csyonghe could you re-review?

Copy link
Contributor

@csyonghe csyonghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@swoods-nv swoods-nv merged commit ba1a9c3 into shader-slang:main Jul 17, 2025
@swoods-nv swoods-nv deleted the wave-intrinsic-blog-post branch July 17, 2025 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create blog post for improving performance of 2D splatting using wave intrinsics

3 participants