-
Notifications
You must be signed in to change notification settings - Fork 23
Add wave intrinsic blog post #106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wave intrinsic blog post #106
Conversation
|
I need to add a few sentences about performance delta -- I'm seeing ~10 iterations per second improvement with wave balloting vs. the original diffsplat on my Windows 11/RTX 2070 machine. |
ArielG-NV
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the blog post and I find it easy to digest (despite my limited background in Gaussian Splatting).
Given the goal of the blog is not capabilities I think the introduction of the feature is well done.
| return pixelState.value; | ||
| } | ||
| ``` | ||
| The first thing you’ll likely notice is that this function carries additional annotations compared to the functions in the original diff-splatting example. The `[require (subgroup_ballot)]` and `[require (subgroup_vote)]` annotations use slang’s **capability system** to indicate that this function requires this optional capability to be supported. The Slang compiler is able to identify whether the target it is currently compiling for supports these capabilities, and if not, it will provide a warning. For example, a shader targeting HLSL Shader Model 5 with these capability requirements would result in: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For most of the code written by our users, we don't expect them to use explicit [require(capability)] decorations, so we may want to omit this part in the blog.
For ordinary users, the only place they need to put [require] attribute is on the entrypoints, and that is only when they want the compiler to enforce the entrypoint isn't using more capabilities than it is intended for.
If you define an entrypoint without [require], the compiler will automatically upgrade its requirement to whatever it uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the blog post to just cover the fact that the Slang compiler will detect and issue a warning if you try to compile this for a profile that doesn't include wave intrinsic support.
| The first thing you’ll likely notice is that this function carries additional annotations compared to the functions in the original diff-splatting example. The `[require (subgroup_ballot)]` and `[require (subgroup_vote)]` annotations use slang’s **capability system** to indicate that this function requires this optional capability to be supported. The Slang compiler is able to identify whether the target it is currently compiling for supports these capabilities, and if not, it will provide a warning. For example, a shader targeting HLSL Shader Model 5 with these capability requirements would result in: | ||
|
|
||
| ``` | ||
| myshader.slang(9): warning 41012: entry point 'computeMain' uses additional capabilities that are not part of the specified profile 'sm_5_0'. The profile setting is automatically updated to include these capabilities: 'sm_6_0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning will appear without marking the above functions with [require], becuase [require] was already marked on WaveActive** functions and it just propagates all the way through to the entrypoint.
| So how does this shader use wave intrinsics? | ||
|
|
||
| Instead of a multi-pass approach– first identifying intersecting blobs for the current tile, sorting them, and then calculating colors from the shorter list of blobs, we’re now using a single pass through the set of Gaussians to process them all, in workgroup-sized chunks. Within each chunk, each lane (a thread within the wave) is assigned a single Gaussian, and tests whether it intersects the current tile bounds. The crucial improvement here is the `WaveActiveBallot(intersects).x` call. This takes the boolean intersection result from each active lane in the wave, and creates a bitmask. All of the lanes in the wave can access the bitmask, and can therefore understand which Gaussians in the chunk being processed are relevant. The code then iterates through the set bits of this mask, which we’ve called `intersectionMask`. For each intersection Gaussian, its contribution is evaluated, and immediately alpha-blended. We still store the indices for the intersecting blobs, because we will still need them during the custom backward pass. | ||
| One benefit of this approach is that we no longer need to do an explicit workgroup-wide sort. Because we keep the blobs in order during processing, we maintain the needed order for alpha blending. Additionally, we no longer need to use an atomic counter– and thereby introduce the possibility of contention– when we increment the number of intersecting blobs and write the index to the blob list. This might look problematic at first glance, because all of the lanes are writing to the same `intersectingBlobList` in shared memory. But we don’t need to worry about data collisions here because of how we’re coming up with this data. Each lane has its own copy of numIntersectingBlobs, so that variable does not need to be atomically incremented. And each lane also will be operating on the same value in `intersectionMask`, calculated using `WaveActiveBallot`. For this reason, all lanes are storing the same indices in the same order into `intersectingBlobList`, so while technically this is a data race, it’s a benign one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand why we need to introduce any data races at all.
We should instead do
int idx = WavePrefixSum(intersects?1:0);
and then each lane just write to intersectingBlobList[idx]. There will be no data races if done this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to increment the overall number of intersecting blobs (intersectingBlobCount), which is used in the backwards pass. With all of the lanes calculating the same value for numIntersectingBlobs, it can be done non-atomically -- but still a technical data race. If I'm not calculating numIntersectingBlobs at all, then I would need to make intersectingBlobCount an atomic, which then complicates its use elsewhere.
|
After discussion with Yong: the wave intrinsic example only works in cases where there's only one workgroup per dispatch. We should wait to post this until the dispatch shape is controllable from SlangPy, so this is blocked on shader-slang/slangpy#72 |
|
Blog post updated with call group shape info -- @csyonghe could you re-review? |
csyonghe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
No description provided.