-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL] Extract args directly from kernel if we can #18387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
Conversation
This is failing with:
I've never added anything to |
The added function is private, so it's not part of the API, but since it's defined in the library source code it becomes part of the ABI. I wouldn't say that's a problem (e.g. it allows applications to benefit from some changes to that function without recompilation), but you can move the definition to the header if you'd like to keep it out of the library ABI boundary. |
sycl/source/detail/handler_impl.hpp
Outdated
@@ -198,6 +198,11 @@ class handler_impl { | |||
|
|||
// Allocation ptr to be freed asynchronously. | |||
void *MFreePtr = nullptr; | |||
|
|||
// A pointer to a blob of direct kernel arguments, alternative to MArgs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer we use an std::variant
to represent that.
sycl/include/sycl/handler.hpp
Outdated
prepareForDirectArgumentCopy((const void *)&KernelFunc, | ||
detail::getKernelNumParams<KernelName>(), | ||
&detail::getKernelParamDesc<KernelName>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong in its current version, I think. Line 768 might have left KernelFunc
in moved-out state. How exactly is this beneficial? We've already created a copy in MHostKernel
, what extra overhead are you trying to eliminate?
Edit: is #18413 doing exactly that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, this is why I started looking at #18413.
I'm going to work on merging these two PRs today. The end result should be that:
- We defer extracting arguments from the kernel until we reach
handler::finalize()
- We skip extracting arguments from the kernel if it's a simple case
What I'm trying to eliminate is the overhead of:
- Creating an
MArgs
vector - Checking that
MArgs
is sorted - Copying the arguments into
MArgs
before they're passed to UR
}; | ||
|
||
applyFuncOnFilteredArgs(EliminatedArgMask, Args, setFunc); | ||
if (DirectArgs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we delay args extraction (by passing a callback from the user's code maybe) until here, so that non-special arguments would naturally be copied directly by the nature of the modified implementation?
Rather than extracting arguments from the lambda when the kernel is enqueued, store a pointer to the lambda alongside relevant information from the integration header or compiler builtins. Storing this information will allow us to defer the extraction of arguments until we reach handler::finalize(), at which point it may be possible to set the kernel arguments directly without populating MArgs. Signed-off-by: John Pennycook <[email protected]>
hasSpecialCaptures requires getParamDesc to be constexpr. Several tests were previously incompatible with this requirement, but it was only discovered when trying to call hasSpecialCaptures for each kernel. Signed-off-by: John Pennycook <[email protected]>
We are currently only able to skip argument extraction in the case where a lambda has no special captures. We can only detect this while we have the kernel type name, and must carry it through until we call handler::finalize(). Signed-off-by: John Pennycook <[email protected]>
8a6269b
to
864bed9
Compare
Signed-off-by: John Pennycook <[email protected]>
864bed9
to
c2813f0
Compare
Signed-off-by: John Pennycook <[email protected]>
This shouldn't be necessary, but in my experiments, the compiler does not optimize the function call away unless it is used in a constexpr if. Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
Assuming the alternative could lead to skipping MArgs inconsistently. Signed-off-by: John Pennycook <[email protected]>
Kernels without special captures might still have an elimination mask. Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
In some cases, all values that need to be passed as kernel arguments are stored within the kernel function object, and their offsets can be calculated using the integration header or equivalent built-ins. In such cases, we can therefore set kernel arguments directly without staging via
MArgs
.This first attempt is limited to the simplest cases where all kernel arguments are either standard layout types or pointers. It may be possible to extend this approach to cover other cases, but only if some classes are redesigned.
The implementation currently stores some information (e.g., the number of kernel arguments) inside of the handler, because there is no way to access the kernel type within
handler::finalize()
.Some notes for reviewers:
This depends on the new
hasSpecialCaptures
functionality introduced in [SYCL] Add hasSpecialCaptures() constexpr function #18386, which returnstrue
for kernels that only capture standard layout classes and pointers.There are some seemingly unrelated changes in kernel_desc.hpp and to some of the unit tests. These changes were necessary because
hasSpecialCaptures
requiresgetParamDesc
to beconstexpr
. I think this wasn't picked up during [SYCL] Add hasSpecialCaptures() constexpr function #18386 becausehasSpecialCaptures
wasn't previously being run for every kernel.I'm not really satisfied by the solution of adding a lot more member variables, but it was the best way I could think of to limit the scope of the changes required. Long-term, it would be better to try and move everything (including the complicated cases) to extract everything directly from the lambda, to design an abstraction that unifies the
MArgs
andMKernelFuncPtr
paths, or to find a way to access the required values without them being stored in the handler (e.g., using something like [SYCL] Optimize kernel name based cache lookup #18081).