-
Notifications
You must be signed in to change notification settings - Fork 768
[SYCL] Extract args directly from kernel if we can #18387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
172d504
d046b0a
a92ff30
c2813f0
bebc2cf
3caf6b0
47472d4
015a5a2
a58c502
b07c7d1
54b4621
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2385,7 +2385,10 @@ static ur_result_t SetKernelParamsAndLaunch( | |
const std::function<void *(Requirement *Req)> &getMemAllocationFunc, | ||
bool IsCooperative, bool KernelUsesClusterLaunch, | ||
uint32_t WorkGroupMemorySize, const RTDeviceBinaryImage *BinImage, | ||
KernelNameStrRefT KernelName) { | ||
KernelNameStrRefT KernelName, void *KernelFuncPtr = nullptr, | ||
int KernelNumArgs = 0, | ||
detail::kernel_param_desc_t (*KernelParamDescGetter)(int) = nullptr, | ||
bool KernelHasSpecialCaptures = true) { | ||
assert(Queue && "Kernel submissions should have an associated queue"); | ||
const AdapterPtr &Adapter = Queue->getAdapter(); | ||
|
||
|
@@ -2397,13 +2400,38 @@ static ur_result_t SetKernelParamsAndLaunch( | |
: Empty); | ||
} | ||
|
||
auto setFunc = [&Adapter, Kernel, &DeviceImageImpl, &getMemAllocationFunc, | ||
&Queue](detail::ArgDesc &Arg, size_t NextTrueIndex) { | ||
SetArgBasedOnType(Adapter, Kernel, DeviceImageImpl, getMemAllocationFunc, | ||
Queue->getContextImplPtr(), Arg, NextTrueIndex); | ||
}; | ||
|
||
applyFuncOnFilteredArgs(EliminatedArgMask, Args, setFunc); | ||
if (KernelFuncPtr && !KernelHasSpecialCaptures) { | ||
auto setFunc = [&Adapter, Kernel, | ||
KernelFuncPtr](const detail::kernel_param_desc_t &ParamDesc, | ||
size_t NextTrueIndex) { | ||
const void *ArgPtr = (const char *)KernelFuncPtr + ParamDesc.offset; | ||
switch (ParamDesc.kind) { | ||
case kernel_param_kind_t::kind_std_layout: { | ||
int Size = ParamDesc.info; | ||
Adapter->call<UrApiKind::urKernelSetArgValue>(Kernel, NextTrueIndex, | ||
Size, nullptr, ArgPtr); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The if (Arg.MPtr) {
Adapter->call<UrApiKind::urKernelSetArgValue>(
Kernel, NextTrueIndex, Arg.MSize, nullptr, Arg.MPtr);
} else {
Adapter->call<UrApiKind::urKernelSetArgLocal>(Kernel, NextTrueIndex,
Arg.MSize, nullptr);
} Is there a reason we don't do that here? Is the else-case falling under "special captures"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I'm honest, I don't really understand how the old On the fast path, I'm extracting a standard layout argument directly from the function object. Since it's a standard layout object and not a pointer to one, it can never be null, so I removed the branch. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe this is some special case for some local memory accessors. I wonder if these changes can handle that correctly. Let's hope testing is good enough! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It could be some overloading of the meaning of the kernel parameter kind. The vector contains the arguments after decomposition, which means that there may indeed be some special fields that need handling. The array I'm working with contains a description of the original arguments, and so anything that's captured as a "standard layout" class really needs to be one -- a local accessor is identified in the array as an accessor, so it will be counted as a special capture. |
||
break; | ||
} | ||
case kernel_param_kind_t::kind_pointer: { | ||
const void *Ptr = *static_cast<const void *const *>(ArgPtr); | ||
Adapter->call<UrApiKind::urKernelSetArgPointer>(Kernel, NextTrueIndex, | ||
nullptr, Ptr); | ||
break; | ||
} | ||
default: | ||
throw std::runtime_error("Direct kernel argument copy failed."); | ||
} | ||
}; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could it make sense to add a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not with the current design. This is what I was alluding to when I said it would be a good idea to try and unify the The main problem is that we have two different ways to represent what an argument is. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see what you mean. Yeah, I am not sure it's worth trying to repack the arguments. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if I'm replying to the correct thread, but why can't you sink There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that could be possible if we had more special-casing in the submission paths, but I didn't want to embark on a redesign that big. I might have misunderstood the current code, but I convinced myself that the I wrote this primarily to demonstrate that extracting things directly from the kernel function object was 1) possible; and 2) faster. I'd love to make that the default path for everything, but I don't understand the runtime well enough! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you describe how exactly you're testing performance of this? Either PR's description or maybe a comment near the fast path code. That would help whoever will try to generalize your approach for all submission paths to ensure your "fast path" doesn't regress. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea. I've added a paragraph to the end of the PR description. |
||
applyFuncOnFilteredArgs(EliminatedArgMask, KernelNumArgs, | ||
KernelParamDescGetter, setFunc); | ||
} else { | ||
auto setFunc = [&Adapter, Kernel, &DeviceImageImpl, &getMemAllocationFunc, | ||
&Queue](detail::ArgDesc &Arg, size_t NextTrueIndex) { | ||
SetArgBasedOnType(Adapter, Kernel, DeviceImageImpl, getMemAllocationFunc, | ||
Queue->getContextImplPtr(), Arg, NextTrueIndex); | ||
}; | ||
applyFuncOnFilteredArgs(EliminatedArgMask, Args, setFunc); | ||
} | ||
|
||
std::optional<int> ImplicitLocalArg = | ||
ProgramManager::getInstance().kernelImplicitLocalArgPos(KernelName); | ||
|
@@ -2655,7 +2683,9 @@ void enqueueImpKernel( | |
const std::function<void *(Requirement *Req)> &getMemAllocationFunc, | ||
ur_kernel_cache_config_t KernelCacheConfig, const bool KernelIsCooperative, | ||
const bool KernelUsesClusterLaunch, const size_t WorkGroupMemorySize, | ||
const RTDeviceBinaryImage *BinImage) { | ||
const RTDeviceBinaryImage *BinImage, void *KernelFuncPtr, int KernelNumArgs, | ||
detail::kernel_param_desc_t (*KernelParamDescGetter)(int), | ||
bool KernelHasSpecialCaptures) { | ||
assert(Queue && "Kernel submissions should have an associated queue"); | ||
// Run OpenCL kernel | ||
auto &ContextImpl = Queue->getContextImplPtr(); | ||
|
@@ -2739,7 +2769,8 @@ void enqueueImpKernel( | |
Queue, Args, DeviceImageImpl, Kernel, NDRDesc, EventsWaitList, | ||
OutEventImpl, EliminatedArgMask, getMemAllocationFunc, | ||
KernelIsCooperative, KernelUsesClusterLaunch, WorkGroupMemorySize, | ||
BinImage, KernelName); | ||
BinImage, KernelName, KernelFuncPtr, KernelNumArgs, | ||
KernelParamDescGetter, KernelHasSpecialCaptures); | ||
|
||
const AdapterPtr &Adapter = Queue->getAdapter(); | ||
if (!SyclKernelImpl && !MSyclKernel) { | ||
|
Uh oh!
There was an error while loading. Please reload this page.