-
Notifications
You must be signed in to change notification settings - Fork 125
[CUDA][HIP][L0][NATIVECPU][OpenCL] Remove usage of die/terminate #1127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA][HIP][L0][NATIVECPU][OpenCL] Remove usage of die/terminate #1127
Conversation
@@ -248,17 +245,19 @@ urEventWait(uint32_t numEvents, const ur_event_handle_t *phEventWaitList) { | |||
UR_APIEXPORT ur_result_t UR_APICALL urEventRetain(ur_event_handle_t hEvent) { | |||
const auto RefCount = hEvent->incrementReferenceCount(); | |||
|
|||
detail::ur::assertion(RefCount != 0, | |||
"Reference count overflow detected in urEventRetain."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This removes a specific error message with a generic one that does not relate to the specific error message.
If you want to keep specific errors then use setErrorMessage etc as e.g here: intel/llvm#10626
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quite a few examples of this in this pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This removes a specific error message with a generic one that does not relate to the specific error message.
Agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Jack for these example I'll restore the original error message but use a standard assert(message)
from the cassert
header.
return UR_RESULT_ERROR_UNSUPPORTED_FEATURE; | ||
} | ||
|
||
UR_APIEXPORT ur_result_t UR_APICALL | ||
urUsmP2PDisablePeerAccessExp([[maybe_unused]] ur_device_handle_t commandDevice, | ||
[[maybe_unused]] ur_device_handle_t peerDevice) { | ||
|
||
cl_adapter::die( | ||
"Experimental P2P feature is not implemented for OpenCL adapter."); | ||
return UR_RESULT_ERROR_UNSUPPORTED_FEATURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that in dpc++ the error will simply be PI_ERROR_INVALID_VALUE which is useless to the user and is probably going to lead to a lot of confusion.
In the long run when it really could return the semantics of UR_RESULT_ERROR_UNSUPPORTED_FEATURE
it is probably worth considering whether this is perhaps also a degradation wrt end-user usefulness wrt the more specific "Experimental P2P feature is not implemented for OpenCL adapter."
There are many examples of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realise there wasn't a DPC++ equivalent error code for UR_RESULT_ERROR_UNSUPPORTED_FEATURE
thanks for pointing this out. We would like to make a change to the SYCL runtime to handle these errors similar to a recent change made here intel/llvm@3756725 would this be acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realise there wasn't a DPC++ equivalent error code for
UR_RESULT_ERROR_UNSUPPORTED_FEATURE
thanks for pointing this out. We would like to make a change to the SYCL runtime to handle these errors similar to a recent change made here intel/llvm@3756725 would this be acceptable?
I guess you could do that as things stand. Although from what I understand it would require that all backends that can call this API won't be able to return any other errors that map to PI_ERROR_INVALID_OPERATION
, and this will have to be guaranteed to remain true into the future. But these are technical implementation questions, which I don't think are the key issue to consider.
Since also presumably all PI_ERROR_*
will be removed at some point when PI is removed, maybe the above solution would only be very temporary, and I can see one argument that in the future it might be decided that UR_RESULT_ERROR_UNSUPPORTED_FEATURE
will be returned for all such cases where a feature isn't supported. Then the runtime can pass a more specific message if it receives this error.
The above paragraph is more of a specification question which in my point of view is the right type of question to be asking at this stage, since there doesn't appear (tell me if I am wrong) a clear specification in UR for dealing with errors. For such a specification to be stable into the future I can think of some requirements it would need to satisfy (in a rough order of importance):
- Support both adapter agnostic and adapter specific errors.
- Be compatible with the error requirements of existing supported language runtime specifications (assuming that the language runtime specifications have error handling fit for purpose, and aren't themselves looking into changes in future versions (SYCL is one), in such a case the error requirements to take account of will be the future versions not the existing ones).
- Be future proof for error requirements of adapters that might be supported in the future.
For sure the above requirements would take some amount of thought to satisfy, but I don't see (particularly the first two) as being impossible. And before solving this problem is properly undertaken I see it as likely that we will just largely go in circles, or at best zig-zag to a stable solution.
At the moment the process of e.g. this PR is:
step 1. make a change to move in a particular direction due to a single requirement: e.g. get rid of dies within UR
step 2: run into some constraints like e.g. the one in this particular thread.
step 3: solve the constraints and merge pr
step 4: realize that we didn't consider some other constraints and have to start the process over.
Of course it is just my opinion, and it doesn't block this PR, but I'd recommend going straight to the finish line and first plan an error handling spec that is fit for purpose.
The other issue is that without a clear spec people deal with errors in an inconsistent way. If once someone has ideally fully considered the requirements they add docs to https://oneapi-src.github.io/unified-runtime/core/INTRO.html#error-handling for contributors that would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JackAKirk I think it's maybe best to continue the discussion for this in a new issue - #1161 as it would potentially make this a very large PR! I think we are going to go ahead with looking at handling the unsupported feature in the sycl runtime for now. Cheers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks.
I think there are quite a few degradations introduced in this PR wrt DPC++ usage. It would be good to consider the effect on the end-user for each category of changes made here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the effort to get rid of asserting in favour of providing error message. I do think however, that greater care is needed to make sure that the error codes are meaningful. Admittedly, this is non-trivial and daunting task, but if we let it slip now, it's likely that it will never get fixed.
Similarly, like Jack pointed out, it would be nice to preserve messages (maybe with exception of those unimplemented API entry).
@@ -510,7 +557,9 @@ UR_APIEXPORT ur_result_t UR_APICALL urDeviceGetInfo(ur_device_handle_t hDevice, | |||
UR_CHECK_ERROR(cuDeviceGetAttribute( | |||
&ECCEnabled, CU_DEVICE_ATTRIBUTE_ECC_ENABLED, hDevice->get())); | |||
|
|||
detail::ur::assertion((ECCEnabled == 0) | (ECCEnabled == 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you are at it, should it be rewritten to use logical or operator? I'm not sure if bitwise was intended or an error in the first place? (in a couple of places in this file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it was likely a typo in the original diff. I've changed it over now, thanks.
source/adapters/cuda/enqueue.cpp
Outdated
default: | ||
detail::ur::die("Invalid image format."); | ||
return 0; | ||
return UR_RESULT_ERROR_INVALID_IMAGE_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would UR_RESULT_ERROR_INVALID_IMAGE_FORMAT_DESCRIPTOR
fit better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to this, thanks.
source/adapters/cuda/memory.cpp
Outdated
@@ -306,8 +306,7 @@ UR_APIEXPORT ur_result_t UR_APICALL urMemImageCreate( | |||
PixelTypeSizeBytes = 4; | |||
break; | |||
default: | |||
detail::ur::die( | |||
"urMemImageCreate given unsupported image_channel_data_type"); | |||
return UR_RESULT_ERROR_UNSUPPORTED_ENUMERATION; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the switch is over ur_image_format_t
maybe return: UR_RESULT_ERROR_INVALID_IMAGE_FORMAT_DESCRIPTOR
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to this, thanks.
switch (Format) { | ||
case HIP_AD_FORMAT_UNSIGNED_INT8: | ||
case HIP_AD_FORMAT_SIGNED_INT8: | ||
*Size = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above switch
/break
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this, thanks.
switch (ArrayDesc.Format) { | ||
case CU_AD_FORMAT_UNSIGNED_INT8: | ||
case CU_AD_FORMAT_SIGNED_INT8: | ||
return 1; | ||
*Size = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need break
?
https://godbolt.org/z/P766Pvaf1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I just spotted that too. The control will be transferred to the next statement and so on until it reaches a break
or end of the switch
. Maybe there are either no tests or not enabled tests to catch this now but it will be giving incorrect element byte sizes for when 8-bit and 16-bit formats are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this, thanks.
@@ -843,23 +843,25 @@ UR_APIEXPORT ur_result_t UR_APICALL urEnqueueMemBufferFill( | |||
} | |||
} | |||
|
|||
static size_t imageElementByteSize(CUDA_ARRAY_DESCRIPTOR ArrayDesc) { | |||
static ur_result_t imageElementByteSize(CUDA_ARRAY_DESCRIPTOR ArrayDesc, | |||
int *Size) { | |||
switch (ArrayDesc.Format) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need break
?
https://godbolt.org/z/P766Pvaf1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this, thanks.
source/adapters/hip/common.hpp
Outdated
@@ -15,6 +15,16 @@ | |||
#include <hip/hip_runtime.h> | |||
#include <ur/ur.hpp> | |||
|
|||
/** | |||
* Call an UR API and, if the result is not UR_RESULT_SUCCESS, automatically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Call an UR API and, if the result is not UR_RESULT_SUCCESS, automatically | |
* Call a UR API and, if the result is not UR_RESULT_SUCCESS, automatically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this, thanks.
return ReturnValue(static_cast<uint64_t>(LocalMemSize)); | ||
} | ||
case UR_DEVICE_INFO_ERROR_CORRECTION_SUPPORT: { | ||
int EccEnabled = 0; | ||
UR_CHECK_ERROR(hipDeviceGetAttribute( | ||
&EccEnabled, hipDeviceAttributeEccEnabled, hDevice->get())); | ||
|
||
detail::ur::assertion((EccEnabled == 0) | (EccEnabled == 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as in CUDA re: binary/logical or.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this, thanks.
source/adapters/hip/device.cpp
Outdated
@@ -493,8 +546,9 @@ UR_APIEXPORT ur_result_t UR_APICALL urDeviceGetInfo(ur_device_handle_t hDevice, | |||
// name instead, this is also what AMD OpenCL devices return. | |||
if (strlen(Name) == 0) { | |||
hipDeviceProp_t Props; | |||
detail::ur::assertion(hipGetDeviceProperties(&Props, hDevice->get()) == | |||
hipSuccess); | |||
if (hipGetDeviceProperties(&Props, hDevice->get()) != hipSuccess) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make some effort to try and translate native error returns to UR specific? There is a bunch of native calls (here and in CUDA: cuMemGetInfo
, cuDeviceTotalMem
etc.) that just hard code UR_RESULT_ERROR_INVALID_OPERATION
/UR_RESULT_ERROR_INVALID_SIZE
on failure. Previously they were wrapped up in assertion, so the value didn't really matter, but now, they will be served up all the way to the user.
7669e65
to
8b7081e
Compare
I have updated the target branch of this PR from the |
ad23ddd
to
b603874
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes to command buffers LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bindless Images LGTM
b603874
to
714adc2
Compare
… codes or replacing original error message to use an assert from cassert header and some other minor fixes.
…checks with an assert instead of returning a generic UR error.
…rror. Format code.
714adc2
to
432bfcc
Compare
ur_event_handle_t_ *Event = ur_cast<ur_event_handle_t_ *>(e); | ||
if (!Event->hasExternalRefs()) | ||
die("urEventsWait must not be called for an internal event"); | ||
ur_event_handle_t_ *Event = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why drop the use of the variable e? then it will just be unused in the internals.
Unified Runtime -> intel/llvm Repo Move NoticeInformationThe source code of Unified Runtime has been moved to intel/llvm under the unified-runtime top-level directory, The code will be mirrored to oneapi-src/unified-runtime and the specification will continue to be hosted at oneapi-src.github.io/unified-runtime. The contribution guide has been updated with new instructions for contributing to Unified Runtime. PR MigrationAll open PRs including this one will be labelled auto-close and shall be automatically closed after 30 days. Should you wish to continue with your PR you will need to migrate it to intel/llvm. This is an automated comment. |
Unified Runtime -> intel/llvm Repo Move NoticeFollowing on from the previous notice, we have now enabled workflows to automatically label and close PRs because the Unified Runtime source code has moved to intel/llvm. This PR has now been marked with the Please review the previous notice for more information, including assistance with migrating your PR to intel/llvm. Should there be a reason for this PR to remain open, manually remove the This is an automated comment. |
Automatic PR Closure NoticeInformationThis PR has been closed automatically. It was marked with the All Unified Runtime development should be done in intel/llvm, details can be found in the updated contribution guide. Next StepsShould you wish to re-open this PR it must be moved to intel/llvm. We have provided a script to help automate this process, otherwise no actions are required. This is an automated comment. |
For #1078.
This PR attempts to replace all die/terminate calls with simply returning an appropriate error code. There may be a better code, or some kind of assert/abort may be more appropriate in some cases so any suggestions are appreciated.
intel-llvm run intel/llvm#12076