[CUDA][HIP][L0][NATIVECPU][OpenCL] Remove usage of die/terminate#1127
[CUDA][HIP][L0][NATIVECPU][OpenCL] Remove usage of die/terminate#1127martygrant wants to merge 4 commits intooneapi-src:mainfrom
Conversation
| const auto RefCount = hEvent->incrementReferenceCount(); | ||
|
|
||
| detail::ur::assertion(RefCount != 0, | ||
| "Reference count overflow detected in urEventRetain."); |
There was a problem hiding this comment.
This removes a specific error message with a generic one that does not relate to the specific error message.
If you want to keep specific errors then use setErrorMessage etc as e.g here: intel/llvm#10626
There was a problem hiding this comment.
There are quite a few examples of this in this pr.
There was a problem hiding this comment.
This removes a specific error message with a generic one that does not relate to the specific error message.
Agree.
There was a problem hiding this comment.
Hi Jack for these example I'll restore the original error message but use a standard assert(message) from the cassert header.
|
|
||
| cl_adapter::die( | ||
| "Experimental P2P feature is not implemented for OpenCL adapter."); | ||
| return UR_RESULT_ERROR_UNSUPPORTED_FEATURE; |
There was a problem hiding this comment.
This means that in dpc++ the error will simply be PI_ERROR_INVALID_VALUE which is useless to the user and is probably going to lead to a lot of confusion.
In the long run when it really could return the semantics of UR_RESULT_ERROR_UNSUPPORTED_FEATURE
it is probably worth considering whether this is perhaps also a degradation wrt end-user usefulness wrt the more specific "Experimental P2P feature is not implemented for OpenCL adapter."
There are many examples of this.
There was a problem hiding this comment.
I didn't realise there wasn't a DPC++ equivalent error code for UR_RESULT_ERROR_UNSUPPORTED_FEATURE thanks for pointing this out. We would like to make a change to the SYCL runtime to handle these errors similar to a recent change made here intel/llvm@3756725 would this be acceptable?
There was a problem hiding this comment.
I didn't realise there wasn't a DPC++ equivalent error code for
UR_RESULT_ERROR_UNSUPPORTED_FEATUREthanks for pointing this out. We would like to make a change to the SYCL runtime to handle these errors similar to a recent change made here intel/llvm@3756725 would this be acceptable?
I guess you could do that as things stand. Although from what I understand it would require that all backends that can call this API won't be able to return any other errors that map to PI_ERROR_INVALID_OPERATION, and this will have to be guaranteed to remain true into the future. But these are technical implementation questions, which I don't think are the key issue to consider.
Since also presumably all PI_ERROR_* will be removed at some point when PI is removed, maybe the above solution would only be very temporary, and I can see one argument that in the future it might be decided that UR_RESULT_ERROR_UNSUPPORTED_FEATURE will be returned for all such cases where a feature isn't supported. Then the runtime can pass a more specific message if it receives this error.
The above paragraph is more of a specification question which in my point of view is the right type of question to be asking at this stage, since there doesn't appear (tell me if I am wrong) a clear specification in UR for dealing with errors. For such a specification to be stable into the future I can think of some requirements it would need to satisfy (in a rough order of importance):
- Support both adapter agnostic and adapter specific errors.
- Be compatible with the error requirements of existing supported language runtime specifications (assuming that the language runtime specifications have error handling fit for purpose, and aren't themselves looking into changes in future versions (SYCL is one), in such a case the error requirements to take account of will be the future versions not the existing ones).
- Be future proof for error requirements of adapters that might be supported in the future.
For sure the above requirements would take some amount of thought to satisfy, but I don't see (particularly the first two) as being impossible. And before solving this problem is properly undertaken I see it as likely that we will just largely go in circles, or at best zig-zag to a stable solution.
At the moment the process of e.g. this PR is:
step 1. make a change to move in a particular direction due to a single requirement: e.g. get rid of dies within UR
step 2: run into some constraints like e.g. the one in this particular thread.
step 3: solve the constraints and merge pr
step 4: realize that we didn't consider some other constraints and have to start the process over.
Of course it is just my opinion, and it doesn't block this PR, but I'd recommend going straight to the finish line and first plan an error handling spec that is fit for purpose.
The other issue is that without a clear spec people deal with errors in an inconsistent way. If once someone has ideally fully considered the requirements they add docs to https://oneapi-src.github.io/unified-runtime/core/INTRO.html#error-handling for contributors that would be useful.
There was a problem hiding this comment.
Hi @JackAKirk I think it's maybe best to continue the discussion for this in a new issue - #1161 as it would potentially make this a very large PR! I think we are going to go ahead with looking at handling the unsupported feature in the sycl runtime for now. Cheers
|
I think there are quite a few degradations introduced in this PR wrt DPC++ usage. It would be good to consider the effect on the end-user for each category of changes made here. |
jchlanda
left a comment
There was a problem hiding this comment.
I like the effort to get rid of asserting in favour of providing error message. I do think however, that greater care is needed to make sure that the error codes are meaningful. Admittedly, this is non-trivial and daunting task, but if we let it slip now, it's likely that it will never get fixed.
Similarly, like Jack pointed out, it would be nice to preserve messages (maybe with exception of those unimplemented API entry).
| UR_CHECK_ERROR(cuDeviceGetAttribute( | ||
| &ECCEnabled, CU_DEVICE_ATTRIBUTE_ECC_ENABLED, hDevice->get())); | ||
|
|
||
| detail::ur::assertion((ECCEnabled == 0) | (ECCEnabled == 1)); |
There was a problem hiding this comment.
While you are at it, should it be rewritten to use logical or operator? I'm not sure if bitwise was intended or an error in the first place? (in a couple of places in this file).
There was a problem hiding this comment.
Looks like it was likely a typo in the original diff. I've changed it over now, thanks.
source/adapters/cuda/enqueue.cpp
Outdated
| default: | ||
| detail::ur::die("Invalid image format."); | ||
| return 0; | ||
| return UR_RESULT_ERROR_INVALID_IMAGE_SIZE; |
There was a problem hiding this comment.
Would UR_RESULT_ERROR_INVALID_IMAGE_FORMAT_DESCRIPTOR fit better here?
There was a problem hiding this comment.
Updated to this, thanks.
source/adapters/cuda/memory.cpp
Outdated
| default: | ||
| detail::ur::die( | ||
| "urMemImageCreate given unsupported image_channel_data_type"); | ||
| return UR_RESULT_ERROR_UNSUPPORTED_ENUMERATION; |
There was a problem hiding this comment.
Given that the switch is over ur_image_format_t maybe return: UR_RESULT_ERROR_INVALID_IMAGE_FORMAT_DESCRIPTOR?
There was a problem hiding this comment.
Updated to this, thanks.
| switch (Format) { | ||
| case HIP_AD_FORMAT_UNSIGNED_INT8: | ||
| case HIP_AD_FORMAT_SIGNED_INT8: | ||
| *Size = 1; |
There was a problem hiding this comment.
Same as above switch/break.
There was a problem hiding this comment.
Fixed this, thanks.
| case CU_AD_FORMAT_UNSIGNED_INT8: | ||
| case CU_AD_FORMAT_SIGNED_INT8: | ||
| return 1; | ||
| *Size = 1; |
There was a problem hiding this comment.
Do you need break?
https://godbolt.org/z/P766Pvaf1
There was a problem hiding this comment.
Oh yeah, I just spotted that too. The control will be transferred to the next statement and so on until it reaches a break or end of the switch. Maybe there are either no tests or not enabled tests to catch this now but it will be giving incorrect element byte sizes for when 8-bit and 16-bit formats are used.
There was a problem hiding this comment.
Fixed this, thanks.
| static size_t imageElementByteSize(CUDA_ARRAY_DESCRIPTOR ArrayDesc) { | ||
| static ur_result_t imageElementByteSize(CUDA_ARRAY_DESCRIPTOR ArrayDesc, | ||
| int *Size) { | ||
| switch (ArrayDesc.Format) { |
There was a problem hiding this comment.
Do you need break?
https://godbolt.org/z/P766Pvaf1
There was a problem hiding this comment.
Fixed this, thanks.
source/adapters/hip/common.hpp
Outdated
| #include <ur/ur.hpp> | ||
|
|
||
| /** | ||
| * Call an UR API and, if the result is not UR_RESULT_SUCCESS, automatically |
There was a problem hiding this comment.
| * Call an UR API and, if the result is not UR_RESULT_SUCCESS, automatically | |
| * Call a UR API and, if the result is not UR_RESULT_SUCCESS, automatically |
There was a problem hiding this comment.
Fixed this, thanks.
| UR_CHECK_ERROR(hipDeviceGetAttribute( | ||
| &EccEnabled, hipDeviceAttributeEccEnabled, hDevice->get())); | ||
|
|
||
| detail::ur::assertion((EccEnabled == 0) | (EccEnabled == 1)); |
There was a problem hiding this comment.
Same as in CUDA re: binary/logical or.
There was a problem hiding this comment.
Fixed this, thanks.
source/adapters/hip/device.cpp
Outdated
| hipDeviceProp_t Props; | ||
| detail::ur::assertion(hipGetDeviceProperties(&Props, hDevice->get()) == | ||
| hipSuccess); | ||
| if (hipGetDeviceProperties(&Props, hDevice->get()) != hipSuccess) { |
There was a problem hiding this comment.
Should we make some effort to try and translate native error returns to UR specific? There is a bunch of native calls (here and in CUDA: cuMemGetInfo, cuDeviceTotalMem etc.) that just hard code UR_RESULT_ERROR_INVALID_OPERATION/UR_RESULT_ERROR_INVALID_SIZE on failure. Previously they were wrapped up in assertion, so the value didn't really matter, but now, they will be served up all the way to the user.
7669e65 to
8b7081e
Compare
|
I have updated the target branch of this PR from the |
ad23ddd to
b603874
Compare
Bensuo
left a comment
There was a problem hiding this comment.
Changes to command buffers LGTM
b603874 to
714adc2
Compare
… codes or replacing original error message to use an assert from cassert header and some other minor fixes.
…checks with an assert instead of returning a generic UR error.
…rror. Format code.
714adc2 to
432bfcc
Compare
| ur_event_handle_t_ *Event = ur_cast<ur_event_handle_t_ *>(e); | ||
| if (!Event->hasExternalRefs()) | ||
| die("urEventsWait must not be called for an internal event"); | ||
| ur_event_handle_t_ *Event = |
There was a problem hiding this comment.
why drop the use of the variable e? then it will just be unused in the internals.
Unified Runtime -> intel/llvm Repo Move NoticeInformationThe source code of Unified Runtime has been moved to intel/llvm under the unified-runtime top-level directory, The code will be mirrored to oneapi-src/unified-runtime and the specification will continue to be hosted at oneapi-src.github.io/unified-runtime. The contribution guide has been updated with new instructions for contributing to Unified Runtime. PR MigrationAll open PRs including this one will be labelled auto-close and shall be automatically closed after 30 days. Should you wish to continue with your PR you will need to migrate it to intel/llvm. This is an automated comment. |
Unified Runtime -> intel/llvm Repo Move NoticeFollowing on from the previous notice, we have now enabled workflows to automatically label and close PRs because the Unified Runtime source code has moved to intel/llvm. This PR has now been marked with the Please review the previous notice for more information, including assistance with migrating your PR to intel/llvm. Should there be a reason for this PR to remain open, manually remove the This is an automated comment. |
Automatic PR Closure NoticeInformationThis PR has been closed automatically. It was marked with the All Unified Runtime development should be done in intel/llvm, details can be found in the updated contribution guide. Next StepsShould you wish to re-open this PR it must be moved to intel/llvm. We have provided a script to help automate this process, otherwise no actions are required. This is an automated comment. |
For #1078.
This PR attempts to replace all die/terminate calls with simply returning an appropriate error code. There may be a better code, or some kind of assert/abort may be more appropriate in some cases so any suggestions are appreciated.
intel-llvm run intel/llvm#12076