[CUDA][HIP][L0][NATIVECPU][OpenCL] Remove usage of die/terminate #1127

martygrant · 2023-11-27T15:30:21Z

This PR attempts to replace all die/terminate calls with simply returning an appropriate error code. There may be a better code, or some kind of assert/abort may be more appropriate in some cases so any suggestions are appreciated.

intel-llvm run intel/llvm#12076

JackAKirk · 2023-11-27T15:36:23Z

source/adapters/cuda/event.cpp

@@ -248,17 +245,19 @@ urEventWait(uint32_t numEvents, const ur_event_handle_t *phEventWaitList) {
 UR_APIEXPORT ur_result_t UR_APICALL urEventRetain(ur_event_handle_t hEvent) {
  const auto RefCount = hEvent->incrementReferenceCount();

-  detail::ur::assertion(RefCount != 0,
-                        "Reference count overflow detected in urEventRetain.");


This removes a specific error message with a generic one that does not relate to the specific error message.
If you want to keep specific errors then use setErrorMessage etc as e.g here: intel/llvm#10626

There are quite a few examples of this in this pr.

This removes a specific error message with a generic one that does not relate to the specific error message.

Agree.

Hi Jack for these example I'll restore the original error message but use a standard assert(message) from the cassert header.

JackAKirk · 2023-11-27T15:47:59Z

source/adapters/opencl/usm_p2p.cpp

  return UR_RESULT_ERROR_UNSUPPORTED_FEATURE;
 }

 UR_APIEXPORT ur_result_t UR_APICALL
 urUsmP2PDisablePeerAccessExp([[maybe_unused]] ur_device_handle_t commandDevice,
                             [[maybe_unused]] ur_device_handle_t peerDevice) {
-
-  cl_adapter::die(
-      "Experimental P2P feature is not implemented for OpenCL adapter.");
  return UR_RESULT_ERROR_UNSUPPORTED_FEATURE;


See here: https://github.com/intel/llvm/blob/88f1d0a2b64e2c61d208d4dcebf9b512f49f4860/sycl/plugins/unified_runtime/pi2ur.hpp#L108

This means that in dpc++ the error will simply be PI_ERROR_INVALID_VALUE which is useless to the user and is probably going to lead to a lot of confusion.

In the long run when it really could return the semantics of UR_RESULT_ERROR_UNSUPPORTED_FEATURE
it is probably worth considering whether this is perhaps also a degradation wrt end-user usefulness wrt the more specific "Experimental P2P feature is not implemented for OpenCL adapter."

There are many examples of this.

I didn't realise there wasn't a DPC++ equivalent error code for UR_RESULT_ERROR_UNSUPPORTED_FEATURE thanks for pointing this out. We would like to make a change to the SYCL runtime to handle these errors similar to a recent change made here intel/llvm@3756725 would this be acceptable?

I didn't realise there wasn't a DPC++ equivalent error code for UR_RESULT_ERROR_UNSUPPORTED_FEATURE thanks for pointing this out. We would like to make a change to the SYCL runtime to handle these errors similar to a recent change made here intel/llvm@3756725 would this be acceptable?

I guess you could do that as things stand. Although from what I understand it would require that all backends that can call this API won't be able to return any other errors that map to PI_ERROR_INVALID_OPERATION, and this will have to be guaranteed to remain true into the future. But these are technical implementation questions, which I don't think are the key issue to consider.

Since also presumably all PI_ERROR_* will be removed at some point when PI is removed, maybe the above solution would only be very temporary, and I can see one argument that in the future it might be decided that UR_RESULT_ERROR_UNSUPPORTED_FEATURE will be returned for all such cases where a feature isn't supported. Then the runtime can pass a more specific message if it receives this error.

The above paragraph is more of a specification question which in my point of view is the right type of question to be asking at this stage, since there doesn't appear (tell me if I am wrong) a clear specification in UR for dealing with errors. For such a specification to be stable into the future I can think of some requirements it would need to satisfy (in a rough order of importance):

Support both adapter agnostic and adapter specific errors.

Be compatible with the error requirements of existing supported language runtime specifications (assuming that the language runtime specifications have error handling fit for purpose, and aren't themselves looking into changes in future versions (SYCL is one), in such a case the error requirements to take account of will be the future versions not the existing ones).

Be future proof for error requirements of adapters that might be supported in the future.

For sure the above requirements would take some amount of thought to satisfy, but I don't see (particularly the first two) as being impossible. And before solving this problem is properly undertaken I see it as likely that we will just largely go in circles, or at best zig-zag to a stable solution.

At the moment the process of e.g. this PR is:

step 1. make a change to move in a particular direction due to a single requirement: e.g. get rid of dies within UR
step 2: run into some constraints like e.g. the one in this particular thread.
step 3: solve the constraints and merge pr
step 4: realize that we didn't consider some other constraints and have to start the process over.

Of course it is just my opinion, and it doesn't block this PR, but I'd recommend going straight to the finish line and first plan an error handling spec that is fit for purpose.
The other issue is that without a clear spec people deal with errors in an inconsistent way. If once someone has ideally fully considered the requirements they add docs to https://oneapi-src.github.io/unified-runtime/core/INTRO.html#error-handling for contributors that would be useful.

Hi @JackAKirk I think it's maybe best to continue the discussion for this in a new issue - #1161 as it would potentially make this a very large PR! I think we are going to go ahead with looking at handling the unsupported feature in the sycl runtime for now. Cheers

Sounds good, thanks.

JackAKirk · 2023-11-27T15:48:55Z

I think there are quite a few degradations introduced in this PR wrt DPC++ usage. It would be good to consider the effect on the end-user for each category of changes made here.

source/adapters/hip/memory.cpp

source/adapters/hip/common.cpp

jchlanda

I like the effort to get rid of asserting in favour of providing error message. I do think however, that greater care is needed to make sure that the error codes are meaningful. Admittedly, this is non-trivial and daunting task, but if we let it slip now, it's likely that it will never get fixed.

Similarly, like Jack pointed out, it would be nice to preserve messages (maybe with exception of those unimplemented API entry).

jchlanda · 2023-11-28T08:31:31Z

source/adapters/cuda/device.cpp

@@ -510,7 +557,9 @@ UR_APIEXPORT ur_result_t UR_APICALL urDeviceGetInfo(ur_device_handle_t hDevice,
    UR_CHECK_ERROR(cuDeviceGetAttribute(
        &ECCEnabled, CU_DEVICE_ATTRIBUTE_ECC_ENABLED, hDevice->get()));

-    detail::ur::assertion((ECCEnabled == 0) | (ECCEnabled == 1));


While you are at it, should it be rewritten to use logical or operator? I'm not sure if bitwise was intended or an error in the first place? (in a couple of places in this file).

Looks like it was likely a typo in the original diff. I've changed it over now, thanks.

jchlanda · 2023-11-28T08:42:35Z

source/adapters/cuda/enqueue.cpp

  default:
-    detail::ur::die("Invalid image format.");
-    return 0;
+    return UR_RESULT_ERROR_INVALID_IMAGE_SIZE;


Would UR_RESULT_ERROR_INVALID_IMAGE_FORMAT_DESCRIPTOR fit better here?

Updated to this, thanks.

jchlanda · 2023-11-28T08:47:37Z

source/adapters/cuda/memory.cpp

@@ -306,8 +306,7 @@ UR_APIEXPORT ur_result_t UR_APICALL urMemImageCreate(
    PixelTypeSizeBytes = 4;
    break;
  default:
-    detail::ur::die(
-        "urMemImageCreate given unsupported image_channel_data_type");
+    return UR_RESULT_ERROR_UNSUPPORTED_ENUMERATION;


Given that the switch is over ur_image_format_t maybe return: UR_RESULT_ERROR_INVALID_IMAGE_FORMAT_DESCRIPTOR?

Updated to this, thanks.

jchlanda · 2023-11-28T08:53:04Z

source/adapters/hip/common.cpp

+  switch (Format) {
+  case HIP_AD_FORMAT_UNSIGNED_INT8:
+  case HIP_AD_FORMAT_SIGNED_INT8:
+    *Size = 1;


Same as above switch/break.

Fixed this, thanks.

jchlanda · 2023-11-28T08:57:01Z

source/adapters/cuda/enqueue.cpp

  switch (ArrayDesc.Format) {
  case CU_AD_FORMAT_UNSIGNED_INT8:
  case CU_AD_FORMAT_SIGNED_INT8:
-    return 1;
+    *Size = 1;


Do you need break?
https://godbolt.org/z/P766Pvaf1

Oh yeah, I just spotted that too. The control will be transferred to the next statement and so on until it reaches a break or end of the switch. Maybe there are either no tests or not enabled tests to catch this now but it will be giving incorrect element byte sizes for when 8-bit and 16-bit formats are used.

Fixed this, thanks.

jchlanda · 2023-11-28T09:01:29Z

source/adapters/cuda/enqueue.cpp

@@ -843,23 +843,25 @@ UR_APIEXPORT ur_result_t UR_APICALL urEnqueueMemBufferFill(
  }
 }

-static size_t imageElementByteSize(CUDA_ARRAY_DESCRIPTOR ArrayDesc) {
+static ur_result_t imageElementByteSize(CUDA_ARRAY_DESCRIPTOR ArrayDesc,
+                                        int *Size) {
  switch (ArrayDesc.Format) {


Do you need break?
https://godbolt.org/z/P766Pvaf1

Fixed this, thanks.

source/adapters/hip/common.cpp

jchlanda · 2023-11-28T09:05:28Z

source/adapters/hip/common.hpp

@@ -15,6 +15,16 @@
 #include <hip/hip_runtime.h>
 #include <ur/ur.hpp>

+/**
+ * Call an UR API and, if the result is not UR_RESULT_SUCCESS, automatically


Suggested change

* Call an UR API and, if the result is not UR_RESULT_SUCCESS, automatically

* Call a UR API and, if the result is not UR_RESULT_SUCCESS, automatically

Fixed this, thanks.

jchlanda · 2023-11-28T09:10:20Z

source/adapters/hip/device.cpp

    return ReturnValue(static_cast<uint64_t>(LocalMemSize));
  }
  case UR_DEVICE_INFO_ERROR_CORRECTION_SUPPORT: {
    int EccEnabled = 0;
    UR_CHECK_ERROR(hipDeviceGetAttribute(
        &EccEnabled, hipDeviceAttributeEccEnabled, hDevice->get()));

-    detail::ur::assertion((EccEnabled == 0) | (EccEnabled == 1));


Same as in CUDA re: binary/logical or.

Fixed this, thanks.

jchlanda · 2023-11-28T09:19:33Z

source/adapters/hip/device.cpp

@@ -493,8 +546,9 @@ UR_APIEXPORT ur_result_t UR_APICALL urDeviceGetInfo(ur_device_handle_t hDevice,
    // name instead, this is also what AMD OpenCL devices return.
    if (strlen(Name) == 0) {
      hipDeviceProp_t Props;
-      detail::ur::assertion(hipGetDeviceProperties(&Props, hDevice->get()) ==
-                            hipSuccess);
+      if (hipGetDeviceProperties(&Props, hDevice->get()) != hipSuccess) {


Should we make some effort to try and translate native error returns to UR specific? There is a bunch of native calls (here and in CUDA: cuMemGetInfo, cuDeviceTotalMem etc.) that just hard code UR_RESULT_ERROR_INVALID_OPERATION/UR_RESULT_ERROR_INVALID_SIZE on failure. Previously they were wrapped up in assertion, so the value didn't really matter, but now, they will be served up all the way to the user.

fabiomestre · 2023-12-05T16:45:58Z

I have updated the target branch of this PR from the adapters branch to the main branch.
Development in UR is moving back to main. The adapters branch will soon be deleted.

Bensuo

Changes to command buffers LGTM

isaacault

Bindless Images LGTM

…ughout adapters.

… codes or replacing original error message to use an assert from cassert header and some other minor fixes.

…checks with an assert instead of returning a generic UR error.

…rror. Format code.

nrspruit · 2024-05-24T15:46:04Z

source/adapters/level_zero/event.cpp

-      ur_event_handle_t_ *Event = ur_cast<ur_event_handle_t_ *>(e);
-      if (!Event->hasExternalRefs())
-        die("urEventsWait must not be called for an internal event");
+      ur_event_handle_t_ *Event =


why drop the use of the variable e? then it will just be unused in the internals.

martygrant · 2025-02-20T11:58:57Z

Unified Runtime -> intel/llvm Repo Move Notice

Information

The source code of Unified Runtime has been moved to intel/llvm under the unified-runtime top-level directory,
all future development will now be carried out there. This was done in intel/llvm#17043.

The code will be mirrored to oneapi-src/unified-runtime and the specification will continue to be hosted at oneapi-src.github.io/unified-runtime.

The contribution guide has been updated with new instructions for contributing to Unified Runtime.

PR Migration

All open PRs including this one will be labelled auto-close and shall be automatically closed after 30 days.
To allow for some breathing space, this automation will not be enabled until next week (27/02/2025).

Should you wish to continue with your PR you will need to migrate it to intel/llvm.
We have provided a script to help automate this process.

This is an automated comment.

martygrant · 2025-02-28T10:47:04Z

Unified Runtime -> intel/llvm Repo Move Notice

Following on from the previous notice, we have now enabled workflows to automatically label and close PRs because the Unified Runtime source code has moved to intel/llvm.

This PR has now been marked with the auto-close label and will be automatically closed after 30 days.

Please review the previous notice for more information, including assistance with migrating your PR to intel/llvm.

Should there be a reason for this PR to remain open, manually remove the auto-close label.

This is an automated comment.

github-actions · 2025-03-31T00:32:49Z

Automatic PR Closure Notice

Information

This PR has been closed automatically. It was marked with the auto-close label 30 days ago as part of the Unified Runtime source code migration to the intel/llvm repository - intel/llvm#17043.

All Unified Runtime development should be done in intel/llvm, details can be found in the updated contribution guide.
This repository will continue to exist as a mirror and will host the specification documentation.

Next Steps

Should you wish to re-open this PR it must be moved to intel/llvm. We have provided a script to help automate this process, otherwise no actions are required.

This is an automated comment.

martygrant added the enhancement New feature or request label Nov 27, 2023

martygrant requested review from a team as code owners November 27, 2023 15:30

JackAKirk reviewed Nov 27, 2023

View reviewed changes

GeorgeWeb reviewed Nov 27, 2023

View reviewed changes

source/adapters/hip/memory.cpp Outdated Show resolved Hide resolved

source/adapters/hip/common.cpp Outdated Show resolved Hide resolved

jchlanda reviewed Nov 28, 2023

View reviewed changes

martygrant force-pushed the martin/removeDieTermination branch from 7669e65 to 8b7081e Compare December 1, 2023 12:16

This was referenced Dec 5, 2023

[CUDA][HIP][L0][NATIVECPU][OpenCL] Remove usage of die/terminate throughout adapters intel/llvm#12076

Closed

Error handling for unsupported adapter features #1161

Closed

fabiomestre changed the base branch from adapters to main December 5, 2023 16:45

martygrant force-pushed the martin/removeDieTermination branch from ad23ddd to b603874 Compare December 6, 2023 17:10

fabiomestre mentioned this pull request Jan 4, 2024

[CUDA] Implement urMemImageGetInfo #1213

Merged

Bensuo approved these changes Jan 8, 2024

View reviewed changes

isaacault approved these changes Jan 8, 2024

View reviewed changes

This was referenced Jan 11, 2024

[SYCL] Throw exception for unsupported UR features intel/llvm#12361

Closed

[NATIVECPU] Empty implementation for command buffer fill #1247

Merged

martygrant force-pushed the martin/removeDieTermination branch from b603874 to 714adc2 Compare January 12, 2024 15:35

martygrant added 4 commits April 8, 2024 14:07

[CUDA][HIP][L0][NATIVECPU][OpenCL] Remove usage of die/terminate thro…

c39ee6a

…ughout adapters.

Address feedback - missing breaks in switches, more appropriate error…

272d04e

… codes or replacing original error message to use an assert from cassert header and some other minor fixes.

Wrap native adapter calls inside UR_CHECK_ERROR macro and wrap value …

1b1279d

…checks with an assert instead of returning a generic UR error.

Surround UR_CHECK_ERROR calls with a try/catch as they may throw an e…

432bfcc

…rror. Format code.

martygrant force-pushed the martin/removeDieTermination branch from 714adc2 to 432bfcc Compare April 8, 2024 13:08

nrspruit reviewed May 24, 2024

View reviewed changes

martygrant added the auto-close label Feb 28, 2025

github-actions bot closed this Mar 31, 2025

	* Call an UR API and, if the result is not UR_RESULT_SUCCESS, automatically
	* Call a UR API and, if the result is not UR_RESULT_SUCCESS, automatically

[CUDA][HIP][L0][NATIVECPU][OpenCL] Remove usage of die/terminate #1127

[CUDA][HIP][L0][NATIVECPU][OpenCL] Remove usage of die/terminate #1127

Uh oh!

Conversation

martygrant commented Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk commented Nov 27, 2023

Uh oh!

Uh oh!

Uh oh!

jchlanda left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fabiomestre commented Dec 5, 2023

Uh oh!

Bensuo left a comment

Choose a reason for hiding this comment

Uh oh!

isaacault left a comment

Choose a reason for hiding this comment

martygrant commented Nov 27, 2023 •

edited

Loading

JackAKirk Nov 27, 2023 •

edited

Loading

JackAKirk Nov 30, 2023 •

edited

Loading