Skip to content

[UR][layer] Add exception sanitizer layer #2216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

hdelan
Copy link
Contributor

@hdelan hdelan commented Oct 17, 2024

Add basic exception sanitizer layer that will call std::abort if an exception is thrown. This layer can be used to make sure an adapter isn't throwing, which is prohibited for a C interface. The layer can be enabled by building with the option -DUR_ENABLE_EXCEPTION_SANITIZER=ON

@hdelan hdelan changed the title Add exception sanitizer layer [UR][layer] Add exception sanitizer layer Oct 17, 2024
@github-actions github-actions bot added the loader Loader related feature/bug label Oct 17, 2024
Copy link
Contributor

@pbalcer pbalcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this was never tested :D So might be good to enable this in CI for CTS.

@github-actions github-actions bot added the ci/cd Continuous integration/devliery label Oct 17, 2024
@hdelan hdelan force-pushed the except-san-layer branch 2 times, most recently from f26923e to 375ab46 Compare October 21, 2024 15:27
@hdelan
Copy link
Contributor Author

hdelan commented Oct 22, 2024

I suspect this was never tested :D So might be good to enable this in CI for CTS.

Thanks have done so. I'm not sure what the failures are related to.

@github-actions github-actions bot added cuda CUDA adapter specific issues hip HIP adapter specific issues command-buffer Command Buffer feature addition/changes/specification labels Oct 22, 2024
@hdelan hdelan marked this pull request as ready for review October 22, 2024 09:18
@hdelan hdelan requested review from a team as code owners October 22, 2024 09:18
Add basic exception sanitizer layer that will call std::abort if an
exception is thrown.
And initialize UR_ENABLE_EXCEPTION_SANITIZER to OFF.
Add compile time macro and also fix typos in ur_lib.hpp.
Don't allow UR_CHECK_ERROR to be called outside of a try block.
Copy link
Contributor

@EwanC EwanC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for improving our code quality

Copy link
Contributor

@kbenzie kbenzie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the missing part here is actually enabling the layer in the CTS environment setup, something along the lines of this:

diff --git a/test/conformance/source/environment.cpp b/test/conformance/source/environment.cpp
index 006ea09b..737d26d2 100644
--- a/test/conformance/source/environment.cpp
+++ b/test/conformance/source/environment.cpp
@@ -90,6 +90,14 @@ uur::PlatformEnvironment::PlatformEnvironment(int argc, char **argv)
             error = "Failed to enable validation layer";
             return;
         }
+#ifdef UR_ENABLE_EXCEPTION_SANITIZER
+        if (urLoaderConfigEnableLayer(config, "UR_LAYER_EXCEPTION_SANITIZER") !=
+            UR_RESULT_SUCCESS) {
+            urLoaderConfigRelease(config);
+            error = "Failed to enable exception sanitizer layer";
+            return;
+        }
+#endif
     } else {
         error = "Failed to create loader config handle";
         return;

CMakeLists.txt Outdated
@@ -43,6 +43,7 @@ option(UR_USE_TSAN "enable ThreadSanitizer" OFF)
option(UR_ENABLE_TRACING "enable api tracing through xpti" OFF)
option(UR_ENABLE_SANITIZER "enable device sanitizer" ON)
option(UR_ENABLE_SYMBOLIZER "enable symoblizer for sanitizer" OFF)
option(UR_ENABLE_EXCEPTION_SANITIZER "enable exception sanitizer" OFF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need a separate option, we could always enable this as part of UR_DEVELOPER_MODE since that's enabled in CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, have removed.

Co-authored-by: Kenneth Benzie (Benie) <[email protected]>
@hdelan
Copy link
Contributor Author

hdelan commented Oct 24, 2024

So the missing part here is actually enabling the layer in the CTS environment setup, something along the lines of this:

diff --git a/test/conformance/source/environment.cpp b/test/conformance/source/environment.cpp
index 006ea09b..737d26d2 100644
--- a/test/conformance/source/environment.cpp
+++ b/test/conformance/source/environment.cpp
@@ -90,6 +90,14 @@ uur::PlatformEnvironment::PlatformEnvironment(int argc, char **argv)
             error = "Failed to enable validation layer";
             return;
         }
+#ifdef UR_ENABLE_EXCEPTION_SANITIZER
+        if (urLoaderConfigEnableLayer(config, "UR_LAYER_EXCEPTION_SANITIZER") !=
+            UR_RESULT_SUCCESS) {
+            urLoaderConfigRelease(config);
+            error = "Failed to enable exception sanitizer layer";
+            return;
+        }
+#endif
     } else {
         error = "Failed to create loader config handle";
         return;

Interesting, how was the layer working here without this call?

@pbalcer
Copy link
Contributor

pbalcer commented Oct 24, 2024

Interesting, how was the layer working here without this call?

The init function of all layers gets called unconditionally, and the layer themselves need to check whether they are on the enabled list.
This is so that a layer can force-enable itself if it's needed as a dependency of something else (e.g., validation layers).

@hdelan
Copy link
Contributor Author

hdelan commented Oct 24, 2024

The init function of all layers gets called unconditionally, and the layer themselves need to check whether they are on the enabled list. This is so that a layer can force-enable itself if it's needed as a dependency of something else (e.g., validation layers).

Aha thanks @pbalcer that explains it

@hdelan
Copy link
Contributor Author

hdelan commented Oct 25, 2024

Have updated. Should be good to go.

@github-actions github-actions bot added the conformance Conformance test suite issues. label Oct 25, 2024
@igchor
Copy link
Member

igchor commented Oct 25, 2024

I think we should actually have a separate option to enable/disable this layer. Right now, as I understand, there is no way to disable this layer for L0 and L0 v2 adapter (other than disabling developer mode).

In L0 and L0 v2, we use exceptions to propagate errors, expecting that the loader will translate them into error codes, like here: https://github.com/oneapi-src/unified-runtime/blob/main/source/adapters/level_zero/v2/kernel.cpp#L77 .There should be no abort for L0

@kbenzie
Copy link
Contributor

kbenzie commented Oct 28, 2024

In L0 and L0 v2, we use exceptions to propagate errors, expecting that the loader will translate them into error codes, like here: https://github.com/oneapi-src/unified-runtime/blob/main/source/adapters/level_zero/v2/kernel.cpp#L77 .There should be no abort for L0

My personal view of this is it's a bad idea and adapters should never allow exceptions to escape the UR entry points because it is a C API. The reason this layer was created is due to exceptions crossing that ABI boundry causing crashes.

@igchor
Copy link
Member

igchor commented Oct 28, 2024

In L0 and L0 v2, we use exceptions to propagate errors, expecting that the loader will translate them into error codes, like here: https://github.com/oneapi-src/unified-runtime/blob/main/source/adapters/level_zero/v2/kernel.cpp#L77 .There should be no abort for L0

My personal view of this is it's a bad idea and adapters should never allow exceptions to escape the UR entry points because it is a C API. The reason this layer was created is due to exceptions crossing that ABI boundry causing crashes.

If so, I think we should then enforce each function in an adapter to be wrapped by try/catch (that is, move those auto-generated checks from loader: https://github.com/oneapi-src/unified-runtime/blob/main/source/loader/ur_libapi.cpp#L2422 to each adapter function). We could also consider making all functions (and ddi pointers) noexcept. This might allow compiler/coverity to complain if there is an uncaught exception.

@kbenzie
Copy link
Contributor

kbenzie commented Oct 28, 2024

If so, I think we should then enforce each function in an adapter to be wrapped by try/catch (that is, move those auto-generated checks from loader: https://github.com/oneapi-src/unified-runtime/blob/main/source/loader/ur_libapi.cpp#L2422 to each adapter function).

Since the adapters are not generated from the yaml source, that could get quite manual and annoying.

We could also consider making all functions (and ddi pointers) noexcept. This might allow compiler/coverity to complain if there is an uncaught exception.

This is an interesting approach and should be much less of a burden to add to adapters. I doubt we could do that as part of the ur_api.h header though. If we can make compilers emit diagnostics, that would be ideal.

@igchor
Copy link
Member

igchor commented Oct 28, 2024

If so, I think we should then enforce each function in an adapter to be wrapped by try/catch (that is, move those auto-generated checks from loader: https://github.com/oneapi-src/unified-runtime/blob/main/source/loader/ur_libapi.cpp#L2422 to each adapter function).

Since the adapters are not generated from the yaml source, that could get quite manual and annoying.

Well, we could also add this try/catch to urGet*ProcAddTable when creating a ddi tables in adapters (which are auto-generated). Something like this:

urGetEnqueueProcAddrTable(...) {
    pDdiTable->pfnEventsWait = [](auto... args) { 
      try {
          return urEnqueueEventsWait(args...);
      } catch(...) { return exceptionToResult(std::current_exception()); 
    };
  
    ...    
}

I belive the compiler will optimize it so there should be no additional overhead but we can verify that.

@kbenzie
Copy link
Contributor

kbenzie commented Oct 28, 2024

For the static linking use case there won't be any DDI tables, this needs to happen in the adapters not in some wrapping layer.

@igchor
Copy link
Member

igchor commented Oct 28, 2024

Okay, I see; then it's probably case by case. I will add the necessary try/catch blocks to the v2 adapter.

@martygrant
Copy link
Contributor

Unified Runtime -> intel/llvm Repo Move Notice

Information

The source code of Unified Runtime has been moved to intel/llvm under the unified-runtime top-level directory,
all future development will now be carried out there. This was done in intel/llvm#17043.

The code will be mirrored to oneapi-src/unified-runtime and the specification will continue to be hosted at oneapi-src.github.io/unified-runtime.

The contribution guide has been updated with new instructions for contributing to Unified Runtime.

PR Migration

All open PRs including this one will be labelled auto-close and shall be automatically closed after 30 days.
To allow for some breathing space, this automation will not be enabled until next week (27/02/2025).

Should you wish to continue with your PR you will need to migrate it to intel/llvm.
We have provided a script to help automate this process.


This is an automated comment.

@martygrant
Copy link
Contributor

Unified Runtime -> intel/llvm Repo Move Notice

Following on from the previous notice, we have now enabled workflows to automatically label and close PRs because the Unified Runtime source code has moved to intel/llvm.

This PR has now been marked with the auto-close label and will be automatically closed after 30 days.

Please review the previous notice for more information, including assistance with migrating your PR to intel/llvm.

Should there be a reason for this PR to remain open, manually remove the auto-close label.


This is an automated comment.

Copy link
Contributor

Automatic PR Closure Notice

Information

This PR has been closed automatically. It was marked with the auto-close label 30 days ago as part of the Unified Runtime source code migration to the intel/llvm repository - intel/llvm#17043.

All Unified Runtime development should be done in intel/llvm, details can be found in the updated contribution guide.
This repository will continue to exist as a mirror and will host the specification documentation.

Next Steps

Should you wish to re-open this PR it must be moved to intel/llvm. We have provided a script to help automate this process, otherwise no actions are required.


This is an automated comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-close ci/cd Continuous integration/devliery command-buffer Command Buffer feature addition/changes/specification conformance Conformance test suite issues. cuda CUDA adapter specific issues hip HIP adapter specific issues loader Loader related feature/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants