-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with 'HandleLLVMOptions' and 'LLVM_DEFINITIONS' #125779
Comments
Clarification on the issue after some investigation: What is recommended by the docs at
The However, what MLIR projects are doing is (as recommended in MLIR Standalone example)
The So I think either If building against installed/pre-packaged LLVM
If embedding LLVM as a sub-project
|
add_definitions(${LLVM_DEFINITIONS})
be retired?This change attempts to resolve issues with use of `HandleLLVMOptions` and `LLVM_DEFINITIONS`, see llvm/llvm-project#125779. Note that this is a breaking change because it could cause build breakage for downstream users. As noted in the comments added to the CMakeLists.txt file, there may not be one perfect CMake incantation for setting Stablehlo's options that works for all users. Since it's easier to *add* compiler options at a specific scope than it is to alter/remove options that Stablehlo itself is setting, this change is hoisting responsibility to the user for setting any compiler options previously provided by the `HandleLLVMOptions` call when building in embedded mode. This means that if user was using `FetchContent|add_subdirectory|CPMAddPackage` to build Stablehlo in their project, they should invoke ``` find_package(LLVM CONFIG REQUIRED) separate_arguments(LLVM_DEFINITIONS_LIST NATIVE_COMMAND ${LLVM_DEFINITIONS}) add_definitions(${LLVM_DEFINITIONS_LIST}) include(HandleLLVMOptions) ``` in their project at the appropriate scope, or set desired flags in some other manner.
Changes are listed oldest-to-newest: 0751887ed36b20cad7d47568ecf35745368acfa9 by Christopher Bate <[email protected]>: [cmake] Fix macro redefinition issues When building with with certain LLVM distributions, the `HandleLLVMOptions` CMake script can set `LLVM_DEFINITIONS` to a space-separated list of `-Ditem=value` definitions. When passed to `add_definitions`, this does not have the desired effect. It's actually not clear where `add_definitions(${LLVM_DEFINITIONS})` really needs to be invoked. I filed a GitHub issue upstream to figure it out: llvm/llvm-project#125779 In our case, it only has the effect of duplicating options and therefore shouldn't be called. -- 9bebb5a25c5cd1b80bcdec931e990ca81d0c4e41 by Chris Bate <[email protected]>: [cmake] Retire some build system logic that is no longer necessary We used to try to do the "right" thing and make targets carry compilation definitions, but we instead use CMake directory level definitions now. Also, remove the `MLIR_TRT_USE_LINKER` flag since we just use `LLVM_USE_LINKER`. -- 88d92a49c52dac9f4d044f09b24e0c750e5cd064 by Zixin Huang <[email protected]>: [python/CAPI] Add Python API to get device name like cuda:0 1. added mtrtDeviceGetIndex into Runtime.cpp Runtime.h. 2. added Device.get_name() into RuntimePyBind.cpp The reason to only get index but not string in Runtime.cpp because passing in/out string ptrs can be complicated (which may require allocating the string buffer and estimating its size beforehand). Test is added: ``` devices = client.get_devices() print(devices[0].get_name()) ``` -- 7f106e7d501a7e0e93a54c4ec43afa0b4ad2d054 by Yuan Yao <[email protected]>: [executor][python]: Fix runtime API for creating float/integer scalars. Previously, when using the `RuntimeClient.create_scalar` API, float numbers were erroneously cast to an integer. Also an `i32` scalar was always created as an `i64` scalar. This commit fixes these bugs. -- 44ccb0370f3435b3f3366dfc534e7d17a0f2dc4e by Samurdhi Karunaratne <[email protected]>: [tensorrt/lib/Utils] Add splat dense attribute support for TRT plugins Although dense attributes were supported, splat dense attributes were not supported so far. When a splat attr is detected some space is allocated inside `PluginParams` so that the PluginField sees a buffer correctly filled with the splat value. Also fixes a bug in tensorrt/test/lib/Target/PluginUtils.h when printing integer plugin fields. Signed-off-by: Samurdhi Karunaratne <[email protected]> -- e6c291638dbe993360f0f10b4e3649862ac9333e by Chris Bate <[email protected]>: [cmake] Extensive updates to dependency management Makes a big update to how dependencies are declared in the CMake code: Previously, we used a set of ad-hoc functions to wrap CPMAddPackage. Each dependency had its own function defined `add_[depname]` in `Depependencies.cmake`. This was unwieldy and disorganized. Furthermore, other CMake dependencies like Stablehlo also need to manage to declare their dependencies in some manner, and they do so via any number of means, and the top-level project typically doesn't have any control over that besides manually patching the upstream project's code. To clean this all up into one coherent interface, we can declare a CMake [Dependency Provider](https://cmake.org/cmake/help/latest/command/cmake_language.html#dependency-providers), available in CMake 3.24+. The dependency provider is a top-level CMake script which can only be declared by the top-level project. It allows us intercept any and every `find_package` or `FetchContent_MakeAvailable` call in the whole build, including for dependencies. Using this mechanism, in our core CMakeLists.txt, and in all the dependencies' CMakeLists, we just need to use `find_package(<name> ...)` to declare dependencies. The Dependency Provider then intercepts those `find_package` calls and satisfies the depdency however we like. To clean things up further, the default provider (build_tools/cmake/DependencyProvider.cmake) now *declares* dependencies at the top of the file (rather than forcing them to be downloaded immediately), and they are only actually downloaded and incorporated into the build when the `find_package` call is made. In our case, `find_package` will cause `CPMAddPackage` to be invoked. The `nv_register_package` is the declaration function, and it accepts to CPMAddPackage arguments as well as a special "POST_ADD_HOOK" field to indicate inline a script to be executed. The contents of `nv_register_package` are simply saved and forwarded to `CPMAddPackage` (less POST_ADD_HOOK) when `find_package` is invoked. Finally, after `CPMAddPackage` returns, the `POST_ADD_HOOK` is executed. -- c2fcfa3f2db54c4ab0737f4df504776b4b07f19c by Christopher Bate <[email protected]>: [cmake] Fix library search directory for TensorRT dialect LIT tests Fixes the search directory for dynamic libraries, resolving build failure when BUILD_SHARED_LIBS=ON. -- 9fe64b7d5075d615c520232cf79f60982d382ec1 by Christopher Bate <[email protected]>: [tensorrt] NFC: Factor out 'isTensorRTInt8Type' to fix circular build issue Recently, the addition of the new plugin support for TRT 10.8 caused a circular dependency between the TensorRT Dialect library and the plugin support library. This only became apparent when building with BUILD_SHARED_LIBS=ON. This change factors out a "base" library for the TensorRT dialect that resolves the circular dependency issue. -- 638abb11a731d48599ac2af349673e8324237931 by Sagar Shelke <[email protected]>: [compiler/lib/Conversion] Add support for `stablehlo.custom_call` conversion This MR adds a converter for `stablehlo.custom_call` op conversion to TensorRT. Currently only valid conversion is when call target is `mhlo.topk` and more possible conversions can be added later. MLIR tests are added. -- 669d488b6b13521df2b54246f4fcd301ab6a8ace by Christopher Bate <[email protected]>: NFC: Fix incorrect header in NvInferPluginUtils Point the include to TensorRTBase.h; otherwise we may be including generated headers on which NvInferPluginUtils does not have a dependency (and might not be generated yet in the build). -- 340a8a5e0888a8c881d469758f4c871f7233e976 by Christopher Bate <[email protected]>: [executor] Update 'allocs-to-globals' pass to use DataLayout for type size calculation Fixes an issue where 'allocs-to-globals' can crash if the the memref element type does not have a canonical bitwidth, e.g. `complex<...>`. Use the module's data layout to get the type byte size instead. -- 6ca9b5da0d7bd672aaef7d2bc78a917d1ed159a2 by Chris Bate <[email protected]>: [compiler] Cleanup pipeline declarations in Plan Dialect Transforms This change removes unused declarations from the Plan Dialect 'Passes.h' header file and removes unused pipeline registrations. We consolidate the three bufferization stages (bufferization, optimizations, deallocation) into a single pipeline declaration that is used by the different top-level pipelines (e.g. linalt-to-executable, stablehlo-to-executable, and so on). -- 464616b4291de0dc80a6b6ce40e7ecd8a06c99fd by Christopher Bate <[email protected]>: [compiler] NFC: add some additional 'plan-bufferize-pipeline' regression tests Adds some additional regression tests for the end-to-end bufferization and deallocation pipeline. This set of tests capture cases where we are producing sub-optimal code related to host constants and shape values. -- 283bf821e4bc6f8a6b5a4f17509d87fcf72a0023 by Christopher Bate <[email protected]>: NFC: fix missing FileCheck command in test added in 6ca9b5da0d7bd672aaef7d2bc78a917d1ed159a2 -- bf050a70cdc1c714b5f1d782a043f889abf3272a by Christopher Bate <[email protected]>: [compiler] Use bufferization analysis in 'plan-alloc-tensors' This change reworks the parts of the 'plan-alloc-tensors' pass that tried to establish DPS connectivity in loop regions and at function boundaries. We now use existing routines in the bufferization analysis to support the transformation, which results in better results and more simplified code. Additionally, a bug is closed where the transform could potentially try modifying functions without fixing the callers if inlining was not performed. We now explicitly check to make sure the functions we are modifying do not have any callers and that we do not traverse nested symbol tables. A future change can update the pass to allow modifying functions with callers. -- 9f03c958cdee998d4c6ea059aa9125e395b68688 by Christopher Bate <[email protected]>: [compiler] Add new module bufferization pass This change forks the upstream 'one-shot-module-bufferization' logic into a 'module-bufferization' pass under the 'mlir-tensorrt/compiler' project. The upstream 'OneShotModuleBufferization' transformation is an extension of the core one-shot-bufferization infrastructure that includes some additional analysis state which is attached to functions. Functions are analyzed to identify which arguments are read/written and additionally which arguments are equivalent to which result values. We require forking the upstream pass because the upstream transformation does not currently allow for supporting functions and call-like operations outside of `func.func` and `func.call` (support for other operations using the appropriate interfaces has had trouble landing upstream). In addition, the upstream pass does not take into account the potential nesting of modules/symbol-tables where nested modules may require different bufferization options than the parent module and functions may be called from a parent module (if using call-like operations other than `func.call`). To allow dynamic querying of bufferization options from modules, we create a new interface, `BufferizationScopeOpInterface` that lets us attach methods for producing OneShotBufferizationOptions to our custom module-like operations. This new interface, together with the forked pass logic, allows us to jointly bufferize multiple nested modules, whereas previously the OneShotModuleBufferization limitations forced a phase-ordering constraint -- nested modules had to be bufferized prior to outer modules. -- 6fbed19b52415fc21ccc13807b1112886f405ef8 by Christopher Bate <[email protected]>: [executor] Fix Executor flatbuffer C++ generation cmake function This change simply ensures that if the flatbuffer dependency version changes that the targets which generate C++ from the flatbuffer schema file will also go out-of-date, forcing re-generation of the C++ flatbuffer files. -- cd8c672be49aaafc5d480d24f900a413f1effd59 by Christopher Bate <[email protected]>: [executor] Bump Flatbuffers to v25.2.10 Bumps the Flatbuffers version and migrates it to the new dependency handling mechanism. -- 99f60bfe28d1247c060cc0be232ec73546785bb9 by Christopher Bate <[email protected]>: [executor] Set max Flatbuffer verification size to the 64bit limit Since we use Flatbuffer's 64bit features, we need to explicitly enable verification to allow 64bit offsets. -- 523da05c5720f395ce9c5815d2a44e82f51e924d by Christopher Bate <[email protected]>: [compiler] Change 'allow-return-allocs-from-loops' to 'true' by default This allows us to handle loops that don't bufferize in-place during 'plan-module-bufferization'. This was the original behavior of `plan-bufferize`. Certain integration tests for generation models won't work without it currently. -- a31e690e2a863aaf52528b8de1c66a2f1b1ea081 by Christopher Bate <[email protected]>: [tensorrt] Add a size limit to constant folding in 'tensorrt-transpose-elimination' Constant folding of transpose operations in the `tensorrt-transpose-elimination` dialect can be extremely costly for large weights (e.g. embedding matrix). We need to impose an upper bound on what constant folding we will allow using naive transposition of DenseElementsAttrs. -- 3b7017e2356022db4ed21369422b2ead639b930b by Christopher Bate <[email protected]>: [tensorrt] NFC: add a statistic to the 'tensorrt-broadcast-elimination' pass Add a simple statistic (only set when built with assertions enabled) that counts the number of eliminated broadcast operations. -- 6feeec4d85ab976ed16676795aa426a87f4a7f38 by Christopher Bate <[email protected]>: [tensorrt] Place an upper bound on constant folding computation Places an upper bound of `1<<17` on the number of elements allowed for constant folding of `DenseElementsAttrs` in the `fold` method of TensorRT operations. -- e8690a68101085e606746361db41533597ef3c1d by Sagar Shelke <[email protected]>: [Conversion/StablehloToTensorRT] Ease restrictions of `stablehlo.reduce` conversion to TensorRT Previously, `stablehlo.reduce` conversion to TensorRT had the following restrictions. - If reduction axes are not contiguous, conversion was unsupported. - In case of multiple reduction axes, reduction happened through reshape and reduction on single axis. It turns out that this is too strict of a restriction. TensorRT supports reduction on multiple as well as non-contiguous axis. Stablehlo to TensorRT conversion tests are updated and a tensorrt test is added. Stablehlo to TensorRT engine conversion path is tested for reduction specific cases with both TensorRT 8.6 and 10.7. -- af0c30a2d3943a8f8b5a80be49a7c0fc945ae07d by Christopher Bate <[email protected]>: [compiler] Fix failing `test_tensorrt_add.py` after importing OSS changes Fixes a test that was not failing after importing GitHub MLIR-TensorRT changes. -- 5599299c5ce64aa0cafad411d90d7fe3bbda7f26 by Christopher Bate <[email protected]>: NFC: [executor] cleanup TensorRTModule output allocator logic Updates TensorRT output allocator logic to remove "reallocation" features which were not usable. As we currently use it, the output allocator lazily allocates new results for each invocation. In order to improve allocator performance in the future, we should: a) Pre-allocate when we know the static dimensions of the output b) Represent the allocator in the IR so that it can be hoisted out of loops, allowing the outputs of TRT engines whose results don't escape the function to be re-used without an additional allocation. Co-authored-by: Sagar Shelke <[email protected]> Co-authored-by: Zixin Huang <[email protected]> Co-authored-by: Yuan Yao <[email protected]> Co-authored-by: Samurdhi Karunaratne <[email protected]> GitOrigin-RevId: 4c4fdae1e9ae089df09e61abf06be28adc7217ca
Changes are listed oldest-to-newest: 0751887ed36b20cad7d47568ecf35745368acfa9 by Christopher Bate <[email protected]>: [cmake] Fix macro redefinition issues When building with with certain LLVM distributions, the `HandleLLVMOptions` CMake script can set `LLVM_DEFINITIONS` to a space-separated list of `-Ditem=value` definitions. When passed to `add_definitions`, this does not have the desired effect. It's actually not clear where `add_definitions(${LLVM_DEFINITIONS})` really needs to be invoked. I filed a GitHub issue upstream to figure it out: llvm/llvm-project#125779 In our case, it only has the effect of duplicating options and therefore shouldn't be called. -- 9bebb5a25c5cd1b80bcdec931e990ca81d0c4e41 by Chris Bate <[email protected]>: [cmake] Retire some build system logic that is no longer necessary We used to try to do the "right" thing and make targets carry compilation definitions, but we instead use CMake directory level definitions now. Also, remove the `MLIR_TRT_USE_LINKER` flag since we just use `LLVM_USE_LINKER`. -- 88d92a49c52dac9f4d044f09b24e0c750e5cd064 by Zixin Huang <[email protected]>: [python/CAPI] Add Python API to get device name like cuda:0 1. added mtrtDeviceGetIndex into Runtime.cpp Runtime.h. 2. added Device.get_name() into RuntimePyBind.cpp The reason to only get index but not string in Runtime.cpp because passing in/out string ptrs can be complicated (which may require allocating the string buffer and estimating its size beforehand). Test is added: ``` devices = client.get_devices() print(devices[0].get_name()) ``` -- 7f106e7d501a7e0e93a54c4ec43afa0b4ad2d054 by Yuan Yao <[email protected]>: [executor][python]: Fix runtime API for creating float/integer scalars. Previously, when using the `RuntimeClient.create_scalar` API, float numbers were erroneously cast to an integer. Also an `i32` scalar was always created as an `i64` scalar. This commit fixes these bugs. -- 44ccb0370f3435b3f3366dfc534e7d17a0f2dc4e by Samurdhi Karunaratne <[email protected]>: [tensorrt/lib/Utils] Add splat dense attribute support for TRT plugins Although dense attributes were supported, splat dense attributes were not supported so far. When a splat attr is detected some space is allocated inside `PluginParams` so that the PluginField sees a buffer correctly filled with the splat value. Also fixes a bug in tensorrt/test/lib/Target/PluginUtils.h when printing integer plugin fields. Signed-off-by: Samurdhi Karunaratne <[email protected]> -- e6c291638dbe993360f0f10b4e3649862ac9333e by Chris Bate <[email protected]>: [cmake] Extensive updates to dependency management Makes a big update to how dependencies are declared in the CMake code: Previously, we used a set of ad-hoc functions to wrap CPMAddPackage. Each dependency had its own function defined `add_[depname]` in `Depependencies.cmake`. This was unwieldy and disorganized. Furthermore, other CMake dependencies like Stablehlo also need to manage to declare their dependencies in some manner, and they do so via any number of means, and the top-level project typically doesn't have any control over that besides manually patching the upstream project's code. To clean this all up into one coherent interface, we can declare a CMake [Dependency Provider](https://cmake.org/cmake/help/latest/command/cmake_language.html#dependency-providers), available in CMake 3.24+. The dependency provider is a top-level CMake script which can only be declared by the top-level project. It allows us intercept any and every `find_package` or `FetchContent_MakeAvailable` call in the whole build, including for dependencies. Using this mechanism, in our core CMakeLists.txt, and in all the dependencies' CMakeLists, we just need to use `find_package(<name> ...)` to declare dependencies. The Dependency Provider then intercepts those `find_package` calls and satisfies the depdency however we like. To clean things up further, the default provider (build_tools/cmake/DependencyProvider.cmake) now *declares* dependencies at the top of the file (rather than forcing them to be downloaded immediately), and they are only actually downloaded and incorporated into the build when the `find_package` call is made. In our case, `find_package` will cause `CPMAddPackage` to be invoked. The `nv_register_package` is the declaration function, and it accepts to CPMAddPackage arguments as well as a special "POST_ADD_HOOK" field to indicate inline a script to be executed. The contents of `nv_register_package` are simply saved and forwarded to `CPMAddPackage` (less POST_ADD_HOOK) when `find_package` is invoked. Finally, after `CPMAddPackage` returns, the `POST_ADD_HOOK` is executed. -- c2fcfa3f2db54c4ab0737f4df504776b4b07f19c by Christopher Bate <[email protected]>: [cmake] Fix library search directory for TensorRT dialect LIT tests Fixes the search directory for dynamic libraries, resolving build failure when BUILD_SHARED_LIBS=ON. -- 9fe64b7d5075d615c520232cf79f60982d382ec1 by Christopher Bate <[email protected]>: [tensorrt] NFC: Factor out 'isTensorRTInt8Type' to fix circular build issue Recently, the addition of the new plugin support for TRT 10.8 caused a circular dependency between the TensorRT Dialect library and the plugin support library. This only became apparent when building with BUILD_SHARED_LIBS=ON. This change factors out a "base" library for the TensorRT dialect that resolves the circular dependency issue. -- 638abb11a731d48599ac2af349673e8324237931 by Sagar Shelke <[email protected]>: [compiler/lib/Conversion] Add support for `stablehlo.custom_call` conversion This MR adds a converter for `stablehlo.custom_call` op conversion to TensorRT. Currently only valid conversion is when call target is `mhlo.topk` and more possible conversions can be added later. MLIR tests are added. -- 669d488b6b13521df2b54246f4fcd301ab6a8ace by Christopher Bate <[email protected]>: NFC: Fix incorrect header in NvInferPluginUtils Point the include to TensorRTBase.h; otherwise we may be including generated headers on which NvInferPluginUtils does not have a dependency (and might not be generated yet in the build). -- 340a8a5e0888a8c881d469758f4c871f7233e976 by Christopher Bate <[email protected]>: [executor] Update 'allocs-to-globals' pass to use DataLayout for type size calculation Fixes an issue where 'allocs-to-globals' can crash if the the memref element type does not have a canonical bitwidth, e.g. `complex<...>`. Use the module's data layout to get the type byte size instead. -- 6ca9b5da0d7bd672aaef7d2bc78a917d1ed159a2 by Chris Bate <[email protected]>: [compiler] Cleanup pipeline declarations in Plan Dialect Transforms This change removes unused declarations from the Plan Dialect 'Passes.h' header file and removes unused pipeline registrations. We consolidate the three bufferization stages (bufferization, optimizations, deallocation) into a single pipeline declaration that is used by the different top-level pipelines (e.g. linalt-to-executable, stablehlo-to-executable, and so on). -- 464616b4291de0dc80a6b6ce40e7ecd8a06c99fd by Christopher Bate <[email protected]>: [compiler] NFC: add some additional 'plan-bufferize-pipeline' regression tests Adds some additional regression tests for the end-to-end bufferization and deallocation pipeline. This set of tests capture cases where we are producing sub-optimal code related to host constants and shape values. -- 283bf821e4bc6f8a6b5a4f17509d87fcf72a0023 by Christopher Bate <[email protected]>: NFC: fix missing FileCheck command in test added in 6ca9b5da0d7bd672aaef7d2bc78a917d1ed159a2 -- bf050a70cdc1c714b5f1d782a043f889abf3272a by Christopher Bate <[email protected]>: [compiler] Use bufferization analysis in 'plan-alloc-tensors' This change reworks the parts of the 'plan-alloc-tensors' pass that tried to establish DPS connectivity in loop regions and at function boundaries. We now use existing routines in the bufferization analysis to support the transformation, which results in better results and more simplified code. Additionally, a bug is closed where the transform could potentially try modifying functions without fixing the callers if inlining was not performed. We now explicitly check to make sure the functions we are modifying do not have any callers and that we do not traverse nested symbol tables. A future change can update the pass to allow modifying functions with callers. -- 9f03c958cdee998d4c6ea059aa9125e395b68688 by Christopher Bate <[email protected]>: [compiler] Add new module bufferization pass This change forks the upstream 'one-shot-module-bufferization' logic into a 'module-bufferization' pass under the 'mlir-tensorrt/compiler' project. The upstream 'OneShotModuleBufferization' transformation is an extension of the core one-shot-bufferization infrastructure that includes some additional analysis state which is attached to functions. Functions are analyzed to identify which arguments are read/written and additionally which arguments are equivalent to which result values. We require forking the upstream pass because the upstream transformation does not currently allow for supporting functions and call-like operations outside of `func.func` and `func.call` (support for other operations using the appropriate interfaces has had trouble landing upstream). In addition, the upstream pass does not take into account the potential nesting of modules/symbol-tables where nested modules may require different bufferization options than the parent module and functions may be called from a parent module (if using call-like operations other than `func.call`). To allow dynamic querying of bufferization options from modules, we create a new interface, `BufferizationScopeOpInterface` that lets us attach methods for producing OneShotBufferizationOptions to our custom module-like operations. This new interface, together with the forked pass logic, allows us to jointly bufferize multiple nested modules, whereas previously the OneShotModuleBufferization limitations forced a phase-ordering constraint -- nested modules had to be bufferized prior to outer modules. -- 6fbed19b52415fc21ccc13807b1112886f405ef8 by Christopher Bate <[email protected]>: [executor] Fix Executor flatbuffer C++ generation cmake function This change simply ensures that if the flatbuffer dependency version changes that the targets which generate C++ from the flatbuffer schema file will also go out-of-date, forcing re-generation of the C++ flatbuffer files. -- cd8c672be49aaafc5d480d24f900a413f1effd59 by Christopher Bate <[email protected]>: [executor] Bump Flatbuffers to v25.2.10 Bumps the Flatbuffers version and migrates it to the new dependency handling mechanism. -- 99f60bfe28d1247c060cc0be232ec73546785bb9 by Christopher Bate <[email protected]>: [executor] Set max Flatbuffer verification size to the 64bit limit Since we use Flatbuffer's 64bit features, we need to explicitly enable verification to allow 64bit offsets. -- 523da05c5720f395ce9c5815d2a44e82f51e924d by Christopher Bate <[email protected]>: [compiler] Change 'allow-return-allocs-from-loops' to 'true' by default This allows us to handle loops that don't bufferize in-place during 'plan-module-bufferization'. This was the original behavior of `plan-bufferize`. Certain integration tests for generation models won't work without it currently. -- a31e690e2a863aaf52528b8de1c66a2f1b1ea081 by Christopher Bate <[email protected]>: [tensorrt] Add a size limit to constant folding in 'tensorrt-transpose-elimination' Constant folding of transpose operations in the `tensorrt-transpose-elimination` dialect can be extremely costly for large weights (e.g. embedding matrix). We need to impose an upper bound on what constant folding we will allow using naive transposition of DenseElementsAttrs. -- 3b7017e2356022db4ed21369422b2ead639b930b by Christopher Bate <[email protected]>: [tensorrt] NFC: add a statistic to the 'tensorrt-broadcast-elimination' pass Add a simple statistic (only set when built with assertions enabled) that counts the number of eliminated broadcast operations. -- 6feeec4d85ab976ed16676795aa426a87f4a7f38 by Christopher Bate <[email protected]>: [tensorrt] Place an upper bound on constant folding computation Places an upper bound of `1<<17` on the number of elements allowed for constant folding of `DenseElementsAttrs` in the `fold` method of TensorRT operations. -- e8690a68101085e606746361db41533597ef3c1d by Sagar Shelke <[email protected]>: [Conversion/StablehloToTensorRT] Ease restrictions of `stablehlo.reduce` conversion to TensorRT Previously, `stablehlo.reduce` conversion to TensorRT had the following restrictions. - If reduction axes are not contiguous, conversion was unsupported. - In case of multiple reduction axes, reduction happened through reshape and reduction on single axis. It turns out that this is too strict of a restriction. TensorRT supports reduction on multiple as well as non-contiguous axis. Stablehlo to TensorRT conversion tests are updated and a tensorrt test is added. Stablehlo to TensorRT engine conversion path is tested for reduction specific cases with both TensorRT 8.6 and 10.7. -- af0c30a2d3943a8f8b5a80be49a7c0fc945ae07d by Christopher Bate <[email protected]>: [compiler] Fix failing `test_tensorrt_add.py` after importing OSS changes Fixes a test that was not failing after importing GitHub MLIR-TensorRT changes. -- 5599299c5ce64aa0cafad411d90d7fe3bbda7f26 by Christopher Bate <[email protected]>: NFC: [executor] cleanup TensorRTModule output allocator logic Updates TensorRT output allocator logic to remove "reallocation" features which were not usable. As we currently use it, the output allocator lazily allocates new results for each invocation. In order to improve allocator performance in the future, we should: a) Pre-allocate when we know the static dimensions of the output b) Represent the allocator in the IR so that it can be hoisted out of loops, allowing the outputs of TRT engines whose results don't escape the function to be re-used without an additional allocation. Co-authored-by: Sagar Shelke <[email protected]> Co-authored-by: Zixin Huang <[email protected]> Co-authored-by: Yuan Yao <[email protected]> Co-authored-by: Samurdhi Karunaratne <[email protected]> GitOrigin-RevId: 4c4fdae1e9ae089df09e61abf06be28adc7217ca
Changes are listed oldest-to-newest: 0751887ed36b20cad7d47568ecf35745368acfa9 by Christopher Bate <[email protected]>: [cmake] Fix macro redefinition issues When building with with certain LLVM distributions, the `HandleLLVMOptions` CMake script can set `LLVM_DEFINITIONS` to a space-separated list of `-Ditem=value` definitions. When passed to `add_definitions`, this does not have the desired effect. It's actually not clear where `add_definitions(${LLVM_DEFINITIONS})` really needs to be invoked. I filed a GitHub issue upstream to figure it out: llvm/llvm-project#125779 In our case, it only has the effect of duplicating options and therefore shouldn't be called. -- 9bebb5a25c5cd1b80bcdec931e990ca81d0c4e41 by Chris Bate <[email protected]>: [cmake] Retire some build system logic that is no longer necessary We used to try to do the "right" thing and make targets carry compilation definitions, but we instead use CMake directory level definitions now. Also, remove the `MLIR_TRT_USE_LINKER` flag since we just use `LLVM_USE_LINKER`. -- 88d92a49c52dac9f4d044f09b24e0c750e5cd064 by Zixin Huang <[email protected]>: [python/CAPI] Add Python API to get device name like cuda:0 1. added mtrtDeviceGetIndex into Runtime.cpp Runtime.h. 2. added Device.get_name() into RuntimePyBind.cpp The reason to only get index but not string in Runtime.cpp because passing in/out string ptrs can be complicated (which may require allocating the string buffer and estimating its size beforehand). Test is added: ``` devices = client.get_devices() print(devices[0].get_name()) ``` -- 7f106e7d501a7e0e93a54c4ec43afa0b4ad2d054 by Yuan Yao <[email protected]>: [executor][python]: Fix runtime API for creating float/integer scalars. Previously, when using the `RuntimeClient.create_scalar` API, float numbers were erroneously cast to an integer. Also an `i32` scalar was always created as an `i64` scalar. This commit fixes these bugs. -- 44ccb0370f3435b3f3366dfc534e7d17a0f2dc4e by Samurdhi Karunaratne <[email protected]>: [tensorrt/lib/Utils] Add splat dense attribute support for TRT plugins Although dense attributes were supported, splat dense attributes were not supported so far. When a splat attr is detected some space is allocated inside `PluginParams` so that the PluginField sees a buffer correctly filled with the splat value. Also fixes a bug in tensorrt/test/lib/Target/PluginUtils.h when printing integer plugin fields. Signed-off-by: Samurdhi Karunaratne <[email protected]> -- e6c291638dbe993360f0f10b4e3649862ac9333e by Chris Bate <[email protected]>: [cmake] Extensive updates to dependency management Makes a big update to how dependencies are declared in the CMake code: Previously, we used a set of ad-hoc functions to wrap CPMAddPackage. Each dependency had its own function defined `add_[depname]` in `Depependencies.cmake`. This was unwieldy and disorganized. Furthermore, other CMake dependencies like Stablehlo also need to manage to declare their dependencies in some manner, and they do so via any number of means, and the top-level project typically doesn't have any control over that besides manually patching the upstream project's code. To clean this all up into one coherent interface, we can declare a CMake [Dependency Provider](https://cmake.org/cmake/help/latest/command/cmake_language.html#dependency-providers), available in CMake 3.24+. The dependency provider is a top-level CMake script which can only be declared by the top-level project. It allows us intercept any and every `find_package` or `FetchContent_MakeAvailable` call in the whole build, including for dependencies. Using this mechanism, in our core CMakeLists.txt, and in all the dependencies' CMakeLists, we just need to use `find_package(<name> ...)` to declare dependencies. The Dependency Provider then intercepts those `find_package` calls and satisfies the depdency however we like. To clean things up further, the default provider (build_tools/cmake/DependencyProvider.cmake) now *declares* dependencies at the top of the file (rather than forcing them to be downloaded immediately), and they are only actually downloaded and incorporated into the build when the `find_package` call is made. In our case, `find_package` will cause `CPMAddPackage` to be invoked. The `nv_register_package` is the declaration function, and it accepts to CPMAddPackage arguments as well as a special "POST_ADD_HOOK" field to indicate inline a script to be executed. The contents of `nv_register_package` are simply saved and forwarded to `CPMAddPackage` (less POST_ADD_HOOK) when `find_package` is invoked. Finally, after `CPMAddPackage` returns, the `POST_ADD_HOOK` is executed. -- c2fcfa3f2db54c4ab0737f4df504776b4b07f19c by Christopher Bate <[email protected]>: [cmake] Fix library search directory for TensorRT dialect LIT tests Fixes the search directory for dynamic libraries, resolving build failure when BUILD_SHARED_LIBS=ON. -- 9fe64b7d5075d615c520232cf79f60982d382ec1 by Christopher Bate <[email protected]>: [tensorrt] NFC: Factor out 'isTensorRTInt8Type' to fix circular build issue Recently, the addition of the new plugin support for TRT 10.8 caused a circular dependency between the TensorRT Dialect library and the plugin support library. This only became apparent when building with BUILD_SHARED_LIBS=ON. This change factors out a "base" library for the TensorRT dialect that resolves the circular dependency issue. -- 638abb11a731d48599ac2af349673e8324237931 by Sagar Shelke <[email protected]>: [compiler/lib/Conversion] Add support for `stablehlo.custom_call` conversion This MR adds a converter for `stablehlo.custom_call` op conversion to TensorRT. Currently only valid conversion is when call target is `mhlo.topk` and more possible conversions can be added later. MLIR tests are added. -- 669d488b6b13521df2b54246f4fcd301ab6a8ace by Christopher Bate <[email protected]>: NFC: Fix incorrect header in NvInferPluginUtils Point the include to TensorRTBase.h; otherwise we may be including generated headers on which NvInferPluginUtils does not have a dependency (and might not be generated yet in the build). -- 340a8a5e0888a8c881d469758f4c871f7233e976 by Christopher Bate <[email protected]>: [executor] Update 'allocs-to-globals' pass to use DataLayout for type size calculation Fixes an issue where 'allocs-to-globals' can crash if the the memref element type does not have a canonical bitwidth, e.g. `complex<...>`. Use the module's data layout to get the type byte size instead. -- 6ca9b5da0d7bd672aaef7d2bc78a917d1ed159a2 by Chris Bate <[email protected]>: [compiler] Cleanup pipeline declarations in Plan Dialect Transforms This change removes unused declarations from the Plan Dialect 'Passes.h' header file and removes unused pipeline registrations. We consolidate the three bufferization stages (bufferization, optimizations, deallocation) into a single pipeline declaration that is used by the different top-level pipelines (e.g. linalt-to-executable, stablehlo-to-executable, and so on). -- 464616b4291de0dc80a6b6ce40e7ecd8a06c99fd by Christopher Bate <[email protected]>: [compiler] NFC: add some additional 'plan-bufferize-pipeline' regression tests Adds some additional regression tests for the end-to-end bufferization and deallocation pipeline. This set of tests capture cases where we are producing sub-optimal code related to host constants and shape values. -- 283bf821e4bc6f8a6b5a4f17509d87fcf72a0023 by Christopher Bate <[email protected]>: NFC: fix missing FileCheck command in test added in 6ca9b5da0d7bd672aaef7d2bc78a917d1ed159a2 -- bf050a70cdc1c714b5f1d782a043f889abf3272a by Christopher Bate <[email protected]>: [compiler] Use bufferization analysis in 'plan-alloc-tensors' This change reworks the parts of the 'plan-alloc-tensors' pass that tried to establish DPS connectivity in loop regions and at function boundaries. We now use existing routines in the bufferization analysis to support the transformation, which results in better results and more simplified code. Additionally, a bug is closed where the transform could potentially try modifying functions without fixing the callers if inlining was not performed. We now explicitly check to make sure the functions we are modifying do not have any callers and that we do not traverse nested symbol tables. A future change can update the pass to allow modifying functions with callers. -- 9f03c958cdee998d4c6ea059aa9125e395b68688 by Christopher Bate <[email protected]>: [compiler] Add new module bufferization pass This change forks the upstream 'one-shot-module-bufferization' logic into a 'module-bufferization' pass under the 'mlir-tensorrt/compiler' project. The upstream 'OneShotModuleBufferization' transformation is an extension of the core one-shot-bufferization infrastructure that includes some additional analysis state which is attached to functions. Functions are analyzed to identify which arguments are read/written and additionally which arguments are equivalent to which result values. We require forking the upstream pass because the upstream transformation does not currently allow for supporting functions and call-like operations outside of `func.func` and `func.call` (support for other operations using the appropriate interfaces has had trouble landing upstream). In addition, the upstream pass does not take into account the potential nesting of modules/symbol-tables where nested modules may require different bufferization options than the parent module and functions may be called from a parent module (if using call-like operations other than `func.call`). To allow dynamic querying of bufferization options from modules, we create a new interface, `BufferizationScopeOpInterface` that lets us attach methods for producing OneShotBufferizationOptions to our custom module-like operations. This new interface, together with the forked pass logic, allows us to jointly bufferize multiple nested modules, whereas previously the OneShotModuleBufferization limitations forced a phase-ordering constraint -- nested modules had to be bufferized prior to outer modules. -- 6fbed19b52415fc21ccc13807b1112886f405ef8 by Christopher Bate <[email protected]>: [executor] Fix Executor flatbuffer C++ generation cmake function This change simply ensures that if the flatbuffer dependency version changes that the targets which generate C++ from the flatbuffer schema file will also go out-of-date, forcing re-generation of the C++ flatbuffer files. -- cd8c672be49aaafc5d480d24f900a413f1effd59 by Christopher Bate <[email protected]>: [executor] Bump Flatbuffers to v25.2.10 Bumps the Flatbuffers version and migrates it to the new dependency handling mechanism. -- 99f60bfe28d1247c060cc0be232ec73546785bb9 by Christopher Bate <[email protected]>: [executor] Set max Flatbuffer verification size to the 64bit limit Since we use Flatbuffer's 64bit features, we need to explicitly enable verification to allow 64bit offsets. -- 523da05c5720f395ce9c5815d2a44e82f51e924d by Christopher Bate <[email protected]>: [compiler] Change 'allow-return-allocs-from-loops' to 'true' by default This allows us to handle loops that don't bufferize in-place during 'plan-module-bufferization'. This was the original behavior of `plan-bufferize`. Certain integration tests for generation models won't work without it currently. -- a31e690e2a863aaf52528b8de1c66a2f1b1ea081 by Christopher Bate <[email protected]>: [tensorrt] Add a size limit to constant folding in 'tensorrt-transpose-elimination' Constant folding of transpose operations in the `tensorrt-transpose-elimination` dialect can be extremely costly for large weights (e.g. embedding matrix). We need to impose an upper bound on what constant folding we will allow using naive transposition of DenseElementsAttrs. -- 3b7017e2356022db4ed21369422b2ead639b930b by Christopher Bate <[email protected]>: [tensorrt] NFC: add a statistic to the 'tensorrt-broadcast-elimination' pass Add a simple statistic (only set when built with assertions enabled) that counts the number of eliminated broadcast operations. -- 6feeec4d85ab976ed16676795aa426a87f4a7f38 by Christopher Bate <[email protected]>: [tensorrt] Place an upper bound on constant folding computation Places an upper bound of `1<<17` on the number of elements allowed for constant folding of `DenseElementsAttrs` in the `fold` method of TensorRT operations. -- e8690a68101085e606746361db41533597ef3c1d by Sagar Shelke <[email protected]>: [Conversion/StablehloToTensorRT] Ease restrictions of `stablehlo.reduce` conversion to TensorRT Previously, `stablehlo.reduce` conversion to TensorRT had the following restrictions. - If reduction axes are not contiguous, conversion was unsupported. - In case of multiple reduction axes, reduction happened through reshape and reduction on single axis. It turns out that this is too strict of a restriction. TensorRT supports reduction on multiple as well as non-contiguous axis. Stablehlo to TensorRT conversion tests are updated and a tensorrt test is added. Stablehlo to TensorRT engine conversion path is tested for reduction specific cases with both TensorRT 8.6 and 10.7. -- af0c30a2d3943a8f8b5a80be49a7c0fc945ae07d by Christopher Bate <[email protected]>: [compiler] Fix failing `test_tensorrt_add.py` after importing OSS changes Fixes a test that was not failing after importing GitHub MLIR-TensorRT changes. -- 5599299c5ce64aa0cafad411d90d7fe3bbda7f26 by Christopher Bate <[email protected]>: NFC: [executor] cleanup TensorRTModule output allocator logic Updates TensorRT output allocator logic to remove "reallocation" features which were not usable. As we currently use it, the output allocator lazily allocates new results for each invocation. In order to improve allocator performance in the future, we should: a) Pre-allocate when we know the static dimensions of the output b) Represent the allocator in the IR so that it can be hoisted out of loops, allowing the outputs of TRT engines whose results don't escape the function to be re-used without an additional allocation. Co-authored-by: Sagar Shelke <[email protected]> Co-authored-by: Zixin Huang <[email protected]> Co-authored-by: Yuan Yao <[email protected]> Co-authored-by: Samurdhi Karunaratne <[email protected]> GitOrigin-RevId: 4c4fdae1e9ae089df09e61abf06be28adc7217ca
Changes are listed oldest-to-newest: 0751887ed36b20cad7d47568ecf35745368acfa9 by Christopher Bate <[email protected]>: [cmake] Fix macro redefinition issues When building with with certain LLVM distributions, the `HandleLLVMOptions` CMake script can set `LLVM_DEFINITIONS` to a space-separated list of `-Ditem=value` definitions. When passed to `add_definitions`, this does not have the desired effect. It's actually not clear where `add_definitions(${LLVM_DEFINITIONS})` really needs to be invoked. I filed a GitHub issue upstream to figure it out: llvm/llvm-project#125779 In our case, it only has the effect of duplicating options and therefore shouldn't be called. -- 9bebb5a25c5cd1b80bcdec931e990ca81d0c4e41 by Chris Bate <[email protected]>: [cmake] Retire some build system logic that is no longer necessary We used to try to do the "right" thing and make targets carry compilation definitions, but we instead use CMake directory level definitions now. Also, remove the `MLIR_TRT_USE_LINKER` flag since we just use `LLVM_USE_LINKER`. -- 88d92a49c52dac9f4d044f09b24e0c750e5cd064 by Zixin Huang <[email protected]>: [python/CAPI] Add Python API to get device name like cuda:0 1. added mtrtDeviceGetIndex into Runtime.cpp Runtime.h. 2. added Device.get_name() into RuntimePyBind.cpp The reason to only get index but not string in Runtime.cpp because passing in/out string ptrs can be complicated (which may require allocating the string buffer and estimating its size beforehand). Test is added: ``` devices = client.get_devices() print(devices[0].get_name()) ``` -- 7f106e7d501a7e0e93a54c4ec43afa0b4ad2d054 by Yuan Yao <[email protected]>: [executor][python]: Fix runtime API for creating float/integer scalars. Previously, when using the `RuntimeClient.create_scalar` API, float numbers were erroneously cast to an integer. Also an `i32` scalar was always created as an `i64` scalar. This commit fixes these bugs. -- 44ccb0370f3435b3f3366dfc534e7d17a0f2dc4e by Samurdhi Karunaratne <[email protected]>: [tensorrt/lib/Utils] Add splat dense attribute support for TRT plugins Although dense attributes were supported, splat dense attributes were not supported so far. When a splat attr is detected some space is allocated inside `PluginParams` so that the PluginField sees a buffer correctly filled with the splat value. Also fixes a bug in tensorrt/test/lib/Target/PluginUtils.h when printing integer plugin fields. Signed-off-by: Samurdhi Karunaratne <[email protected]> -- e6c291638dbe993360f0f10b4e3649862ac9333e by Chris Bate <[email protected]>: [cmake] Extensive updates to dependency management Makes a big update to how dependencies are declared in the CMake code: Previously, we used a set of ad-hoc functions to wrap CPMAddPackage. Each dependency had its own function defined `add_[depname]` in `Depependencies.cmake`. This was unwieldy and disorganized. Furthermore, other CMake dependencies like Stablehlo also need to manage to declare their dependencies in some manner, and they do so via any number of means, and the top-level project typically doesn't have any control over that besides manually patching the upstream project's code. To clean this all up into one coherent interface, we can declare a CMake [Dependency Provider](https://cmake.org/cmake/help/latest/command/cmake_language.html#dependency-providers), available in CMake 3.24+. The dependency provider is a top-level CMake script which can only be declared by the top-level project. It allows us intercept any and every `find_package` or `FetchContent_MakeAvailable` call in the whole build, including for dependencies. Using this mechanism, in our core CMakeLists.txt, and in all the dependencies' CMakeLists, we just need to use `find_package(<name> ...)` to declare dependencies. The Dependency Provider then intercepts those `find_package` calls and satisfies the depdency however we like. To clean things up further, the default provider (build_tools/cmake/DependencyProvider.cmake) now *declares* dependencies at the top of the file (rather than forcing them to be downloaded immediately), and they are only actually downloaded and incorporated into the build when the `find_package` call is made. In our case, `find_package` will cause `CPMAddPackage` to be invoked. The `nv_register_package` is the declaration function, and it accepts to CPMAddPackage arguments as well as a special "POST_ADD_HOOK" field to indicate inline a script to be executed. The contents of `nv_register_package` are simply saved and forwarded to `CPMAddPackage` (less POST_ADD_HOOK) when `find_package` is invoked. Finally, after `CPMAddPackage` returns, the `POST_ADD_HOOK` is executed. -- c2fcfa3f2db54c4ab0737f4df504776b4b07f19c by Christopher Bate <[email protected]>: [cmake] Fix library search directory for TensorRT dialect LIT tests Fixes the search directory for dynamic libraries, resolving build failure when BUILD_SHARED_LIBS=ON. -- 9fe64b7d5075d615c520232cf79f60982d382ec1 by Christopher Bate <[email protected]>: [tensorrt] NFC: Factor out 'isTensorRTInt8Type' to fix circular build issue Recently, the addition of the new plugin support for TRT 10.8 caused a circular dependency between the TensorRT Dialect library and the plugin support library. This only became apparent when building with BUILD_SHARED_LIBS=ON. This change factors out a "base" library for the TensorRT dialect that resolves the circular dependency issue. -- 638abb11a731d48599ac2af349673e8324237931 by Sagar Shelke <[email protected]>: [compiler/lib/Conversion] Add support for `stablehlo.custom_call` conversion This MR adds a converter for `stablehlo.custom_call` op conversion to TensorRT. Currently only valid conversion is when call target is `mhlo.topk` and more possible conversions can be added later. MLIR tests are added. -- 669d488b6b13521df2b54246f4fcd301ab6a8ace by Christopher Bate <[email protected]>: NFC: Fix incorrect header in NvInferPluginUtils Point the include to TensorRTBase.h; otherwise we may be including generated headers on which NvInferPluginUtils does not have a dependency (and might not be generated yet in the build). -- 340a8a5e0888a8c881d469758f4c871f7233e976 by Christopher Bate <[email protected]>: [executor] Update 'allocs-to-globals' pass to use DataLayout for type size calculation Fixes an issue where 'allocs-to-globals' can crash if the the memref element type does not have a canonical bitwidth, e.g. `complex<...>`. Use the module's data layout to get the type byte size instead. -- 6ca9b5da0d7bd672aaef7d2bc78a917d1ed159a2 by Chris Bate <[email protected]>: [compiler] Cleanup pipeline declarations in Plan Dialect Transforms This change removes unused declarations from the Plan Dialect 'Passes.h' header file and removes unused pipeline registrations. We consolidate the three bufferization stages (bufferization, optimizations, deallocation) into a single pipeline declaration that is used by the different top-level pipelines (e.g. linalt-to-executable, stablehlo-to-executable, and so on). -- 464616b4291de0dc80a6b6ce40e7ecd8a06c99fd by Christopher Bate <[email protected]>: [compiler] NFC: add some additional 'plan-bufferize-pipeline' regression tests Adds some additional regression tests for the end-to-end bufferization and deallocation pipeline. This set of tests capture cases where we are producing sub-optimal code related to host constants and shape values. -- 283bf821e4bc6f8a6b5a4f17509d87fcf72a0023 by Christopher Bate <[email protected]>: NFC: fix missing FileCheck command in test added in 6ca9b5da0d7bd672aaef7d2bc78a917d1ed159a2 -- bf050a70cdc1c714b5f1d782a043f889abf3272a by Christopher Bate <[email protected]>: [compiler] Use bufferization analysis in 'plan-alloc-tensors' This change reworks the parts of the 'plan-alloc-tensors' pass that tried to establish DPS connectivity in loop regions and at function boundaries. We now use existing routines in the bufferization analysis to support the transformation, which results in better results and more simplified code. Additionally, a bug is closed where the transform could potentially try modifying functions without fixing the callers if inlining was not performed. We now explicitly check to make sure the functions we are modifying do not have any callers and that we do not traverse nested symbol tables. A future change can update the pass to allow modifying functions with callers. -- 9f03c958cdee998d4c6ea059aa9125e395b68688 by Christopher Bate <[email protected]>: [compiler] Add new module bufferization pass This change forks the upstream 'one-shot-module-bufferization' logic into a 'module-bufferization' pass under the 'mlir-tensorrt/compiler' project. The upstream 'OneShotModuleBufferization' transformation is an extension of the core one-shot-bufferization infrastructure that includes some additional analysis state which is attached to functions. Functions are analyzed to identify which arguments are read/written and additionally which arguments are equivalent to which result values. We require forking the upstream pass because the upstream transformation does not currently allow for supporting functions and call-like operations outside of `func.func` and `func.call` (support for other operations using the appropriate interfaces has had trouble landing upstream). In addition, the upstream pass does not take into account the potential nesting of modules/symbol-tables where nested modules may require different bufferization options than the parent module and functions may be called from a parent module (if using call-like operations other than `func.call`). To allow dynamic querying of bufferization options from modules, we create a new interface, `BufferizationScopeOpInterface` that lets us attach methods for producing OneShotBufferizationOptions to our custom module-like operations. This new interface, together with the forked pass logic, allows us to jointly bufferize multiple nested modules, whereas previously the OneShotModuleBufferization limitations forced a phase-ordering constraint -- nested modules had to be bufferized prior to outer modules. -- 6fbed19b52415fc21ccc13807b1112886f405ef8 by Christopher Bate <[email protected]>: [executor] Fix Executor flatbuffer C++ generation cmake function This change simply ensures that if the flatbuffer dependency version changes that the targets which generate C++ from the flatbuffer schema file will also go out-of-date, forcing re-generation of the C++ flatbuffer files. -- cd8c672be49aaafc5d480d24f900a413f1effd59 by Christopher Bate <[email protected]>: [executor] Bump Flatbuffers to v25.2.10 Bumps the Flatbuffers version and migrates it to the new dependency handling mechanism. -- 99f60bfe28d1247c060cc0be232ec73546785bb9 by Christopher Bate <[email protected]>: [executor] Set max Flatbuffer verification size to the 64bit limit Since we use Flatbuffer's 64bit features, we need to explicitly enable verification to allow 64bit offsets. -- 523da05c5720f395ce9c5815d2a44e82f51e924d by Christopher Bate <[email protected]>: [compiler] Change 'allow-return-allocs-from-loops' to 'true' by default This allows us to handle loops that don't bufferize in-place during 'plan-module-bufferization'. This was the original behavior of `plan-bufferize`. Certain integration tests for generation models won't work without it currently. -- a31e690e2a863aaf52528b8de1c66a2f1b1ea081 by Christopher Bate <[email protected]>: [tensorrt] Add a size limit to constant folding in 'tensorrt-transpose-elimination' Constant folding of transpose operations in the `tensorrt-transpose-elimination` dialect can be extremely costly for large weights (e.g. embedding matrix). We need to impose an upper bound on what constant folding we will allow using naive transposition of DenseElementsAttrs. -- 3b7017e2356022db4ed21369422b2ead639b930b by Christopher Bate <[email protected]>: [tensorrt] NFC: add a statistic to the 'tensorrt-broadcast-elimination' pass Add a simple statistic (only set when built with assertions enabled) that counts the number of eliminated broadcast operations. -- 6feeec4d85ab976ed16676795aa426a87f4a7f38 by Christopher Bate <[email protected]>: [tensorrt] Place an upper bound on constant folding computation Places an upper bound of `1<<17` on the number of elements allowed for constant folding of `DenseElementsAttrs` in the `fold` method of TensorRT operations. -- e8690a68101085e606746361db41533597ef3c1d by Sagar Shelke <[email protected]>: [Conversion/StablehloToTensorRT] Ease restrictions of `stablehlo.reduce` conversion to TensorRT Previously, `stablehlo.reduce` conversion to TensorRT had the following restrictions. - If reduction axes are not contiguous, conversion was unsupported. - In case of multiple reduction axes, reduction happened through reshape and reduction on single axis. It turns out that this is too strict of a restriction. TensorRT supports reduction on multiple as well as non-contiguous axis. Stablehlo to TensorRT conversion tests are updated and a tensorrt test is added. Stablehlo to TensorRT engine conversion path is tested for reduction specific cases with both TensorRT 8.6 and 10.7. -- af0c30a2d3943a8f8b5a80be49a7c0fc945ae07d by Christopher Bate <[email protected]>: [compiler] Fix failing `test_tensorrt_add.py` after importing OSS changes Fixes a test that was not failing after importing GitHub MLIR-TensorRT changes. -- 5599299c5ce64aa0cafad411d90d7fe3bbda7f26 by Christopher Bate <[email protected]>: NFC: [executor] cleanup TensorRTModule output allocator logic Updates TensorRT output allocator logic to remove "reallocation" features which were not usable. As we currently use it, the output allocator lazily allocates new results for each invocation. In order to improve allocator performance in the future, we should: a) Pre-allocate when we know the static dimensions of the output b) Represent the allocator in the IR so that it can be hoisted out of loops, allowing the outputs of TRT engines whose results don't escape the function to be re-used without an additional allocation. Co-authored-by: Sagar Shelke <[email protected]> Co-authored-by: Zixin Huang <[email protected]> Co-authored-by: Yuan Yao <[email protected]> Co-authored-by: Samurdhi Karunaratne <[email protected]> GitOrigin-RevId: 4c4fdae1e9ae089df09e61abf06be28adc7217ca
Changes are listed oldest-to-newest: 0751887ed36b20cad7d47568ecf35745368acfa9 by Christopher Bate <[email protected]>: [cmake] Fix macro redefinition issues When building with with certain LLVM distributions, the `HandleLLVMOptions` CMake script can set `LLVM_DEFINITIONS` to a space-separated list of `-Ditem=value` definitions. When passed to `add_definitions`, this does not have the desired effect. It's actually not clear where `add_definitions(${LLVM_DEFINITIONS})` really needs to be invoked. I filed a GitHub issue upstream to figure it out: llvm/llvm-project#125779 In our case, it only has the effect of duplicating options and therefore shouldn't be called. -- 9bebb5a25c5cd1b80bcdec931e990ca81d0c4e41 by Chris Bate <[email protected]>: [cmake] Retire some build system logic that is no longer necessary We used to try to do the "right" thing and make targets carry compilation definitions, but we instead use CMake directory level definitions now. Also, remove the `MLIR_TRT_USE_LINKER` flag since we just use `LLVM_USE_LINKER`. -- 88d92a49c52dac9f4d044f09b24e0c750e5cd064 by Zixin Huang <[email protected]>: [python/CAPI] Add Python API to get device name like cuda:0 1. added mtrtDeviceGetIndex into Runtime.cpp Runtime.h. 2. added Device.get_name() into RuntimePyBind.cpp The reason to only get index but not string in Runtime.cpp because passing in/out string ptrs can be complicated (which may require allocating the string buffer and estimating its size beforehand). Test is added: ``` devices = client.get_devices() print(devices[0].get_name()) ``` -- 7f106e7d501a7e0e93a54c4ec43afa0b4ad2d054 by Yuan Yao <[email protected]>: [executor][python]: Fix runtime API for creating float/integer scalars. Previously, when using the `RuntimeClient.create_scalar` API, float numbers were erroneously cast to an integer. Also an `i32` scalar was always created as an `i64` scalar. This commit fixes these bugs. -- 44ccb0370f3435b3f3366dfc534e7d17a0f2dc4e by Samurdhi Karunaratne <[email protected]>: [tensorrt/lib/Utils] Add splat dense attribute support for TRT plugins Although dense attributes were supported, splat dense attributes were not supported so far. When a splat attr is detected some space is allocated inside `PluginParams` so that the PluginField sees a buffer correctly filled with the splat value. Also fixes a bug in tensorrt/test/lib/Target/PluginUtils.h when printing integer plugin fields. Signed-off-by: Samurdhi Karunaratne <[email protected]> -- e6c291638dbe993360f0f10b4e3649862ac9333e by Chris Bate <[email protected]>: [cmake] Extensive updates to dependency management Makes a big update to how dependencies are declared in the CMake code: Previously, we used a set of ad-hoc functions to wrap CPMAddPackage. Each dependency had its own function defined `add_[depname]` in `Depependencies.cmake`. This was unwieldy and disorganized. Furthermore, other CMake dependencies like Stablehlo also need to manage to declare their dependencies in some manner, and they do so via any number of means, and the top-level project typically doesn't have any control over that besides manually patching the upstream project's code. To clean this all up into one coherent interface, we can declare a CMake [Dependency Provider](https://cmake.org/cmake/help/latest/command/cmake_language.html#dependency-providers), available in CMake 3.24+. The dependency provider is a top-level CMake script which can only be declared by the top-level project. It allows us intercept any and every `find_package` or `FetchContent_MakeAvailable` call in the whole build, including for dependencies. Using this mechanism, in our core CMakeLists.txt, and in all the dependencies' CMakeLists, we just need to use `find_package(<name> ...)` to declare dependencies. The Dependency Provider then intercepts those `find_package` calls and satisfies the depdency however we like. To clean things up further, the default provider (build_tools/cmake/DependencyProvider.cmake) now *declares* dependencies at the top of the file (rather than forcing them to be downloaded immediately), and they are only actually downloaded and incorporated into the build when the `find_package` call is made. In our case, `find_package` will cause `CPMAddPackage` to be invoked. The `nv_register_package` is the declaration function, and it accepts to CPMAddPackage arguments as well as a special "POST_ADD_HOOK" field to indicate inline a script to be executed. The contents of `nv_register_package` are simply saved and forwarded to `CPMAddPackage` (less POST_ADD_HOOK) when `find_package` is invoked. Finally, after `CPMAddPackage` returns, the `POST_ADD_HOOK` is executed. -- c2fcfa3f2db54c4ab0737f4df504776b4b07f19c by Christopher Bate <[email protected]>: [cmake] Fix library search directory for TensorRT dialect LIT tests Fixes the search directory for dynamic libraries, resolving build failure when BUILD_SHARED_LIBS=ON. -- 9fe64b7d5075d615c520232cf79f60982d382ec1 by Christopher Bate <[email protected]>: [tensorrt] NFC: Factor out 'isTensorRTInt8Type' to fix circular build issue Recently, the addition of the new plugin support for TRT 10.8 caused a circular dependency between the TensorRT Dialect library and the plugin support library. This only became apparent when building with BUILD_SHARED_LIBS=ON. This change factors out a "base" library for the TensorRT dialect that resolves the circular dependency issue. -- 638abb11a731d48599ac2af349673e8324237931 by Sagar Shelke <[email protected]>: [compiler/lib/Conversion] Add support for `stablehlo.custom_call` conversion This MR adds a converter for `stablehlo.custom_call` op conversion to TensorRT. Currently only valid conversion is when call target is `mhlo.topk` and more possible conversions can be added later. MLIR tests are added. -- 669d488b6b13521df2b54246f4fcd301ab6a8ace by Christopher Bate <[email protected]>: NFC: Fix incorrect header in NvInferPluginUtils Point the include to TensorRTBase.h; otherwise we may be including generated headers on which NvInferPluginUtils does not have a dependency (and might not be generated yet in the build). -- 340a8a5e0888a8c881d469758f4c871f7233e976 by Christopher Bate <[email protected]>: [executor] Update 'allocs-to-globals' pass to use DataLayout for type size calculation Fixes an issue where 'allocs-to-globals' can crash if the the memref element type does not have a canonical bitwidth, e.g. `complex<...>`. Use the module's data layout to get the type byte size instead. -- 6ca9b5da0d7bd672aaef7d2bc78a917d1ed159a2 by Chris Bate <[email protected]>: [compiler] Cleanup pipeline declarations in Plan Dialect Transforms This change removes unused declarations from the Plan Dialect 'Passes.h' header file and removes unused pipeline registrations. We consolidate the three bufferization stages (bufferization, optimizations, deallocation) into a single pipeline declaration that is used by the different top-level pipelines (e.g. linalt-to-executable, stablehlo-to-executable, and so on). -- 464616b4291de0dc80a6b6ce40e7ecd8a06c99fd by Christopher Bate <[email protected]>: [compiler] NFC: add some additional 'plan-bufferize-pipeline' regression tests Adds some additional regression tests for the end-to-end bufferization and deallocation pipeline. This set of tests capture cases where we are producing sub-optimal code related to host constants and shape values. -- 283bf821e4bc6f8a6b5a4f17509d87fcf72a0023 by Christopher Bate <[email protected]>: NFC: fix missing FileCheck command in test added in 6ca9b5da0d7bd672aaef7d2bc78a917d1ed159a2 -- bf050a70cdc1c714b5f1d782a043f889abf3272a by Christopher Bate <[email protected]>: [compiler] Use bufferization analysis in 'plan-alloc-tensors' This change reworks the parts of the 'plan-alloc-tensors' pass that tried to establish DPS connectivity in loop regions and at function boundaries. We now use existing routines in the bufferization analysis to support the transformation, which results in better results and more simplified code. Additionally, a bug is closed where the transform could potentially try modifying functions without fixing the callers if inlining was not performed. We now explicitly check to make sure the functions we are modifying do not have any callers and that we do not traverse nested symbol tables. A future change can update the pass to allow modifying functions with callers. -- 9f03c958cdee998d4c6ea059aa9125e395b68688 by Christopher Bate <[email protected]>: [compiler] Add new module bufferization pass This change forks the upstream 'one-shot-module-bufferization' logic into a 'module-bufferization' pass under the 'mlir-tensorrt/compiler' project. The upstream 'OneShotModuleBufferization' transformation is an extension of the core one-shot-bufferization infrastructure that includes some additional analysis state which is attached to functions. Functions are analyzed to identify which arguments are read/written and additionally which arguments are equivalent to which result values. We require forking the upstream pass because the upstream transformation does not currently allow for supporting functions and call-like operations outside of `func.func` and `func.call` (support for other operations using the appropriate interfaces has had trouble landing upstream). In addition, the upstream pass does not take into account the potential nesting of modules/symbol-tables where nested modules may require different bufferization options than the parent module and functions may be called from a parent module (if using call-like operations other than `func.call`). To allow dynamic querying of bufferization options from modules, we create a new interface, `BufferizationScopeOpInterface` that lets us attach methods for producing OneShotBufferizationOptions to our custom module-like operations. This new interface, together with the forked pass logic, allows us to jointly bufferize multiple nested modules, whereas previously the OneShotModuleBufferization limitations forced a phase-ordering constraint -- nested modules had to be bufferized prior to outer modules. -- 6fbed19b52415fc21ccc13807b1112886f405ef8 by Christopher Bate <[email protected]>: [executor] Fix Executor flatbuffer C++ generation cmake function This change simply ensures that if the flatbuffer dependency version changes that the targets which generate C++ from the flatbuffer schema file will also go out-of-date, forcing re-generation of the C++ flatbuffer files. -- cd8c672be49aaafc5d480d24f900a413f1effd59 by Christopher Bate <[email protected]>: [executor] Bump Flatbuffers to v25.2.10 Bumps the Flatbuffers version and migrates it to the new dependency handling mechanism. -- 99f60bfe28d1247c060cc0be232ec73546785bb9 by Christopher Bate <[email protected]>: [executor] Set max Flatbuffer verification size to the 64bit limit Since we use Flatbuffer's 64bit features, we need to explicitly enable verification to allow 64bit offsets. -- 523da05c5720f395ce9c5815d2a44e82f51e924d by Christopher Bate <[email protected]>: [compiler] Change 'allow-return-allocs-from-loops' to 'true' by default This allows us to handle loops that don't bufferize in-place during 'plan-module-bufferization'. This was the original behavior of `plan-bufferize`. Certain integration tests for generation models won't work without it currently. -- a31e690e2a863aaf52528b8de1c66a2f1b1ea081 by Christopher Bate <[email protected]>: [tensorrt] Add a size limit to constant folding in 'tensorrt-transpose-elimination' Constant folding of transpose operations in the `tensorrt-transpose-elimination` dialect can be extremely costly for large weights (e.g. embedding matrix). We need to impose an upper bound on what constant folding we will allow using naive transposition of DenseElementsAttrs. -- 3b7017e2356022db4ed21369422b2ead639b930b by Christopher Bate <[email protected]>: [tensorrt] NFC: add a statistic to the 'tensorrt-broadcast-elimination' pass Add a simple statistic (only set when built with assertions enabled) that counts the number of eliminated broadcast operations. -- 6feeec4d85ab976ed16676795aa426a87f4a7f38 by Christopher Bate <[email protected]>: [tensorrt] Place an upper bound on constant folding computation Places an upper bound of `1<<17` on the number of elements allowed for constant folding of `DenseElementsAttrs` in the `fold` method of TensorRT operations. -- e8690a68101085e606746361db41533597ef3c1d by Sagar Shelke <[email protected]>: [Conversion/StablehloToTensorRT] Ease restrictions of `stablehlo.reduce` conversion to TensorRT Previously, `stablehlo.reduce` conversion to TensorRT had the following restrictions. - If reduction axes are not contiguous, conversion was unsupported. - In case of multiple reduction axes, reduction happened through reshape and reduction on single axis. It turns out that this is too strict of a restriction. TensorRT supports reduction on multiple as well as non-contiguous axis. Stablehlo to TensorRT conversion tests are updated and a tensorrt test is added. Stablehlo to TensorRT engine conversion path is tested for reduction specific cases with both TensorRT 8.6 and 10.7. -- af0c30a2d3943a8f8b5a80be49a7c0fc945ae07d by Christopher Bate <[email protected]>: [compiler] Fix failing `test_tensorrt_add.py` after importing OSS changes Fixes a test that was not failing after importing GitHub MLIR-TensorRT changes. -- 5599299c5ce64aa0cafad411d90d7fe3bbda7f26 by Christopher Bate <[email protected]>: NFC: [executor] cleanup TensorRTModule output allocator logic Updates TensorRT output allocator logic to remove "reallocation" features which were not usable. As we currently use it, the output allocator lazily allocates new results for each invocation. In order to improve allocator performance in the future, we should: a) Pre-allocate when we know the static dimensions of the output b) Represent the allocator in the IR so that it can be hoisted out of loops, allowing the outputs of TRT engines whose results don't escape the function to be re-used without an additional allocation. Co-authored-by: Sagar Shelke <[email protected]> Co-authored-by: Zixin Huang <[email protected]> Co-authored-by: Yuan Yao <[email protected]> Co-authored-by: Samurdhi Karunaratne <[email protected]> GitOrigin-RevId: 4c4fdae1e9ae089df09e61abf06be28adc7217ca
TLDR:
find_project(LLVM) ; include(HandleLLVMOptions)
to set directory-scoped cMake compilation options based on CMake variables provided byLLVMConfig.cmake
. This includes compiler options, linker options, and compiler definitions.HandleLLVMOptions
will then also setLLVM_DEFINITIONS
to the set of definitions in the top-level CMake scope.add_definitions(${LLVM_DEFINITIONS})
but they are unaware that a) LLVM_DEFINITIONS is a space-separated list and needs to be pre-processed and b) has different effects depending on ordering withHandleLLVMOptions
and c) can cause duplication of compiler definitionsThe text was updated successfully, but these errors were encountered: