From 29922d77e62ee28df7c90f06cddaf87eb46ca4b4 Mon Sep 17 00:00:00 2001
From: Jake Awe <jawe@nvidia.com>
Date: Thu, 13 Feb 2025 09:44:35 -0600
Subject: [PATCH 1/5] Update Changelog [skip ci]

---
 CHANGELOG.md | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 9d7a5e150b..ce70024f06 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,3 +1,36 @@
+# kvikio 25.02.00 (13 Feb 2025)
+
+## 🚨 Breaking Changes
+
+- Continue to make KvikIO a shared library by moving code from hpp to cpp ([#581](https://github.com/rapidsai/kvikio/pull/581)) [@kingcrimsontianyu](https://github.com/kingcrimsontianyu)
+
+## 🐛 Bug Fixes
+
+- Set `POSITION_INDEPENDENT_CODE` on libcurl ([#625](https://github.com/rapidsai/kvikio/pull/625)) [@KyleFromNVIDIA](https://github.com/KyleFromNVIDIA)
+- allow deselecting nvcomp wheels ([#589](https://github.com/rapidsai/kvikio/pull/589)) [@jameslamb](https://github.com/jameslamb)
+- Fix the misplaced libcurl in CMake ([#587](https://github.com/rapidsai/kvikio/pull/587)) [@kingcrimsontianyu](https://github.com/kingcrimsontianyu)
+
+## 🚀 New Features
+
+- Remove upper bounds on cuda-python to allow 12.6.2 and 11.8.5 ([#564](https://github.com/rapidsai/kvikio/pull/564)) [@bdice](https://github.com/bdice)
+
+## 🛠️ Improvements
+
+- Use `rapids-pip-retry` in CI jobs that might need retries ([#619](https://github.com/rapidsai/kvikio/pull/619)) [@gforsyth](https://github.com/gforsyth)
+- Use std::string ([#617](https://github.com/rapidsai/kvikio/pull/617)) [@madsbk](https://github.com/madsbk)
+- Revert CUDA 12.8 shared workflow branch changes ([#609](https://github.com/rapidsai/kvikio/pull/609)) [@vyasr](https://github.com/vyasr)
+- Build and test with CUDA 12.8.0 ([#600](https://github.com/rapidsai/kvikio/pull/600)) [@bdice](https://github.com/bdice)
+- Separate more interface and definition. Add comments on std::future. Mark noexcept to compat mode-related functions ([#588](https://github.com/rapidsai/kvikio/pull/588)) [@kingcrimsontianyu](https://github.com/kingcrimsontianyu)
+- Use GCC 13 in CUDA 12 conda builds. ([#586](https://github.com/rapidsai/kvikio/pull/586)) [@bdice](https://github.com/bdice)
+- Zarr: pin to version 2.* ([#583](https://github.com/rapidsai/kvikio/pull/583)) [@madsbk](https://github.com/madsbk)
+- Continue to make KvikIO a shared library by moving code from hpp to cpp ([#581](https://github.com/rapidsai/kvikio/pull/581)) [@kingcrimsontianyu](https://github.com/kingcrimsontianyu)
+- Check if nightlies have succeeded recently enough ([#579](https://github.com/rapidsai/kvikio/pull/579)) [@vyasr](https://github.com/vyasr)
+- Update cuda-python lower bounds to 12.6.2 / 11.8.5 ([#573](https://github.com/rapidsai/kvikio/pull/573)) [@bdice](https://github.com/bdice)
+- Improve compatibility mode consistency ([#567](https://github.com/rapidsai/kvikio/pull/567)) [@kingcrimsontianyu](https://github.com/kingcrimsontianyu)
+- Require approval to run CI on draft PRs ([#563](https://github.com/rapidsai/kvikio/pull/563)) [@bdice](https://github.com/bdice)
+- Add breaking change workflow trigger ([#536](https://github.com/rapidsai/kvikio/pull/536)) [@AyodeAwe](https://github.com/AyodeAwe)
+- Add Initial Java Support for GDS to KvikIO ([#396](https://github.com/rapidsai/kvikio/pull/396)) [@aslobodaNV](https://github.com/aslobodaNV)
+
 # kvikio 24.12.00 (11 Dec 2024)
 
 ## 🚨 Breaking Changes

From ecb81c2c3c17252fcd5a902af9789debb234426b Mon Sep 17 00:00:00 2001
From: Kyle Edwards <kyedwards@nvidia.com>
Date: Fri, 14 Feb 2025 15:49:54 -0500
Subject: [PATCH 2/5] Consolidate more Conda solves in CI (#636)

Issue: https://github.com/rapidsai/build-planning/issues/22

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: https://github.com/rapidsai/kvikio/pull/636
---
 ci/test_java.sh   | 16 +++++++---------
 dependencies.yaml |  1 +
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/ci/test_java.sh b/ci/test_java.sh
index e473dd6915..05c4e3346d 100755
--- a/ci/test_java.sh
+++ b/ci/test_java.sh
@@ -1,15 +1,20 @@
 #!/bin/bash
-# Copyright (c) 2024, NVIDIA CORPORATION.
+# Copyright (c) 2024-2025, NVIDIA CORPORATION.
 
 set -euo pipefail
 
 . /opt/conda/etc/profile.d/conda.sh
 
+rapids-logger "Downloading artifacts from previous jobs"
+CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
+
 rapids-logger "Generate java testing dependencies"
 rapids-dependency-file-generator \
   --output conda \
   --file-key test_java \
-  --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch)" | tee env.yaml
+  --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch)" \
+  --prepend-channel "${CPP_CHANNEL}" \
+  | tee env.yaml
 
 rapids-mamba-retry env create --yes -f env.yaml -n test
 
@@ -18,15 +23,8 @@ set +u
 conda activate test
 set -u
 
-rapids-logger "Downloading artifacts from previous jobs"
-CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
-
 rapids-print-env
 
-rapids-mamba-retry install \
-  --channel "${CPP_CHANNEL}" \
-  libkvikio
-
 rapids-logger "Check GPU usage"
 nvidia-smi
 
diff --git a/dependencies.yaml b/dependencies.yaml
index be30907dff..f855576305 100644
--- a/dependencies.yaml
+++ b/dependencies.yaml
@@ -104,6 +104,7 @@ files:
       - cuda_version
       - cuda
       - test_java
+      - depends_on_libkvikio
 channels:
   - rapidsai
   - rapidsai-nightly

From 887679b3ae92c68334735ff1a97b49e4e5e7ac6b Mon Sep 17 00:00:00 2001
From: Tianyu Liu <kingcrimsontianyu@gmail.com>
Date: Mon, 17 Feb 2025 12:27:32 -0500
Subject: [PATCH 3/5] Adjust the way of determining FileHandle's compatibility
 mode for sync and async I/O to improve code readability (#608)

This PR improves the readability of compatibility mode handling.

The current way of determining FileHandle's compatibility mode is somewhat complicated and unintuitive. The data member `_compat_mode` accompanied by some utility functions more or less combines 3 different things into one:

- The initially requested compat mode (`ON`/`OFF`/`AUTO`)
- The capability of performing synchronous cuFile I/O (bool)
- The capability of performing asynchronous cufile I/O (bool)

The disadvantages include:
- `FileHandle::is_compat_mode_preferred()` always derives the preferred compat mode on the fly as opposed to getting an already determined value.
- `FileHandle::is_compat_mode_preferred_for_async(CompatMode)` is potentially throwing, which is asymmetric to `is_compat_mode_preferred()`. Also when the compat mode is `OFF`, it has to invoke `is_stream_api_available()` and `config_path()` on each pass instead of getting an already determined value.
- There is no way to retrieve what the original requested compat mode is.

These add to cognitive burden when rereading the source to introduce new features to FileHandle. This PR attempts to improve the logic by making it concise and crystal clear.

This PR also fixes a line number bug in error handling.

This PR is breaking in that the rarely used public functions to query the compat mode data in the `FileHandle` are removed. These data are instead queryable via the new `CompatModeManager` class.

Authors:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Lawrence Mitchell (https://github.com/wence-)

URL: https://github.com/rapidsai/kvikio/pull/608
---
 cpp/CMakeLists.txt                 |   1 +
 cpp/doxygen/main_page.md           |   2 +-
 cpp/include/kvikio/compat_mode.hpp | 150 +++++++++++++++++++++++++++++
 cpp/include/kvikio/defaults.hpp    |  33 +------
 cpp/include/kvikio/error.hpp       |  12 +--
 cpp/include/kvikio/file_handle.hpp |  44 +++------
 cpp/src/batch.cpp                  |   2 +-
 cpp/src/compat_mode.cpp            | 143 +++++++++++++++++++++++++++
 cpp/src/defaults.cpp               |  25 +----
 cpp/src/file_handle.cpp            | 127 +++++++-----------------
 docs/source/runtime_settings.rst   |   2 +-
 11 files changed, 353 insertions(+), 188 deletions(-)
 create mode 100644 cpp/include/kvikio/compat_mode.hpp
 create mode 100644 cpp/src/compat_mode.cpp

diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index 4c1daa58c6..0fd44e4e88 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -138,6 +138,7 @@ set(SOURCES
     "src/batch.cpp"
     "src/bounce_buffer.cpp"
     "src/buffer.cpp"
+    "src/compat_mode.cpp"
     "src/cufile/config.cpp"
     "src/cufile/driver.cpp"
     "src/defaults.cpp"
diff --git a/cpp/doxygen/main_page.md b/cpp/doxygen/main_page.md
index 5f21d6f224..a5e9e9162d 100644
--- a/cpp/doxygen/main_page.md
+++ b/cpp/doxygen/main_page.md
@@ -80,7 +80,7 @@ When KvikIO is running in compatibility mode, it doesn't load `libcufile.so`. In
 
 The environment variable `KVIKIO_COMPAT_MODE` has three options (case-insensitive):
   - `ON` (aliases: `TRUE`, `YES`, `1`): Enable the compatibility mode.
-  - `OFF` (aliases: `FALSE`, `NO`, `0`): Disable the compatibility mode, and enforce cuFile I/O. GDS will be activated if the system requirements for cuFile are met and cuFile is properly configured. However, if the system is not suited for cuFile, I/O operations under the `OFF` option may error out, crash or hang.
+  - `OFF` (aliases: `FALSE`, `NO`, `0`): Disable the compatibility mode, and enforce cuFile I/O. GDS will be activated if the system requirements for cuFile are met and cuFile is properly configured. However, if the system is not suited for cuFile, I/O operations under the `OFF` option may error out.
   - `AUTO`: Try cuFile I/O first, and fall back to POSIX I/O if the system requirements for cuFile are not met.
 
 Under `AUTO`, KvikIO falls back to the compatibility mode:
diff --git a/cpp/include/kvikio/compat_mode.hpp b/cpp/include/kvikio/compat_mode.hpp
new file mode 100644
index 0000000000..03ed443fe0
--- /dev/null
+++ b/cpp/include/kvikio/compat_mode.hpp
@@ -0,0 +1,150 @@
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <cstdint>
+#include <string>
+
+#include <kvikio/file_utils.hpp>
+
+namespace kvikio {
+/**
+ * @brief I/O compatibility mode.
+ */
+enum class CompatMode : uint8_t {
+  OFF,  ///< Enforce cuFile I/O. GDS will be activated if the system requirements for cuFile are met
+        ///< and cuFile is properly configured. However, if the system is not suited for cuFile, I/O
+        ///< operations under the OFF option may error out.
+  ON,   ///< Enforce POSIX I/O.
+  AUTO,  ///< Try cuFile I/O first, and fall back to POSIX I/O if the system requirements for cuFile
+         ///< are not met.
+};
+
+namespace detail {
+/**
+ * @brief Parse a string into a CompatMode enum.
+ *
+ * @param compat_mode_str Compatibility mode in string format (case-insensitive). Valid values
+ * are:
+ *   - `ON` (alias: `TRUE`, `YES`, `1`)
+ *   - `OFF` (alias: `FALSE`, `NO`, `0`)
+ *   - `AUTO`
+ * @return A CompatMode enum.
+ */
+CompatMode parse_compat_mode_str(std::string_view compat_mode_str);
+
+}  // namespace detail
+
+// Forward declaration.
+class FileHandle;
+
+/**
+ * @brief Store and manage the compatibility mode data associated with a FileHandle.
+ */
+class CompatModeManager {
+ private:
+  CompatMode _compat_mode_requested{CompatMode::AUTO};
+  bool _is_compat_mode_preferred{true};
+  bool _is_compat_mode_preferred_for_async{true};
+
+ public:
+  /**
+   * @brief Construct an empty compatibility mode manager.
+   */
+  CompatModeManager() noexcept = default;
+
+  /**
+   * @brief Construct a compatibility mode manager associated with a FileHandle.
+   *
+   * According to the file path, requested compatibility mode, and the system configuration, the
+   * compatibility manager:
+   * - Infers the final compatibility modes for synchronous and asynchronous I/O paths,
+   * respectively.
+   * - Initializes the file wrappers and cuFile handle associated with a FileHandle.
+   *
+   * @param file_path Refer to
+   * FileHandle::FileHandle(std::string const&, std::string const&, mode_t, CompatMode).
+   * @param flags Same as above.
+   * @param mode Same as above.
+   * @param compat_mode_requested Same as above.
+   * @param file_handle Pointer to the FileHandle object that owns this compatibility mode manager.
+   */
+  CompatModeManager(std::string const& file_path,
+                    std::string const& flags,
+                    mode_t mode,
+                    CompatMode compat_mode_requested,
+                    FileHandle* file_handle);
+
+  ~CompatModeManager() noexcept                              = default;
+  CompatModeManager(const CompatModeManager&)                = default;
+  CompatModeManager& operator=(const CompatModeManager&)     = default;
+  CompatModeManager(CompatModeManager&&) noexcept            = default;
+  CompatModeManager& operator=(CompatModeManager&&) noexcept = default;
+
+  /**
+   * @brief Functionally identical to defaults::infer_compat_mode_if_auto(CompatMode).
+   *
+   * @param compat_mode Compatibility mode.
+   * @return If the given compatibility mode is CompatMode::AUTO, infer the final compatibility
+   * mode.
+   */
+  CompatMode infer_compat_mode_if_auto(CompatMode compat_mode) noexcept;
+
+  /**
+   * @brief Functionally identical to defaults::is_compat_mode_preferred(CompatMode).
+   *
+   * @param compat_mode Compatibility mode.
+   * @return Boolean answer.
+   */
+  bool is_compat_mode_preferred(CompatMode compat_mode) noexcept;
+
+  /**
+   * @brief Check if the compatibility mode for synchronous I/O of the associated FileHandle is
+   * expected to be CompatMode::ON.
+   *
+   * @return Boolean answer.
+   */
+  bool is_compat_mode_preferred() const noexcept;
+
+  /**
+   * @brief Check if the compatibility mode for asynchronous I/O of the associated FileHandle is
+   * expected to be CompatMode::ON.
+   *
+   * @return Boolean answer.
+   */
+  bool is_compat_mode_preferred_for_async() const noexcept;
+
+  /**
+   * @brief Retrieve the original compatibility mode requested.
+   *
+   * @return The original compatibility mode requested.
+   */
+  CompatMode compat_mode_requested() const noexcept;
+
+  /**
+   * @brief Determine if asynchronous I/O can be performed or not (throw exceptions)
+   * according to the existing compatibility mode data in the manager.
+   *
+   * Asynchronous I/O cannot be performed, for instance, when compat_mode_requested() is
+   * CompatMode::OFF, is_compat_mode_preferred() is CompatMode::OFF, but
+   * is_compat_mode_preferred_for_async() is CompatMode::ON (due to missing cuFile stream API or
+   * cuFile configuration file).
+   */
+  void validate_compat_mode_for_async() const;
+};
+
+}  // namespace kvikio
diff --git a/cpp/include/kvikio/defaults.hpp b/cpp/include/kvikio/defaults.hpp
index 501c71981a..563cf09456 100644
--- a/cpp/include/kvikio/defaults.hpp
+++ b/cpp/include/kvikio/defaults.hpp
@@ -14,11 +14,6 @@
  * limitations under the License.
  */
 
-// Enable documentation of the enum.
-/**
- * @file
- */
-
 #pragma once
 
 #include <cstddef>
@@ -29,35 +24,13 @@
 
 #include <BS_thread_pool.hpp>
 
+#include <kvikio/compat_mode.hpp>
 #include <kvikio/shim/cufile.hpp>
 
-namespace kvikio {
 /**
- * @brief I/O compatibility mode.
+ * @brief KvikIO namespace.
  */
-enum class CompatMode : uint8_t {
-  OFF,  ///< Enforce cuFile I/O. GDS will be activated if the system requirements for cuFile are met
-        ///< and cuFile is properly configured. However, if the system is not suited for cuFile, I/O
-        ///< operations under the OFF option may error out, crash or hang.
-  ON,   ///< Enforce POSIX I/O.
-  AUTO,  ///< Try cuFile I/O first, and fall back to POSIX I/O if the system requirements for cuFile
-         ///< are not met.
-};
-
-namespace detail {
-/**
- * @brief Parse a string into a CompatMode enum.
- *
- * @param compat_mode_str Compatibility mode in string format(case-insensitive). Valid values
- * include:
- *   - `ON` (alias: `TRUE`, `YES`, `1`)
- *   - `OFF` (alias: `FALSE`, `NO`, `0`)
- *   - `AUTO`
- * @return A CompatMode enum.
- */
-CompatMode parse_compat_mode_str(std::string_view compat_mode_str);
-
-}  // namespace detail
+namespace kvikio {
 
 template <typename T>
 T getenv_or(std::string_view env_var_name, T default_val)
diff --git a/cpp/include/kvikio/error.hpp b/cpp/include/kvikio/error.hpp
index e38d05054f..f0253fe70e 100644
--- a/cpp/include/kvikio/error.hpp
+++ b/cpp/include/kvikio/error.hpp
@@ -17,6 +17,7 @@
 
 #include <cstring>
 #include <exception>
+#include <string>
 #include <system_error>
 
 #include <kvikio/shim/cuda.hpp>
@@ -71,7 +72,7 @@ void cuda_driver_try_2(CUresult error, int line_number, char const* filename)
 {
   if (error == CUDA_ERROR_STUB_LIBRARY) {
     throw Exception{std::string{"CUDA error at: "} + std::string(filename) + ":" +
-                    KVIKIO_STRINGIFY(line_number) +
+                    std::to_string(line_number) +
                     ": CUDA_ERROR_STUB_LIBRARY("
                     "The CUDA driver loaded is a stub library)"};
   }
@@ -82,9 +83,8 @@ void cuda_driver_try_2(CUresult error, int line_number, char const* filename)
     CUresult err_str_status  = cudaAPI::instance().GetErrorString(error, &err_str);
     if (err_name_status == CUDA_ERROR_INVALID_VALUE) { err_name = "unknown"; }
     if (err_str_status == CUDA_ERROR_INVALID_VALUE) { err_str = "unknown"; }
-    throw Exception{std::string{"CUDA error at: "} + filename + ":" +
-                    KVIKIO_STRINGIFY(line_number) + ": " + std::string(err_name) + "(" +
-                    std::string(err_str) + ")"};
+    throw Exception{std::string{"CUDA error at: "} + filename + ":" + std::to_string(line_number) +
+                    ": " + std::string(err_name) + "(" + std::string(err_str) + ")"};
   }
 }
 
@@ -97,7 +97,7 @@ void cufile_try_2(CUfileError_t error, int line_number, char const* filename)
       CUDA_DRIVER_TRY(cuda_error);
     }
     throw Exception{std::string{"cuFile error at: "} + filename + ":" +
-                    KVIKIO_STRINGIFY(line_number) + ": " +
+                    std::to_string(line_number) + ": " +
                     cufileop_status_error((CUfileOpError)std::abs(error.err))};
   }
 }
@@ -111,7 +111,7 @@ void cufile_check_bytes_done_2(ssize_t nbytes_done, int line_number, char const*
                        ? std::string(cufileop_status_error((CUfileOpError)err))
                        : std::string(std::strerror(err));
     throw Exception{std::string{"cuFile error at: "} + filename + ":" +
-                    KVIKIO_STRINGIFY(line_number) + ": " + msg};
+                    std::to_string(line_number) + ": " + msg};
   }
 }
 
diff --git a/cpp/include/kvikio/file_handle.hpp b/cpp/include/kvikio/file_handle.hpp
index ba44992b53..3d31b96dc1 100644
--- a/cpp/include/kvikio/file_handle.hpp
+++ b/cpp/include/kvikio/file_handle.hpp
@@ -25,6 +25,7 @@
 #include <utility>
 
 #include <kvikio/buffer.hpp>
+#include <kvikio/compat_mode.hpp>
 #include <kvikio/cufile/config.hpp>
 #include <kvikio/defaults.hpp>
 #include <kvikio/error.hpp>
@@ -32,6 +33,7 @@
 #include <kvikio/parallel_operation.hpp>
 #include <kvikio/posix_io.hpp>
 #include <kvikio/shim/cufile.hpp>
+#include <kvikio/shim/cufile_h_wrapper.hpp>
 #include <kvikio/stream.hpp>
 #include <kvikio/utils.hpp>
 
@@ -45,23 +47,13 @@ namespace kvikio {
 class FileHandle {
  private:
   // We use two file descriptors, one opened with the O_DIRECT flag and one without.
-  FileWrapper _fd_direct_on{};
-  FileWrapper _fd_direct_off{};
+  FileWrapper _file_direct_on{};
+  FileWrapper _file_direct_off{};
   bool _initialized{false};
-  CompatMode _compat_mode{CompatMode::AUTO};
   mutable std::size_t _nbytes{0};  // The size of the underlying file, zero means unknown.
   CUFileHandleWrapper _cufile_handle{};
-
-  /**
-   * @brief Given a requested compatibility mode, whether it is expected to reduce to `ON` for
-   * asynchronous I/O.
-   *
-   * @param requested_compat_mode Requested compatibility mode.
-   * @return True if POSIX I/O fallback will be used; false for cuFile I/O.
-   * @exception std::runtime_error When the requested compatibility mode is `OFF`, but cuFile
-   * batch/stream library symbol is missing, or cuFile configuration file is missing.
-   */
-  bool is_compat_mode_preferred_for_async(CompatMode requested_compat_mode);
+  CompatModeManager _compat_mode_manager;
+  friend class CompatModeManager;
 
  public:
   static constexpr mode_t m644 = S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH;
@@ -444,27 +436,13 @@ class FileHandle {
                                          CUstream stream     = nullptr);
 
   /**
-   * @brief Returns `true` if the compatibility mode is expected to be `ON` for this file.
-   *
-   * Compatibility mode can be explicitly enabled in object creation. The mode is also enabled
-   * automatically, if file cannot be opened with the `O_DIRECT` flag, or if the system does not
-   * meet the requirements for the cuFile library under the `AUTO` compatibility mode.
-   *
-   * @return Boolean answer.
-   */
-  [[nodiscard]] bool is_compat_mode_preferred() const noexcept;
-
-  /**
-   * @brief Returns `true` if the compatibility mode is expected to be `ON` for the asynchronous I/O
-   * on this file.
-   *
-   * For asynchronous I/O, the compatibility mode can be automatically enabled if the cuFile batch
-   * and stream symbols are missing, or if the cuFile configuration file is missing, or if
-   * `is_compat_mode_preferred()` returns true.
+   * @brief Get the associated compatibility mode manager, which can be used to query the original
+   * requested compatibility mode or the expected compatibility modes for synchronous and
+   * asynchronous I/O.
    *
-   * @return Boolean answer.
+   * @return The associated compatibility mode manager.
    */
-  [[nodiscard]] bool is_compat_mode_preferred_for_async() const noexcept;
+  const CompatModeManager& get_compat_mode_manager() const noexcept;
 };
 
 }  // namespace kvikio
diff --git a/cpp/src/batch.cpp b/cpp/src/batch.cpp
index 128c2d5953..ee3148d7cc 100644
--- a/cpp/src/batch.cpp
+++ b/cpp/src/batch.cpp
@@ -60,7 +60,7 @@ void BatchHandle::submit(std::vector<BatchOp> const& operations)
   std::vector<CUfileIOParams_t> io_batch_params;
   io_batch_params.reserve(operations.size());
   for (auto const& op : operations) {
-    if (op.file_handle.is_compat_mode_preferred()) {
+    if (op.file_handle.get_compat_mode_manager().is_compat_mode_preferred()) {
       throw CUfileException("Cannot submit a FileHandle opened in compatibility mode");
     }
 
diff --git a/cpp/src/compat_mode.cpp b/cpp/src/compat_mode.cpp
new file mode 100644
index 0000000000..5944f91e05
--- /dev/null
+++ b/cpp/src/compat_mode.cpp
@@ -0,0 +1,143 @@
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <algorithm>
+#include <cassert>
+#include <stdexcept>
+
+#include <kvikio/compat_mode.hpp>
+#include <kvikio/cufile/config.hpp>
+#include <kvikio/error.hpp>
+#include <kvikio/file_handle.hpp>
+#include <kvikio/shim/cufile.hpp>
+
+namespace kvikio {
+
+namespace detail {
+CompatMode parse_compat_mode_str(std::string_view compat_mode_str)
+{
+  // Convert to lowercase
+  std::string tmp{compat_mode_str};
+  std::transform(
+    tmp.begin(), tmp.end(), tmp.begin(), [](unsigned char c) { return std::tolower(c); });
+
+  if (tmp == "on" || tmp == "true" || tmp == "yes" || tmp == "1") {
+    return CompatMode::ON;
+  } else if (tmp == "off" || tmp == "false" || tmp == "no" || tmp == "0") {
+    return CompatMode::OFF;
+  } else if (tmp == "auto") {
+    return CompatMode::AUTO;
+  } else {
+    throw std::invalid_argument("Unknown compatibility mode: " + std::string{tmp});
+  }
+}
+
+}  // namespace detail
+
+CompatMode CompatModeManager::infer_compat_mode_if_auto(CompatMode compat_mode) noexcept
+{
+  if (compat_mode == CompatMode::AUTO) {
+    return is_cufile_available() ? CompatMode::OFF : CompatMode::ON;
+  }
+  return compat_mode;
+}
+
+bool CompatModeManager::is_compat_mode_preferred(CompatMode compat_mode) noexcept
+{
+  return compat_mode == CompatMode::ON ||
+         (compat_mode == CompatMode::AUTO &&
+          infer_compat_mode_if_auto(compat_mode) == CompatMode::ON);
+}
+
+bool CompatModeManager::is_compat_mode_preferred() const noexcept
+{
+  return _is_compat_mode_preferred;
+}
+
+bool CompatModeManager::is_compat_mode_preferred_for_async() const noexcept
+{
+  return _is_compat_mode_preferred_for_async;
+}
+
+CompatMode CompatModeManager::compat_mode_requested() const noexcept
+{
+  return _compat_mode_requested;
+}
+
+CompatModeManager::CompatModeManager(std::string const& file_path,
+                                     std::string const& flags,
+                                     mode_t mode,
+                                     CompatMode compat_mode_requested_v,
+                                     FileHandle* file_handle)
+{
+  if (file_handle == nullptr) {
+    throw std::invalid_argument(
+      "The compatibility mode manager does not have a proper owning file handle.");
+  }
+
+  file_handle->_file_direct_off.open(file_path, flags, false, mode);
+  _is_compat_mode_preferred = is_compat_mode_preferred(compat_mode_requested_v);
+
+  // Nothing to do in compatibility mode
+  if (_is_compat_mode_preferred) { return; }
+
+  try {
+    file_handle->_file_direct_on.open(file_path, flags, true, mode);
+  } catch (...) {
+    // Try to open the file with the O_DIRECT flag. Fall back to compatibility mode, if it fails.
+    if (compat_mode_requested_v == CompatMode::AUTO) {
+      _is_compat_mode_preferred = true;
+    } else {  // CompatMode::OFF
+      throw;
+    }
+  }
+
+  if (_is_compat_mode_preferred) { return; }
+
+  auto error_code = file_handle->_cufile_handle.register_handle(file_handle->_file_direct_on.fd());
+  assert(error_code.has_value());
+
+  // For the AUTO mode, if the first cuFile API call fails, fall back to the compatibility
+  // mode.
+  if (compat_mode_requested_v == CompatMode::AUTO && error_code.value().err != CU_FILE_SUCCESS) {
+    _is_compat_mode_preferred = true;
+  } else {
+    CUFILE_TRY(error_code.value());
+  }
+
+  // Check cuFile async API
+  static bool const is_extra_symbol_available = is_stream_api_available();
+  static bool const is_config_path_empty      = config_path().empty();
+  _is_compat_mode_preferred_for_async =
+    _is_compat_mode_preferred || !is_extra_symbol_available || is_config_path_empty;
+}
+
+void CompatModeManager::validate_compat_mode_for_async() const
+{
+  if (!_is_compat_mode_preferred && _is_compat_mode_preferred_for_async &&
+      _compat_mode_requested == CompatMode::OFF) {
+    std::string err_msg;
+    if (!is_stream_api_available()) { err_msg += "Missing the cuFile stream api."; }
+
+    // When checking for availability, we also check if cuFile's config file exists. This is
+    // because even when the stream API is available, it doesn't work if no config file exists.
+    if (config_path().empty()) { err_msg += " Missing cuFile configuration file."; }
+
+    throw std::runtime_error(err_msg);
+  }
+}
+
+}  // namespace kvikio
diff --git a/cpp/src/defaults.cpp b/cpp/src/defaults.cpp
index f005e86d0b..d71f51fa68 100644
--- a/cpp/src/defaults.cpp
+++ b/cpp/src/defaults.cpp
@@ -14,7 +14,6 @@
  * limitations under the License.
  */
 
-#include <algorithm>
 #include <cstddef>
 #include <cstdlib>
 #include <sstream>
@@ -23,34 +22,12 @@
 
 #include <BS_thread_pool.hpp>
 
+#include <kvikio/compat_mode.hpp>
 #include <kvikio/defaults.hpp>
 #include <kvikio/shim/cufile.hpp>
 
 namespace kvikio {
 
-namespace detail {
-CompatMode parse_compat_mode_str(std::string_view compat_mode_str)
-{
-  // Convert to lowercase
-  std::string tmp{compat_mode_str};
-  std::transform(
-    tmp.begin(), tmp.end(), tmp.begin(), [](unsigned char c) { return std::tolower(c); });
-
-  CompatMode res{};
-  if (tmp == "on" || tmp == "true" || tmp == "yes" || tmp == "1") {
-    res = CompatMode::ON;
-  } else if (tmp == "off" || tmp == "false" || tmp == "no" || tmp == "0") {
-    res = CompatMode::OFF;
-  } else if (tmp == "auto") {
-    res = CompatMode::AUTO;
-  } else {
-    throw std::invalid_argument("Unknown compatibility mode: " + std::string{tmp});
-  }
-  return res;
-}
-
-}  // namespace detail
-
 template <>
 bool getenv_or(std::string_view env_var_name, bool default_val)
 {
diff --git a/cpp/src/file_handle.cpp b/cpp/src/file_handle.cpp
index 0e65afb7fd..4e8376a285 100644
--- a/cpp/src/file_handle.cpp
+++ b/cpp/src/file_handle.cpp
@@ -21,8 +21,9 @@
 #include <cstddef>
 #include <cstdlib>
 #include <stdexcept>
-#include <system_error>
+#include <utility>
 
+#include <kvikio/compat_mode.hpp>
 #include <kvikio/defaults.hpp>
 #include <kvikio/file_handle.hpp>
 #include <kvikio/file_utils.hpp>
@@ -33,62 +34,28 @@ FileHandle::FileHandle(std::string const& file_path,
                        std::string const& flags,
                        mode_t mode,
                        CompatMode compat_mode)
-  : _fd_direct_off{file_path, flags, false, mode}, _initialized{true}, _compat_mode{compat_mode}
+  : _initialized{true}, _compat_mode_manager{file_path, flags, mode, compat_mode, this}
 {
-  if (is_compat_mode_preferred()) {
-    return;  // Nothing to do in compatibility mode
-  }
-
-  // Try to open the file with the O_DIRECT flag. Fall back to compatibility mode, if it fails.
-  auto handle_o_direct_except = [this] {
-    if (_compat_mode == CompatMode::AUTO) {
-      _compat_mode = CompatMode::ON;
-    } else {  // CompatMode::OFF
-      throw;
-    }
-  };
-
-  try {
-    _fd_direct_on.open(file_path, flags, true, mode);
-  } catch (std::system_error const&) {
-    handle_o_direct_except();
-  } catch (std::invalid_argument const&) {
-    handle_o_direct_except();
-  }
-
-  if (_compat_mode == CompatMode::ON) { return; }
-
-  CUFileHandleWrapper handle;
-  auto error_code = handle.register_handle(_fd_direct_on.fd());
-  assert(error_code.has_value());
-
-  // For the AUTO mode, if the first cuFile API call fails, fall back to the compatibility
-  // mode.
-  if (_compat_mode == CompatMode::AUTO && error_code.value().err != CU_FILE_SUCCESS) {
-    _compat_mode = CompatMode::ON;
-  } else {
-    CUFILE_TRY(error_code.value());
-  }
 }
 
 FileHandle::FileHandle(FileHandle&& o) noexcept
-  : _fd_direct_on{std::exchange(o._fd_direct_on, {})},
-    _fd_direct_off{std::exchange(o._fd_direct_off, {})},
+  : _file_direct_on{std::exchange(o._file_direct_on, {})},
+    _file_direct_off{std::exchange(o._file_direct_off, {})},
     _initialized{std::exchange(o._initialized, false)},
-    _compat_mode{std::exchange(o._compat_mode, CompatMode::AUTO)},
     _nbytes{std::exchange(o._nbytes, 0)},
-    _cufile_handle{std::exchange(o._cufile_handle, {})}
+    _cufile_handle{std::exchange(o._cufile_handle, {})},
+    _compat_mode_manager{std::move(o._compat_mode_manager)}
 {
 }
 
 FileHandle& FileHandle::operator=(FileHandle&& o) noexcept
 {
-  _fd_direct_on  = std::exchange(o._fd_direct_on, {});
-  _fd_direct_off = std::exchange(o._fd_direct_off, {});
-  _initialized   = std::exchange(o._initialized, false);
-  _compat_mode   = std::exchange(o._compat_mode, CompatMode::AUTO);
-  _nbytes        = std::exchange(o._nbytes, 0);
-  _cufile_handle = std::exchange(o._cufile_handle, {});
+  _file_direct_on      = std::exchange(o._file_direct_on, {});
+  _file_direct_off     = std::exchange(o._file_direct_off, {});
+  _initialized         = std::exchange(o._initialized, false);
+  _nbytes              = std::exchange(o._nbytes, 0);
+  _cufile_handle       = std::exchange(o._cufile_handle, {});
+  _compat_mode_manager = std::move(o._compat_mode_manager);
   return *this;
 }
 
@@ -100,11 +67,10 @@ void FileHandle::close() noexcept
 {
   try {
     if (closed()) { return; }
-
     _cufile_handle.unregister_handle();
-    _compat_mode = CompatMode::AUTO;
-    _fd_direct_off.close();
-    _fd_direct_on.close();
+    _file_direct_off.close();
+    _file_direct_on.close();
+    _nbytes      = 0;
     _initialized = false;
   } catch (...) {
   }
@@ -113,7 +79,7 @@ void FileHandle::close() noexcept
 CUfileHandle_t FileHandle::handle()
 {
   if (closed()) { throw CUfileException("File handle is closed"); }
-  if (is_compat_mode_preferred()) {
+  if (get_compat_mode_manager().is_compat_mode_preferred()) {
     throw CUfileException("The underlying cuFile handle isn't available in compatibility mode");
   }
   return _cufile_handle.handle();
@@ -121,7 +87,7 @@ CUfileHandle_t FileHandle::handle()
 
 int FileHandle::fd(bool o_direct) const noexcept
 {
-  return o_direct ? _fd_direct_on.fd() : _fd_direct_off.fd();
+  return o_direct ? _file_direct_on.fd() : _file_direct_off.fd();
 }
 
 int FileHandle::fd_open_flags(bool o_direct) const { return open_flags(fd(o_direct)); }
@@ -129,7 +95,7 @@ int FileHandle::fd_open_flags(bool o_direct) const { return open_flags(fd(o_dire
 std::size_t FileHandle::nbytes() const
 {
   if (closed()) { return 0; }
-  if (_nbytes == 0) { _nbytes = get_file_size(_fd_direct_off.fd()); }
+  if (_nbytes == 0) { _nbytes = get_file_size(_file_direct_off.fd()); }
   return _nbytes;
 }
 
@@ -139,9 +105,9 @@ std::size_t FileHandle::read(void* devPtr_base,
                              std::size_t devPtr_offset,
                              bool sync_default_stream)
 {
-  if (is_compat_mode_preferred()) {
+  if (get_compat_mode_manager().is_compat_mode_preferred()) {
     return detail::posix_device_read(
-      _fd_direct_off.fd(), devPtr_base, size, file_offset, devPtr_offset);
+      _file_direct_off.fd(), devPtr_base, size, file_offset, devPtr_offset);
   }
   if (sync_default_stream) { CUDA_DRIVER_TRY(cudaAPI::instance().StreamSynchronize(nullptr)); }
 
@@ -163,9 +129,9 @@ std::size_t FileHandle::write(void const* devPtr_base,
 {
   _nbytes = 0;  // Invalidate the computed file size
 
-  if (is_compat_mode_preferred()) {
+  if (get_compat_mode_manager().is_compat_mode_preferred()) {
     return detail::posix_device_write(
-      _fd_direct_off.fd(), devPtr_base, size, file_offset, devPtr_offset);
+      _file_direct_off.fd(), devPtr_base, size, file_offset, devPtr_offset);
   }
   if (sync_default_stream) { CUDA_DRIVER_TRY(cudaAPI::instance().StreamSynchronize(nullptr)); }
 
@@ -200,7 +166,7 @@ std::future<std::size_t> FileHandle::pread(void* buf,
                      std::size_t hostPtr_offset) -> std::size_t {
       char* buf = static_cast<char*>(hostPtr_base) + hostPtr_offset;
       return detail::posix_host_read<detail::PartialIO::NO>(
-        _fd_direct_off.fd(), buf, size, file_offset);
+        _file_direct_off.fd(), buf, size, file_offset);
     };
 
     return parallel_io(op, buf, size, file_offset, task_size, 0);
@@ -212,13 +178,13 @@ std::future<std::size_t> FileHandle::pread(void* buf,
   if (size < gds_threshold) {
     auto task = [this, ctx, buf, size, file_offset]() -> std::size_t {
       PushAndPopContext c(ctx);
-      return detail::posix_device_read(_fd_direct_off.fd(), buf, size, file_offset, 0);
+      return detail::posix_device_read(_file_direct_off.fd(), buf, size, file_offset, 0);
     };
     return std::async(std::launch::deferred, task);
   }
 
   // Let's synchronize once instead of in each task.
-  if (sync_default_stream && !is_compat_mode_preferred()) {
+  if (sync_default_stream && !get_compat_mode_manager().is_compat_mode_preferred()) {
     PushAndPopContext c(ctx);
     CUDA_DRIVER_TRY(cudaAPI::instance().StreamSynchronize(nullptr));
   }
@@ -250,7 +216,7 @@ std::future<std::size_t> FileHandle::pwrite(void const* buf,
                      std::size_t hostPtr_offset) -> std::size_t {
       char const* buf = static_cast<char const*>(hostPtr_base) + hostPtr_offset;
       return detail::posix_host_write<detail::PartialIO::NO>(
-        _fd_direct_off.fd(), buf, size, file_offset);
+        _file_direct_off.fd(), buf, size, file_offset);
     };
 
     return parallel_io(op, buf, size, file_offset, task_size, 0);
@@ -262,13 +228,13 @@ std::future<std::size_t> FileHandle::pwrite(void const* buf,
   if (size < gds_threshold) {
     auto task = [this, ctx, buf, size, file_offset]() -> std::size_t {
       PushAndPopContext c(ctx);
-      return detail::posix_device_write(_fd_direct_off.fd(), buf, size, file_offset, 0);
+      return detail::posix_device_write(_file_direct_off.fd(), buf, size, file_offset, 0);
     };
     return std::async(std::launch::deferred, task);
   }
 
   // Let's synchronize once instead of in each task.
-  if (sync_default_stream && !is_compat_mode_preferred()) {
+  if (sync_default_stream && !get_compat_mode_manager().is_compat_mode_preferred()) {
     PushAndPopContext c(ctx);
     CUDA_DRIVER_TRY(cudaAPI::instance().StreamSynchronize(nullptr));
   }
@@ -292,7 +258,8 @@ void FileHandle::read_async(void* devPtr_base,
                             ssize_t* bytes_read_p,
                             CUstream stream)
 {
-  if (is_compat_mode_preferred_for_async(_compat_mode)) {
+  get_compat_mode_manager().validate_compat_mode_for_async();
+  if (get_compat_mode_manager().is_compat_mode_preferred_for_async()) {
     CUDA_DRIVER_TRY(cudaAPI::instance().StreamSynchronize(stream));
     *bytes_read_p =
       static_cast<ssize_t>(read(devPtr_base, *size_p, *file_offset_p, *devPtr_offset_p));
@@ -324,7 +291,8 @@ void FileHandle::write_async(void* devPtr_base,
                              ssize_t* bytes_written_p,
                              CUstream stream)
 {
-  if (is_compat_mode_preferred_for_async(_compat_mode)) {
+  get_compat_mode_manager().validate_compat_mode_for_async();
+  if (get_compat_mode_manager().is_compat_mode_preferred_for_async()) {
     CUDA_DRIVER_TRY(cudaAPI::instance().StreamSynchronize(stream));
     *bytes_written_p =
       static_cast<ssize_t>(write(devPtr_base, *size_p, *file_offset_p, *devPtr_offset_p));
@@ -349,34 +317,9 @@ StreamFuture FileHandle::write_async(
   return ret;
 }
 
-bool FileHandle::is_compat_mode_preferred() const noexcept
-{
-  return defaults::is_compat_mode_preferred(_compat_mode);
-}
-
-bool FileHandle::is_compat_mode_preferred_for_async() const noexcept
+const CompatModeManager& FileHandle::get_compat_mode_manager() const noexcept
 {
-  static bool is_extra_symbol_available = is_stream_api_available();
-  static bool is_config_path_empty      = config_path().empty();
-  return is_compat_mode_preferred() || !is_extra_symbol_available || is_config_path_empty;
-}
-
-bool FileHandle::is_compat_mode_preferred_for_async(CompatMode requested_compat_mode)
-{
-  if (defaults::is_compat_mode_preferred(requested_compat_mode)) { return true; }
-
-  if (!is_stream_api_available()) {
-    if (requested_compat_mode == CompatMode::AUTO) { return true; }
-    throw std::runtime_error("Missing the cuFile stream api.");
-  }
-
-  // When checking for availability, we also check if cuFile's config file exists. This is
-  // because even when the stream API is available, it doesn't work if no config file exists.
-  if (config_path().empty()) {
-    if (requested_compat_mode == CompatMode::AUTO) { return true; }
-    throw std::runtime_error("Missing cuFile configuration file.");
-  }
-  return false;
+  return _compat_mode_manager;
 }
 
 }  // namespace kvikio
diff --git a/docs/source/runtime_settings.rst b/docs/source/runtime_settings.rst
index be5508741e..0ce1ab7972 100644
--- a/docs/source/runtime_settings.rst
+++ b/docs/source/runtime_settings.rst
@@ -8,7 +8,7 @@ When KvikIO is running in compatibility mode, it doesn't load ``libcufile.so``.
 The environment variable ``KVIKIO_COMPAT_MODE`` has three options (case-insensitive):
 
   * ``ON`` (aliases: ``TRUE``, ``YES``, ``1``): Enable the compatibility mode.
-  * ``OFF`` (aliases: ``FALSE``, ``NO``, ``0``): Disable the compatibility mode, and enforce cuFile I/O. GDS will be activated if the system requirements for cuFile are met and cuFile is properly configured. However, if the system is not suited for cuFile, I/O operations under the ``OFF`` option may error out, crash or hang.
+  * ``OFF`` (aliases: ``FALSE``, ``NO``, ``0``): Disable the compatibility mode, and enforce cuFile I/O. GDS will be activated if the system requirements for cuFile are met and cuFile is properly configured. However, if the system is not suited for cuFile, I/O operations under the ``OFF`` option may error out.
   * ``AUTO``: Try cuFile I/O first, and fall back to POSIX I/O if the system requirements for cuFile are not met.
 
 Under ``AUTO``, KvikIO falls back to the compatibility mode:

From 096ac0ffa73db4b09b8c2e2089c4c83713fe1622 Mon Sep 17 00:00:00 2001
From: Tianyu Liu <kingcrimsontianyu@gmail.com>
Date: Tue, 18 Feb 2025 03:57:45 -0500
Subject: [PATCH 4/5] Enable tracing of thread pool tasks using NVTX (#630)

This PR implements the basic feature outlined in #631.
The two good-to-haves are currently blocked.

Authors:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Lawrence Mitchell (https://github.com/wence-)

URL: https://github.com/rapidsai/kvikio/pull/630
---
 cpp/CMakeLists.txt                        |   1 +
 cpp/include/kvikio/nvtx.hpp               | 200 ++++++++++++++++++++++
 cpp/include/kvikio/parallel_operation.hpp |  58 ++++++-
 cpp/include/kvikio/posix_io.hpp           |   1 +
 cpp/include/kvikio/utils.hpp              | 118 +------------
 cpp/src/file_handle.cpp                   |  21 ++-
 cpp/src/nvtx.cpp                          |  94 ++++++++++
 cpp/src/posix_io.cpp                      |   1 +
 cpp/src/remote_handle.cpp                 |   4 +-
 cpp/src/utils.cpp                         |   5 -
 10 files changed, 368 insertions(+), 135 deletions(-)
 create mode 100644 cpp/include/kvikio/nvtx.hpp
 create mode 100644 cpp/src/nvtx.cpp

diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index 0fd44e4e88..cc0436fd43 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -145,6 +145,7 @@ set(SOURCES
     "src/error.cpp"
     "src/file_handle.cpp"
     "src/file_utils.cpp"
+    "src/nvtx.cpp"
     "src/posix_io.cpp"
     "src/shim/cuda.cpp"
     "src/shim/cufile.cpp"
diff --git a/cpp/include/kvikio/nvtx.hpp b/cpp/include/kvikio/nvtx.hpp
new file mode 100644
index 0000000000..fc401fd38a
--- /dev/null
+++ b/cpp/include/kvikio/nvtx.hpp
@@ -0,0 +1,200 @@
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#pragma once
+
+#include <cstdint>
+
+#ifdef KVIKIO_CUDA_FOUND
+#include <nvtx3/nvtx3.hpp>
+#endif
+
+#include <kvikio/shim/cuda.hpp>
+#include <kvikio/utils.hpp>
+
+namespace kvikio {
+
+#ifdef KVIKIO_CUDA_FOUND
+/**
+ * @brief Tag type for libkvikio's NVTX domain.
+ */
+struct libkvikio_domain {
+  static constexpr char const* name{"libkvikio"};
+};
+
+using nvtx_scoped_range_type      = nvtx3::scoped_range_in<libkvikio_domain>;
+using nvtx_registered_string_type = nvtx3::registered_string_in<libkvikio_domain>;
+
+// Macro to concatenate two tokens x and y.
+#define KVIKIO_CONCAT_HELPER(x, y) x##y
+#define KVIKIO_CONCAT(x, y)        KVIKIO_CONCAT_HELPER(x, y)
+
+// Macro to create a static, registered string that will not have a name conflict with any
+// registered string defined in the same scope.
+#define KVIKIO_REGISTER_STRING(message)                              \
+  [](const char* a_message) -> auto& {                               \
+    static kvikio::nvtx_registered_string_type a_reg_str{a_message}; \
+    return a_reg_str;                                                \
+  }(message)
+
+// Implementation of KVIKIO_NVTX_FUNC_RANGE()
+#define KVIKIO_NVTX_FUNC_RANGE_IMPL() NVTX3_FUNC_RANGE_IN(kvikio::libkvikio_domain)
+
+// Implementation of KVIKIO_NVTX_SCOPED_RANGE(...)
+#define KVIKIO_NVTX_SCOPED_RANGE_IMPL_3(message, payload_v, color)                                \
+  kvikio::nvtx_scoped_range_type KVIKIO_CONCAT(_kvikio_nvtx_range, __LINE__)                      \
+  {                                                                                               \
+    nvtx3::event_attributes                                                                       \
+    {                                                                                             \
+      KVIKIO_REGISTER_STRING(message), nvtx3::payload{kvikio::convert_to_64bit(payload_v)}, color \
+    }                                                                                             \
+  }
+#define KVIKIO_NVTX_SCOPED_RANGE_IMPL_2(message, payload) \
+  KVIKIO_NVTX_SCOPED_RANGE_IMPL_3(message, payload, kvikio::nvtx_manager::default_color())
+#define KVIKIO_NVTX_SCOPED_RANGE_SELECTOR(_1, _2, _3, NAME, ...) NAME
+#define KVIKIO_NVTX_SCOPED_RANGE_IMPL(...)                                         \
+  KVIKIO_NVTX_SCOPED_RANGE_SELECTOR(                                               \
+    __VA_ARGS__, KVIKIO_NVTX_SCOPED_RANGE_IMPL_3, KVIKIO_NVTX_SCOPED_RANGE_IMPL_2) \
+  (__VA_ARGS__)
+
+// Implementation of KVIKIO_NVTX_MARKER(message, payload)
+#define KVIKIO_NVTX_MARKER_IMPL(message, payload_v)                 \
+  nvtx3::mark_in<kvikio::libkvikio_domain>(nvtx3::event_attributes{ \
+    KVIKIO_REGISTER_STRING(message), nvtx3::payload{kvikio::convert_to_64bit(payload_v)}})
+
+#endif
+
+#ifdef KVIKIO_CUDA_FOUND
+using nvtx_color_type = nvtx3::color;
+#else
+using nvtx_color_type = int;
+#endif
+
+/**
+ * @brief Utility singleton class for NVTX annotation.
+ */
+class nvtx_manager {
+ public:
+  static nvtx_manager& instance() noexcept;
+
+  /**
+   * @brief Return the default color.
+   *
+   * @return Default color.
+   */
+  static const nvtx_color_type& default_color() noexcept;
+
+  /**
+   * @brief Return the color at the given index from the internal color palette whose size n is a
+   * power of 2. The index may exceed the size of the color palette, in which case it wraps around,
+   * i.e. (idx mod n).
+   *
+   * @param idx The index value.
+   * @return The color picked from the internal color palette.
+   */
+  static const nvtx_color_type& get_color_by_index(std::uint64_t idx) noexcept;
+
+  /**
+   * @brief Rename the current thread under the KvikIO NVTX domain.
+   *
+   * @note This NVTX feature is currently not supported by the Nsight System profiler. As a result,
+   * the OS thread will not be renamed in the nsys-ui.
+   */
+  static void rename_current_thread(std::string_view new_name) noexcept;
+
+  nvtx_manager(nvtx_manager const&)            = delete;
+  nvtx_manager& operator=(nvtx_manager const&) = delete;
+  nvtx_manager(nvtx_manager&&)                 = delete;
+  nvtx_manager& operator=(nvtx_manager&&)      = delete;
+
+ private:
+  nvtx_manager() = default;
+};
+
+/**
+ * @brief Convenience macro for generating an NVTX range in the `libkvikio` domain
+ * from the lifetime of a function.
+ *
+ * Takes no argument. The name of the immediately enclosing function returned by `__func__` is used
+ * as the message.
+ *
+ * Example:
+ * ```
+ * void some_function(){
+ *    KVIKIO_NVTX_FUNC_RANGE();  // The name `some_function` is used as the message
+ *    ...
+ * }
+ * ```
+ */
+#ifdef KVIKIO_CUDA_FOUND
+#define KVIKIO_NVTX_FUNC_RANGE() KVIKIO_NVTX_FUNC_RANGE_IMPL()
+#else
+#define KVIKIO_NVTX_FUNC_RANGE(...) \
+  do {                              \
+  } while (0)
+#endif
+
+/**
+ * @brief Convenience macro for generating an NVTX scoped range in the `libkvikio` domain to
+ * annotate a time duration.
+ *
+ * @param message String literal for NVTX annotation. To improve profile-time performance, the
+ * string literal is registered in NVTX.
+ * @param payload NVTX payload.
+ * @param color (Optional) NVTX color. If unspecified, a default NVTX color is used.
+ *
+ * Example:
+ * ```
+ * void some_function(){
+ *    KVIKIO_NVTX_SCOPED_RANGE("my function", 42);
+ *    ...
+ * }
+ * ```
+ */
+#ifdef KVIKIO_CUDA_FOUND
+#define KVIKIO_NVTX_SCOPED_RANGE(...) KVIKIO_NVTX_SCOPED_RANGE_IMPL(__VA_ARGS__)
+#else
+#define KVIKIO_NVTX_SCOPED_RANGE(message, payload, ...) \
+  do {                                                  \
+  } while (0)
+#endif
+
+/**
+ * @brief Convenience macro for generating an NVTX marker in the `libkvikio` domain to annotate a
+ * certain time point.
+ *
+ * @param message String literal for NVTX annotation. To improve profile-time performance, the
+ * string literal is registered in NVTX.
+ * @param payload NVTX payload.
+ *
+ * Example:
+ * ```
+ * std::future<void> some_function(){
+ *     size_t io_size{2077};
+ *     KVIKIO_NVTX_MARKER("I/O operation", io_size);
+ *     perform_async_io_operation(io_size);
+ *     ...
+ * }
+ * ```
+ */
+#ifdef KVIKIO_CUDA_FOUND
+#define KVIKIO_NVTX_MARKER(message, payload) KVIKIO_NVTX_MARKER_IMPL(message, payload)
+#else
+#define KVIKIO_NVTX_MARKER(message, payload) \
+  do {                                       \
+  } while (0)
+#endif
+
+}  // namespace kvikio
diff --git a/cpp/include/kvikio/parallel_operation.hpp b/cpp/include/kvikio/parallel_operation.hpp
index f345333c4f..2172d1616d 100644
--- a/cpp/include/kvikio/parallel_operation.hpp
+++ b/cpp/include/kvikio/parallel_operation.hpp
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2021-2024, NVIDIA CORPORATION.
+ * Copyright (c) 2021-2025, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -15,6 +15,7 @@
  */
 #pragma once
 
+#include <atomic>
 #include <cassert>
 #include <future>
 #include <numeric>
@@ -24,18 +25,51 @@
 
 #include <kvikio/defaults.hpp>
 #include <kvikio/error.hpp>
+#include <kvikio/nvtx.hpp>
 #include <kvikio/utils.hpp>
 
 namespace kvikio {
 
 namespace detail {
 
+/**
+ * @brief Determine the NVTX color and call index. They are used to identify tasks from different
+ * pread/pwrite calls. Tasks from the same pread/pwrite call are given the same color and call
+ * index. The call index is atomically incremented on each pread/pwrite call, and will wrap around
+ * once it reaches the maximum value the integer type `std::uint64_t` can hold (this overflow
+ * behavior is well-defined in C++). The color is picked from an internal color palette according to
+ * the call index value.
+ *
+ * @return A pair of NVTX color and call index.
+ */
+inline const std::pair<const nvtx_color_type&, std::uint64_t> get_next_color_and_call_idx() noexcept
+{
+  static std::atomic_uint64_t call_counter{1ull};
+  auto call_idx    = call_counter.fetch_add(1ull, std::memory_order_relaxed);
+  auto& nvtx_color = nvtx_manager::get_color_by_index(call_idx);
+  return {nvtx_color, call_idx};
+}
+
 template <typename F, typename T>
-std::future<std::size_t> submit_task(
-  F op, T buf, std::size_t size, std::size_t file_offset, std::size_t devPtr_offset)
+std::future<std::size_t> submit_task(F op,
+                                     T buf,
+                                     std::size_t size,
+                                     std::size_t file_offset,
+                                     std::size_t devPtr_offset,
+                                     std::uint64_t nvtx_payload = 0ull,
+                                     nvtx_color_type nvtx_color = nvtx_manager::default_color())
 {
-  return defaults::thread_pool().submit_task(
-    [=] { return op(buf, size, file_offset, devPtr_offset); });
+  return defaults::thread_pool().submit_task([=] {
+    KVIKIO_NVTX_SCOPED_RANGE("task", nvtx_payload, nvtx_color);
+
+    // Rename the worker thread in the thread pool to improve clarity from nsys-ui.
+    // Note: This NVTX feature is currently not supported by nsys-ui.
+    thread_local std::once_flag call_once_per_thread;
+    std::call_once(call_once_per_thread,
+                   [] { nvtx_manager::rename_current_thread("thread pool"); });
+
+    return op(buf, size, file_offset, devPtr_offset);
+  });
 }
 
 }  // namespace detail
@@ -58,13 +92,15 @@ std::future<std::size_t> parallel_io(F op,
                                      std::size_t size,
                                      std::size_t file_offset,
                                      std::size_t task_size,
-                                     std::size_t devPtr_offset)
+                                     std::size_t devPtr_offset,
+                                     std::uint64_t call_idx     = 0,
+                                     nvtx_color_type nvtx_color = nvtx_manager::default_color())
 {
   if (task_size == 0) { throw std::invalid_argument("`task_size` cannot be zero"); }
 
   // Single-task guard
   if (task_size >= size || page_size >= size) {
-    return detail::submit_task(op, buf, size, file_offset, devPtr_offset);
+    return detail::submit_task(op, buf, size, file_offset, devPtr_offset, call_idx, nvtx_color);
   }
 
   // We know an upper bound of the total number of tasks
@@ -73,14 +109,18 @@ std::future<std::size_t> parallel_io(F op,
 
   // 1) Submit `task_size` sized tasks
   while (size >= task_size) {
-    tasks.push_back(detail::submit_task(op, buf, task_size, file_offset, devPtr_offset));
+    tasks.push_back(
+      detail::submit_task(op, buf, task_size, file_offset, devPtr_offset, call_idx, nvtx_color));
     file_offset += task_size;
     devPtr_offset += task_size;
     size -= task_size;
   }
 
   // 2) Submit a task for the remainder
-  if (size > 0) { tasks.push_back(detail::submit_task(op, buf, size, file_offset, devPtr_offset)); }
+  if (size > 0) {
+    tasks.push_back(
+      detail::submit_task(op, buf, size, file_offset, devPtr_offset, call_idx, nvtx_color));
+  }
 
   // Finally, we sum the result of all tasks.
   auto gather_tasks = [](std::vector<std::future<std::size_t>>&& tasks) -> std::size_t {
diff --git a/cpp/include/kvikio/posix_io.hpp b/cpp/include/kvikio/posix_io.hpp
index 99964315b3..955262066e 100644
--- a/cpp/include/kvikio/posix_io.hpp
+++ b/cpp/include/kvikio/posix_io.hpp
@@ -23,6 +23,7 @@
 
 #include <kvikio/bounce_buffer.hpp>
 #include <kvikio/error.hpp>
+#include <kvikio/nvtx.hpp>
 #include <kvikio/shim/cuda.hpp>
 #include <kvikio/utils.hpp>
 
diff --git a/cpp/include/kvikio/utils.hpp b/cpp/include/kvikio/utils.hpp
index b10e54c482..4b0c8aa75a 100644
--- a/cpp/include/kvikio/utils.hpp
+++ b/cpp/include/kvikio/utils.hpp
@@ -23,10 +23,6 @@
 #include <tuple>
 #include <type_traits>
 
-#ifdef KVIKIO_CUDA_FOUND
-#include <nvtx3/nvtx3.hpp>
-#endif
-
 #include <kvikio/shim/cuda.hpp>
 
 namespace kvikio {
@@ -54,6 +50,12 @@ template <typename T, std::enable_if_t<std::is_integral_v<T>>* = nullptr>
   return std::int64_t(value);
 }
 
+/**
+ * @brief Helper function to allow NVTX payload of type std::uint64_t to pass through without doing
+ * anything.
+ */
+[[nodiscard]] inline std::uint64_t convert_to_64bit(std::uint64_t value) { return value; }
+
 /**
  * @brief Help function to convert value to 64 bit float
  */
@@ -155,112 +157,4 @@ bool is_future_done(T const& future)
   return future.wait_for(std::chrono::seconds(0)) != std::future_status::timeout;
 }
 
-#ifdef KVIKIO_CUDA_FOUND
-/**
- * @brief Tag type for libkvikio's NVTX domain.
- */
-struct libkvikio_domain {
-  static constexpr char const* name{"libkvikio"};
-};
-
-// Macro to concatenate two tokens x and y.
-#define KVIKIO_CONCAT_HELPER(x, y) x##y
-#define KVIKIO_CONCAT(x, y)        KVIKIO_CONCAT_HELPER(x, y)
-
-// Macro to create a static, registered string that will not have a name conflict with any
-// registered string defined in the same scope.
-#define KVIKIO_REGISTER_STRING(msg)                                        \
-  [](const char* a_msg) -> auto& {                                         \
-    static nvtx3::registered_string_in<libkvikio_domain> a_reg_str{a_msg}; \
-    return a_reg_str;                                                      \
-  }(msg)
-
-// Macro overloads of KVIKIO_NVTX_FUNC_RANGE
-#define KVIKIO_NVTX_FUNC_RANGE_IMPL() NVTX3_FUNC_RANGE_IN(libkvikio_domain)
-
-#define KVIKIO_NVTX_SCOPED_RANGE_IMPL(msg, val)                                        \
-  nvtx3::scoped_range_in<libkvikio_domain> KVIKIO_CONCAT(_kvikio_nvtx_range, __LINE__) \
-  {                                                                                    \
-    nvtx3::event_attributes                                                            \
-    {                                                                                  \
-      KVIKIO_REGISTER_STRING(msg), nvtx3::payload { convert_to_64bit(val) }            \
-    }                                                                                  \
-  }
-
-#define KVIKIO_NVTX_MARKER_IMPL(msg, val) \
-  nvtx3::mark_in<libkvikio_domain>(       \
-    nvtx3::event_attributes{KVIKIO_REGISTER_STRING(msg), nvtx3::payload{convert_to_64bit(val)}})
-
-#endif
-
-/**
- * @brief Convenience macro for generating an NVTX range in the `libkvikio` domain
- * from the lifetime of a function.
- *
- * Takes no argument. The name of the immediately enclosing function returned by `__func__` is used
- * as the message.
- *
- * Example:
- * ```
- * void some_function(){
- *    KVIKIO_NVTX_FUNC_RANGE();  // The name `some_function` is used as the message
- *    ...
- * }
- * ```
- */
-#ifdef KVIKIO_CUDA_FOUND
-#define KVIKIO_NVTX_FUNC_RANGE() KVIKIO_NVTX_FUNC_RANGE_IMPL()
-#else
-#define KVIKIO_NVTX_FUNC_RANGE(...) \
-  do {                              \
-  } while (0)
-#endif
-
-/**
- * @brief Convenience macro for generating an NVTX scoped range in the `libkvikio` domain to
- * annotate a time duration.
- *
- * Takes two arguments (message, payload).
- *
- * Example:
- * ```
- * void some_function(){
- *    KVIKIO_NVTX_SCOPED_RANGE("my function", 42);
- *    ...
- * }
- * ```
- */
-#ifdef KVIKIO_CUDA_FOUND
-#define KVIKIO_NVTX_SCOPED_RANGE(msg, val) KVIKIO_NVTX_SCOPED_RANGE_IMPL(msg, val)
-#else
-#define KVIKIO_NVTX_SCOPED_RANGE(msg, val) \
-  do {                                     \
-  } while (0)
-#endif
-
-/**
- * @brief Convenience macro for generating an NVTX marker in the `libkvikio` domain to annotate a
- * certain time point.
- *
- * Takes two arguments (message, payload). Use this macro to annotate asynchronous I/O operations,
- * where the payload refers to the I/O size.
- *
- * Example:
- * ```
- * std::future<void> some_function(){
- *     size_t io_size{2077};
- *     KVIKIO_NVTX_MARKER("I/O operation", io_size);
- *     perform_async_io_operation(io_size);
- *     ...
- * }
- * ```
- */
-#ifdef KVIKIO_CUDA_FOUND
-#define KVIKIO_NVTX_MARKER(message, payload) KVIKIO_NVTX_MARKER_IMPL(message, payload)
-#else
-#define KVIKIO_NVTX_MARKER(message, payload) \
-  do {                                       \
-  } while (0)
-#endif
-
 }  // namespace kvikio
diff --git a/cpp/src/file_handle.cpp b/cpp/src/file_handle.cpp
index 4e8376a285..7c960fb167 100644
--- a/cpp/src/file_handle.cpp
+++ b/cpp/src/file_handle.cpp
@@ -27,6 +27,7 @@
 #include <kvikio/defaults.hpp>
 #include <kvikio/file_handle.hpp>
 #include <kvikio/file_utils.hpp>
+#include <kvikio/nvtx.hpp>
 
 namespace kvikio {
 
@@ -105,13 +106,13 @@ std::size_t FileHandle::read(void* devPtr_base,
                              std::size_t devPtr_offset,
                              bool sync_default_stream)
 {
+  KVIKIO_NVTX_SCOPED_RANGE("FileHandle::read()", size);
   if (get_compat_mode_manager().is_compat_mode_preferred()) {
     return detail::posix_device_read(
       _file_direct_off.fd(), devPtr_base, size, file_offset, devPtr_offset);
   }
   if (sync_default_stream) { CUDA_DRIVER_TRY(cudaAPI::instance().StreamSynchronize(nullptr)); }
 
-  KVIKIO_NVTX_SCOPED_RANGE("cufileRead()", size);
   ssize_t ret = cuFileAPI::instance().Read(_cufile_handle.handle(),
                                            devPtr_base,
                                            size,
@@ -127,6 +128,7 @@ std::size_t FileHandle::write(void const* devPtr_base,
                               std::size_t devPtr_offset,
                               bool sync_default_stream)
 {
+  KVIKIO_NVTX_SCOPED_RANGE("FileHandle::write()", size);
   _nbytes = 0;  // Invalidate the computed file size
 
   if (get_compat_mode_manager().is_compat_mode_preferred()) {
@@ -135,7 +137,6 @@ std::size_t FileHandle::write(void const* devPtr_base,
   }
   if (sync_default_stream) { CUDA_DRIVER_TRY(cudaAPI::instance().StreamSynchronize(nullptr)); }
 
-  KVIKIO_NVTX_SCOPED_RANGE("cufileWrite()", size);
   ssize_t ret = cuFileAPI::instance().Write(_cufile_handle.handle(),
                                             devPtr_base,
                                             size,
@@ -158,7 +159,8 @@ std::future<std::size_t> FileHandle::pread(void* buf,
                                            std::size_t gds_threshold,
                                            bool sync_default_stream)
 {
-  KVIKIO_NVTX_MARKER("FileHandle::pread()", size);
+  auto& [nvtx_color, call_idx] = detail::get_next_color_and_call_idx();
+  KVIKIO_NVTX_SCOPED_RANGE("FileHandle::pread()", size, nvtx_color);
   if (is_host_memory(buf)) {
     auto op = [this](void* hostPtr_base,
                      std::size_t size,
@@ -169,7 +171,7 @@ std::future<std::size_t> FileHandle::pread(void* buf,
         _file_direct_off.fd(), buf, size, file_offset);
     };
 
-    return parallel_io(op, buf, size, file_offset, task_size, 0);
+    return parallel_io(op, buf, size, file_offset, task_size, 0, call_idx, nvtx_color);
   }
 
   CUcontext ctx = get_context_from_pointer(buf);
@@ -198,7 +200,8 @@ std::future<std::size_t> FileHandle::pread(void* buf,
     return read(devPtr_base, size, file_offset, devPtr_offset, /* sync_default_stream = */ false);
   };
   auto [devPtr_base, base_size, devPtr_offset] = get_alloc_info(buf, &ctx);
-  return parallel_io(task, devPtr_base, size, file_offset, task_size, devPtr_offset);
+  return parallel_io(
+    task, devPtr_base, size, file_offset, task_size, devPtr_offset, call_idx, nvtx_color);
 }
 
 std::future<std::size_t> FileHandle::pwrite(void const* buf,
@@ -208,7 +211,8 @@ std::future<std::size_t> FileHandle::pwrite(void const* buf,
                                             std::size_t gds_threshold,
                                             bool sync_default_stream)
 {
-  KVIKIO_NVTX_MARKER("FileHandle::pwrite()", size);
+  auto& [nvtx_color, call_idx] = detail::get_next_color_and_call_idx();
+  KVIKIO_NVTX_SCOPED_RANGE("FileHandle::pwrite()", size, nvtx_color);
   if (is_host_memory(buf)) {
     auto op = [this](void const* hostPtr_base,
                      std::size_t size,
@@ -219,7 +223,7 @@ std::future<std::size_t> FileHandle::pwrite(void const* buf,
         _file_direct_off.fd(), buf, size, file_offset);
     };
 
-    return parallel_io(op, buf, size, file_offset, task_size, 0);
+    return parallel_io(op, buf, size, file_offset, task_size, 0, call_idx, nvtx_color);
   }
 
   CUcontext ctx = get_context_from_pointer(buf);
@@ -248,7 +252,8 @@ std::future<std::size_t> FileHandle::pwrite(void const* buf,
     return write(devPtr_base, size, file_offset, devPtr_offset, /* sync_default_stream = */ false);
   };
   auto [devPtr_base, base_size, devPtr_offset] = get_alloc_info(buf, &ctx);
-  return parallel_io(op, devPtr_base, size, file_offset, task_size, devPtr_offset);
+  return parallel_io(
+    op, devPtr_base, size, file_offset, task_size, devPtr_offset, call_idx, nvtx_color);
 }
 
 void FileHandle::read_async(void* devPtr_base,
diff --git a/cpp/src/nvtx.cpp b/cpp/src/nvtx.cpp
new file mode 100644
index 0000000000..8611533a2f
--- /dev/null
+++ b/cpp/src/nvtx.cpp
@@ -0,0 +1,94 @@
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <sys/syscall.h>
+#include <array>
+#include <sstream>
+
+#ifdef KVIKIO_CUDA_FOUND
+#include <nvtx3/nvtx3.hpp>
+#endif
+
+#include <kvikio/nvtx.hpp>
+
+namespace kvikio {
+
+nvtx_manager& nvtx_manager::instance() noexcept
+{
+  static nvtx_manager _instance;
+  return _instance;
+}
+
+const nvtx_color_type& nvtx_manager::default_color() noexcept
+{
+#ifdef KVIKIO_CUDA_FOUND
+  static nvtx_color_type default_color{nvtx3::argb{0, 255, 255, 255}};
+  return default_color;
+#else
+  static nvtx_color_type dummy{};
+  return dummy;
+#endif
+}
+
+const nvtx_color_type& nvtx_manager::get_color_by_index(std::uint64_t idx) noexcept
+{
+#ifdef KVIKIO_CUDA_FOUND
+  constexpr std::size_t num_color{16};
+  static_assert((num_color & (num_color - 1)) == 0);  // Is power of 2
+  static std::array<nvtx_color_type, num_color> color_palette = {nvtx3::rgb{106, 192, 67},
+                                                                 nvtx3::rgb{191, 73, 203},
+                                                                 nvtx3::rgb{93, 151, 76},
+                                                                 nvtx3::rgb{96, 72, 194},
+                                                                 nvtx3::rgb{179, 170, 71},
+                                                                 nvtx3::rgb{92, 58, 113},
+                                                                 nvtx3::rgb{212, 136, 57},
+                                                                 nvtx3::rgb{96, 144, 194},
+                                                                 nvtx3::rgb{211, 69, 56},
+                                                                 nvtx3::rgb{97, 179, 155},
+                                                                 nvtx3::rgb{203, 69, 131},
+                                                                 nvtx3::rgb{57, 89, 48},
+                                                                 nvtx3::rgb{184, 133, 199},
+                                                                 nvtx3::rgb{128, 102, 51},
+                                                                 nvtx3::rgb{211, 138, 130},
+                                                                 nvtx3::rgb{122, 50, 49}};
+  auto safe_idx = idx & (num_color - 1);  // idx % num_color
+  return color_palette[safe_idx];
+#else
+  static nvtx_color_type dummy{};
+  return dummy;
+#endif
+}
+
+void nvtx_manager::rename_current_thread(std::string_view new_name) noexcept
+{
+#ifdef KVIKIO_CUDA_FOUND
+  auto tid = syscall(SYS_gettid);
+  std::stringstream ss;
+  ss << new_name << " (" << tid << ")";
+
+  nvtxResourceAttributes_t attribs = {0};
+  attribs.version                  = NVTX_VERSION;
+  attribs.size                     = NVTX_RESOURCE_ATTRIB_STRUCT_SIZE;
+  attribs.identifierType           = NVTX_RESOURCE_TYPE_GENERIC_THREAD_NATIVE;
+  attribs.identifier.ullValue      = tid;
+  attribs.messageType              = NVTX_MESSAGE_TYPE_ASCII;
+  attribs.message.ascii            = ss.str().c_str();
+  nvtxResourceHandle_t handle =
+    nvtxDomainResourceCreate(nvtx3::domain::get<libkvikio_domain>(), &attribs);
+#endif
+}
+
+}  // namespace kvikio
diff --git a/cpp/src/posix_io.cpp b/cpp/src/posix_io.cpp
index 9576f284dc..ed149f5d43 100644
--- a/cpp/src/posix_io.cpp
+++ b/cpp/src/posix_io.cpp
@@ -22,6 +22,7 @@
 
 #include <kvikio/bounce_buffer.hpp>
 #include <kvikio/error.hpp>
+#include <kvikio/nvtx.hpp>
 #include <kvikio/posix_io.hpp>
 #include <kvikio/shim/cuda.hpp>
 #include <kvikio/utils.hpp>
diff --git a/cpp/src/remote_handle.cpp b/cpp/src/remote_handle.cpp
index 8ca04f94ed..1096cfac5a 100644
--- a/cpp/src/remote_handle.cpp
+++ b/cpp/src/remote_handle.cpp
@@ -25,6 +25,7 @@
 
 #include <kvikio/defaults.hpp>
 #include <kvikio/error.hpp>
+#include <kvikio/nvtx.hpp>
 #include <kvikio/parallel_operation.hpp>
 #include <kvikio/posix_io.hpp>
 #include <kvikio/remote_handle.hpp>
@@ -392,6 +393,7 @@ std::future<std::size_t> RemoteHandle::pread(void* buf,
                                              std::size_t file_offset,
                                              std::size_t task_size)
 {
+  auto& [nvtx_color, call_idx] = detail::get_next_color_and_call_idx();
   KVIKIO_NVTX_SCOPED_RANGE("RemoteHandle::pread()", size);
   auto task = [this](void* devPtr_base,
                      std::size_t size,
@@ -399,7 +401,7 @@ std::future<std::size_t> RemoteHandle::pread(void* buf,
                      std::size_t devPtr_offset) -> std::size_t {
     return read(static_cast<char*>(devPtr_base) + devPtr_offset, size, file_offset);
   };
-  return parallel_io(task, buf, size, file_offset, task_size, 0);
+  return parallel_io(task, buf, size, file_offset, task_size, 0, call_idx, nvtx_color);
 }
 
 }  // namespace kvikio
diff --git a/cpp/src/utils.cpp b/cpp/src/utils.cpp
index bed2cbafbc..fdc174f791 100644
--- a/cpp/src/utils.cpp
+++ b/cpp/src/utils.cpp
@@ -15,7 +15,6 @@
  */
 
 #include <cstring>
-#include <future>
 #include <iostream>
 #include <map>
 #include <optional>
@@ -23,10 +22,6 @@
 #include <tuple>
 #include <type_traits>
 
-#ifdef KVIKIO_CUDA_FOUND
-#include <nvtx3/nvtx3.hpp>
-#endif
-
 #include <kvikio/error.hpp>
 #include <kvikio/shim/cuda.hpp>
 #include <kvikio/utils.hpp>

From 77da58718ae6a829596bcf1e5cec33f208419514 Mon Sep 17 00:00:00 2001
From: Yiheng Wang <68361391+yiheng-wang-nv@users.noreply.github.com>
Date: Tue, 18 Feb 2025 21:47:49 +0800
Subject: [PATCH 5/5] Add tutorials to showcase GDS for nifti and dicom images
 (#597)

This PR addresses the tutorial contribution mentioned in #580

Authors:
  - Yiheng Wang (https://github.com/yiheng-wang-nv)
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: https://github.com/rapidsai/kvikio/pull/597
---
 .../medical_dicom_image_loading_example.ipynb | 424 ++++++++++++++++++
 .../medical_nifti_image_loading_example.ipynb | 377 ++++++++++++++++
 2 files changed, 801 insertions(+)
 create mode 100644 notebooks/medical_dicom_image_loading_example.ipynb
 create mode 100644 notebooks/medical_nifti_image_loading_example.ipynb

diff --git a/notebooks/medical_dicom_image_loading_example.ipynb b/notebooks/medical_dicom_image_loading_example.ipynb
new file mode 100644
index 0000000000..3e47693ee5
--- /dev/null
+++ b/notebooks/medical_dicom_image_loading_example.ipynb
@@ -0,0 +1,424 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "99b73d7d",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "\n",
+    "This tutorial showcases how to use Kvikio to accelerate the loading of DICOM images. We will also utilize the `pydicom` library to handle these medical image formats.\n",
+    "\n",
+    "### Common Medical Image Formats\n",
+    "\n",
+    "Medical images are complex due to the extensive metadata they contain, which includes patient information, imaging parameters, and more. DICOM (Digital Imaging and Communications in Medicine) is one of the most common formats:\n",
+    "\n",
+    "- **Description**: The most widely used standard for storing and transmitting medical images. It includes metadata about the patient, imaging parameters, and more.\n",
+    "- **Usage**: Commonly used in hospitals and clinics for storing images from modalities like MRI, CT, X-ray, and ultrasound.\n",
+    "\n",
+    "### Extra Library Used\n",
+    "\n",
+    "#### pydicom\n",
+    "- **Description**: A Python library for working with DICOM files. It allows for reading, modifying, and writing DICOM data.\n",
+    "- **Usage**: Widely used in clinical and research settings to handle DICOM files.\n",
+    "\n",
+    "### GPU Acceleration with Kvikio\n",
+    "\n",
+    "Kvikio is a powerful tool that leverages GPU acceleration to significantly speed up the loading and processing of medical images. In this tutorial, we will demonstrate how to use Kvikio to efficiently handle DICOM images, providing a performance comparison between CPU and GPU processing.\n",
+    "\n",
+    "By the end of this tutorial, you will understand:\n",
+    "- How to load DICOM images using `pydicom`.\n",
+    "- How to accelerate the loading and processing of these images using Kvikio.\n",
+    "- The performance benefits of using GPU acceleration for medical image processing."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c45b7e8b",
+   "metadata": {},
+   "source": [
+    "### Setup Environment"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "d2e043c2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Check if pydicom is installed, if not, install it\n",
+    "!python -c \"import pydicom\" || pip install -q pydicom"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "47d412b4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import kvikio\n",
+    "import kvikio.defaults\n",
+    "import cupy as cp\n",
+    "import tempfile\n",
+    "import pydicom\n",
+    "from pydicom.dataset import Dataset, FileDataset\n",
+    "import numpy as np\n",
+    "import os\n",
+    "import datetime\n",
+    "import requests\n",
+    "import tarfile\n",
+    "import gzip\n",
+    "import shutil\n",
+    "import io\n",
+    "from timeit import default_timer as timer"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b35ea6d3",
+   "metadata": {},
+   "source": [
+    "### Warmup Kvikio"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "cc03538d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def warmup_kvikio():\n",
+    "    \"\"\"\n",
+    "    Warm up the Kvikio library to initialize the internal buffers, cuFile, GDS, etc.\n",
+    "    \"\"\"\n",
+    "    # warmup cuFile\n",
+    "    a = cp.arange(100)\n",
+    "    with tempfile.NamedTemporaryFile() as tmp_file:\n",
+    "        tmp_file_name = tmp_file.name\n",
+    "        f = kvikio.CuFile(tmp_file_name, \"w\")\n",
+    "        f.write(a)\n",
+    "        f.close()\n",
+    "\n",
+    "        b = cp.empty_like(a)\n",
+    "        f = kvikio.CuFile(tmp_file_name, \"r\")\n",
+    "        f.read(b)\n",
+    "\n",
+    "    # warmup cupy\n",
+    "    c = cp.random.rand(100, 100, 3)\n",
+    "    d = cp.mean(c)\n",
+    "\n",
+    "warmup_kvikio()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "df2d2e9d",
+   "metadata": {},
+   "source": [
+    "### Set Kvikio Threads\n",
+    "\n",
+    "KvikIO can automatically use multiple threads for I/O operations. Setting the environment variable `KVIKIO_NTHREADS` to the desired number of threads may improve performance. In this tutorial, 4 threads are used. For more details, refer to the [official documentation](https://docs.rapids.ai/api/kvikio/nightly/runtime_settings/#thread-pool-kvikio-nthreads)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "1ad596a4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "kvikio.defaults.num_threads_reset(nthreads=4)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "52790010",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "source": [
+    "### DICOM Data Preparation\n",
+    "\n",
+    "A fake DICOM file will be prepared to test the performance."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "7b8bfbe1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "temp_working_dir = tempfile.mkdtemp()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "45f893a0",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Multi-frame DICOM file created at: /tmp/tmplw76c4_0/example.dcm\n"
+     ]
+    }
+   ],
+   "source": [
+    "def create_multiframe_dicom(file_path, num_slices=128, pixel_array_shape=(1024, 1024), pixel_value_range=(0, 4095)):\n",
+    "    # Create a new DICOM dataset\n",
+    "    file_meta = pydicom.dataset.FileMetaDataset()\n",
+    "    file_meta.MediaStorageSOPClassUID = pydicom.uid.generate_uid()\n",
+    "    file_meta.MediaStorageSOPInstanceUID = pydicom.uid.generate_uid()\n",
+    "    file_meta.ImplementationClassUID = pydicom.uid.PYDICOM_IMPLEMENTATION_UID\n",
+    "    file_meta.TransferSyntaxUID = pydicom.uid.ExplicitVRLittleEndian\n",
+    "\n",
+    "    # Create the FileDataset instance (inherits from Dataset)\n",
+    "    ds = FileDataset(file_path, {}, file_meta=file_meta, preamble=b\"\\0\" * 128)\n",
+    "\n",
+    "    # Set some basic DICOM attributes\n",
+    "    ds.PatientName = \"Test^Patient\"\n",
+    "    ds.PatientID = \"123456\"\n",
+    "    ds.Modality = \"CT\"\n",
+    "    ds.StudyInstanceUID = pydicom.uid.generate_uid()\n",
+    "    ds.SeriesInstanceUID = pydicom.uid.generate_uid()\n",
+    "    ds.SOPInstanceUID = file_meta.MediaStorageSOPInstanceUID\n",
+    "    ds.StudyDate = datetime.date.today().strftime('%Y%m%d')\n",
+    "    ds.ContentDate = datetime.date.today().strftime('%Y%m%d')\n",
+    "    ds.StudyTime = datetime.datetime.now().strftime('%H%M%S')\n",
+    "    ds.ContentTime = datetime.datetime.now().strftime('%H%M%S')\n",
+    "\n",
+    "    # Set the pixel data with random integers\n",
+    "    pixel_array = np.random.randint(\n",
+    "        pixel_value_range[0],\n",
+    "        pixel_value_range[1],\n",
+    "        (num_slices, *pixel_array_shape),\n",
+    "        dtype=np.uint16,\n",
+    "    )\n",
+    "    ds.Rows, ds.Columns = pixel_array_shape\n",
+    "    ds.NumberOfFrames = num_slices\n",
+    "    ds.PixelData = pixel_array.tobytes()\n",
+    "    ds.SamplesPerPixel = 1\n",
+    "    ds.PhotometricInterpretation = \"MONOCHROME2\"\n",
+    "    ds.BitsAllocated = 16\n",
+    "    ds.BitsStored = 16\n",
+    "    ds.HighBit = 15\n",
+    "    ds.PixelRepresentation = 0\n",
+    "\n",
+    "    # Set multi-frame specific attributes\n",
+    "    ds.PerFrameFunctionalGroupsSequence = []\n",
+    "    for slice_index in range(num_slices):\n",
+    "        frame = Dataset()\n",
+    "        plane_position = Dataset()\n",
+    "        plane_position.ImagePositionPatient = [0, 0, slice_index]\n",
+    "        plane_orientation = Dataset()\n",
+    "        plane_orientation.ImageOrientationPatient = [1, 0, 0, 0, 1, 0]\n",
+    "        pixel_measures = Dataset()\n",
+    "        pixel_measures.SliceThickness = 1\n",
+    "\n",
+    "        frame.PlanePositionSequence = [plane_position]\n",
+    "        frame.PlaneOrientationSequence = [plane_orientation]\n",
+    "        frame.PixelMeasuresSequence = [pixel_measures]\n",
+    "        ds.PerFrameFunctionalGroupsSequence.append(frame)\n",
+    "\n",
+    "    ds.is_little_endian = True\n",
+    "    ds.is_implicit_VR = True\n",
+    "\n",
+    "    ds.save_as(file_path)\n",
+    "    print(f\"Multi-frame DICOM file created at: {file_path}\")\n",
+    "\n",
+    "# Example usage\n",
+    "example_dcm_path = os.path.join(temp_working_dir, \"example.dcm\")\n",
+    "\n",
+    "create_multiframe_dicom(example_dcm_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d0374c01",
+   "metadata": {},
+   "source": [
+    "### Test DICOM Data Loading"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "67ea8d22",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def load_image_cpu(file_path):\n",
+    "    ds = pydicom.dcmread(file_path)\n",
+    "    pixel_array = ds.pixel_array\n",
+    "    return pixel_array"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "4eabb76a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def load_image_gpu(file_path):\n",
+    "    # set defer_size to prevent reading the entire file\n",
+    "    dcm_read_data = pydicom.dcmread(file_path, defer_size=\"100 KB\")\n",
+    "\n",
+    "    # Extract relevant attributes\n",
+    "    rows = dcm_read_data.Rows\n",
+    "    columns = dcm_read_data.Columns\n",
+    "    bits_allocated = dcm_read_data.BitsAllocated\n",
+    "    samples_per_pixel = dcm_read_data.SamplesPerPixel\n",
+    "    number_of_frames = getattr(dcm_read_data, 'NumberOfFrames', 1)\n",
+    "    pixel_representation = dcm_read_data.PixelRepresentation\n",
+    "\n",
+    "    if bits_allocated == 8:\n",
+    "        dtype = cp.int8 if pixel_representation == 1 else cp.uint8\n",
+    "    elif bits_allocated == 16:\n",
+    "        dtype = cp.int16 if pixel_representation == 1 else cp.uint16\n",
+    "    elif bits_allocated == 32:\n",
+    "        dtype = cp.int32 if pixel_representation == 1 else cp.uint32\n",
+    "    else:\n",
+    "        raise ValueError(\"Unsupported BitsAllocated value\")\n",
+    "\n",
+    "    bytes_per_pixel = bits_allocated // 8\n",
+    "    total_pixels = rows * columns * samples_per_pixel * number_of_frames\n",
+    "    expected_pixel_data_length = total_pixels * bytes_per_pixel\n",
+    "\n",
+    "    offset = dcm_read_data.get_item(0x7FE00010, keep_deferred=True).value_tell\n",
+    "\n",
+    "    with kvikio.CuFile(file_path, \"r\") as f:\n",
+    "        buffer = cp.empty(expected_pixel_data_length, dtype=cp.int8)\n",
+    "        f.read(buffer, expected_pixel_data_length, offset)\n",
+    "\n",
+    "    cupy_data_array = buffer.view(dtype).reshape((number_of_frames, rows, columns))\n",
+    "\n",
+    "    return cupy_data_array"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "006bcd54",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(128, 1024, 1024) 2046.878812596202\n",
+      "Kvikio GPU loading time: 0.0975 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Measure Kvikio GPU loading time\n",
+    "# the saved outputs are run with a Tesla V100-PCIE-16GB GPU\n",
+    "start_gpu = timer()\n",
+    "img_gpu = load_image_gpu(example_dcm_path)\n",
+    "print(img_gpu.shape, img_gpu.mean())\n",
+    "end_gpu = timer()\n",
+    "gpu_time = end_gpu - start_gpu\n",
+    "print(f\"Kvikio GPU loading time: {gpu_time:.4f} seconds\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "d048fb3e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(128, 1024, 1024) 2046.878812596202\n",
+      "Normal CPU loading time: 0.3950 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Measure CPU loading time\n",
+    "start_cpu = timer()\n",
+    "img_cpu = load_image_cpu(example_dcm_path)\n",
+    "print(img_cpu.shape, img_cpu.mean())\n",
+    "end_cpu = timer()\n",
+    "cpu_time = end_cpu - start_cpu\n",
+    "print(f\"Normal CPU loading time: {cpu_time:.4f} seconds\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "997a7116",
+   "metadata": {},
+   "source": [
+    "### validate cpu and gpu data are close"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "15006dd8",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "True\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(np.allclose(img_gpu, img_cpu))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b801dc5b",
+   "metadata": {},
+   "source": [
+    "### Cleanup tmp Directory"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "0ba89f01",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "shutil.rmtree(temp_working_dir)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebooks/medical_nifti_image_loading_example.ipynb b/notebooks/medical_nifti_image_loading_example.ipynb
new file mode 100644
index 0000000000..bc0ec2f837
--- /dev/null
+++ b/notebooks/medical_nifti_image_loading_example.ipynb
@@ -0,0 +1,377 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "99b73d7d",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "\n",
+    "This tutorial showcases how to use Kvikio to accelerate the loading of NIFTI images. We will also utilize the `nibabel` library to handle this medical image format.\n",
+    "\n",
+    "### Common Medical Image Formats\n",
+    "\n",
+    "Medical images are complex due to the extensive metadata they contain, which includes patient information, imaging parameters, and more.\n",
+    "\n",
+    "NIfTI (Neuroimaging Informatics Technology Initiative) is one of the most common formats:\n",
+    "\n",
+    "- **Description**: A popular format for storing brain imaging data, particularly in research settings. It is designed to store volumetric data and is often used in neuroimaging.\n",
+    "- **Usage**: Widely used in neuroscience research and supported by many neuroimaging software packages.\n",
+    "\n",
+    "### Extra Library Used\n",
+    "\n",
+    "#### NiBabel\n",
+    "- **Description**: A Python library for reading and writing medical image formats, particularly NIfTI and Analyze.\n",
+    "- **Usage**: Commonly used in neuroimaging research for handling NIfTI files.\n",
+    "\n",
+    "### GPU Acceleration with Kvikio\n",
+    "\n",
+    "Kvikio is a powerful tool that leverages GPU acceleration to significantly speed up the loading and processing of medical images. In this tutorial, we will demonstrate how to use Kvikio to efficiently handle NIFTI images, providing a performance comparison between CPU and GPU processing.\n",
+    "\n",
+    "By the end of this tutorial, you will understand:\n",
+    "- How to load NIFTI images using `nibabel`.\n",
+    "- How to accelerate the loading and processing of these images using Kvikio.\n",
+    "- The performance benefits of using GPU acceleration for medical image processing."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c45b7e8b",
+   "metadata": {},
+   "source": [
+    "### Setup Environment"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "d2e043c2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Check if nibabel is installed, if not, install it\n",
+    "!python -c \"import nibabel\" || pip install -q nibabel"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "47d412b4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import kvikio\n",
+    "import kvikio.defaults\n",
+    "import cupy as cp\n",
+    "import numpy as np\n",
+    "import tempfile\n",
+    "import nibabel as nib\n",
+    "import os\n",
+    "import requests\n",
+    "import tarfile\n",
+    "import gzip\n",
+    "import shutil\n",
+    "import io\n",
+    "from timeit import default_timer as timer"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b35ea6d3",
+   "metadata": {},
+   "source": [
+    "### Warmup Kvikio"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "cc03538d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def warmup_kvikio():\n",
+    "    \"\"\"\n",
+    "    Warm up the Kvikio library to initialize the internal buffers, cuFile, GDS, etc.\n",
+    "    \"\"\"\n",
+    "    # warmup cuFile\n",
+    "    a = cp.arange(100)\n",
+    "    with tempfile.NamedTemporaryFile() as tmp_file:\n",
+    "        tmp_file_name = tmp_file.name\n",
+    "        f = kvikio.CuFile(tmp_file_name, \"w\")\n",
+    "        f.write(a)\n",
+    "        f.close()\n",
+    "\n",
+    "        b = cp.empty_like(a)\n",
+    "        f = kvikio.CuFile(tmp_file_name, \"r\")\n",
+    "        f.read(b)\n",
+    "\n",
+    "    # warmup cupy\n",
+    "    c = cp.random.rand(100, 100, 3)\n",
+    "    d = cp.mean(c)\n",
+    "\n",
+    "warmup_kvikio()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b1efcccc",
+   "metadata": {},
+   "source": [
+    "### Set Kvikio Threads\n",
+    "\n",
+    "KvikIO can automatically use multiple threads for I/O operations. Setting the environment variable `KVIKIO_NTHREADS` to the desired number of threads may improve performance. In this tutorial, 4 threads are used. For more details, refer to the [official documentation](https://docs.rapids.ai/api/kvikio/nightly/runtime_settings/#thread-pool-kvikio-nthreads)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "547213d4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "kvikio.defaults.num_threads_reset(nthreads=4)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "52790010",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "source": [
+    "### NIFTI Data Preparation\n",
+    "\n",
+    "For NIFTI images, we will use the [MSD Spleen dataset](https://msd-for-monai.s3-us-west-2.amazonaws.com/Task09_Spleen.tar) from the [Medical Segmentation Decathlon](http://medicaldecathlon.com/dataaws/). This dataset is commonly used for training and evaluating medical image segmentation algorithms and provides a good example of volumetric medical imaging data.\n",
+    "\n",
+    "Larger datasets typically demonstrate more significant acceleration benefits when using GPU processing. If you are interested in comparing performance with a larger dataset, it is recommended to use images from the [MSD Liver dataset](https://msd-for-monai.s3-us-west-2.amazonaws.com/Task03_Liver.tar) for the following experiments. The MSD Liver dataset contains more extensive volumetric data, which can better showcase the advantages of GPU acceleration."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "45f893a0",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Extraction completed! Files are saved in: /tmp/tmpcjecciqo\n"
+     ]
+    }
+   ],
+   "source": [
+    "temp_working_dir = tempfile.mkdtemp()\n",
+    "\n",
+    "nifti_output_path = os.path.join(temp_working_dir, \"Task09_Spleen.tar\")\n",
+    "url = \"https://msd-for-monai.s3-us-west-2.amazonaws.com/Task09_Spleen.tar\"\n",
+    "response = requests.get(url, stream=True)\n",
+    "with open(nifti_output_path, \"wb\") as file:\n",
+    "    for chunk in response.iter_content(chunk_size=8192):\n",
+    "        file.write(chunk)\n",
+    "\n",
+    "# Extract the contents\n",
+    "with tarfile.open(nifti_output_path, \"r\") as tar:\n",
+    "    tar.extractall(path=temp_working_dir)\n",
+    "\n",
+    "print(f\"Extraction completed! Files are saved in: {temp_working_dir}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "028021e2",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "a decompressed nifti file is saved at:  /tmp/tmpcjecciqo/Task09_Spleen/imagesTr/spleen_53.nii\n"
+     ]
+    }
+   ],
+   "source": [
+    "# decompress the nii.gz file\n",
+    "example_nifti_path = os.path.join(temp_working_dir, \"Task09_Spleen\", \"imagesTr\", \"spleen_53.nii\")\n",
+    "with gzip.open(example_nifti_path+\".gz\", \"rb\") as f_in:\n",
+    "    with open(example_nifti_path, \"wb\") as f_out:\n",
+    "        shutil.copyfileobj(f_in, f_out)\n",
+    "print(\"a decompressed nifti file is saved at: \", example_nifti_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d0374c01",
+   "metadata": {},
+   "source": [
+    "### Test NIFTI Data Loading"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "fac454d7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def nifti_gpu_load(filename):\n",
+    "    file_size = os.path.getsize(filename)\n",
+    "    image = cp.empty(file_size, dtype=cp.uint8)\n",
+    "\n",
+    "    with kvikio.CuFile(filename, \"r\") as f:\n",
+    "        f.read(image)\n",
+    "\n",
+    "    header_bytes = cp.asnumpy(image[:348])\n",
+    "    header = nib.Nifti1Header.from_fileobj(io.BytesIO(header_bytes))\n",
+    "    data_offset = header.get_data_offset()\n",
+    "    data_shape = header.get_data_shape()\n",
+    "    data_dtype = header.get_data_dtype()\n",
+    "    affine = header.get_best_affine()\n",
+    "    meta = dict(header)\n",
+    "    meta[\"affine\"] = affine\n",
+    "    return image[data_offset:].view(data_dtype).reshape(data_shape, order=\"F\"), meta"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "006bcd54",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(512, 512, 156) -474.2267\n",
+      "Kvikio GPU loading time: 0.0505 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Measure Kvikio GPU loading time\n",
+    "# the saved outputs are run with a Tesla V100-PCIE-16GB GPU\n",
+    "start_gpu = timer()\n",
+    "img_gpu, meta_gpu = nifti_gpu_load(example_nifti_path)\n",
+    "print(img_gpu.shape, img_gpu.mean())\n",
+    "end_gpu = timer()\n",
+    "gpu_time = end_gpu - start_gpu\n",
+    "print(f\"Kvikio GPU loading time: {gpu_time:.4f} seconds\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "d048fb3e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(512, 512, 156) -474.22673315879626\n",
+      "Normal CPU loading time: 0.1699 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Measure CPU loading time\n",
+    "start_cpu = timer()\n",
+    "img_cpu = nib.load(example_nifti_path)\n",
+    "img_cpu_array = img_cpu.get_fdata()\n",
+    "print(img_cpu_array.shape, img_cpu_array.mean())\n",
+    "end_cpu = timer()\n",
+    "cpu_time = end_cpu - start_cpu\n",
+    "print(f\"Normal CPU loading time: {cpu_time:.4f} seconds\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "997a7116",
+   "metadata": {},
+   "source": [
+    "### validate cpu and gpu data are close"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "72475fa5",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "True\n"
+     ]
+    }
+   ],
+   "source": [
+    "# validate affine\n",
+    "print(np.all(img_cpu.affine == meta_gpu[\"affine\"]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "784ee851",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "True\n"
+     ]
+    }
+   ],
+   "source": [
+    "# validate array\n",
+    "print(np.allclose(img_cpu_array, img_gpu))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7422f5a6",
+   "metadata": {},
+   "source": [
+    "### Cleanup tmp Directory"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "bc85278e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "shutil.rmtree(temp_working_dir)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}