CuWeaver is a CUDA concurrency library designed to simplify parallel programming by automating concurrency flow management. It provides C++-style wrappers for selected CUDA Runtime APIs and helps reduce the complexity of managing concurrency in multi-GPU environments.
- Concurrency Automation: Automatically manages memory streams, execution streams, and event pools for each GPU, ensuring isolation of memory and computation operations.
- Multi-GPU Simplification: Optimizes memory management and kernel invocation in multi-GPU environments, reducing the complexity of cross-GPU development.
- Event-Driven Dependency Management: Ensures correct data access order by maintaining dependencies between operations, preventing data races.
- Modern C++ Wrappers: Provides C++-style wrappers for CUDA's native C API, leveraging RAII and move semantics to simplify resource management.
- CMake: Version 3.18 or higher
- C++ Compiler: Must support C++17 or later
- CUDA Driver: Required
- CUDA Version: 10.1 or higher
- CUDA Runtime
This project uses CMake for building and installation. To compile and install the library, follow these steps:
-
Clone the repository:
git clone https://github.com/your-username/CuWeaver.git cd CuWeaver -
Create a build directory:
mkdir build cd build -
Generate build files:
cmake ..
-
Build the project:
cmake --build . -
Install the library:
sudo cmake --install .
The library will be installed as a static library.
Once the library is installed, you can use it to manage concurrency in your CUDA applications. The library provides simple and intuitive C++ interfaces to wrap CUDA Runtime APIs and manage memory and event flows.
// CuWeaver - Automatic CUDA Stream Management Demo
#include <cuweaver/StreamManager.cuh>
__global__ void fillKernel(int* data, int value, size_t size) {
size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size) data[idx] = value;
}
__global__ void addKernel(int* c, const int* a, const int* b, size_t size) {
size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size) c[idx] = a[idx] + b[idx];
}
int main() {
using namespace cuweaver;
auto manager = &StreamManager::getInstance();
manager->initialize(50, 8, 4); // Event pool, execution streams, resource streams
int *a, *b, *c;
constexpr size_t size = 1 << 20;
// GPU memory allocation
manager->malloc(&a, size * sizeof(int), 0);
manager->malloc(&b, size * sizeof(int), 0);
manager->malloc(&c, size * sizeof(int), 0);
dim3 grid((size + 511) / 512), block(512);
// Concurrent operations - automatic stream management and dependency tracking
// Note: All StreamManager operations are non-blocking "submissions" to CUDA,
// they return immediately without blocking the host thread
manager->launchKernel(fillKernel, grid, block, 0, deviceFlags::Auto,
makeWrite(a, 0), 1, size);
manager->launchKernel(fillKernel, grid, block, 0, deviceFlags::Auto,
makeWrite(b, 0), 2, size);
// Automatically waits for a and b to be ready before execution
// This submission also returns immediately - dependency is handled by CUDA streams
manager->launchKernel(addKernel, grid, block, 0, deviceFlags::Auto,
makeWrite(c, 0), makeRead(a, 0), makeRead(b, 0), size);
// Automatically waits for computation to complete before copying
// This memcpy submission is also non-blocking to the host thread
auto h_result = new int[size];
manager->memcpy(h_result, Host, c, 0, size * sizeof(int), memcpyFlags::DeviceToHost);
cudaDeviceSynchronize(); // Explicit synchronization needed to wait for all GPU operations to complete
// Verify results: should all be 3 (1+2)
std::cout << "Result: ";
for (int i = 0; i < 5; ++i) std::cout << h_result[i] << " ";
// Clean up resources
delete[] h_result;
manager->free(a, 0);
manager->free(b, 0);
manager->free(c, 0);
return 0;
}- Automatic Flow Management: CuWeaver automatically separates memory operations (e.g., allocation, copying) from computation (e.g., kernel calls) by dispatching them to distinct memory and execution streams.
- Multi-GPU Optimization: CuWeaver simplifies GPU-to-GPU communication by automatically handling memory transfers across GPUs when required.
- Event-Driven Dependency Maintenance: Dependencies between operations are managed using events to ensure that operations are executed in the correct order without data races.
- C++-style CUDA Runtime API Wrappers (Completed): Wrapping core CUDA Runtime functions (like
cudaMalloc) with C++-style abstractions using RAII and move semantics. - Automatic Concurrency Management (Completed): Automating the flow control for memory and computation streams, including event-driven dependency management between different operations.
- Simplified Multi-GPU Management (In Progress): Streamlining memory management and kernel invocation for multi-GPU systems, with automatic memory transfers and optimizations for cross-GPU communication.
- Automated, Portable CUDA Memory Management (Planned): Developing a more streamlined, portable approach to memory management across different CUDA devices, including automatic memory allocation, transfers, and synchronization.
We welcome contributions from the community! Please see Contributing Guide for detailed guidelines on how to get involved.
For instructions on building and running the test suite, please refer to Testing Guide.
CuWeaver is licensed under the MIT License.