CuWeaver: CUDA Concurrency Library

CuWeaver is a CUDA concurrency library designed to simplify parallel programming by automating concurrency flow management. It provides C++-style wrappers for selected CUDA Runtime APIs and helps reduce the complexity of managing concurrency in multi-GPU environments.

Key Features

Concurrency Automation: Automatically manages memory streams, execution streams, and event pools for each GPU, ensuring isolation of memory and computation operations.
Multi-GPU Simplification: Optimizes memory management and kernel invocation in multi-GPU environments, reducing the complexity of cross-GPU development.
Event-Driven Dependency Management: Ensures correct data access order by maintaining dependencies between operations, preventing data races.
Modern C++ Wrappers: Provides C++-style wrappers for CUDA's native C API, leveraging RAII and move semantics to simplify resource management.

Dependencies

CMake: Version 3.18 or higher
C++ Compiler: Must support C++17 or later
CUDA Driver: Required
CUDA Version: 10.1 or higher
CUDA Runtime

Building and Installation

This project uses CMake for building and installation. To compile and install the library, follow these steps:

Clone the repository:

git clone https://github.com/your-username/CuWeaver.git
cd CuWeaver

Create a build directory:
```
mkdir build
cd build
```
Generate build files:
```
cmake ..
```
Build the project:
```
cmake --build .
```
Install the library:
```
sudo cmake --install .
```

The library will be installed as a static library.

Usage

Once the library is installed, you can use it to manage concurrency in your CUDA applications. The library provides simple and intuitive C++ interfaces to wrap CUDA Runtime APIs and manage memory and event flows.

// CuWeaver - Automatic CUDA Stream Management Demo
#include <cuweaver/StreamManager.cuh>

__global__ void fillKernel(int* data, int value, size_t size) {
    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) data[idx] = value;
}

__global__ void addKernel(int* c, const int* a, const int* b, size_t size) {
    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) c[idx] = a[idx] + b[idx];
}

int main() {
    using namespace cuweaver;
    
    auto manager = &StreamManager::getInstance();
    manager->initialize(50, 8, 4); // Event pool, execution streams, resource streams
    
    int *a, *b, *c;
    constexpr size_t size = 1 << 20;
    
    // GPU memory allocation
    manager->malloc(&a, size * sizeof(int), 0);
    manager->malloc(&b, size * sizeof(int), 0);
    manager->malloc(&c, size * sizeof(int), 0);
    
    dim3 grid((size + 511) / 512), block(512);
    
    // Concurrent operations - automatic stream management and dependency tracking
    // Note: All StreamManager operations are non-blocking "submissions" to CUDA,
    // they return immediately without blocking the host thread
    manager->launchKernel(fillKernel, grid, block, 0, deviceFlags::Auto, 
                         makeWrite(a, 0), 1, size);
    manager->launchKernel(fillKernel, grid, block, 0, deviceFlags::Auto, 
                         makeWrite(b, 0), 2, size);
    
    // Automatically waits for a and b to be ready before execution
    // This submission also returns immediately - dependency is handled by CUDA streams
    manager->launchKernel(addKernel, grid, block, 0, deviceFlags::Auto, 
                         makeWrite(c, 0), makeRead(a, 0), makeRead(b, 0), size);
    
    // Automatically waits for computation to complete before copying
    // This memcpy submission is also non-blocking to the host thread
    auto h_result = new int[size];
    manager->memcpy(h_result, Host, c, 0, size * sizeof(int), memcpyFlags::DeviceToHost);
    
    cudaDeviceSynchronize(); // Explicit synchronization needed to wait for all GPU operations to complete
    
    // Verify results: should all be 3 (1+2)
    std::cout << "Result: ";
    for (int i = 0; i < 5; ++i) std::cout << h_result[i] << " ";
    
    // Clean up resources
    delete[] h_result;
    manager->free(a, 0);
    manager->free(b, 0);
    manager->free(c, 0);
    
    return 0;
}

Design Overview

Automatic Flow Management: CuWeaver automatically separates memory operations (e.g., allocation, copying) from computation (e.g., kernel calls) by dispatching them to distinct memory and execution streams.
Multi-GPU Optimization: CuWeaver simplifies GPU-to-GPU communication by automatically handling memory transfers across GPUs when required.
Event-Driven Dependency Maintenance: Dependencies between operations are managed using events to ensure that operations are executed in the correct order without data races.

Project Plan and Progress

Current Progress:

C++-style CUDA Runtime API Wrappers (Completed): Wrapping core CUDA Runtime functions (like cudaMalloc) with C++-style abstractions using RAII and move semantics.
Automatic Concurrency Management (Completed): Automating the flow control for memory and computation streams, including event-driven dependency management between different operations.
Simplified Multi-GPU Management (In Progress): Streamlining memory management and kernel invocation for multi-GPU systems, with automatic memory transfers and optimizations for cross-GPU communication.

Upcoming Features:

Automated, Portable CUDA Memory Management (Planned): Developing a more streamlined, portable approach to memory management across different CUDA devices, including automatic memory allocation, transfers, and synchronization.

Contributing

We welcome contributions from the community! Please see Contributing Guide for detailed guidelines on how to get involved.

Testing

For instructions on building and running the test suite, please refer to Testing Guide.

License

CuWeaver is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
cmake		cmake
docs		docs
examples		examples
include		include
src		src
tests		tests
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CuWeaver: CUDA Concurrency Library

Key Features

Dependencies

Building and Installation

Usage

Design Overview

Project Plan and Progress

Current Progress:

Upcoming Features:

Contributing

Testing

License

About

Uh oh!

Releases

Packages

Languages

License

Mgepahmge/CuWeaver

Folders and files

Latest commit

History

Repository files navigation

CuWeaver: CUDA Concurrency Library

Key Features

Dependencies

Building and Installation

Usage

Design Overview

Project Plan and Progress

Current Progress:

Upcoming Features:

Contributing

Testing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages