int overflow when using cuda backend #287

amiguel256 · 2025-03-26T15:45:54Z

Hello,

I'm currently using AMGCL with cuda backend to solve Poisson's equation (results in a 7-point stencil, so approximately (N^3)*7 nnz). I have tested the problem with "small" matrix sizes (about N=100) and it works perfectly. However, when I get to sizes of about N=800 (nnz > 3 billion) the solver breaks because of int overflow. Creating the sparse matrix with the std::tuple method works just fine, since col and ptr are of type "ptrdiff_t". However, when doing the initial solve step (Solver solve(A,prm,bprm);) the code breaks because of indexing error. I have noticed that the cuda backend is set with CUSPARSE_INDEX_32I but I know cuda supports CUSPARSE_INDEX_64I.

This is my current setup:

// Initializes and gets name of GPU hardware...
int device;
cudaDeviceProp prop;
cudaGetDevice(&device);
cudaGetDeviceProperties(&prop, device);
std::cout<<"\nUsing "<<prop.name<<" to solve..."<<std::endl;

// Initializing the profiler (helps to get information about solver steps)...
amgcl::profiler<> prof;

// Initializing AMGCL solver (AMG preconditioner and BiCGStab solver)...
typedef amgcl::backend::cuda<double> Backend;
typedef amgcl::make_solver<
  amgcl::amg<
      Backend,
      amgcl::coarsening::smoothed_aggregation,
      amgcl::relaxation::spai0
      >,
  amgcl::solver::bicgstab<Backend>
  > Solver;

// Initializing the CUSPARSE library and pass the handle to AMGCL in backend parameters (currently using default values)...
Backend::params bprm;
cusparseCreate(&bprm.cusparse_handle);

// Initializing solver parameters (to control tolerance, iters, etc.)... (currently using default values)   
Solver::params prm;

// Initializing sparse matrix (CSR format)...
std::vector<std::ptrdiff_t> ptr, col;
std::vector<double> val, rhs, x;
int iters;
double error;

ptr.clear(); ptr.reserve(M + 1); ptr.push_back(0);
col.clear(); col.reserve(M * 7); // We get 7-point stencil from "del(D*del(C))=0" using FVM,
val.clear(); val.reserve(M * 7); // so the matrix will have at most (Mx*My*Mz) * 7 nonzero elements.

rhs.resize(M);
x.resize(M);

and then I solve the system as:

auto A = std::tie(M,ptr,col,val);

Solver solve(A,prm,bprm);
prof.toc("setup");
std::cout << solve << std::endl;

thrust::device_vector<double> f(rhs);
thrust::device_vector<double> xSol(x);	

prof.tic("solve");
std::tie(iters, error) = solve(f, xSol);
prof.toc("solve");

Another thing I'm worried about is that the problem size is too large for the GPU to handle (GPU has a global memory size of 80GB). Is there a better way to handle these types of problems (maybe through VexCL). Any help is greatly appreciated. Apologies in advance as this is my first time posting on GitHub.

The text was updated successfully, but these errors were encountered:

ddemidov · 2025-03-26T16:38:54Z

On the first look, it seems like a support for 64-bit indices could be added by some small changes here:

amgcl/amgcl/backend/cuda.hpp

Line 187 in b9433f9

thrust::device_vector<int> &ptr,
amgcl/amgcl/backend/cuda.hpp

Line 202 in b9433f9

CUSPARSE_INDEX_32I,
amgcl/amgcl/backend/cuda.hpp

Line 302 in b9433f9

thrust::device_vector<int> ptr;

and possibly some other places I missed. I guess these changes could be made dependent on the backend template parameters (similar to the builtin backend).

But the real question is, as you noted, would the matrix fit into the GPU memory? If you use 64bit indices, the memory requrements will grow, and the supported problem size will be further reduced. AMGCL will only work if the whole problem fits into the GPU memory. If you want to solve a bigger system, you will need multiple GPUs, and use MPI (or may be a vexcl backend with multiple GPUs in the context).

amiguel256 · 2025-03-26T16:42:56Z

Thank you so much for the prompt response. I actually made these changes in cuda.hpp and it works! But yes, at this point the problem is memory, reaching greater than 80GB for the problems I need to work with. The interesting part is that the solver takes about 26GB but the preconditioner takes >100GB. I might play around with the preconditioner parameters or other preconditions, however, spai0 was the one giving the best performance if I recall correctly.

ddemidov · 2025-03-26T16:50:05Z

If your changes are backward compatible (they won't break the older code using the cuda backend), then a PR is welcome. Otherwise, if you care to share the patch, I'll try to find some time and see if it is possible to make it backward compatible.

amiguel256 · 2025-03-26T17:52:56Z

I don't think they would break the older code but the only changes I made was changing the type of ptr and col to int64_t and CUSPARSE_INDEX32I to CUSPARSE_INDEX_64I. I'm still getting familiarized with GitHub so I'm not sure yet how to share this with you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

int overflow when using cuda backend #287

int overflow when using cuda backend #287

amiguel256 commented Mar 26, 2025 •

edited

Loading

ddemidov commented Mar 26, 2025

amiguel256 commented Mar 26, 2025 •

edited

Loading

ddemidov commented Mar 26, 2025

amiguel256 commented Mar 26, 2025

int overflow when using cuda backend #287

int overflow when using cuda backend #287

Comments

amiguel256 commented Mar 26, 2025 • edited Loading

ddemidov commented Mar 26, 2025

amiguel256 commented Mar 26, 2025 • edited Loading

ddemidov commented Mar 26, 2025

amiguel256 commented Mar 26, 2025

amiguel256 commented Mar 26, 2025 •

edited

Loading

amiguel256 commented Mar 26, 2025 •

edited

Loading