Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int overflow when using cuda backend #287

Open
amiguel256 opened this issue Mar 26, 2025 · 4 comments
Open

int overflow when using cuda backend #287

amiguel256 opened this issue Mar 26, 2025 · 4 comments

Comments

@amiguel256
Copy link

amiguel256 commented Mar 26, 2025

Hello,

I'm currently using AMGCL with cuda backend to solve Poisson's equation (results in a 7-point stencil, so approximately (N^3)*7 nnz). I have tested the problem with "small" matrix sizes (about N=100) and it works perfectly. However, when I get to sizes of about N=800 (nnz > 3 billion) the solver breaks because of int overflow. Creating the sparse matrix with the std::tuple method works just fine, since col and ptr are of type "ptrdiff_t". However, when doing the initial solve step (Solver solve(A,prm,bprm);) the code breaks because of indexing error. I have noticed that the cuda backend is set with CUSPARSE_INDEX_32I but I know cuda supports CUSPARSE_INDEX_64I.

This is my current setup:

// Initializes and gets name of GPU hardware...
int device;
cudaDeviceProp prop;
cudaGetDevice(&device);
cudaGetDeviceProperties(&prop, device);
std::cout<<"\nUsing "<<prop.name<<" to solve..."<<std::endl;

// Initializing the profiler (helps to get information about solver steps)...
amgcl::profiler<> prof;

// Initializing AMGCL solver (AMG preconditioner and BiCGStab solver)...
typedef amgcl::backend::cuda<double> Backend;
typedef amgcl::make_solver<
  amgcl::amg<
      Backend,
      amgcl::coarsening::smoothed_aggregation,
      amgcl::relaxation::spai0
      >,
  amgcl::solver::bicgstab<Backend>
  > Solver;

// Initializing the CUSPARSE library and pass the handle to AMGCL in backend parameters (currently using default values)...
Backend::params bprm;
cusparseCreate(&bprm.cusparse_handle);

// Initializing solver parameters (to control tolerance, iters, etc.)... (currently using default values)   
Solver::params prm;

// Initializing sparse matrix (CSR format)...
std::vector<std::ptrdiff_t> ptr, col;
std::vector<double> val, rhs, x;
int iters;
double error;

ptr.clear(); ptr.reserve(M + 1); ptr.push_back(0);
col.clear(); col.reserve(M * 7); // We get 7-point stencil from "del(D*del(C))=0" using FVM,
val.clear(); val.reserve(M * 7); // so the matrix will have at most (Mx*My*Mz) * 7 nonzero elements.

rhs.resize(M);
x.resize(M);

and then I solve the system as:

auto A = std::tie(M,ptr,col,val);

Solver solve(A,prm,bprm);
prof.toc("setup");
std::cout << solve << std::endl;

thrust::device_vector<double> f(rhs);
thrust::device_vector<double> xSol(x);	

prof.tic("solve");
std::tie(iters, error) = solve(f, xSol);
prof.toc("solve");

Another thing I'm worried about is that the problem size is too large for the GPU to handle (GPU has a global memory size of 80GB). Is there a better way to handle these types of problems (maybe through VexCL). Any help is greatly appreciated. Apologies in advance as this is my first time posting on GitHub.

@ddemidov
Copy link
Owner

On the first look, it seems like a support for 64-bit indices could be added by some small changes here:

and possibly some other places I missed. I guess these changes could be made dependent on the backend template parameters (similar to the builtin backend).

But the real question is, as you noted, would the matrix fit into the GPU memory? If you use 64bit indices, the memory requrements will grow, and the supported problem size will be further reduced. AMGCL will only work if the whole problem fits into the GPU memory. If you want to solve a bigger system, you will need multiple GPUs, and use MPI (or may be a vexcl backend with multiple GPUs in the context).

@amiguel256
Copy link
Author

amiguel256 commented Mar 26, 2025

Thank you so much for the prompt response. I actually made these changes in cuda.hpp and it works! But yes, at this point the problem is memory, reaching greater than 80GB for the problems I need to work with. The interesting part is that the solver takes about 26GB but the preconditioner takes >100GB. I might play around with the preconditioner parameters or other preconditions, however, spai0 was the one giving the best performance if I recall correctly.

@ddemidov
Copy link
Owner

If your changes are backward compatible (they won't break the older code using the cuda backend), then a PR is welcome. Otherwise, if you care to share the patch, I'll try to find some time and see if it is possible to make it backward compatible.

@amiguel256
Copy link
Author

I don't think they would break the older code but the only changes I made was changing the type of ptr and col to int64_t and CUSPARSE_INDEX32I to CUSPARSE_INDEX_64I. I'm still getting familiarized with GitHub so I'm not sure yet how to share this with you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants