Colvars on GPU #816

HanatoK · 2025-07-07T18:22:11Z

HanatoK
Jul 7, 2025
Collaborator

This is a draft plan to continue #652 and #655.

`colvarproxy`

MD engines

Investigate how Colvars interoperates with GROMACS, ~~LAMMPS~~ and Tinker-HP in case of the GPU-resident mode.

LAMMPS support

LAMMPS uses GPUs primarily in two ways (see https://docs.lammps.org/Speed_packages.html):

the GPU package, which supports offload;
the KOKKOSpackage, which is "GPU-resident" but also uses more abstract syntax; KOKKOS should be interoperable with the underlying languages (CUDA, HIP, SYCL, ...) but probably not all their specialized features.

GPU buffers

Move atoms_masses, atoms_charge, atoms_positions, atoms_total_forces and atoms_new_colvar_forces to the subclasses of colvarproxy, and allocate device memory if a subclass of colvarproxy supports the GPU-resident mode.

Stream/Queue management

Implement a colvarproxy_gpu class to create, synchronize and delete the streams (CUDA and HIP) or queues (SYCL).

`colvarmodule`

Add support for smp gpu.

`cvm::atom_group`

GPU buffers

Implement atoms_pos, atoms_charge, atoms_vel, atoms_mass, atoms_grad, atoms_total_force and atoms_weight on device memory;
Implement read_positions, read_velocities and read_total_forces on GPU.

GPU kernels of atom-group calculations

Basically we need to implement everything in calc_required_properties with GPU kernels:

Implement calc_center_of_mass on GPU;
Implement calc_center_of_geometry on GPU;
Implement calc_apply_roto_translation on GPU;
- Implement all features colvarmodule::rotation on GPU;
- Implement calc_optimal_rotation_soa on GPU.

Question: should we have a separate cvm::atom_group_base for the CPU and GPU implementations?

`colvar::cvc`

GPU kernels for CVCs

Implement calc_value_gpu and calc_gradients_gpu for all CVCs;
If smp gpu is used, then calc_value_gpu and calc_gradients_gpu will be called.

Tests

Implement run_colvars_test.cpp on GPU;
Implement colvarproxy_stub_gpu on GPU;
Test the basic functionalities on GPU.

HanatoK · 2025-07-08T14:55:17Z

HanatoK
Jul 8, 2025
Collaborator Author

It looks like the IForceProvider in GROMACS does not support GPU for the time being:
https://gitlab.com/gromacs/gromacs/-/issues/5039

5 replies

giacomofiorin Jul 10, 2025
Maintainer

As noted by @jhenin offline, GROMACS should be doing the host-device copy internally, so it is possible to use a GPU-resident scheme while contributing forces via IForceProvider. I don't know how it compares to NAMD's GlobalMaster or CUDAGlobalMaster.

HanatoK Jul 10, 2025
Collaborator Author

CUDAGlobalMaster itself only does the device-device copy. The device to host copy happens in the colvarproxy derived class:

colvars/namd/cudaglobalmaster/colvarproxy_cudaglobalmaster.C

Lines 710 to 729 in e06f5e6

    
           if (numAtoms > 0) { 
        
             // Transform the arrays for Colvars 
        
             auto &colvars_pos = *(modify_atom_positions()); 
        
             copy_DtoH(d_trans_mPositions, colvars_pos.data(), numAtoms, mStream); 
        
             if (mClient->requestUpdateAtomTotalForces()) { 
        
               auto &colvars_total_force = *(modify_atom_total_forces()); 
        
               copy_DtoH(d_trans_mTotalForces, colvars_total_force.data(), numAtoms, mStream); 
        
             } 
        
             if (mClient->requestUpdateMasses()) { 
        
               auto &colvars_mass = *(modify_atom_masses()); 
        
               copy_DtoH(d_trans_mMass, colvars_mass.data(), numAtoms, mStream); 
        
             } 
        
             if (mClient->requestUpdateCharges()) { 
        
               auto &colvars_charge  = *(modify_atom_charges()); 
        
               copy_DtoH(d_trans_mCharges, colvars_charge.data(), numAtoms, mStream); 
        
             } 
        
             if (mClient->requestUpdateLattice()) { 
        
               copy_DtoH(d_mLattice, h_mLattice, 3*4, mStream); 
        
             } 
        
           }

Is the host-device copy in IForceProvider mandatory in GROMACS?

giacomofiorin Jul 10, 2025
Maintainer

CUDAGlobalMaster itself only does the device-device copy.

That's right, thanks for the correction.

Is the host-device copy in IForceProvider mandatory in GROMACS?

Probably mandatory if you are using at least one such feature (Colvars, PLUMED, QM/MM, density fitting, etc). Otherwise I would assume that no copy at all is done for a conventional MD run. (No idea how it works for older features like the pull code)

giacomofiorin Jul 10, 2025
Maintainer

@HubLot This particular thread may benefit from your knowledge.

jhenin Jul 11, 2025
Maintainer

@HubLot and I have discussed this. In Gromacs, a device-to-host copy is always triggered by (among others) the haveCpuLocalForceWork flag (see sim_util.cpp), which includes special forces like iForceProvider. The copy concerns all atoms.
The whole IForceProvider framework seems designed for CPU-only implementations, as far as we can see, but Lukas Müllender might know better.

giacomofiorin · 2025-07-10T20:21:03Z

giacomofiorin
Jul 10, 2025
Maintainer

@HanatoK I made a quick edit about LAMMPS to your message at the top (which we could probably keep editing to keep the information organized, since you have a great starting point there?)

Regarding the point on the atom groups: the main idea behind #655 (see also your comment here) was to allow sharing the atomic coordinate buffers between different colvar::cvc objects. There are many use cases where the same group is copied over and over: eliminating those extra copies would bring its own benefit, to be compounded with the conversion to SOA (which is now done in #788).

Beyond that, the longer-term "plan" was less about improving the data structure of the atom groups, and more about trying to have the CVCs be more agnostic to the details of that data structure. This was the goal of the second point of #655, a good chunk of which you have also implemented in #788.

Ideally, some of the member functions of the CVCs could become templates that are instantiated differently in each scenario (sequential, shared memory, domain decomposition). I had originally thought that major refactoring would be required for run all CVCs efficiently, but #783 shows that this is probably not be needed for every feature.

At this point, it would definitely make sense to have separate implementations of atom_group for CPU and GPU, with a shared base class. Is that what you were thinking of?

1 reply

HanatoK Jul 10, 2025
Collaborator Author

At this point, it would definitely make sense to have separate implementations of atom_group for CPU and GPU, with a shared base class. Is that what you were thinking of?

Yes.

HanatoK · 2025-07-23T16:32:32Z

HanatoK
Jul 23, 2025
Collaborator Author

After some explorations and micro-benchmarks, I think it would be better to:

Start from CUDA graphs, which can save the kernel launch time for many small kernels, and CUDA graphs are supported by Kokkos (see https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Graph.html);
Use a hand-written CUDA kernel for 4x4 eigendecomposition instead of cuSOLVER, as cuSOLVER is too slow for such a small matrix and it is not even compatible with CUDA graphs. My code is in https://github.com/HanatoK/RMSD_CUDA/blob/731bd2be2f0ba79e533046d8fc5a6c0364439022/rmsd_cuda_kernel.cu#L89;

In addition, for the time being, it is difficult to make a base class for different implementations of atom_group, so I will try directly inserting GPU buffers in atom_group (guarded by the COLVARS_CUDA macro), and reorganizing atom_group after having a working GPU code base.

4 replies

giacomofiorin Jul 23, 2025
Maintainer

After some explorations and micro-benchmarks, I think it would be better to:

* Start from CUDA graphs, which can save the kernel launch time for many small kernels, and CUDA graphs are supported by Kokkos (see https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Graph.html);

This is awesome! I didn't know that Kokkos supported them last time I looked.

* Use a hand-written CUDA kernel for 4x4 eigendecomposition instead of cuSOLVER, as cuSOLVER is too slow for such a small matrix and it is not even compatible with CUDA graphs. My code is in https://github.com/HanatoK/RMSD_CUDA/blob/731bd2be2f0ba79e533046d8fc5a6c0364439022/rmsd_cuda_kernel.cu#L89;

Yeah, that hardly feels like it's worth launching a new kernel for that diagonalization.

In addition, for the time being, it is difficult to make a base class for different implementations of atom_group, so I will try directly inserting GPU buffers in atom_group (guarded by the COLVARS_CUDA macro), and reorganizing atom_group after having a working GPU code base.

Totally makes sense as a starting point. Longer term, does it makes sense to share these GPU buffers (at least the read-only ones like coordinates) between different kernels?

HanatoK Jul 23, 2025
Collaborator Author

It should be possible to share the buffer among CUDA, HIP and SYCL.

giacomofiorin Jul 23, 2025
Maintainer

Sorry, I intended that as the kernels of different CVs that are defined on the same atoms.

HanatoK Jul 23, 2025
Collaborator Author

That should be possible, but if you refer to two CVs share the same atom group, then the accumulation of forces from different CVs has to use atomicAdd.

Colvars on GPU #816

Uh oh!

Uh oh!

HanatoK Jul 7, 2025 Collaborator

colvarproxy

MD engines

LAMMPS support

GPU buffers

Stream/Queue management

colvarmodule

cvm::atom_group

GPU buffers

GPU kernels of atom-group calculations

colvar::cvc

GPU kernels for CVCs

Tests

Replies: 3 comments · 10 replies

Uh oh!

HanatoK Jul 8, 2025 Collaborator Author

Uh oh!

giacomofiorin Jul 10, 2025 Maintainer

Uh oh!

HanatoK Jul 10, 2025 Collaborator Author

Uh oh!

giacomofiorin Jul 10, 2025 Maintainer

Uh oh!

giacomofiorin Jul 10, 2025 Maintainer

Uh oh!

jhenin Jul 11, 2025 Maintainer

Uh oh!

Uh oh!

giacomofiorin Jul 10, 2025 Maintainer

Uh oh!

HanatoK Jul 10, 2025 Collaborator Author

Uh oh!

HanatoK Jul 23, 2025 Collaborator Author

Uh oh!

giacomofiorin Jul 23, 2025 Maintainer

Uh oh!

HanatoK Jul 23, 2025 Collaborator Author

Uh oh!

giacomofiorin Jul 23, 2025 Maintainer

Uh oh!

HanatoK Jul 23, 2025 Collaborator Author

HanatoK
Jul 7, 2025
Collaborator

`colvarproxy`

`colvarmodule`

`cvm::atom_group`

`colvar::cvc`

Replies: 3 comments 10 replies

HanatoK
Jul 8, 2025
Collaborator Author

giacomofiorin Jul 10, 2025
Maintainer

HanatoK Jul 10, 2025
Collaborator Author

giacomofiorin Jul 10, 2025
Maintainer

giacomofiorin Jul 10, 2025
Maintainer

jhenin Jul 11, 2025
Maintainer

giacomofiorin
Jul 10, 2025
Maintainer

HanatoK Jul 10, 2025
Collaborator Author

HanatoK
Jul 23, 2025
Collaborator Author

giacomofiorin Jul 23, 2025
Maintainer

HanatoK Jul 23, 2025
Collaborator Author

giacomofiorin Jul 23, 2025
Maintainer

HanatoK Jul 23, 2025
Collaborator Author