You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been playing around with Allegro and LAMMPS. I can get away with a relatively small simulation (~500 atoms) but would like to have this run for a long time scale. Because of that, I'd like to optimize the performance per timestep. I'm using a fairly small Allegro model (1 layer, SO(3) symmetry, small # of tensor features, etc.). Looking at the CPU and GPU utilization for 1 MPI rank and 1 V100 GPU, I'm seeing 100% CPU usage and 75% GPU usage. Moving to 2 MPI ranks and 2 GPUs (1 node), I'm seeing 100% CPU usage per rank and 66% GPU usage per GPU. It seems it's currently bottlenecked by something on the CPU.
I did a bit of profiling, and it looks like a significant chunk of total runtime (30%) is spent on LAMMPS_NS::CommKokkos::borders()here, with about 20% of total runtime spent on LAMMPS_NS::CommKokkos::borders_device<Kokkos::Cuda>(). It looks like this function transfers neighbor data between procs, but the odd thing is, I'm running this with only 1 MPI rank. Maybe this is just sending the neighbor list to the GPU? I can provide the gprof output if you'd like a closer look.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi Alby, Anders, and co.!
I've been playing around with Allegro and LAMMPS. I can get away with a relatively small simulation (~500 atoms) but would like to have this run for a long time scale. Because of that, I'd like to optimize the performance per timestep. I'm using a fairly small Allegro model (1 layer, SO(3) symmetry, small # of tensor features, etc.). Looking at the CPU and GPU utilization for 1 MPI rank and 1 V100 GPU, I'm seeing 100% CPU usage and 75% GPU usage. Moving to 2 MPI ranks and 2 GPUs (1 node), I'm seeing 100% CPU usage per rank and 66% GPU usage per GPU. It seems it's currently bottlenecked by something on the CPU.
I did a bit of profiling, and it looks like a significant chunk of total runtime (30%) is spent on
LAMMPS_NS::CommKokkos::borders()
here, with about 20% of total runtime spent onLAMMPS_NS::CommKokkos::borders_device<Kokkos::Cuda>()
. It looks like this function transfers neighbor data between procs, but the odd thing is, I'm running this with only 1 MPI rank. Maybe this is just sending the neighbor list to the GPU? I can provide thegprof
output if you'd like a closer look.Have you encountered something similar before?
Beta Was this translation helpful? Give feedback.
All reactions