debug mode for alchemical trajectory analyses #231

jmichel80 · 2024-09-13T10:13:40Z

Is your feature request related to a problem? Please describe.
A recurring issue seen with alchemical free energy calculations with SOMD2 is that occasionally trajectories terminate early due to a 'NaN' generated after an integration step. We have also seen cases of trajectories showing transient spikes in non-bonded energies that we would expect cause a numerical integration error.
Because of the stochastic nature and rare frequency of the issue it is difficult to isolate the source of the error.

Describe the solution you'd like
A 'debug' mode that enables buffering of coordinates and energies for the past N integration time-steps would be helpful. The code could be updated to write this information in molecular file formats to allow visualisation of the trajectory in the few steps immediately before a crash occurs.

Describe alternatives you've considered
This could be in principle implemented at the python API by adding extra logic to save/overwrite snapshots after every MD time-step. However this would likely be very slow and make it difficult to re-generate in a timely manner NaN crashes.

We could however buffer internally coordinates and forces and write them to disk only when a crash has been triggerred. There is already low-level logic in the code to attempt to deal with NaN errors by performing energy minimisation. Some compromise on speed (a few fold) would be acceptable for troubleshooting purposes.

chryswoods · 2025-02-17T07:01:33Z

This is doable, but would be extremely slow. Buffering the coordinates and energies for every integration timestep would require calculating the energy, plus transferring the coordinates from GPU to CPU memory every timestep. The buffering itself once calculated and transferred is easy. The first step would be to test to see how slow this would be by setting the trajectory and energy frequency to 1 timestep. This would simulate an infinite buffer. If the speed of this is acceptable, then the code change would be to add something to the trajectory object to tell it to act in a first in, last out cache mode. This is straightforward, as the trajectory object is already holding each individual frame in memory, so it would just have to drop the oldest frame once the buffer size is reached.

A similar thing could be done with the energy trajectory, but this isn't needed as much as it won't consume that much energy.

jmichel80 added the enhancement New feature or request label Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debug mode for alchemical trajectory analyses #231

debug mode for alchemical trajectory analyses #231

jmichel80 commented Sep 13, 2024

chryswoods commented Feb 17, 2025

debug mode for alchemical trajectory analyses #231

debug mode for alchemical trajectory analyses #231

Comments

jmichel80 commented Sep 13, 2024

chryswoods commented Feb 17, 2025