Skip to content

Commit 7cb57eb

Browse files
Robadobptheywood
andauthored
Distributed Ensembles (MPI) (#156)
* Distributed Ensembles (MPI) Also clarify CUDAEnsemble::getLogs() now returns std::map * Address cleanup() and cuda-aware MPI * Add examples of how to call diff MPI implementations with 1 process per node --------- Co-authored-by: Peter Heywood <[email protected]>
1 parent 991fcdf commit 7cb57eb

File tree

1 file changed

+75
-9
lines changed
  • src/guide/running-multiple-simulations

1 file changed

+75
-9
lines changed

src/guide/running-multiple-simulations/index.rst

Lines changed: 75 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,9 @@ Next you need to decide which data will be collected, as it is not possible to e
144144

145145
A short example is shown below, however you should refer to the :ref:`previous chapter<Configuring Data to be Logged>` for the comprehensive guide.
146146

147-
One benefit of using :class:`CUDAEnsemble<flamegpu::CUDAEnsemble>` to carry out experiments, is that the specific :class:`RunPlan<flamegpu::RunPlan>` data is included in each log file, allowing them to be automatically processed and used for reproducible research. However, this does not identify the particular version or build of your model.
147+
One benefit of using :class:`CUDAEnsemble<flamegpu::CUDAEnsemble>` to carry out experiments, is that the specific :class:`RunPlan<flamegpu::RunPlan>` data is included in each log file, allowing them to be automatically processed and used for reproducible research. However, this does not identify the particular version or build of your model.
148+
149+
If you wish to post-process the logs programmatically, then :func:`CUDAEnsemble::getLogs()<flamegpu::CUDAEnsemble::getLogs>` can be used to fetch a map of :class:`RunLog<flamegpu::RunLog>` where keys correspond to the index of successful runs within the input :class:`RunPlanVector<flamegpu::RunPlanVector>` (if a simulation run failed it will not have a log within the map).
148150

149151
Agent data is logged according to agent state, so agents with multiple states must have the config specified for each state required to be logged.
150152

@@ -167,8 +169,8 @@ One benefit of using :class:`CUDAEnsemble<flamegpu::CUDAEnsemble>` to carry out
167169
exit_log_cfg.logEnvironment("lerp_float");
168170

169171
// Pass the logging configs to the CUDAEnsemble
170-
cuda_ensemble.setStepLog(step_log_cfg);
171-
cuda_ensemble.setExitLog(exit_log_cfg);
172+
ensemble.setStepLog(step_log_cfg);
173+
ensemble.setExitLog(exit_log_cfg);
172174

173175
.. code-tab:: py Python
174176

@@ -187,8 +189,8 @@ One benefit of using :class:`CUDAEnsemble<flamegpu::CUDAEnsemble>` to carry out
187189
exit_log_cfg.logEnvironment("lerp_float")
188190

189191
# Pass the logging configs to the CUDAEnsemble
190-
cuda_ensemble.setStepLog(step_log_cfg)
191-
cuda_ensemble.setExitLog(exit_log_cfg)
192+
ensemble.setStepLog(step_log_cfg)
193+
ensemble.setExitLog(exit_log_cfg)
192194

193195
Configuring & Running the Ensemble
194196
----------------------------------
@@ -239,11 +241,21 @@ You may also wish to specify your own defaults, by setting the values prior to c
239241
ensemble.initialise(argc, argv);
240242

241243
// Pass the logging configs to the CUDAEnsemble
242-
cuda_ensemble.setStepLog(step_log_cfg);
243-
cuda_ensemble.setExitLog(exit_log_cfg);
244+
ensemble.setStepLog(step_log_cfg);
245+
ensemble.setExitLog(exit_log_cfg);
244246

245247
// Execute the ensemble using the specified RunPlans
246248
const unsigned int errs = ensemble.simulate(runs);
249+
250+
// Fetch the RunLogs of successful runs
251+
const std::map<unsigned int, flamegpu::RunLog> &logs = ensemble.getLogs();
252+
for (const auto &[plan_id, log] : logs) {
253+
// Post-process the logs
254+
...
255+
}
256+
257+
// Ensure profiling / memcheck work correctly (and trigger MPI_Finalize())
258+
flamegpu::util::cleanup();
247259

248260
.. code-tab:: py Python
249261

@@ -266,12 +278,21 @@ You may also wish to specify your own defaults, by setting the values prior to c
266278
ensemble.initialise(sys.argv)
267279

268280
# Pass the logging configs to the CUDAEnsemble
269-
cuda_ensemble.setStepLog(step_log_cfg)
270-
cuda_ensemble.setExitLog(exit_log_cfg)
281+
ensemble.setStepLog(step_log_cfg)
282+
ensemble.setExitLog(exit_log_cfg)
271283

272284
# Execute the ensemble using the specified RunPlans
273285
errs = ensemble.simulate(runs)
274286

287+
# Fetch the RunLogs of successful runs
288+
logs = ensemble.getLogs()
289+
for plan_id, log in logs.items():
290+
# Post-process the logs
291+
...
292+
293+
# Ensure profiling / memcheck work correctly (and trigger MPI_Finalize())
294+
pyflamegpu.cleanup();
295+
275296
Error Handling Within Ensembles
276297
-------------------------------
277298

@@ -289,6 +310,51 @@ The default error level is "Slow" (1), which will cause an exception to be raise
289310

290311
Alternatively, calls to :func:`simulate()<flamegpu::CUDAEnsemble::simulate>` return the number of errors, when the error level is set to "Off" (0). Therefore, failed runs can be probed manually via checking that the return value of :func:`simulate()<flamegpu::CUDAEnsemble::simulate>` does not equal zero.
291312

313+
Distributed Ensembles via MPI
314+
-----------------------------
315+
316+
For particularly expensive batch runs you may wish to distribute the workload across multiple nodes within a HPC cluster. This can be achieved via Message Passing Interface (MPI) support. This feature is supported by both the C++ and Python interfaces to FLAMEGPU, however it is not available in pre-built binaries/packages/wheels and must be compiled from source as required.
317+
318+
To enable MPI support FLAMEGPU should be configured with the CMake flag ``FLAMEGPU_ENABLE_MPI`` enabled. When compiled with this flag :class:`CUDAEnsemble<flamegpu::CUDAEnsemble>` will use MPI. The ``mpi`` member of the :class:`CUDAEnsemble::EnsembleConfig<flamegpu::CUDAEnsemble::EnsembleConfig>` which will be set ``true`` if MPI support was enabled at compile time.
319+
320+
It is not necessary to use a CUDA aware MPI library, as `CUDAEnsemble<flamegpu::CUDAEnsemble>` will make use of all available GPUs by default using the it's existing multi-gpu support (as opposed to GPU direct MPI comms).
321+
Hence it's only necessary to launch 1 process per node, although requesting multiple CPU cores in a HPC environment are still recommended (e.g. a minimum of ``N+1``, where ``N`` is the number of GPUs in the node).
322+
323+
If more than one MPI process is launched per node, the available GPUs will be load-balanced across the MPI ranks.
324+
If more MPI processes are launched per node than there are GPUs available, a warning will be issued as the additional MPI ranks will not participate in execution of the ensemble as they are unnecessary.
325+
326+
.. note::
327+
328+
MPI implementations differ in how to request 1 process per node when calling MPI. A few examples are provided below:
329+
330+
* `Open MPI`_: ``mpirun --pernode`` or ``mpirun --npernode 1``
331+
* `MVAPICH2`_: ``mpirun_rsh -ppn 1``
332+
* `Bede`_: ``bede-mpirun --bede-par 1ppn``
333+
334+
.. _Open MPI: https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php
335+
.. _MVAPICH2: https://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-userguide.html#x1-320005.2.1
336+
.. _Bede: https://bede-documentation.readthedocs.io/en/latest/usage/index.html?#multiple-nodes-mpi
337+
338+
When executing with MPI, :class:`CUDAEnsemble<flamegpu::CUDAEnsemble>` will execute the input :class:`RunPlanVector<flamegpu::RunPlanVector>` across all available GPUs and concurrent runs, automatically assigning jobs when a runner becomes free. This should achieve better load balancing than manually dividing work across nodes, but may lead to increased HPC queue times as the nodes must be available concurrently.
339+
340+
The call to :func:`CUDAEnsemble::simulate()<flamegpu::CUDAEnsemble::simulate>` will initialise MPI state if this has necessary, in order to cleanly exit :func:`flamegpu::util::cleanup()<flamegpu::util::cleanup>` must be called before the program exits. Hence, you may call :func:`CUDAEnsemble::simulate()<flamegpu::CUDAEnsemble::simulate>` multiple times to execute multiple ensembles via MPI in a single execution, or probe the MPI world state prior to launching the ensemble, but :func:`flamegpu::util::cleanup()<flamegpu::util::cleanup>` must only be called once.
341+
342+
All three error-levels are supported and behave similarly. In all cases the rank 0 process will be the only process to raise an exception after the MPI group exits cleanly.
343+
344+
If programmatically accessing run logs when using MPI, via :func:`CUDAEnsemble::getLogs()<flamegpu::CUDAEnsemble::getLogs>`, each MPI process will return the logs for the runs it personally completed. This enables further post-processing to remain distributed.
345+
346+
For more guidance around using MPI, such as how to launch MPI jobs, you should refer to the documentation for the HPC system you will be using.
347+
348+
.. warning::
349+
350+
:class:`CUDAEnsemble<flamegpu::CUDAEnsemble>` MPI support distributes GPUs within a shared memory system (node) across the MPI ranks assigned to that node, to avoid overallocation of resources and unnecessary model failures. It's only necessary to launch 1 MPI process per node, as :class:`CUDAEnsemble<flamegpu::CUDAEnsemble>` is natively able to utilise multiple GPUs within a single node, and a warning will be emitted if more MPI ranks are assigned to a node than there are visible GPUs.
351+
352+
.. warning::
353+
354+
:func:`flamegpu::util::cleanup()<flamegpu::util::cleanup()>` must be called before the program returns when using MPI, this triggers ``MPI_Finalize()``. It must only be called once per process.
355+
356+
FLAMEGPU has a dedicated MPI test suite, this can be built and ran via the ``tests_mpi`` CMake target. It is configured to automatically divide GPUs between MPI processes when executed with MPI on a single node (e.g. ``mpirun -n 2 ./tests_mpi``) and scale across any multi-node configuration. Some tests will not run if only a single GPU (and therefore MPI rank) is available. Due to limitations with GoogleTest each runner will execute tests and print to stdout/stderr, crashes during a test may cause the suite to deadlock.
357+
292358

293359
Related Links
294360
-------------

0 commit comments

Comments
 (0)