Skip to content
jdha edited this page Jan 16, 2023 · 3 revisions

XIOS Hanging (GNU-MPICH)

In the initialisation phase of the model run the simulation would hang when large core counts were used (>3000). This would occur in iom.F90 at the call to xios_close_context_definition(). It was first thought this problem may be linked to a bug in the MPI libraries (concerning MPI_IProbe). With this in mind other MPI implementations where explored (MPICH4, OMPI). Tests with the CRAY compiler were also examined. However, a different kind of issue was encountered when using these, namely that the model fell over in the first few time steps with Salt > 1e308, or Salt = NaN. In some instances switching all code optimisation off allowed the model to run, although this appeared to be hit and miss (these issues will be examined at a later date).

Revisiting GNU-MPICH

In desperation the GNU-MPICH was revisited, beginning with the simplest setup know to work adding in additional components to see where things break:

  1. ZPS, 1500 NEMO cores, 8 XIOS, -O0 everywhere, only 1ts output of ssh, sst and sss ✔️
  2. As 1. but MES as ZPS (i.e. same namelist)✔️
  3. As 2. but with all MES additions in the namelist ✔️
  4. As 3. but with 16 XIOS ✔️

Note: the simulations are running with multiple_file in the file_def* in place of one_file to simplify things. Moving from 8 XIOS to 16 XIOS requires a fudge with the domain_cfg file requiring single wet points to be inserted into Antarctica otherwise the first XIOS server will be waiting for MPI_COMM that will never happen (see iodef subsection below).

Now scaling up the core counts with a working example:

  1. as 4. with 101 nodes, 8435 NEMO cores, 16 XIOS and -g3 (NEMO core spacing gap every 3 cores) ✔️
  2. as 4. with 68 nodes, 8435 NEMO cores, 16 XIOS, and -g9000 (i.e. no gaps) ✔️
  3. as 4. with 70 nodes, 8435 NEMO cores, 32 XIOS, and -g9000 (i.e. no gaps) ✔️
  4. as 4. with 68 nodes, 8090 NEMO cores, 32 XIOS, and -g9000 (i.e. no gaps) ✔️

This last option is optimised to remove all land cores from the simulation. For reference the tile size at 8090 cores in 16x11.

Now the large core count simulations are running, attention turns to output:

  1. as 8. with all standard ocean output switched on ❌
  2. as 9. will ENABLED=.FALSE. on all output in the XML file ❌

The last test is a little confusing as one would expect this simulation to behave as if there is no required output (to be investigated)

  1. as 8. but with only 5 day mean output ✔️
  2. as 11. but with the addition of grid_T monthly means, and grid_U monthly means ✔️
  3. as 12. with the addition of grid_V monthly means ❌
  4. as 12. but switch out grid_U for grid_V monthly means ✔️

So from 14 this indicated that it is not variable dependant.

  1. as 9. but remove optimal_buffer_size from the iodef.xml file ✔️
  2. as 15. with XIOS compiled with -O2 ✔️
  3. as 16. with the addition of ice output ✔️
  4. as 17. with NEMO compiled with -O1 ✔️ and compiled with -O2 ✔️

Further sensitivity tests reveal that adding optimal_buffer_size back in (and setting it to preference memory) and increasing buffer_size_factor to 4.0 also worked.

Note on iodef.xml

There are several levers you can pull with regards to XIOS management. One option is to increase the number of XIOS servers employed. Note that in NEMO version < 4.2 you must make sure that no XIOS server occupies land-only regions of the configuration. For example if I choose 8 XIOS servers then the SE-NEMO is divided up into 8 zonal xios_server.exe strips each ~150 points in height (i.e. 1207 j-points divided by 8). This decomposition works for the SE-NEMO configuration as the first XIOS server will have a small region of the Weddell Sea. If 16 XIOS servers are requested this becomes an issue - so single grid cell wet points are added to the domain_cfg file in Antarctica to overcome this). Other options when running into dificulties with XIOS can be adjusted in the iodef.xml file. The buffer size is optimised for performance by default. This can be either optimised for memory or can be scaled up or even both by adding the following code:

<variable_definition>
          <variable_group id="buffer">
             <variable id="optimal_buffer_size" type="string">memory</variable>
             <variable id="buffer_size_factor" type="double">4.0</variable>
          </variable_group>
          ...

Note on XML files

Memory issues may also arise if outputting a large number of variables over a long run. By using split_freq in the output XML file, the output will be written out in more memory friendly chunks (and buffers cleared). For example:

 <file_group id="1m" output_freq="1mo" output_level="10" split_freq="1mo" enabled=".TRUE."> <!-- real monthly files -->

will write out 1 file per monthly mean. If split_freq is omitted then the output will be written to a single file the length of the job submission.