Analysis scripts

Details of the analysis scripts under N1280/analyse_data

We use the CPMIP definitions of performance metrics.

archer_archer2_costs.py

Overview

Calculates the costs for the Archer and Archer2 runs, including and excluding failed cycles. For these runs failures are due to machine issues and model instabilities. Looking at the costs with and without failures, shows the inefficiencies inherent to running an unstable configuration.

We could potentially work out which failures were machine or other issues, and which were model instabilties. But this might need to be done at least semi-manually.

Inputs

../files/suite_info_N1280.csv : Summary information on the runs to process
../files/processed/ : CSV file with information on each of the UM atmos tasks run by the suite.

Outputs

../files/suite_perf_N1280_Archer_Archer2.csv : Cost metrics, including and excluding failed tasks, for each run.
- Failed jobs : The total number of failed jobs for the whole run.
- CHSY : Core Hours per Simulated Year, for the whole run including failures.
- CHSY exc fail :
- NHSY : Node Hours per Simulated Year, including failures. This should relate directly to CHSY by dividing by the number of cores per node.
- NHSY exc fail :
- MAU/SY : Cost in terms of Archer allocation units MAUs.
- MAU/SY exc fail
- CU/SY : Cost in terms of Archer2 allocation units CUs.
- CU/SY exc fail :

Procedure

We work out which tasks succeeded based on method described here, and then calculate metrics using either the full set of tasks, or just the succeeded ones. We have a full set of logs for the Archer and Archer2 runs, plus all the raw cylc files, so these costs can be considered accurate.

The costs in terms of Archer and Archer2 allocation units are calculated using the published rates. For Archer suites, we calculate the MAU cost first, then convert to CUs; and for Archer2 suites we do the converse and calculate CUs then convert to MAUs.

For Archer this is 0.36 kAU per node hour.
For Archer2 it is 1 CU per node hour.
The conversion is 1.5156 kAU to 1 CU.

performance_stats.py

Overview

Generates performance statistics for all the N1280 suites we have data for.

Inputs

../files/suite_info_N1280.csv : Summary information on the runs to process
../files/processed/ : CSV file with information on each of the UM atmos tasks run by the suite. This includes submission time, start time, run time etc.

Outputs

../files/suite_perf_N1280.csv : Performance metrics for each suite, plus some of the run information from the suite info file.
- Mean run time (h) : averaged over all cycles
- Mean queue time (h)
- Run time (days) / SY
- Queue time (days) / SY
- Run + queue time (days) / SY : An ideal time per SY, discounting any slow down due to machine, model or workflow.
- SYPD : Simulated Years Per Day, a key metric for model speed.
- ASYPD inc down : Actual Simulated Years Per Day, a measure of workflow speed. How fast we are actually getting model output per SY. Measured from the time the first task is submitted until last output file is archived. This is the CPMIP definition.
- ASYPD inc wait : An ideal workflow speed. How fast the simulation could run just based on the model speed, and HPC queue time, assuming all other tasks running in parallel and no downtime.
- Cores used : The number of parallel tasks used by the model, assuming 1 core per task. So the MPI processes times OpenMP threads.
- Cores allocated : The number of cores charged to the model. On the HPC systems used here, we pay for exclusive use of a full nod, so this is the number of nodes times cores per node. If we are running underpopulated this will differ to the cores used. We use this number to calculated the cost metrics, because we are charged for a full node.
- CHSY : Core Hours per Simulated Year
- NHSY : Node Hours per Simulated Year

Procedure

The metrics are calculated based on averaging over all tasks listed in the csv log files. In some cases we do not have data for every completed cycle. For run time statistics, we exclude any failed tasks, since these may not have run a full cycle. For queue time, we do include failed tasks. We determine failed tasks by ignoring all except the last job per cycle. In the case of one of the Archer2 suites (u-cf432_1), we have some jobs that go past the last cycle listed in the suite information file, so these are also filtered out for run time. This job changed the number of nodes mid-way through which is why we split it up.

Ignoring failed jobs, however, does not give a true account of the cost for the simulation. For the XCS suites, the failed jobs were already filtered out in the inital processing script and we no longer have the raw cylc files. For Archer and Archer2 suites we look at the costs with and without failed jobs in archer_archer2_costs.py.

The code also reads in additional data from the Archer SAFE logs, and XCS PBS epilogues. This includes information like memory and energy usage, and file I/O, but we don't use this data here. There is probably a way to get this data from Archer2 slurm logs too.

performance_stats_extra.py

Overview

Generates some additional metrics - memory, energy and IO rates.

Inputs

../files/suite_perf_N1280.csv : Performance data generated by performance_stats.py.
../files/processed/ : For XCS suites, CSV files of data from PBS job epilogues. For Archer suites, CSV files of cylc logs, and job reports downloaded from SAFE.

Output

files/suite_perf_N1280_extra.csv : Some extra performance metrics for each suite.
- Memory usage (TB) : Total memory used over all nodes
- Energy usage (GJ/SY)
- Data written (TB/SY)
- Data write rate (MB/s)

Procedure

These metrics are generated from additional job information produced in PBS job logs for XCS suites, and downloaded from SAFE for Archer jobs. They may not be strictly equivalent between machines. The data write rate is just derived from the volume of data written and the runtime.

Note that we only have a few entries for u-bx090. And the Archer job data is now held in a different format in SAFE, so the code would need to be amended. We have energy and memory data in the _safe_node files, but have had difficulty extracting IO data.

queue_variability.py

Overview

Plots the queue time for each cycle in hours.

Inputs

../files/processed/ : CSV file with information on each of the UM atmos tasks run by the suite.

Outputs

../plots/ :
- archer_archer2_queue_variability.png : Queue time per cycle, plotted against the submission date.
- archer_archer2_queue_variability_small.png : Same as above, plotted in a smaller size.

Procedure

We only consider Archer and Archer2 suites here. The queue time for all tasks is plotted agains the real time the job was submitted. This shows periods where the machine is busy versus quiet.

Potentially we could look at eg, weekend versus weekday submissions. And possibly we could annotate the plots with the nodes and hours requested in each case. In any case, this can be combined with the mean queue time / SY to show for example, whether running longer cycles is more efficient even if queue time is longer.

runtime_variability.py

Overview

Plots the run time for each cycle, in terms of time in hours, and SYPD.

Inputs

../files/suite_info_N1280.csv : Summary information on the runs to process
../files/processed/ : CSV file with information on each of the UM atmos tasks run by the suite.

Outputs

../plots/ :
- xcs_sypd_variability.png : SYPD per cycle for XCS suites
- archer_sypd_variability.png : SYPD per cycle for Archer suites
- archer2_sypd_variability.png : SYPD per cycle for Archer2 suites
- archer_archer2_sypd_variability.png : SYPD per cycle for Archer and Archer2 suites together.
- archer_runtime_variability.png : Runtime in hours per cycle for Archer suites.
- archer2_runtime_variability.png : Runtime in hours per cycle for Archer2 suites.

Procedure

We plot the model speed (in SYPD) and run time for all successful tasks (see here), against the model cycle. We generate plots for all the suites for each machine, and additionally plot the Archer and Archer2 suites together. Since these have different model dates, we re-index the data against the month of the simulation.

transfer_rates.py

Overview

Plots the transfer rate per cycle for two recent runs.

Inputs

../files/processed/ : CSV file with data from transfer tasks, in particular transfer speed, and data size per cycle.

Outputs

../plots/ :
- u-bx090_transfer.png : Transfer rate per cycle in MB/s for this run, annotated with the mean data volume/cycle.
- u-cd936_transfer.png

Procedure

We only plot the two suites that we have the right transfer data for. These runs use rsync and the logs report the transfer rate and data sizes. For gridftp transfers, we need to make sure the logs report the data size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis scripts

archer_archer2_costs.py

performance_stats.py

performance_stats_extra.py

queue_variability.py

runtime_variability.py

transfer_rates.py

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally