Migrate SLURM accounting data when upgrading to a newer ParallelCluster version

Introduction

This guide describes how to migrate your SLURM accounting data when upgrading to a newer version of AWS ParallelCluster. The migration process has two main steps: first, creating a new cluster with the newer version, then initializing its accounting database using data from your previous cluster. After migration, the two clusters will each point to their own independent database. Note that this process requires downtime on your current cluster to ensure data integrity.

This guide focuses on ParallelCluster-specific constraints and recommendations. For additional details and recommendations specific to SLURM version upgrades, please refer to the official SLURM documentation.

Procedure Overview

The high-level steps to migrate the SLURM accounting database are:

Stop the current cluster and backup the accounting data
Create a new database and import the accounting data
Set up the new ParallelCluster pointing to the new accounting database
Verify the migration was successful

Requirements

You have an existing cluster with SLURM accounting enabled
You can stop the compute fleet and SLURM daemons on the current cluster.
You can install MySQL client commands (mysql, mysqldump) on a host; could be whatever EC2 instance able to communicate with the database.
You have access to a database user with read/write permissions to the db cluster. This is required to create the new database and import export database dumps.

Best Practices

Always create a backup of your data before migration
Test the migration process in a non-production environment first
Verify data integrity after migration
Keep the old database as backup for a reasonable period

Step 1: Stop the fleet and SLURM daemons on the current cluster

In this step we will to stop using the accounting database and retrieve relevant information that we will later use to setup the new cluster.

Stop the compute fleet:

pcluster update-compute-fleet \
    --region <region>
    --cluster-name <clusterA_Name>
    --status STOP_REQUESTED

Take note of the SLURM configuration ClusterName in the current cluster:

scontrol show config | grep ClusterName
ClusterName             = <clusterA_Slurm_ClusterName>

Take note of the SLURM configuration StorageLoc in the current cluster:

sacctmgr show configuration | grep StorageLoc
StorageLoc             = <clusterA_DatabaseName>

Take note of the ID of the last job submitted to the current cluster:

sacct --format=jobid -X | tail -n 1
<clusterA_LastJobId>

Then stop all SLURM daemons:

systemctl stop slurmrestd   # Only if present
systemctl stop slurmctld 
systemctl stop slurmdbd

Step 2: Backup data on the current cluster

In this step we will backup cluster state and accounting data. The dump of accounting data will be then imported into the new database.

This is an example procedure that relies on mysqldump to perform a local snapshot of the database. We recommend users follow their preferred database backup procedure for this step, which may involve different tools or techniques depending on the specific database setup. In this example, since the backup is created locally on an instance, we recommend to check that there is enough space for the backup file and to save it to a persistent storage location for future use.

Backup the state of the current cluster:

mkdir -p pcluster-backup/var/spool/slurm.state
cp -R /var/spool/slurm.state pcluster-backup/var/spool/slurm.state

Backup data from the accounting database:

mysqldump <clusterA_DatabaseName> \
-h <databaseHostname> \
-u <databaseAdminUsername> -p \
--routines --triggers --events --set-gtid-purged=OFF > slurm_accounting_backup.sql
# Type the password when prompted

Important flags explained:

--routines --triggers --events: Includes stored procedures, triggers, and scheduled events
--set-gtid-purged=OFF: Prevents replication-related issues

Step 3: Copy data to the new database

In this step we will import the accounting data that we saved in the previous step in to the new database that will be used by the new cluster. We recommend creating a new database for a new cluster as safe approach for the migration as it decouples the two clusters.

This is an example procedure that relies on mysqldump to import a local snapshot of the old database into the new one. We recommend users follow their preferred database backup procedure for this step, which may involve different tools or techniques depending on the specific database setup.

Create a new database:

mysql -h <databaseHostname> \
-u <databaseAdminUsername> \
-p -e "CREATE DATABASE <clusterB_DatabaseName>;"
# Type the password when prompted

Import data to the new database:

mysql <clusterB_DatabaseName> \
-h <databaseHostname> \
-u <databaseAdminUsername> -p \
--triggers --routines --events --set-gtid-purged=OFF < slurm_accounting_backup.sql
# Type the password when prompted

Verify that the new database contains the expected data.

In particular, check that the tables in the two databases match:

TABLES_A=$(
mysql <clusterA_DatabaseName> \
-h <databaseHostname> \
-u <databaseAdminUsername> -p \
-e 'SHOW TABLES;')

TABLES_B=$(
mysql <clusterB_DatabaseName> \
-h <databaseHostname> \
-u <databaseAdminUsername> -p \
-e 'SHOW TABLES;')

diff <(echo "$TABLES_A") <(echo "$TABLES_B")

and check that the number of recorded jobs match in the two databases:

COUNT_A=$(
mysql <clusterA_DatabaseName> \
-h <databaseHostname> \
-u <databaseAdminUsername> -p \
-e 'SELECT COUNT(*) FROM `<clusterA_Slurm_ClusterName>_job_table`;')

COUNT_B=$(
mysql <clusterB_DatabaseName> \
-h <databaseHostname> \
-u <databaseAdminUsername> -p \
-e 'SELECT COUNT(*) FROM `<clusterA_Slurm_ClusterName>_job_table`;')

diff <(echo "$COUNT_A") <(echo "$COUNT_B")

Please, notice that such validation is not comprehensive, however it can provide you enough confidence about the outcome of the copy.

Step 4: Setup the new cluster, without SLURM accounting

Create the new cluster without SLURM accounting, but with the custom SLURM settings below to make the new cluster able to work the old data. It’s is important to create the new cluster without SLURM accounting to guarantee the success of the operation because otherwise slurmctld will fail to start due to a conflict in the cluster name.

Create the new cluster with the following custom SLURM settings:

Scheduling:
  Scheduler: slurm
  SlurmSettings:
    CustomSlurmSettings:      
      - ClusterName: <clusterA_Slurm_ClusterName>
      - FirstJobId: <clusterA_LastJobId + 100>

setting ClusterName to the old cluster name is required for SLURM to store new accounting data on the same tables where old accounting data is stored.
settings FirstJobId to a number that is greater than the last job submitted in the old cluster is required so that the new cluster will not overwrite the old accounting data. Please notice that 100 here is an arbitrary number.

Once the cluster is created, stop the compute fleet so that the cluster can be updated in the next step to enable the accounting:

pcluster update-compute-fleet \
    --region <region>
    --cluster-name <clusterB_Name>
    --status STOP_REQUESTED

Step 5: Enable SLURM accounting in the new cluster

Enable SLURM accounting in the new cluster by updating the cluster configuration:

HeadNode:
  Networking:
    AdditionalSecurityGroups:
      - <database_ClientSecurityGroup>
Scheduling:
  SlurmSettings:    
    Database:
      Uri: <databaseHostname>:<databasePort>
      UserName: <databaseAdminUsername>
      PasswordSecretArn: <databasePasswordSecret>
      DatabaseName: <clusterB_DatabaseName>
    CustomSlurmSettings:      
      - ClusterName: <clusterA_Slurm_ClusterName>
      - FirstJobId: <clusterA_LastJobId + 100>

Apply the new configuration through a cluster update:

pcluster update-cluster \
    --region <region> \
    --cluster-name <clusterB_Name> \
    --cluster-configuration <clusterB_Config>

Once the update completes, restart the compute fleet:

pcluster update-compute-fleet \
    --region <region>
    --cluster-name <clusterB_Name>
    --status START_REQUESTED

Step 6: Verify old data is visible from the new cluster

Log into the new cluster head node and verify that the old accounting data is visible.

sacct --format=jobid,jobname,partition,alloccpus,account,user,uid,state,exitcode --starttime now-30days

Step 7: Verify new data can be written from the new cluster

In the new cluster, submit a dummy job:

sbatch --wrap 'sleep 60'

Take note of the job id and check that the accounting data for that job is visible

sacct --format=jobid,jobname,partition,alloccpus,account,user,uid,state,exitcode --starttime now-30days

Migrate SLURM accounting data when upgrading to a newer ParallelCluster version

Introduction

Procedure Overview

Requirements

Best Practices

Step 1: Stop the fleet and SLURM daemons on the current cluster

Step 2: Backup data on the current cluster

Step 3: Copy data to the new database

Step 4: Setup the new cluster, without SLURM accounting

Step 5: Enable SLURM accounting in the new cluster

Step 6: Verify old data is visible from the new cluster

Step 7: Verify new data can be written from the new cluster

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!