-
Notifications
You must be signed in to change notification settings - Fork 33
Compute Canada
John Giorgi edited this page Mar 18, 2020
·
30 revisions
This page serves as internal documentation for setting up the project on one of Compute Canada's clusters.
The following bash script can be saved to setup.sh
and run on a Compute Canada cluster to set the project up. Once the script has completed running, you will have installed the project and all dependencies to $WORK
, and the virtual environment you created at $ENV
will be active.
Note: For the time being, you will need to comment out the
transformers
dependency of thesetup.py
file of AllenNLP before callingpip install --editable .
setup.sh
#!/bin/bash
PROJECT_NAME="t2t"
ENV="$HOME/$PROJECT_NAME"
WORK="$SCRATCH/t2t"
mkdir -p $WORK
module load python/3.7 cuda/10.1
# Create and activate a virtual environment
virtualenv --no-download $ENV
source $ENV/bin/activate
pip install --no-index --upgrade pip
# (TEMP) Install Transformers manually
pip install transformers==2.3.0
# Install AllenNLP from source
cd $WORK
git clone https://github.com/allenai/allennlp.git
cd allennlp
# *YOU NEED TO VIM SETUP.PY AND COMMENT OUT THE "TRANSFORMERS" DEPENDENCY*
pip install --editable .
cd ../
# Install the project
git clone https://github.com/JohnGiorgi/t2t.git
cd t2t
pip install --editable .
Once setup.sh
has been run successfully, the script train.sh
can be submitted with sbatch train.sh
in order to schedule a job. A few things of note:
- All hyperparameters are selected in the JSON file at
$CONFIG_FILEPATH
. If you want to make quick changes at the command line, you can use--overrides
, e.g.--overrides '{"data_loader.batch_size": 16}
to modify the config in place. - All output is saved to
$OUTPUT
. - Tensorboard logs exist at
$OUTPUT/log
, so you can calltensorboard --log-dir $OUTPUT/log
to view them. Note however, that the compute nodes are air-gapped, so you will need to copy$OUTPUT/log
to a login node, or your local computer, before runningtensorboard
. - In
train.sh
, there is an example call withsalloc
, which you can use to request the job interactively (for things like debugging). These jobs should be short.
train.sh
#!/bin/bash
# Requested resources
#SBATCH --mem=32G
#SBATCH --cpus-per-task=10
#SBATCH --gres=gpu:1
# Wall time and job details
#SBATCH --time=24:00:00
#SBATCH --job-name=t2t-train
# Emails me when job starts, ends or fails
#SBATCH [email protected]
#SBATCH --mail-type=FAIL
# Use this command to run the same job interactively
# salloc --mem=32G --cpus-per-task=10 --gres=gpu:1 --time=3:00:00
PROJECT_NAME="t2t"
ENV="$HOME/$PROJECT_NAME"
OUTPUT="$SCRATCH/$PROJECT_NAME"
WORK="$SCRATCH/$PROJECT_NAME/$PROJECT_NAME"
# Path to the AllenNLP config
CONFIG_FILEPATH="$WORK/configs/contrastive.jsonnet"
# Directory to save model, vocabulary and training logs
SERIALIZED_DIR="$OUTPUT/tmp"
# Load the required modules and activate the environment
module load python/3.7 cuda/10.1
source "$ENV/bin/activate"
cd $WORK
# Run the job
allennlp train $CONFIG_FILEPATH \
--serialization-dir $SERIALIZED_DIR \
--include-package t2t