|
| 1 | +# Quick start |
| 2 | + |
| 3 | +In this page you can find a quick start guide on how to use IEETA cluster (Pleiades). |
| 4 | + |
| 5 | +## 1. Access IEETA cluster (Pleiades) |
| 6 | + |
| 7 | +Access the cluster via SSH using the credentials provided to you by email. If you do not have access yet, please refer to the `how_to_access.md` page. |
| 8 | + |
| 9 | +```bash |
| 10 | + |
| 11 | +``` |
| 12 | + |
| 13 | +By default, upon logging in, you will land on our **login** node in your home directory, which is located at `/data/home`. This is a network storage partition visible to all cluster nodes. |
| 14 | + |
| 15 | +The **login** node is where you should prepare your code in order to submit jobs to run on the **worker** nodes of the cluster. The worker nodes are equipped with powerful resources. Currently, we have: |
| 16 | + |
| 17 | +- **CPU nodes**: Nodes with a high amount of RAM and faster CPUs. *Currently not added to the cluster yet* |
| 18 | +- **GPU nodes**: Nodes equipped with GPUs and more modest CPU/RAM configurations. |
| 19 | + |
| 20 | +For more information about each node check the [nodes page](detail_material/nodes.md). |
| 21 | + |
| 22 | +## 2. Prepare your software environment |
| 23 | + |
| 24 | +The next step is to prepare your environment to run/build your application. We recommend using a virtual environment so that you can install any package locally. First, load the Python module. |
| 25 | + |
| 26 | +```bash |
| 27 | +$ module load python |
| 28 | +``` |
| 29 | +Then create and activate your virtual environment. |
| 30 | + |
| 31 | +```bash |
| 32 | +$ python -m venv virtual-venv |
| 33 | +$ source virtual-venv/bin/activate |
| 34 | +``` |
| 35 | +You can then install your package dependencies with pip. |
| 36 | +```bash |
| 37 | +(virtual-venv)$ pip install --upgrade pip |
| 38 | +(virtual-venv)$ pip install torch transformers |
| 39 | +``` |
| 40 | + |
| 41 | +## 3. Create your SLURM job script |
| 42 | + |
| 43 | +After setting up your runtime environment, you should create a SLURM job script to submit your job. For example: |
| 44 | + |
| 45 | +```bash |
| 46 | +#!/bin/bash |
| 47 | +#SBATCH --job-name=trainer # create a short name for your job |
| 48 | +#SBATCH --output="trainer-%j.out" # %j will be replaced by the slurm jobID |
| 49 | +#SBATCH --nodes=1 # node count |
| 50 | +#SBATCH --ntasks=1 # total number of tasks across all nodes |
| 51 | +#SBATCH --cpus-per-task=2 # cpu-cores per task (>1 if multi-threaded tasks) |
| 52 | +#SBATCH --gres=gpu:1 # number of gpus per node |
| 53 | +#SBATCH --mem=4G # Total amount of RAM requested |
| 54 | + |
| 55 | +source /virtual-venv/bin/activate # If you have your venv activated when you submit the job, then you do not need to activate/deactivate |
| 56 | + |
| 57 | +python your_trainer_script.py |
| 58 | + |
| 59 | +deactivate |
| 60 | +``` |
| 61 | +The script is made of two parts: |
| 62 | +1. Specification of the resources needed and some job information; |
| 63 | +2. Comands that will be executed on the destination node. |
| 64 | + |
| 65 | +As an example, in the first part of the script, we define the job name, the output file and the requested resources (1 GPU, 2 CPUs and 4GB RAM). Then, in the second part, we define the tasks of the job. |
| 66 | + |
| 67 | +By default since no partition was defined the job will run under the default partitaion that in this cluster is the gpu partition, you can check which partitions and nodes are available with: |
| 68 | + |
| 69 | +```bash |
| 70 | + $ sinfo |
| 71 | +```bash |
| 72 | +
|
| 73 | +## 4. Submit the job |
| 74 | +
|
| 75 | +To submit the job, you should run the following command: |
| 76 | +
|
| 77 | +```bash |
| 78 | + |
| 79 | + $ sbatch script_trainer.sh |
| 80 | + Submitted batch job 144 |
| 81 | + |
| 82 | +You can check the job status using the following command: |
| 83 | + |
| 84 | +```bash |
| 85 | + $ squeue |
| 86 | +```bash |
0 commit comments