This repository contains code to reproduce the experiments of our work with Improved Intra-Operator Parallelism for Distributed LLM Training. The associated class paper can be found above.
To run this project, you need to install the required packages. Follow the steps below to install the dependencies using the requirements.txt file.
- Clone the repository:
git clone https://github.com/livingshade/Metis.git
- Navigate to the project directory:
cd Metis
- Install dependencies using the requirements.txt file:
# If using conda
conda create -n HTDR python=3.11 pip -y
conda activate HTDR
pip install -r requirements.txt
- Once all dependencies are installed, you are ready to start running experiments. First, generate synthetic profiling data for a 10 layer model with a global batch size of 128.
python3 gen_synth_data.py 10 128
- Create your hostfile. To make things easier, we provided a script that can quickly generate a cluster made up of a 50/50 split of A100 and V100s nodes. Each node has 4 GPU cards.
python3 gen_hostfile.py 16
- Run the simulation experiment with naive Metis. Please make sure to pass in the absolute file path on your device to Metis (the local repository) to
HOME_DIR
. The results should be found inlogs/GPT_1.5B.log
. If the count value is612
, then this step was done correctly.
# MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=10 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=10 USE_STRAT=False
# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=10 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=10 USE_STRAT=False
- Run the simulation experiment with our method (same commands as before but change
USE_STRAT
toTrue
). Please make sure to pass in the absolute file path on your device to Metis (the local repository) toHOME_DIR
. The results should be found inlogs/GPT_1.5B.log
. If the count value is562
, then this step was done correctly.
# MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=10 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=10 USE_STRAT=True
# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=10 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=10 USE_STRAT=True
Count is the number of search steps, average time is the average wall clock time out of TRIALS
trials. The second value in the 4th line of logs/GPT_1.5B.log
is the cost of the best plan.
NOTE: Wall-clock time has variance due to many factors such as hardware and current load on your host device. We mitigated this variance by taking the average of 10 trials (set by the TRIALS=10
) argument. Cost of best plan of our method may be occasionally worse than Metis, but on average, the costs of our plans is no worse than Metis's. The step counts are fixed and our count should ALWAYS be lower than Metis's.
Same exact commands as previous row, but change the 10 to a 20 in the first command, 16 to 32 in the second, and 10 to 20 for NUM_LAYERS
when running either shell script. Please make sure to pass in the absolute file path on your device to Metis (the local repository) to HOME_DIR
. This may take a while. The results should be found in logs/GPT_1.5B.log
. If the count value of Metis is 58078
and ours is 41208
, then this step was done correctly.
python3 gen_synth_data.py 20 128
python3 gen_hostfile.py 32
# Use Metis
# MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=False
# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=False
# Use ours
# MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=True
# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=True
Same exact commands as previous row, but 32 to 64 in the second command. Please make sure to pass in the absolute file path on your device to Metis (the local repository) to HOME_DIR
. The results should be found in logs/GPT_1.5B.log
. If the count value of Metis is 3734
and ours is 3482
, then this step was done correctly.
python3 gen_synth_data.py 20 128
python3 gen_hostfile.py 64
# Use Metis
# MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=False
# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=False
# Use ours
#MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=True
# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=True
This one takes a very long time (30 min per trial), so we only use 1 trial. Same exact commands as previous row, but change the 20 to a 40 in the first command, 64 to 128 in the second, and 20 to 40 for NUM_LAYERS
when running either shell script. Please make sure to pass in the absolute file path on your device to Metis (the local repository) to HOME_DIR
. The results should be found in logs/GPT_1.5B.log
. If the count value of Metis is 1714619
and ours is 1553492
, then this step was done correctly.
python3 gen_synth_data.py 40 128
python3 gen_hostfile.py 128
# Use Metis
# MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=40 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=1 USE_STRAT=False
# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=40 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=1 USE_STRAT=False
# Use ours
# MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=40 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=1 USE_STRAT=True
# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=40 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=1 USE_STRAT=True
- 3.11
The two graphs found in the paper were generated in graphs.ipynb. They are found in plots.
2 c240g5 Cloudlab nodes, each fitted with one P100 GPU. Simulation was done on a 2021 Apple M1 Macbook Pro.