Final Code for CS598: How to do Research (Group 3)

This repository contains code to reproduce the experiments of our work with Improved Intra-Operator Parallelism for Distributed LLM Training. The associated class paper can be found above.

Install

To run this project, you need to install the required packages. Follow the steps below to install the dependencies using the requirements.txt file.

Clone the repository:

git clone https://github.com/livingshade/Metis.git

Navigate to the project directory:

cd Metis

Install dependencies using the requirements.txt file:

# If using conda
conda create -n HTDR python=3.11 pip -y
conda activate HTDR

pip install -r requirements.txt

Usage (First Row of Table 1)

Once all dependencies are installed, you are ready to start running experiments. First, generate synthetic profiling data for a 10 layer model with a global batch size of 128.

python3 gen_synth_data.py 10 128

Create your hostfile. To make things easier, we provided a script that can quickly generate a cluster made up of a 50/50 split of A100 and V100s nodes. Each node has 4 GPU cards.

python3 gen_hostfile.py 16

Run the simulation experiment with naive Metis. Please make sure to pass in the absolute file path on your device to Metis (the local repository) to HOME_DIR. The results should be found in logs/GPT_1.5B.log. If the count value is 612, then this step was done correctly.

# MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=10 GBS=128  MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=10 USE_STRAT=False

# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=10 GBS=128  MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=10 USE_STRAT=False

Run the simulation experiment with our method (same commands as before but change USE_STRAT to True). Please make sure to pass in the absolute file path on your device to Metis (the local repository) to HOME_DIR. The results should be found in logs/GPT_1.5B.log. If the count value is 562, then this step was done correctly.

# MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=10 GBS=128  MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=10 USE_STRAT=True

# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=10 GBS=128  MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=10 USE_STRAT=True

How to read results

Count is the number of search steps, average time is the average wall clock time out of TRIALS trials. The second value in the 4th line of logs/GPT_1.5B.log is the cost of the best plan.

Recreating Rest of the Table

NOTE: Wall-clock time has variance due to many factors such as hardware and current load on your host device. We mitigated this variance by taking the average of 10 trials (set by the TRIALS=10) argument. Cost of best plan of our method may be occasionally worse than Metis, but on average, the costs of our plans is no worse than Metis's. The step counts are fixed and our count should ALWAYS be lower than Metis's.

Second Row of Table 1 (20 layers, 32 A100s + 32 V100s)

Same exact commands as previous row, but change the 10 to a 20 in the first command, 16 to 32 in the second, and 10 to 20 for NUM_LAYERS when running either shell script. Please make sure to pass in the absolute file path on your device to Metis (the local repository) to HOME_DIR. This may take a while. The results should be found in logs/GPT_1.5B.log. If the count value of Metis is 58078 and ours is 41208, then this step was done correctly.

python3 gen_synth_data.py 20 128 
python3 gen_hostfile.py 32

# Use Metis
# MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128  MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=False

# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128  MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=False

# Use ours
# MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128  MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=True

# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128  MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=True

Third Row of Table 1 (20 layers, 64 A100s + 64 V100s)

Same exact commands as previous row, but 32 to 64 in the second command. Please make sure to pass in the absolute file path on your device to Metis (the local repository) to HOME_DIR. The results should be found in logs/GPT_1.5B.log. If the count value of Metis is 3734 and ours is 3482, then this step was done correctly.

python3 gen_synth_data.py 20 128 
python3 gen_hostfile.py 64

# Use Metis
# MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=False
# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=False

# Use ours
#MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=True

# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=20 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=5 USE_STRAT=True

Fourth Row Row of Table 1 (40 layers, 128 A100s + 128 V100s)

This one takes a very long time (30 min per trial), so we only use 1 trial. Same exact commands as previous row, but change the 20 to a 40 in the first command, 64 to 128 in the second, and 20 to 40 for NUM_LAYERS when running either shell script. Please make sure to pass in the absolute file path on your device to Metis (the local repository) to HOME_DIR. The results should be found in logs/GPT_1.5B.log. If the count value of Metis is 1714619 and ours is 1553492, then this step was done correctly.

python3 gen_synth_data.py 40 128 
python3 gen_hostfile.py 128

# Use Metis
# MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=40 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=1 USE_STRAT=False

# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=40 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=1 USE_STRAT=False

# Use ours
# MacOS
sh ./scripts/mac_cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=40 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=1 USE_STRAT=True

# Linux
source ./scripts/cost_het_cluster.sh HOME_DIR='ABSOLUTE_FILE_PATH_TO_METIS' MODEL_NAME=GPT MODEL_SIZE=1.5B NUM_LAYERS=40 GBS=128 MAX_PROFILED_TP=128 MAX_PROFILED_BATCH_SIZE=128 SCALE_VARIANCE=1 MAX_PERMUTE_LEN=128 TRIALS=1 USE_STRAT=True

Supported Python Versions

3.11

Training Experiments and Supplementary Graphs

The two graphs found in the paper were generated in graphs.ipynb. They are found in plots.

Hardware

2 c240g5 Cloudlab nodes, each fitted with one P100 GPU. Simulation was done on a 2021 Apple M1 Macbook Pro.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
data		data
docs		docs
model		model
plots		plots
profile_data_samples		profile_data_samples
results		results
scripts		scripts
search_space		search_space
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
__init__.py		__init__.py
arguments.py		arguments.py
clusterfile.json		clusterfile.json
cost_het_cluster.py		cost_het_cluster.py
cost_homo_cluster.py		cost_homo_cluster.py
data_loader.py		data_loader.py
gen_hostfile.py		gen_hostfile.py
gen_synth_data.py		gen_synth_data.py
gpu_cluster.py		gpu_cluster.py
graphs.ipynb		graphs.ipynb
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Final Code for CS598: How to do Research (Group 3)

Install

Usage (First Row of Table 1)

How to read results

Recreating Rest of the Table

Second Row of Table 1 (20 layers, 32 A100s + 32 V100s)

Third Row of Table 1 (20 layers, 64 A100s + 64 V100s)

Fourth Row Row of Table 1 (40 layers, 128 A100s + 128 V100s)

Supported Python Versions

Training Experiments and Supplementary Graphs

Hardware

About

Languages

License

livingshade/Metis

Folders and files

Latest commit

History

Repository files navigation

Final Code for CS598: How to do Research (Group 3)

Install

Usage (First Row of Table 1)

How to read results

Recreating Rest of the Table

Second Row of Table 1 (20 layers, 32 A100s + 32 V100s)

Third Row of Table 1 (20 layers, 64 A100s + 64 V100s)

Fourth Row Row of Table 1 (40 layers, 128 A100s + 128 V100s)

Supported Python Versions

Training Experiments and Supplementary Graphs

Hardware

About

Resources

License

Stars

Watchers

Forks

Languages