Improving CLIP Training with Bayesian Optimization

This repository explores and optimizes global contrastive loss functions, focusing on enhancing image-text representation learning. The project was developed as part of the CSCE 636 Deep Learning course (Fall 24) at Texas A&M University.

Benchmark Comparison

Method	MSCOCO TR@1	MSCOCO IR@1	ImageNet ACC@1	Average
CLIP (Benchmark)	12.00	9.32	21.35	14.22
SogCLR (Provided codebase)	14.38 (+19.8%)	10.73 (+15.1%)	24.54 (+15.0%)	16.55 (+16.4%)
iSogCLR_New (Ours)	14.86(+23.8%)	10.52(+12.8%)	29.37(+37.6%)	18.25(+28.3%)

Final Model Benchmark Download

The final model checkpoint can be downloaded at this link: https://drive.google.com/file/d/1cFobjg78IdLlY0Ftk24roIRpeda0IVQO/view?usp=sharing

Team Leaderboard

This solution won the class competition (over 100 students) with a test objective score = 19.1594. See the CERTIFICATE OF RECOGNITION.

Evaluation criteria

There're in total 3 metrics being evaluated in this final project:
- zero-shot classification task on ImageNet test set
- image retrieval and test retrieval on mscoco test set.
For zero-shot classification, we prepare a list of templates to convert the class name(label) into a series of sentence; then for each image, compute the similarity between this image feature(generated by model's image encoder) and the feature of all the sentences converted from the class names(generated by model's text encoder), and finally compute the top1 accuracy based on the similarity score.
For retrieval tasks, we first use a model to generate features for all the images and all the captions, then compute their similarity score(as a matrix).
For image-to-text retrieval we compute the recall@1 given each image based on its similarity score with all the captions while for text-to-image retrieval we compute the recall@1 given each caption based on its similarity score with all the images.
Regarding (imagenet and mscoco) testset construction, we randomly sample the same amount of samples from each class from the training set, and combine them together as a testset(e.g. we sample 50 images from each of the 1000 classes from imagenet train set to construct the 50k test set). Since we didn't provide the imagenet train set and the mscoco train set to you for training, your models didn't 'see' any samples in our test set.

Data

We used this data in this project:

https://drive.google.com/file/d/142zQjlOw0Xw4tKzXMrQjYE6NtGRTeasT/view?usp=drive_link https://drive.google.com/file/d/142xxRoMaHxX3BIfCw_1b_G_dgu-02Yq3/view?usp=drive_link https://drive.google.com/file/d/142zQjlOw0Xw4tKzXMrQjYE6NtGRTeasT/view?usp=drive_link https://drive.google.com/file/d/1NXhfhwFy-nhdABACkodgYqm9pomDKE39/view?usp=sharing

Repository Structure

Key Folders

models/: Image and text encoders (ResNet-50 and DistilBERT) with modular loss functions.
notebooks/: Jupyter notebooks for result analysis and experimentation.
optim/: Custom optimizers including AdamW, RAdam, and NovoGrad.
scheduler/: Learning rate schedulers for warmup, cooldown, and decay.
zeroshot_transfer/: Evaluation scripts for zero-shot classification.
documentation/: Contains the project report detailing the methodology, experiments, and results.

Key Improvements

This repository extends the original provided codebase with several enhancements:

AWS SageMaker Integration: Enables seamless training of models in distributed environments using SageMaker.
Modularized Code: Refactored for easy integration of new loss functions, optimizers, and datasets.
Advanced Hyperparameter Tuning: Incorporates Bayesian optimization for tuning critical parameters such as learning rates, temperature, and regularization.
Robust Evaluation Pipeline: Enhanced evaluation metrics and dataset handling for retrieval and classification tasks.

Getting Started

Prepare datasets: Ensure the dataset folder structure matches:

datasets/: Organized datasets for training and validation:
- cc3m_subset_100k/: Training subset of Conceptual Captions 3M.
- clip_train/: Metadata for training and validation datasets.
- mscoco_val/: MSCOCO validation data (image-text retrieval).
- imagenet/: ImageNet validation data (zero-shot classification).

Install dependencies:

  pip install -r requirements.txt

[Optional] If you want to use the sagemaker training, and tuning, make sure to create config.py with the following:

role = <Your Sagemaker role>
region = <Your AWS region>
aws_access_key_id = <Your acess key id>
aws_secret_access_key = <Your secret acess key>

Test Run the main script

python main.py \
  --data_path "./datasets" \
  --ann_path "datasets/clip_train" \
  --zs_datafolder "datasets/imagenet/val" \
  --train_file cc3m_train_subset.json \
  --train_image_root cc3m_subset_100k \
  --output_dir "./test_output" \
  --loss_function isogclr_new \
  --optimizer fusedadam \
  --tau_init 0.01 \
  --sogclr_gamma 0.8 \
  --eta_init 0.03 --sched cosine \
  --device cuda \
  --val_frequency 5 \
  --epochs 1

Test Create Sagemaker training job

python train_sagemaker.py \
 --entry_point main.py \
 --source_dir . \
 --instance_type ml.g5.4xlarge\
 --use_spot \
 --max_wait 36000 \
 --config_file ./config.json \
 --job_name "Test-improve-clip"

Test Create Sagemaker tuning job

python tuning.py \
 --entry_point main.py \
 --source_dir . \
 --instance_type ml.g5.4xlarge\
 --use_spot \
 --max_wait 36000 \
 --config_file ./config_phase3.json \
 --job_name "improved-clip-phase3"

[Optional] Continue Tune existing finished tuning job

python warm_start_tuning.py \
 --job_name improved-clip-phase3-extended \
 --entry_point main.py \
 --instance_type ml.g5.4xlarge\
 --source_dir . \
 --config_file phase3_extended.json \
 --max_wait 36000 \
 --previous_job_name improved-clip-phase3-241127-1727\
 --objective_metric_name BestObjectiveValue\
 --use_spot

Known Bugs

isogclr_new_v1 didn't work with existing code
isogclr_temp_net=1 break the training

Citation

If you use this work, please cite it as follows:

@misc{omarkhater2024improvedclip,
  author       = {Omar Khater, Michael Norman},
  title        = {Improving CLIP Training with Bayesian Optimization},
  year         = {2024},
  url          = {https://github.com/omarkhater-school/improved-clip},

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
dataset		dataset
documentation		documentation
models		models
notebooks		notebooks
optim		optim
scheduler		scheduler
zeroshot_transfer		zeroshot_transfer
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
config_phase1.json		config_phase1.json
config_phase2.json		config_phase2.json
config_phase3.json		config_phase3.json
evaluation.py		evaluation.py
license.md		license.md
main.py		main.py
phase3_extended.json		phase3_extended.json
prepare_data.py		prepare_data.py
requirements.in		requirements.in
requirements.txt		requirements.txt
train.py		train.py
train_sagemaker.py		train_sagemaker.py
training_notebook.ipynb		training_notebook.ipynb
tuning.py		tuning.py
utils.py		utils.py
visualization_helpers.py		visualization_helpers.py
warm_start_tuning.py		warm_start_tuning.py
zero_shot_helpers.py		zero_shot_helpers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving CLIP Training with Bayesian Optimization

Benchmark Comparison

Final Model Benchmark Download

Team Leaderboard

Evaluation criteria

Data

Repository Structure

Key Folders

Key Improvements

Getting Started

Known Bugs

Citation

About

Releases

Packages

Contributors 2

Languages

License

omarkhater-school/improved-clip

Folders and files

Latest commit

History

Repository files navigation

Improving CLIP Training with Bayesian Optimization

Benchmark Comparison

Final Model Benchmark Download

Team Leaderboard

Evaluation criteria

Data

Repository Structure

Key Folders

Key Improvements

Getting Started

Known Bugs

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages