This repository explores and optimizes global contrastive loss functions, focusing on enhancing image-text representation learning. The project was developed as part of the CSCE 636 Deep Learning course (Fall 24) at Texas A&M University.
Method | MSCOCO TR@1 | MSCOCO IR@1 | ImageNet ACC@1 | Average |
---|---|---|---|---|
CLIP (Benchmark) | 12.00 | 9.32 | 21.35 | 14.22 |
SogCLR (Provided codebase) | 14.38 (+19.8%) | 10.73 (+15.1%) | 24.54 (+15.0%) | 16.55 (+16.4%) |
iSogCLR_New (Ours) | 14.86(+23.8%) | 10.52(+12.8%) | 29.37(+37.6%) | 18.25(+28.3%) |
The final model checkpoint can be downloaded at this link: https://drive.google.com/file/d/1cFobjg78IdLlY0Ftk24roIRpeda0IVQO/view?usp=sharing
This solution won the class competition (over 100 students) with a test objective score = 19.1594
. See the CERTIFICATE OF RECOGNITION.
-
There're in total 3 metrics being evaluated in this final project:
- zero-shot classification task on ImageNet test set
- image retrieval and test retrieval on mscoco test set.
-
For zero-shot classification, we prepare a list of templates to convert the class name(label) into a series of sentence; then for each image, compute the similarity between this image feature(generated by model's image encoder) and the feature of all the sentences converted from the class names(generated by model's text encoder), and finally compute the top1 accuracy based on the similarity score.
-
For retrieval tasks, we first use a model to generate features for all the images and all the captions, then compute their similarity score(as a matrix).
-
For image-to-text retrieval we compute the recall@1 given each image based on its similarity score with all the captions while for text-to-image retrieval we compute the recall@1 given each caption based on its similarity score with all the images.
-
Regarding (imagenet and mscoco) testset construction, we randomly sample the same amount of samples from each class from the training set, and combine them together as a testset(e.g. we sample 50 images from each of the 1000 classes from imagenet train set to construct the 50k test set). Since we didn't provide the imagenet train set and the mscoco train set to you for training, your models didn't 'see' any samples in our test set.
We used this data in this project:
https://drive.google.com/file/d/142zQjlOw0Xw4tKzXMrQjYE6NtGRTeasT/view?usp=drive_link https://drive.google.com/file/d/142xxRoMaHxX3BIfCw_1b_G_dgu-02Yq3/view?usp=drive_link https://drive.google.com/file/d/142zQjlOw0Xw4tKzXMrQjYE6NtGRTeasT/view?usp=drive_link https://drive.google.com/file/d/1NXhfhwFy-nhdABACkodgYqm9pomDKE39/view?usp=sharing
models/
: Image and text encoders (ResNet-50 and DistilBERT) with modular loss functions.notebooks/
: Jupyter notebooks for result analysis and experimentation.optim/
: Custom optimizers including AdamW, RAdam, and NovoGrad.scheduler/
: Learning rate schedulers for warmup, cooldown, and decay.zeroshot_transfer/
: Evaluation scripts for zero-shot classification.documentation/
: Contains the project report detailing the methodology, experiments, and results.
This repository extends the original provided codebase with several enhancements:
- AWS SageMaker Integration: Enables seamless training of models in distributed environments using SageMaker.
- Modularized Code: Refactored for easy integration of new loss functions, optimizers, and datasets.
- Advanced Hyperparameter Tuning: Incorporates Bayesian optimization for tuning critical parameters such as learning rates, temperature, and regularization.
- Robust Evaluation Pipeline: Enhanced evaluation metrics and dataset handling for retrieval and classification tasks.
- Prepare datasets: Ensure the dataset folder structure matches:
datasets/
: Organized datasets for training and validation:cc3m_subset_100k/
: Training subset of Conceptual Captions 3M.clip_train/
: Metadata for training and validation datasets.mscoco_val/
: MSCOCO validation data (image-text retrieval).imagenet/
: ImageNet validation data (zero-shot classification).
- Install dependencies:
pip install -r requirements.txt
- [Optional] If you want to use the sagemaker training, and tuning, make sure to create
config.py
with the following:
role = <Your Sagemaker role>
region = <Your AWS region>
aws_access_key_id = <Your acess key id>
aws_secret_access_key = <Your secret acess key>
- Test Run the main script
python main.py \
--data_path "./datasets" \
--ann_path "datasets/clip_train" \
--zs_datafolder "datasets/imagenet/val" \
--train_file cc3m_train_subset.json \
--train_image_root cc3m_subset_100k \
--output_dir "./test_output" \
--loss_function isogclr_new \
--optimizer fusedadam \
--tau_init 0.01 \
--sogclr_gamma 0.8 \
--eta_init 0.03 --sched cosine \
--device cuda \
--val_frequency 5 \
--epochs 1
-
Test Create Sagemaker training job
python train_sagemaker.py \ --entry_point main.py \ --source_dir . \ --instance_type ml.g5.4xlarge\ --use_spot \ --max_wait 36000 \ --config_file ./config.json \ --job_name "Test-improve-clip"
-
Test Create Sagemaker tuning job
python tuning.py \ --entry_point main.py \ --source_dir . \ --instance_type ml.g5.4xlarge\ --use_spot \ --max_wait 36000 \ --config_file ./config_phase3.json \ --job_name "improved-clip-phase3"
-
[Optional] Continue Tune existing finished tuning job
python warm_start_tuning.py \ --job_name improved-clip-phase3-extended \ --entry_point main.py \ --instance_type ml.g5.4xlarge\ --source_dir . \ --config_file phase3_extended.json \ --max_wait 36000 \ --previous_job_name improved-clip-phase3-241127-1727\ --objective_metric_name BestObjectiveValue\ --use_spot
isogclr_new_v1
didn't work with existing codeisogclr_temp_net=1
break the training
If you use this work, please cite it as follows:
@misc{omarkhater2024improvedclip,
author = {Omar Khater, Michael Norman},
title = {Improving CLIP Training with Bayesian Optimization},
year = {2024},
url = {https://github.com/omarkhater-school/improved-clip},