Skip to content

omarkhater-school/improved-clip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Improving CLIP Training with Bayesian Optimization

This repository explores and optimizes global contrastive loss functions, focusing on enhancing image-text representation learning. The project was developed as part of the CSCE 636 Deep Learning course (Fall 24) at Texas A&M University.


Benchmark Comparison

Method MSCOCO TR@1 MSCOCO IR@1 ImageNet ACC@1 Average
CLIP (Benchmark) 12.00 9.32 21.35 14.22
SogCLR (Provided codebase) 14.38 (+19.8%) 10.73 (+15.1%) 24.54 (+15.0%) 16.55 (+16.4%)
iSogCLR_New (Ours) 14.86(+23.8%) 10.52(+12.8%) 29.37(+37.6%) 18.25(+28.3%)

Final Model Benchmark Download

The final model checkpoint can be downloaded at this link: https://drive.google.com/file/d/1cFobjg78IdLlY0Ftk24roIRpeda0IVQO/view?usp=sharing


Team Leaderboard

This solution won the class competition (over 100 students) with a test objective score = 19.1594. See the CERTIFICATE OF RECOGNITION.

Evaluation criteria

  • There're in total 3 metrics being evaluated in this final project:

    • zero-shot classification task on ImageNet test set
    • image retrieval and test retrieval on mscoco test set.
  • For zero-shot classification, we prepare a list of templates to convert the class name(label) into a series of sentence; then for each image, compute the similarity between this image feature(generated by model's image encoder) and the feature of all the sentences converted from the class names(generated by model's text encoder), and finally compute the top1 accuracy based on the similarity score.

  • For retrieval tasks, we first use a model to generate features for all the images and all the captions, then compute their similarity score(as a matrix).

  • For image-to-text retrieval we compute the recall@1 given each image based on its similarity score with all the captions while for text-to-image retrieval we compute the recall@1 given each caption based on its similarity score with all the images.

  • Regarding (imagenet and mscoco) testset construction, we randomly sample the same amount of samples from each class from the training set, and combine them together as a testset(e.g. we sample 50 images from each of the 1000 classes from imagenet train set to construct the 50k test set). Since we didn't provide the imagenet train set and the mscoco train set to you for training, your models didn't 'see' any samples in our test set.

Data


We used this data in this project:

https://drive.google.com/file/d/142zQjlOw0Xw4tKzXMrQjYE6NtGRTeasT/view?usp=drive_link https://drive.google.com/file/d/142xxRoMaHxX3BIfCw_1b_G_dgu-02Yq3/view?usp=drive_link https://drive.google.com/file/d/142zQjlOw0Xw4tKzXMrQjYE6NtGRTeasT/view?usp=drive_link https://drive.google.com/file/d/1NXhfhwFy-nhdABACkodgYqm9pomDKE39/view?usp=sharing

Repository Structure

Key Folders

  • models/: Image and text encoders (ResNet-50 and DistilBERT) with modular loss functions.
  • notebooks/: Jupyter notebooks for result analysis and experimentation.
  • optim/: Custom optimizers including AdamW, RAdam, and NovoGrad.
  • scheduler/: Learning rate schedulers for warmup, cooldown, and decay.
  • zeroshot_transfer/: Evaluation scripts for zero-shot classification.
  • documentation/: Contains the project report detailing the methodology, experiments, and results.

Key Improvements

This repository extends the original provided codebase with several enhancements:

  1. AWS SageMaker Integration: Enables seamless training of models in distributed environments using SageMaker.
  2. Modularized Code: Refactored for easy integration of new loss functions, optimizers, and datasets.
  3. Advanced Hyperparameter Tuning: Incorporates Bayesian optimization for tuning critical parameters such as learning rates, temperature, and regularization.
  4. Robust Evaluation Pipeline: Enhanced evaluation metrics and dataset handling for retrieval and classification tasks.

Getting Started

  1. Prepare datasets: Ensure the dataset folder structure matches:
  • datasets/: Organized datasets for training and validation:
    • cc3m_subset_100k/: Training subset of Conceptual Captions 3M.
    • clip_train/: Metadata for training and validation datasets.
    • mscoco_val/: MSCOCO validation data (image-text retrieval).
    • imagenet/: ImageNet validation data (zero-shot classification).
  1. Install dependencies:
  pip install -r requirements.txt
  1. [Optional] If you want to use the sagemaker training, and tuning, make sure to create config.py with the following:
role = <Your Sagemaker role>
region = <Your AWS region>
aws_access_key_id = <Your acess key id>
aws_secret_access_key = <Your secret acess key>
  1. Test Run the main script
python main.py \
  --data_path "./datasets" \
  --ann_path "datasets/clip_train" \
  --zs_datafolder "datasets/imagenet/val" \
  --train_file cc3m_train_subset.json \
  --train_image_root cc3m_subset_100k \
  --output_dir "./test_output" \
  --loss_function isogclr_new \
  --optimizer fusedadam \
  --tau_init 0.01 \
  --sogclr_gamma 0.8 \
  --eta_init 0.03 --sched cosine \
  --device cuda \
  --val_frequency 5 \
  --epochs 1
  1. Test Create Sagemaker training job

    python train_sagemaker.py \
     --entry_point main.py \
     --source_dir . \
     --instance_type ml.g5.4xlarge\
     --use_spot \
     --max_wait 36000 \
     --config_file ./config.json \
     --job_name "Test-improve-clip"
  2. Test Create Sagemaker tuning job

    python tuning.py \
     --entry_point main.py \
     --source_dir . \
     --instance_type ml.g5.4xlarge\
     --use_spot \
     --max_wait 36000 \
     --config_file ./config_phase3.json \
     --job_name "improved-clip-phase3"
  3. [Optional] Continue Tune existing finished tuning job

    python warm_start_tuning.py \
     --job_name improved-clip-phase3-extended \
     --entry_point main.py \
     --instance_type ml.g5.4xlarge\
     --source_dir . \
     --config_file phase3_extended.json \
     --max_wait 36000 \
     --previous_job_name improved-clip-phase3-241127-1727\
     --objective_metric_name BestObjectiveValue\
     --use_spot

Known Bugs


  • isogclr_new_v1 didn't work with existing code
  • isogclr_temp_net=1 break the training

Citation


If you use this work, please cite it as follows:

@misc{omarkhater2024improvedclip,
  author       = {Omar Khater, Michael Norman},
  title        = {Improving CLIP Training with Bayesian Optimization},
  year         = {2024},
  url          = {https://github.com/omarkhater-school/improved-clip},

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published