|
| 1 | +# Contrastive Language-Image Pre-Training with EVA (EVA-CLIP) |
| 2 | + |
| 3 | +**Table of Contents** |
| 4 | + |
| 5 | +- [Contrastive Language-Image Pre-Training with EVA (EVA-CLIP)](#contrastive-language-image-pre-training-with-eva-eva-clip) |
| 6 | + - [Model Card](#model-card) |
| 7 | + - [Performance of EVA-CLIP Vision Encoder on ImageNet-1K](#performance-of-eva-clip-vision-encoder-on-imagenet-1k) |
| 8 | + - [EVA-CLIP Zero-shot Evaluation Results](#eva-clip-zero-shot-evaluation-results) |
| 9 | + - [**All 35 Benchmark Results in Details**](#all-35-benchmark-results-in-details) |
| 10 | + - [Zero-shot Image Classification Evaluation](#zero-shot-image-classification-evaluation) |
| 11 | + - [Zero-shot Video Action Recognition Evaluation](#zero-shot-video-action-recognition-evaluation) |
| 12 | + - [Zero-shot Retrieval Evaluation](#zero-shot-retrieval-evaluation) |
| 13 | + - [Usage](#usage) |
| 14 | + - [Acknowledgement](#acknowledgement) |
| 15 | + |
| 16 | + |
| 17 | +## Model Card |
| 18 | + |
| 19 | +<div align="center"> |
| 20 | + |
| 21 | +| model name | #param. | precision | data | batch size | IN-1K zero-shot top-1 | weight | |
| 22 | +|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:| |
| 23 | +| `eva_clip_psz14` | 1.1B | `fp16` | [LAION-400M](https://laion.ai/laion-400-open-dataset/) | 41K | 78.5 | [🤗 HF link](https://huggingface.co/BAAI/EVA/blob/main/eva_clip_psz14.pt) (`2GB`) | |
| 24 | + |
| 25 | +</div> |
| 26 | + |
| 27 | + |
| 28 | +We choose to train a 1.1B CLIP model, not because it is easy, but because it is hard. Please refer to [this note](https://docs.google.com/document/d/1FXosAZ3wMrzThgnWR6KSkXIz4IMItq3umDGos38pJps/edit) for a glance at the challenges in training very large CLIP. |
| 29 | + |
| 30 | +To our knowledge, EVA-CLIP is **the largest performant open-sourced CLIP model** evaluated via zero-shot classification performance, especially on mainstream classification benchmarks such as ImageNet along with its variants. |
| 31 | +For more details about EVA-CLIP, please refer to Section 2.3.5 of [our paper](https://arxiv.org/pdf/2211.07636.pdf). |
| 32 | + |
| 33 | +We hope open-sourcing EVA-CLIP can facilitate future research in multi-modal learning, representation learning, AIGC, *etc*, and we hope our solution for scaling up CLIPs can provide insight for practitioners studying large foundation models. |
| 34 | + |
| 35 | + |
| 36 | + |
| 37 | +## Performance of EVA-CLIP Vision Encoder on ImageNet-1K |
| 38 | + |
| 39 | +<div align="center"> |
| 40 | + |
| 41 | +| model | zero-shot @ 224px | linear probing @ 224px | linear probing @ 336px | fine-tuning @ 224px | fine-tuning @ 336px | |
| 42 | +|:-----:|:------:|:------:|:------:|:------:|:------:| |
| 43 | +| EVA-CLIP | **78.5** ([weight](https://huggingface.co/BAAI/EVA/blob/main/eva_clip_psz14.pt) \| [log](https://wandb.ai/baaivision/eva-clip/reports/ViT-g-14--VmlldzoyOTkwMDYy)) | **86.5** ([weight](https://huggingface.co/BAAI/EVA/blob/main/eva_clip_vis_enc_sz224_lincls_86p5.pth) \| [log](../logs/cls/linear_eva_clip_vision_enc_1k_cls_sz224_86p5.txt)) | **86.5** ([weight](https://huggingface.co/BAAI/EVA/blob/main/eva_clip_vis_enc_sz336_lincls_86p5.pth) \| [log](../logs/cls/linear_eva_clip_vision_enc_1k_cls_sz336_86p5.txt)) | **89.1** ([weight](https://huggingface.co/BAAI/EVA/blob/main/eva_clip_vis_enc_sz224_ftcls_89p1.pt) \| [log](../logs/cls/ft_eva_clip_vision_enc_1k_cls_sz224_89p1.txt)) | **89.4** ([weight](https://huggingface.co/BAAI/EVA/blob/main/eva_clip_vis_enc_sz336_ftcls_89p4.pt) \| [log](../logs/cls/ft_eva_clip_vision_enc_1k_cls_sz336_89p4.txt)) | |
| 44 | + |
| 45 | +</div> |
| 46 | + |
| 47 | +EVA-CLIP achieves the state-of-the-art top-1 accuracy on ImageNet-1K among all self-supervised learning approaches. |
| 48 | +We will provide instructions for re-producing these results soon. |
| 49 | + |
| 50 | + |
| 51 | +## EVA-CLIP Zero-shot Evaluation Results |
| 52 | + |
| 53 | + |
| 54 | +<div align="center"> |
| 55 | + |
| 56 | +### [**All 35 Benchmark Results in Details**](./benchmark.md) |
| 57 | + |
| 58 | +</div> |
| 59 | + |
| 60 | + |
| 61 | + |
| 62 | +### Zero-shot Image Classification Evaluation |
| 63 | + |
| 64 | +The top-1 accuracy of ImageNet-1K variants and ObjectNet. |
| 65 | + |
| 66 | +<div align="center"> |
| 67 | + |
| 68 | +[](https://paperswithcode.com/sota/self-supervised-image-classification-with?p=eva-exploring-the-limits-of-masked-visual) |
| 69 | + |
| 70 | +| model | IN-1K | IN-V2 | IN-Adv. | IN-Ren. |IN-Ske. | ObjectNet | |
| 71 | +|-------|:-----:|:-----:|:----:| :------:|:-------:|:---------:| |
| 72 | +| OpenAI CLIP-L | 75.55 | 69.86 | 70.76 | 87.83 | 59.58 | 68.98 | |
| 73 | +| Open CLIP-H | 77.96 | 70.87 | 59.33 | 89.33 | 66.58 | 69.71 | |
| 74 | +| Open CLIP-g | 76.65 | 69.56 | 57.19 | 88.69 | 65.17 | 67.53 | |
| 75 | +| EVA CLIP-g | **78.53** | **71.52** | **73.59** | **92.5** | **67.31** | **72.33** | |
| 76 | + |
| 77 | +</div> |
| 78 | + |
| 79 | +### Zero-shot Video Action Recognition Evaluation |
| 80 | + |
| 81 | +The performance of video action recognition benchmarks. |
| 82 | + |
| 83 | +<div align="center"> |
| 84 | + |
| 85 | +| model | UCF-101 | Kinetics-400 | Kinetics-600 | Kinetics-700 | |
| 86 | +|-------|:-----:|:-----:|:----:| :----:| |
| 87 | +| OpenAI CLIP-L | 76.39 | 64.47 | 64.21 | 57.68 | |
| 88 | +| Open CLIP-H | **78.16** | 63.06 | 63.58 | 56.09 | |
| 89 | +| Open CLIP-g | 77.73 | 61.69 | 62.16 | 54.99 | |
| 90 | +| EVA CLIP-g | 76.05 | **65.23** | **64.38** | **58.4** | |
| 91 | + |
| 92 | +</div> |
| 93 | + |
| 94 | + |
| 95 | +> For video action recognition, we sample only a single center frame each video, turning it into an image classification task. |
| 96 | +> Following the conventional settings, we report the top-1 accuracy for UCF-101 and the mean of top-1 and top-5 accuracy for Kinetics-400/600/700. |
| 97 | +
|
| 98 | +### Zero-shot Retrieval Evaluation |
| 99 | + |
| 100 | +<div align="center"> |
| 101 | + |
| 102 | +<table> |
| 103 | + <tr> |
| 104 | + <td rowspan=2>Dataset</td> |
| 105 | + <td rowspan=2>Model</td> |
| 106 | + <td colspan=3>Text-to-Image Retrival</td> |
| 107 | + <td colspan=3>Image-to-Text Retrival</td> |
| 108 | + </tr> |
| 109 | + <tr> |
| 110 | + <td>R@1</td> |
| 111 | + <td>R@5</td> |
| 112 | + <td>R@10</td> |
| 113 | + <td>R@1</td> |
| 114 | + <td>R@5</td> |
| 115 | + <td>R@10</td> |
| 116 | + </tr> |
| 117 | + <tr> |
| 118 | + <td rowspan=4>Flickr30k</td> |
| 119 | + <td>OpenAI CLIP-L</td> |
| 120 | + <td>65.18 </td> |
| 121 | + <td>87.28 </td> |
| 122 | + <td>92 </td> |
| 123 | + <td>85.2 </td> |
| 124 | + <td>97.3 </td> |
| 125 | + <td>99 </td> |
| 126 | + </tr> |
| 127 | + <tr> |
| 128 | + <td>Open CLIP-H</td> |
| 129 | + <td><b>77.78</b></td> |
| 130 | + <td><b>94.14</b></td> |
| 131 | + <td><b>96.62</b></td> |
| 132 | + <td><b>90.8</b></td> |
| 133 | + <td><b>99.3</b></td> |
| 134 | + <td>99.7</td> |
| 135 | + </tr> |
| 136 | + <tr> |
| 137 | + <td>Open CLIP-g</td> |
| 138 | + <td>76.52 </td> |
| 139 | + <td>93.62 </td> |
| 140 | + <td>96.28 </td> |
| 141 | + <td>90.8 </td> |
| 142 | + <td>99.1 </td> |
| 143 | + <td><b>99.8</b></td> |
| 144 | + </tr> |
| 145 | + <tr> |
| 146 | + <td>EVA CLIP-g</td> |
| 147 | + <td>72.64 </td> |
| 148 | + <td>91.6 </td> |
| 149 | + <td>95.12 </td> |
| 150 | + <td>88.3 </td> |
| 151 | + <td>98.3 </td> |
| 152 | + <td>99.3 </td> |
| 153 | + </tr> |
| 154 | + <tr> |
| 155 | + <td rowspan=4>MSCOCO</td> |
| 156 | + <td>OpenAI CLIP-L</td> |
| 157 | + <td>36.51 </td> |
| 158 | + <td>61.01 </td> |
| 159 | + <td>71.11 </td> |
| 160 | + <td>56.34 </td> |
| 161 | + <td>79.32 </td> |
| 162 | + <td>86.66 </td> |
| 163 | + </tr> |
| 164 | + <tr> |
| 165 | + <td>Open CLIP-H</td> |
| 166 | + <td><b>49.47</b></td> |
| 167 | + <td><b>73.4</b></td> |
| 168 | + <td><b>81.53</b></td> |
| 169 | + <td><b>65.96</b></td> |
| 170 | + <td><b>86.06</b></td> |
| 171 | + <td><b>91.9</b></td> |
| 172 | + </tr> |
| 173 | + <tr> |
| 174 | + <td>Open CLIP-g</td> |
| 175 | + <td>47.99 </td> |
| 176 | + <td>72.37 </td> |
| 177 | + <td>80.75 </td> |
| 178 | + <td>64.96 </td> |
| 179 | + <td>85.3 </td> |
| 180 | + <td>91.46 </td> |
| 181 | + </tr> |
| 182 | + <tr> |
| 183 | + <td>EVA CLIP-g</td> |
| 184 | + <td>44.07 </td> |
| 185 | + <td>68.5 </td> |
| 186 | + <td>77.33 </td> |
| 187 | + <td>61.76 </td> |
| 188 | + <td>83.28 </td> |
| 189 | + <td>89.96 </td> |
| 190 | + </tr> |
| 191 | +</table> |
| 192 | + |
| 193 | +</div> |
| 194 | + |
| 195 | +> The zero-shot retrieval performance of EVA-CLIP is relatively inferior to the Open CLIP-H / -g counterpart. We speculate there are two main reasons: |
| 196 | +> - The size / capacity of the language tower in EVA-CLIP is much smaller / weaker than Open CLIP-H and Open CLIP-g, *i.e.*, `124M` *v.s.* `354M`, and is only `~1/8` of the vision tower. Meanwhile, retrieval tasks depend more on the capacity of the language branch compared with classification tasks. |
| 197 | +> - Retrieval tasks seem benefit more from the training dataset size (LAION-2B used by Open CLIP), while we only leverage LAION-400M for EVA-CLIP training. |
| 198 | +> Nevertheless, it is hard to make a head-to-head comparison between different CLIP models. In the future, we will further scale up the language encoder & training data to improve the retrieval performance. |
| 199 | +
|
| 200 | +## Usage |
| 201 | + |
| 202 | +The use of EVA-CLIP is similar to [OpenAI CLIP](https://github.com/openai/CLIP) and [Open CLIP](https://github.com/mlfoundations/open_clip). |
| 203 | +Here we provide a showcase of zero-shot image classification. |
| 204 | + |
| 205 | +First, [install PyTorch 1.7.1](https://pytorch.org/get-started/locally/) (or later) and torchvision, as well as small additional dependencies, and then install this repo as a Python package. On a CUDA GPU machine, the following will do the trick: |
| 206 | + |
| 207 | +```bash |
| 208 | +$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0 |
| 209 | +$ pip install ftfy regex tqdm |
| 210 | +``` |
| 211 | + |
| 212 | +The training code of our 1.1B EVA-CLIP will be available at [FlagAI](https://github.com/FlagAI-Open/FlagAI). Please stay tuned. |
| 213 | + |
| 214 | + |
| 215 | +An example: |
| 216 | +```python |
| 217 | +import torch |
| 218 | +from eva_clip import build_eva_model_and_transforms |
| 219 | +from clip import tokenize |
| 220 | +from PIL import Image |
| 221 | + |
| 222 | +eva_clip_path = "/path/to/eva_clip_psz14.pt" # https://huggingface.co/BAAI/EVA/blob/main/eva_clip_psz14.pt |
| 223 | +model_name = "EVA_CLIP_g_14" |
| 224 | +image_path = "CLIP.png" |
| 225 | +caption = ["a diagram", "a dog", "a cat"] |
| 226 | + |
| 227 | +device = "cuda" if torch.cuda.is_available() else "cpu" |
| 228 | +model, preprocess = build_eva_model_and_transforms(model_name, pretrained=eva_clip_path) |
| 229 | +model = model.to(device) |
| 230 | + |
| 231 | +image = preprocess(Image.open(image_path)).unsqueeze(0).to(device) |
| 232 | +text = tokenize(caption).to(device) |
| 233 | + |
| 234 | +with torch.no_grad(): |
| 235 | + image_features = model.encode_image(image) |
| 236 | + text_features = model.encode_text(text) |
| 237 | + image_features /= image_features.norm(dim=-1, keepdim=True) |
| 238 | + text_features /= text_features.norm(dim=-1, keepdim=True) |
| 239 | + |
| 240 | + text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) |
| 241 | + |
| 242 | +print("Label probs:", text_probs) # prints: [1.0000e+00, 2.0857e-10, 4.8534e-12] |
| 243 | +``` |
| 244 | + |
| 245 | + |
| 246 | +## Acknowledgement |
| 247 | +EVA-CLIP is built with [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip) and [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark). Thanks for their awesome work! |
0 commit comments