Skip to content

Commit e8d6e2f

Browse files
aszalaj-min
authored andcommitted
Initial commit
0 parents  commit e8d6e2f

File tree

91 files changed

+32971
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

91 files changed

+32971
-0
lines changed

.gitignore

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
__pycache__/
2+
checkpoints/
3+
pretrained_weights/
4+
data/eva_clip_features/
5+
data/ASR/
6+
data/ASR_feats_all-MiniLM-L6-v2/
7+
custom_video_pipeline/

EVA_clip/CLIP.png

247 KB
Loading

EVA_clip/README.md

+247
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,247 @@
1+
# Contrastive Language-Image Pre-Training with EVA (EVA-CLIP)
2+
3+
**Table of Contents**
4+
5+
- [Contrastive Language-Image Pre-Training with EVA (EVA-CLIP)](#contrastive-language-image-pre-training-with-eva-eva-clip)
6+
- [Model Card](#model-card)
7+
- [Performance of EVA-CLIP Vision Encoder on ImageNet-1K](#performance-of-eva-clip-vision-encoder-on-imagenet-1k)
8+
- [EVA-CLIP Zero-shot Evaluation Results](#eva-clip-zero-shot-evaluation-results)
9+
- [**All 35 Benchmark Results in Details**](#all-35-benchmark-results-in-details)
10+
- [Zero-shot Image Classification Evaluation](#zero-shot-image-classification-evaluation)
11+
- [Zero-shot Video Action Recognition Evaluation](#zero-shot-video-action-recognition-evaluation)
12+
- [Zero-shot Retrieval Evaluation](#zero-shot-retrieval-evaluation)
13+
- [Usage](#usage)
14+
- [Acknowledgement](#acknowledgement)
15+
16+
17+
## Model Card
18+
19+
<div align="center">
20+
21+
| model name | #param. | precision | data | batch size | IN-1K zero-shot top-1 | weight |
22+
|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|
23+
| `eva_clip_psz14` | 1.1B | `fp16` | [LAION-400M](https://laion.ai/laion-400-open-dataset/) | 41K | 78.5 | [🤗 HF link](https://huggingface.co/BAAI/EVA/blob/main/eva_clip_psz14.pt) (`2GB`) |
24+
25+
</div>
26+
27+
28+
We choose to train a 1.1B CLIP model, not because it is easy, but because it is hard. Please refer to [this note](https://docs.google.com/document/d/1FXosAZ3wMrzThgnWR6KSkXIz4IMItq3umDGos38pJps/edit) for a glance at the challenges in training very large CLIP.
29+
30+
To our knowledge, EVA-CLIP is **the largest performant open-sourced CLIP model** evaluated via zero-shot classification performance, especially on mainstream classification benchmarks such as ImageNet along with its variants.
31+
For more details about EVA-CLIP, please refer to Section 2.3.5 of [our paper](https://arxiv.org/pdf/2211.07636.pdf).
32+
33+
We hope open-sourcing EVA-CLIP can facilitate future research in multi-modal learning, representation learning, AIGC, *etc*, and we hope our solution for scaling up CLIPs can provide insight for practitioners studying large foundation models.
34+
35+
36+
37+
## Performance of EVA-CLIP Vision Encoder on ImageNet-1K
38+
39+
<div align="center">
40+
41+
| model | zero-shot @ 224px | linear probing @ 224px | linear probing @ 336px | fine-tuning @ 224px | fine-tuning @ 336px |
42+
|:-----:|:------:|:------:|:------:|:------:|:------:|
43+
| EVA-CLIP | **78.5** ([weight](https://huggingface.co/BAAI/EVA/blob/main/eva_clip_psz14.pt) \| [log](https://wandb.ai/baaivision/eva-clip/reports/ViT-g-14--VmlldzoyOTkwMDYy)) | **86.5** ([weight](https://huggingface.co/BAAI/EVA/blob/main/eva_clip_vis_enc_sz224_lincls_86p5.pth) \| [log](../logs/cls/linear_eva_clip_vision_enc_1k_cls_sz224_86p5.txt)| **86.5** ([weight](https://huggingface.co/BAAI/EVA/blob/main/eva_clip_vis_enc_sz336_lincls_86p5.pth) \| [log](../logs/cls/linear_eva_clip_vision_enc_1k_cls_sz336_86p5.txt)) | **89.1** ([weight](https://huggingface.co/BAAI/EVA/blob/main/eva_clip_vis_enc_sz224_ftcls_89p1.pt) \| [log](../logs/cls/ft_eva_clip_vision_enc_1k_cls_sz224_89p1.txt)) | **89.4** ([weight](https://huggingface.co/BAAI/EVA/blob/main/eva_clip_vis_enc_sz336_ftcls_89p4.pt) \| [log](../logs/cls/ft_eva_clip_vision_enc_1k_cls_sz336_89p4.txt)) |
44+
45+
</div>
46+
47+
EVA-CLIP achieves the state-of-the-art top-1 accuracy on ImageNet-1K among all self-supervised learning approaches.
48+
We will provide instructions for re-producing these results soon.
49+
50+
51+
## EVA-CLIP Zero-shot Evaluation Results
52+
53+
54+
<div align="center">
55+
56+
### [**All 35 Benchmark Results in Details**](./benchmark.md)
57+
58+
</div>
59+
60+
61+
62+
### Zero-shot Image Classification Evaluation
63+
64+
The top-1 accuracy of ImageNet-1K variants and ObjectNet.
65+
66+
<div align="center">
67+
68+
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/self-supervised-image-classification-with)](https://paperswithcode.com/sota/self-supervised-image-classification-with?p=eva-exploring-the-limits-of-masked-visual)
69+
70+
| model | IN-1K | IN-V2 | IN-Adv. | IN-Ren. |IN-Ske. | ObjectNet |
71+
|-------|:-----:|:-----:|:----:| :------:|:-------:|:---------:|
72+
| OpenAI CLIP-L | 75.55 | 69.86 | 70.76 | 87.83 | 59.58 | 68.98 |
73+
| Open CLIP-H | 77.96 | 70.87 | 59.33 | 89.33 | 66.58 | 69.71 |
74+
| Open CLIP-g | 76.65 | 69.56 | 57.19 | 88.69 | 65.17 | 67.53 |
75+
| EVA CLIP-g | **78.53** | **71.52** | **73.59** | **92.5** | **67.31** | **72.33** |
76+
77+
</div>
78+
79+
### Zero-shot Video Action Recognition Evaluation
80+
81+
The performance of video action recognition benchmarks.
82+
83+
<div align="center">
84+
85+
| model | UCF-101 | Kinetics-400 | Kinetics-600 | Kinetics-700 |
86+
|-------|:-----:|:-----:|:----:| :----:|
87+
| OpenAI CLIP-L | 76.39 | 64.47 | 64.21 | 57.68 |
88+
| Open CLIP-H | **78.16** | 63.06 | 63.58 | 56.09 |
89+
| Open CLIP-g | 77.73 | 61.69 | 62.16 | 54.99 |
90+
| EVA CLIP-g | 76.05 | **65.23** | **64.38** | **58.4** |
91+
92+
</div>
93+
94+
95+
> For video action recognition, we sample only a single center frame each video, turning it into an image classification task.
96+
> Following the conventional settings, we report the top-1 accuracy for UCF-101 and the mean of top-1 and top-5 accuracy for Kinetics-400/600/700.
97+
98+
### Zero-shot Retrieval Evaluation
99+
100+
<div align="center">
101+
102+
<table>
103+
<tr>
104+
<td rowspan=2>Dataset</td>
105+
<td rowspan=2>Model</td>
106+
<td colspan=3>Text-to-Image Retrival</td>
107+
<td colspan=3>Image-to-Text Retrival</td>
108+
</tr>
109+
<tr>
110+
<td>R@1</td>
111+
<td>R@5</td>
112+
<td>R@10</td>
113+
<td>R@1</td>
114+
<td>R@5</td>
115+
<td>R@10</td>
116+
</tr>
117+
<tr>
118+
<td rowspan=4>Flickr30k</td>
119+
<td>OpenAI CLIP-L</td>
120+
<td>65.18 </td>
121+
<td>87.28 </td>
122+
<td>92 </td>
123+
<td>85.2 </td>
124+
<td>97.3 </td>
125+
<td>99 </td>
126+
</tr>
127+
<tr>
128+
<td>Open CLIP-H</td>
129+
<td><b>77.78</b></td>
130+
<td><b>94.14</b></td>
131+
<td><b>96.62</b></td>
132+
<td><b>90.8</b></td>
133+
<td><b>99.3</b></td>
134+
<td>99.7</td>
135+
</tr>
136+
<tr>
137+
<td>Open CLIP-g</td>
138+
<td>76.52 </td>
139+
<td>93.62 </td>
140+
<td>96.28 </td>
141+
<td>90.8 </td>
142+
<td>99.1 </td>
143+
<td><b>99.8</b></td>
144+
</tr>
145+
<tr>
146+
<td>EVA CLIP-g</td>
147+
<td>72.64 </td>
148+
<td>91.6 </td>
149+
<td>95.12 </td>
150+
<td>88.3 </td>
151+
<td>98.3 </td>
152+
<td>99.3 </td>
153+
</tr>
154+
<tr>
155+
<td rowspan=4>MSCOCO</td>
156+
<td>OpenAI CLIP-L</td>
157+
<td>36.51 </td>
158+
<td>61.01 </td>
159+
<td>71.11 </td>
160+
<td>56.34 </td>
161+
<td>79.32 </td>
162+
<td>86.66 </td>
163+
</tr>
164+
<tr>
165+
<td>Open CLIP-H</td>
166+
<td><b>49.47</b></td>
167+
<td><b>73.4</b></td>
168+
<td><b>81.53</b></td>
169+
<td><b>65.96</b></td>
170+
<td><b>86.06</b></td>
171+
<td><b>91.9</b></td>
172+
</tr>
173+
<tr>
174+
<td>Open CLIP-g</td>
175+
<td>47.99 </td>
176+
<td>72.37 </td>
177+
<td>80.75 </td>
178+
<td>64.96 </td>
179+
<td>85.3 </td>
180+
<td>91.46 </td>
181+
</tr>
182+
<tr>
183+
<td>EVA CLIP-g</td>
184+
<td>44.07 </td>
185+
<td>68.5 </td>
186+
<td>77.33 </td>
187+
<td>61.76 </td>
188+
<td>83.28 </td>
189+
<td>89.96 </td>
190+
</tr>
191+
</table>
192+
193+
</div>
194+
195+
> The zero-shot retrieval performance of EVA-CLIP is relatively inferior to the Open CLIP-H / -g counterpart. We speculate there are two main reasons:
196+
> - The size / capacity of the language tower in EVA-CLIP is much smaller / weaker than Open CLIP-H and Open CLIP-g, *i.e.*, `124M` *v.s.* `354M`, and is only `~1/8` of the vision tower. Meanwhile, retrieval tasks depend more on the capacity of the language branch compared with classification tasks.
197+
> - Retrieval tasks seem benefit more from the training dataset size (LAION-2B used by Open CLIP), while we only leverage LAION-400M for EVA-CLIP training.
198+
> Nevertheless, it is hard to make a head-to-head comparison between different CLIP models. In the future, we will further scale up the language encoder & training data to improve the retrieval performance.
199+
200+
## Usage
201+
202+
The use of EVA-CLIP is similar to [OpenAI CLIP](https://github.com/openai/CLIP) and [Open CLIP](https://github.com/mlfoundations/open_clip).
203+
Here we provide a showcase of zero-shot image classification.
204+
205+
First, [install PyTorch 1.7.1](https://pytorch.org/get-started/locally/) (or later) and torchvision, as well as small additional dependencies, and then install this repo as a Python package. On a CUDA GPU machine, the following will do the trick:
206+
207+
```bash
208+
$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
209+
$ pip install ftfy regex tqdm
210+
```
211+
212+
The training code of our 1.1B EVA-CLIP will be available at [FlagAI](https://github.com/FlagAI-Open/FlagAI). Please stay tuned.
213+
214+
215+
An example:
216+
```python
217+
import torch
218+
from eva_clip import build_eva_model_and_transforms
219+
from clip import tokenize
220+
from PIL import Image
221+
222+
eva_clip_path = "/path/to/eva_clip_psz14.pt" # https://huggingface.co/BAAI/EVA/blob/main/eva_clip_psz14.pt
223+
model_name = "EVA_CLIP_g_14"
224+
image_path = "CLIP.png"
225+
caption = ["a diagram", "a dog", "a cat"]
226+
227+
device = "cuda" if torch.cuda.is_available() else "cpu"
228+
model, preprocess = build_eva_model_and_transforms(model_name, pretrained=eva_clip_path)
229+
model = model.to(device)
230+
231+
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
232+
text = tokenize(caption).to(device)
233+
234+
with torch.no_grad():
235+
image_features = model.encode_image(image)
236+
text_features = model.encode_text(text)
237+
image_features /= image_features.norm(dim=-1, keepdim=True)
238+
text_features /= text_features.norm(dim=-1, keepdim=True)
239+
240+
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
241+
242+
print("Label probs:", text_probs) # prints: [1.0000e+00, 2.0857e-10, 4.8534e-12]
243+
```
244+
245+
246+
## Acknowledgement
247+
EVA-CLIP is built with [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip) and [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark). Thanks for their awesome work!

EVA_clip/__init__.py

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from .clip import *
2+
from .eva_clip import *

0 commit comments

Comments
 (0)