-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathgpt4v_insertions.html
333 lines (283 loc) · 20.6 KB
/
gpt4v_insertions.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
<script src="http://www.google.com/jsapi" type="text/javascript"></script>
<script type="text/javascript">google.load("jquery", "1.3.2");</script>
<style type="text/css">
body {
font-family: "HelveticaNeue-Light", "Helvetica Neue Light", "Helvetica Neue", Helvetica, Arial, "Lucida Grande", sans-serif;
font-weight:300;
font-size:18px;
margin-left: auto;
margin-right: auto;
width: 1100px;
}
h1 {
font-size:32px;
font-weight:300;
}
.disclaimerbox {
background-color: #eee;
border: 1px solid #eeeeee;
border-radius: 10px ;
-moz-border-radius: 10px ;
-webkit-border-radius: 10px ;
padding: 20px;
}
video.header-vid {
height: 140px;
border: 1px solid black;
border-radius: 10px ;
-moz-border-radius: 10px ;
-webkit-border-radius: 10px ;
}
img.header-img {
height: 140px;
border: 1px solid black;
border-radius: 10px ;
-moz-border-radius: 10px ;
-webkit-border-radius: 10px ;
}
img.rounded {
border: 1px solid #eeeeee;
border-radius: 10px ;
-moz-border-radius: 10px ;
-webkit-border-radius: 10px ;
}
a:link,a:visited
{
color: #1367a7;
text-decoration: none;
}
a:hover {
color: #208799;
}
td.dl-link {
height: 160px;
text-align: center;
font-size: 22px;
}
.caption {
margin-top: 8px; /* Space between image and caption */
font-style: italic; /* Italicize the caption text */
}
.layered-paper-big { /* modified from: http://css-tricks.com/snippets/css/layered-paper/ */
box-shadow:
0px 0px 1px 1px rgba(0,0,0,0.35), /* The top layer shadow */
5px 5px 0 0px #fff, /* The second layer */
5px 5px 1px 1px rgba(0,0,0,0.35), /* The second layer shadow */
10px 10px 0 0px #fff, /* The third layer */
10px 10px 1px 1px rgba(0,0,0,0.35), /* The third layer shadow */
15px 15px 0 0px #fff, /* The fourth layer */
15px 15px 1px 1px rgba(0,0,0,0.35), /* The fourth layer shadow */
20px 20px 0 0px #fff, /* The fifth layer */
20px 20px 1px 1px rgba(0,0,0,0.35), /* The fifth layer shadow */
25px 25px 0 0px #fff, /* The fifth layer */
25px 25px 1px 1px rgba(0,0,0,0.35); /* The fifth layer shadow */
margin-left: 10px;
margin-right: 45px;
}
.paper-big { /* modified from: http://css-tricks.com/snippets/css/layered-paper/ */
box-shadow:
0px 0px 1px 1px rgba(0,0,0,0.35); /* The top layer shadow */
margin-left: 10px;
margin-right: 45px;
}
.layered-paper { /* modified from: http://css-tricks.com/snippets/css/layered-paper/ */
box-shadow:
0px 0px 1px 1px rgba(0,0,0,0.35), /* The top layer shadow */
5px 5px 0 0px #fff, /* The second layer */
5px 5px 1px 1px rgba(0,0,0,0.35), /* The second layer shadow */
10px 10px 0 0px #fff, /* The third layer */
10px 10px 1px 1px rgba(0,0,0,0.35); /* The third layer shadow */
margin-top: 5px;
margin-left: 10px;
margin-right: 30px;
margin-bottom: 5px;
}
.vert-cent {
position: relative;
top: 50%;
transform: translateY(-50%);
}
hr
{
border: 0;
height: 1px;
background-image: linear-gradient(to right, rgba(0, 0, 0, 0), rgba(0, 0, 0, 0.75), rgba(0, 0, 0, 0));
}
</style>
<html>
<head>
<title>GPT-4V's Robustness</title>
<meta property="og:image" content="./assets/gpt4v.png"/>
<meta property="og:title" content="Robustness of GPT-4V to Cross-Modal Insertions" />
<meta property="og:description" content="Title: Assessing the Robustness of GPT-4V to Cross-Modal Insertions. Authors: Gaurav Verma and Srijan Kumar; Affiliation: Georgia Institute of Technology" />
<!-- Get from Google Analytics -->
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src=""></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-75863369-6');
</script>
</head>
<body>
<br>
<center>
<span style="font-size:36px"><b>How Robust is GPT-4V to Cross-Modal Insertions?</b></span><br/>
<span style="font-size:20px">Part 1: Quick Assessment of the Robustness of GPT-4V's Visual Entailment Capabilities (SNLI-VE)</span><br/><br/>
<span><a href="https://gaurav22verma.github.io/">Gaurav Verma</a> <em><span style="font-size: 10pt;">and</span></em> <a href="https://faculty.cc.gatech.edu/~srijan/">Srijan Kumar</a></span><br/>
<span style="font-size: 11pt;">Georgia Institute of Technology</span><br/>
<a href="https://www.cc.gatech.edu/"><img src="./assets/gt-logo.png" width=170px></a><br/><br/>
<!--
<span>⭐ Slides from EMNLP 2022 Oral are now available [<a href="./assets/Verma_RobustnessFinal.pdf">slides (pdf)</a>]</span><br/>
<span>⭐ Code and Colab notebook released [<a href="https://github.com/claws-lab/multimodal-robustness">GitHub link</a>]</span><br/>
<span>⭐ Paper accepted at EMNLP 2022 (main) [<a href="./assets/MultimodalRobustness_EMNLP2022.pdf">paper pdf</a>]</span><br/><br/>
<a href="https://2022.emnlp.org/"><img src="./assets/emnlp-logo.png" width=150px></a><br/><br/> -->
<hr>
<center>
<table align=center width=850px>
<tr>
<td width=260px>
<!-- <center> -->
<center><h1>Summary (tl;dr)</h1></center>
<b>How robust is GPT-4V to realistic changes in multimodal data?</b> A simple way to check that is to introduce realistic changes in the input data and see if GPT-4V predictions remain the same. For instance, instead of asking GPT-4V to output if the input image depicts "a boy wearing a shirt playing the piano ," we can ask if it depicts "a <em>young</em> boy wearing a <em>green</em> shirt playing the piano." Provided that the input image does depict a young boy wearing a green shirt, GPT-4V's output should remain the same. In our <a href="https://aclanthology.org/2023.acl-long.890/">ACL'23</a> work, we call these changes in the input as <em>cross-modal insertions</em>.<br/><br/>
<b>On a visual entailment task</b> – i.e., given an image and a textual description, GPT-4V's job is to predict if the decription entails, contradicts, or is neutral to the information in the image, we find that GPT-4V's performance drops by 21.59% when faced with cross-modal insertions. This is a notable drop. Qualitatively, since for some of multimodal examples the inserted attributes are demographic attributes, it is possible that GPT-4V is "cautious" in its prediction. But otherwise, there are examples where GPT-4V struggles with multimodal reasoning in presence of cross-modal insertions.<br/><br/>
<b>Lastly</b>, as a passing remark, we also find that even without any changes in the input, GPT-4V's zero-shot performance on the visual entailment task is <em>not</em> close to models that learn with full supervision.<br/><br/>
<em>Continue reading for more details!</em>
<br/><br/>
<hr/>
<center><h1>Introduction and Takeaways</h1></center>
<b>Background</b>: We have investigated the robustness of vision-language models when faced with realistic changes in the text that are cross-modally grounded (<em>dilutions</em> in <a href="https://claws-lab.github.io/multimodal-robustness/" target="_blank">Verma et al., (EMNLP 2022)</a> and <em>insertions</em> in <a href = "https://github.com/claws-lab/multimodal-robustness-xmai">Ramshetty et al. (ACL 2023)</a>). Collectively, experiments across multiple multimodal classification tasks, cross-modal retrieval, and visual entailment tasks have indicated that the performance of leading <a href="https://github.com/openai/CLIP" target="_blank">CLIP</a>-like vision-language models drops notably – i.e., anywhere between 15% to 25% – when faced with cross-modal dilutions or insertions. The following figure provides qualitative examples of dilutions and insertions that lead to misclassification of the multimodal input.<br/><br/>
<center>
<img src="./assets/dilution_insertion.png" width="800">
<div class="caption"><b>Figure 1</b>: Examples of cross-modal dilution and cross-modal insertion.</div><br/><br/>
</center>
<b>The question</b>: Now, we are interested in assessing how robust GPT-4V is to these cross-modal changes. In the first part of this series, we focus on the robustness to cross-modal insertions and present the results on a subset of SNLI-VE dataset. Put concisely, SNLI-VE involves a premise (image) and a hypothesis (text) and the task is to predict whether the information in the text contradicts/entails from the information in the image or are the two neutral; essentially a 3-class classification task. GPT-4V's performance on this task is assessed under two settings: (i) when the textual hypothesis is unaltered (denoted as <span style="color: blue;">original</span>), and (ii) when the textual hypothesis is altered by inserting attributes of objects in the corresponding image (i.e., the insertion is cross-modal grounded) (denoted as <span style="color: red;">w/ cross-modal insertion</span>). We show the results in the following figure. For comparison, we have a CLIP-style model trained using the METER framework for the visual entailment task, and assessed under the same two settings. <br/><br/>
<b>Takeaways</b>: We have two key takeaways: (i) While the zero-shot visual entailment prediction capabilities of GPT-4V are impressive, there are not close to models that learn with full supervision (<a href="https://paperswithcode.com/sota/visual-entailment-on-snli-ve-test" target="_blank">view SNLI-VE leaderboard here</a>). (ii) GPT-4V's performance drops by 21.59% when faced with cross-modal insertions. This is a notable drop. We present some qualitative insights about what is it about cross-modal insertions that make GPT-4V predict incorrectly. <br/><br/>
<center>
<img src="./assets/gpt4v.png" alt="Performance comparison between the robustness of zero-shot GPT-4V model and a model fine-tuned based on the METER framework." width="500">
<div class="caption"><b>Figure 2</b>: While a model that is fine-tuned on the SNLI-VE train set using the METER framework (<a href="https://openaccess.thecvf.com/content/CVPR2022/papers/Dou_An_Empirical_Study_of_Training_End-to-End_Vision-and-Language_Transformers_CVPR_2022_paper.pdf">Dou et al.; CVPR 2022</a>), experiences a drop of 18.81% in the presence of cross-modal insertions, GPT-4V, when prompted in a zero-shot setting, experiences a drop of 21.59%. Results for the METER fine-tuned model are taken from <a href="https://github.com/claws-lab/multimodal-robustness-xmai">Ramshetty et al.; ACL 2023</a>. Due to rate limits, we conducted the GPT-4V experiments on a random subset of the test set (<em>n = 300</em>).</div><br/><br/>
</center>
<b>What kind of cross-modal insertions confuse GPT-4V?</b> Of the 61 examples (out of 300) that were miscategorized by GPT-4V <em>after</em> cross-modal insertions were made (i.e., these examples were categorized accurately in their original form), quite a few examples involved the insertion of attributes that could relate to demographic attributes. For instance, for the following two examples, the predicted label changed when the men/person were specified to be white. Such cases can be argued to be an artifact of our approach and are not really a limitation of the robustness of the GPT-4V model. <br/><br/>
<center>
<table>
<tr>
<td style="border: 2px solid black; border-radius: 10px; padding: 10px;">
<center>
<img src="./assets/climbing.jpg" width="270px">
</center>
</td>
<td style="border: 2px solid black; border-radius: 10px; padding: 10px;">
<center>
<img src="./assets/camera.jpg" width="300px">
</center>
</td>
</tr>
<tr>
<td style="border: 2px solid black; border-radius: 10px; padding: 10px;">
original: <span style="color: blue;">a person is climbing (<em>neutral</em>) <span style="color: green;">✓</span></span> <br>
modified: <span style="color: red;">a white person is climbing (<em>contradiction</em>) <span style="color: red;">✗</span></span>
</td>
<td style="border: 2px solid black; border-radius: 10px; padding: 10px;">
original: <span style="color: blue;">two men look toward a camera. (<em>entailment</em>) <span style="color: green;">✓</span></span> <br>
modified: <span style="color: red;"> two white men look toward a press camera. (<em>contradiction</em>) <span style="color: red;">✗</span></span>
</td>
</tr>
</table>
</center><br/>
However, for some examples, the change in prediction was not completely justifiable – check out the two examples below. Perhaps investigating the rationales (obtained via chain-of-thought prompting) could help in investigating the reasons for such misclassifications. Additionally, we believe that in-context learning could improve both the original and the cross-modal insertion accuracies of GPT-4V. <br/><br/>
<center>
<table>
<tr>
<td style="border: 2px solid black; border-radius: 10px; padding: 10px;">
<center>
<img src="./assets/waiter.jpg" width="270px">
</center>
</td>
<td style="border: 2px solid black; border-radius: 10px; padding: 10px;">
<center>
<img src="./assets/young_child.jpg" width="300px">
</center>
</td>
</tr>
<tr>
<td style="border: 2px solid black; border-radius: 10px; padding: 10px;">
original: <span style="color: blue;">the waiter waits for the women to odrder and his shift will be over soon (<em>neutral</em>) <span style="color: green;">✓</span></span> <br>
modified: <span style="color: red;">the waiter waits for the women to odrder and his first shift will be over soon (<em>contradiction</em>) <span style="color: red;">✗</span></span>
</td>
<td style="border: 2px solid black; border-radius: 10px; padding: 10px;">
original: <span style="color: blue;">a child is watching a movie on the dvd player. (<em>neutral</em>) <span style="color: green;">✓</span></span> <br>
modified: <span style="color: red;"> a young child is watching a movie on the dvd player. (<em>contradiction</em>) <span style="color: red;">✗</span></span>
</td>
</tr>
</table>
</center><br/>
<br/>
<!-- </center> -->
</td>
</tr>
</table>
<hr>
</center>
<table align=center width=850px>
<center><h1>Methodology and Resources</h1></center>
<tr>
<td>
<b>Data</b>: We conducted experiments on a subset of 300 examples from the SNLI-VE test set, which were randomly sampled from the original test set while ensuring 100 examples from each of the three classes (contradiction, entailment, and neutral). We used the cross-modal insertions introduced to the corresponding original text from our prior study (<a href="https://github.com/claws-lab/multimodal-robustness-xmai" target="_blank">Ramshetty et al. (ACL 2023)</a>). The choice of experimenting with 300 examples was largely driven by the rate limits imposed by the OpenAI API [<a href="https://community.openai.com/t/rate-limit-for-gpt-4v-preview/476851" target="_blank">more here</a>]. For the presented experiments, the GPT-4V model was accessed during November 7-14, 2023.<br/><br/>
<b>Prompt</b>: This is the prompt we used for doing zero-shot visual entailment inference with GPT-4V:<br/><br/>
<code style="color: purple;">
Given the image as premise and the following sentence as hypothesis, assign one of the three labels: entailment, contradiction, or neutral, based on the relationship conveyed by the image and the text. The definitions of the labels are provided below.<br/>
entailment: if there is enough evidence in the image to conclude that the sentence is true.<br/>
contradiction: if there is enough evidence in the image to conclude that the sentence is false.<br/>
neutral: if the evidence in the image is insufficient to draw a conclusion about the text.<br/><br/>
You must respond only with one of the three words: entailment, contradiction, or neutral, and nothing else.<br/><br/>
sentence: {text goes here}<br/>
label:<br/>
</code><br/>
<b>Resources</b>: All our code, data, and other resources used to do this analysis are available in this <a href="https://github.com/claws-lab/multimodal-robustness-xmai">GitHub repository</a>. Please feel free to reach out to us if you have any questions or comments.<br/><br/>
</td>
</tr>
</table>
<br>
<hr>
<table align=center width=850px>
<center><h1>Related Papers on Multimodal Robustness</h1></center>
<span style="font-size:11pt">The experiments and insights presented here are developed on top of our prior work.<br/> Please refer to the following papers and the associated resources for more details:</span><br/><br/>
<tr>
<td><img class="layered-paper-small" style="height:155px" src="./assets/acl2023.png"/></td>
<td><span style="font-size:11pt">
<b>Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning</b><br>
Shivaen Ramshetty*, Gaurav Verma*, Srijan Kumar<br>
In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023).<br><br/>
code: <a href="https://github.com/claws-lab/multimodal-robustness-xmai">https://github.com/claws-lab/multimodal-robustness-xmai</a><br/>
arXiv: <a href="https://arxiv.org/abs/2306.11065">https://arxiv.org/abs/2306.11065</a><br><br/><br/>
</td>
</tr>
<tr>
<td><img class="layered-paper-small" style="height:155px" src="./assets/emnlp2022.png"/></td>
<td><span style="font-size:11pt">
<b>Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content Dilutions</b><br>
Gaurav Verma, Vishwa Vinay, Ryan A. Rossi, Srijan Kumar<br>
In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022).<br><br/>
webpage: <a href="https://claws-lab.github.io/multimodal-robustness/">https://claws-lab.github.io/multimodal-robustness/</a><br/>
code: <a href="https://github.com/claws-lab/multimodal-robustness">https://github.com/claws-lab/multimodal-robustness</a><br/>
arXiv: <a href="https://arxiv.org/abs/2211.02646">https://arxiv.org/abs/2211.02646</a><br><br/><br/>
</td>
</tr>
</table>
<span style="font-size: 11pt;"><b><em>What's next?</em></b><br/> Stay tuned for more follow-up parts from us within this series where we investigate multimodal language models!</span><br/>
<br>
<hr>
<br>
<table align=center width=600px>
<tr>
<td width=400px>
<center>
<span style="font-size: 8pt;">The template was originally made by <a href="http://web.mit.edu/phillipi/">Phillip Isola</a> and <a href="http://richzhang.github.io/">Richard Zhang</a>; the code can be found <a href="https://github.com/richzhang/webpage-template">here</a>.</span>
</center>
</td>
</tr>
</table>
<br>
</body>
</html>