A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit <a rel="external noopener" href="https://arxiv.org/pdf/2012.08673v2.pdf">the original URL</a>. The file type is <code>application/pdf</code>.
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
[article]
<span title="2021-03-30">2021</span>
<i >
arXiv
</i>
<span class="release-stage" >pre-print</span>
Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level. Although achieving impressive performance on standard tasks, to date, it still remains unclear how robust these pre-trained models are. To investigate, we conduct a host of thorough evaluations on existing pre-trained models over 4 different types of V+L specific model robustness: (i) Linguistic Variation; (ii) Logical Reasoning;
<span class="external-identifiers">
<a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2012.08673v2">arXiv:2012.08673v2</a>
<a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/orl3dt3r3fg3xjac2rt4xwqxxu">fatcat:orl3dt3r3fg3xjac2rt4xwqxxu</a>
</span>
more »
... iii) Visual Content Manipulation; and (iv) Answer Distribution Shift. Interestingly, by standard model finetuning, pre-trained V+L models already exhibit better robustness than many task-specific state-of-the-art methods. To further enhance model robustness, we propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L models. Differing from previous studies focused on one specific type of robustness, Mango is task-agnostic, and enables universal performance lift for pre-trained models over diverse tasks designed to evaluate broad aspects of robustness. Comprehensive experiments demonstrate that Mango achieves new state of the art on 7 out of 9 robustness benchmarks, surpassing existing methods by a significant margin. As the first comprehensive study on V+L robustness, this work puts robustness of pre-trained models into sharper focus, pointing new directions for future study.
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20210410234437/https://arxiv.org/pdf/2012.08673v2.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext">
<button class="ui simple right pointing dropdown compact black labeled icon button serp-button">
<i class="icon ia-icon"></i>
Web Archive
[PDF]
<div class="menu fulltext-thumbnail">
<img src="https://blobs.fatcat.wiki/thumbnail/pdf/eb/7d/eb7d3c02f17fd40941e5e9a8c3d12c93667a4070.180px.jpg" alt="fulltext thumbnail" loading="lazy">
</div>
</button>
</a>
<a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2012.08673v2" title="arxiv.org access">
<button class="ui compact blue labeled icon button serp-button">
<i class="file alternate outline icon"></i>
arxiv.org
</button>
</a>