A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers
[article]
2021
arXiv
pre-print
Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales and is efficient for billions of images using approximate nearest neighbour search. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings, but is often
arXiv:2103.16553v1
fatcat:rw2av5leebdx7kcrqowxv6yo54