2,218 Hits in 5.3 sec

Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features [article]

Nicola Messina, Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, Stéphane Marchand-Maillet
2021 arXiv   pre-print
Our main objective is to lay down the paths for efficient indexing of complex multi-modal descriptions. We use the recently introduced TERN architecture as an image-sentence features extractor.  ...  Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities.  ...  yet effective cross-modal retrieval using deep features.  ... 
arXiv:2106.00358v1 fatcat:mhmnmtiq2bgy7oeh5y7q647mcm

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders [article]

Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, Stéphane Marchand-Maillet
2021 arXiv   pre-print
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task.  ...  Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated.  ...  ACKNOWLEDGMENTS This work was partially supported by "Intelligenza Artificiale per il Monitoraggio Visuale dei Siti Culturali" (AI4CHSites) CNR4C program, CUP B15J19001040004, by the AI4EU project, funded  ... 
arXiv:2008.05231v2 fatcat:h5ybwbeukjamviphhfykrcbpnu

Transformer Reasoning Network for Image-Text Matching and Retrieval [article]

Nicola Messina, Fabrizio Falchi, Andrea Esuli, Giuseppe Amato
2021 arXiv   pre-print
However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems.  ...  Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step.  ...  VISUAL-TEXTUAL REASONING USING TRANSFORMER ENCODERS Our work relies almost entirely on the TE architecture, both for the visual and the textual data pipelines.  ... 
arXiv:2004.09144v3 fatcat:5u4vvdhxdza5djkr66lf4guwum

New Ideas and Trends in Deep Multimodal Content Understanding: A Review [article]

Wei Chen and Weiping Wang and Li Liu and Michael S. Lew
2020 arXiv   pre-print
These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering  ...  The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text.  ...  1) Cross-modal retrieval Single-modal and cross-modal retrieval have been researched for decades [61] .  ... 
arXiv:2010.08189v1 fatcat:2l7molbcn5hf3oyhe3l52tdwra

New Ideas and Trends in Deep Multimodal Content Understanding: A Review

Wei Chen, Weiping Wang, Li Liu, Michael S. Lew
2020 Neurocomputing  
These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering  ...  The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text.  ...  The achieved progress of cross-modal hash retrieval on the MIRFlickr25k [206] and the NUS-WIDE [207] datasets. Hashing methods have higher retrieval efficiency using the binary hash codes.  ... 
doi:10.1016/j.neucom.2020.10.042 fatcat:hyjkj5enozfrvgzxy6avtbmoxu

Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework [article]

Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Minzhe Niu, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, Chunjing Xu, Hang Xu
2022 arXiv   pre-print
Their success heavily relies on the scale of pre-trained cross-modal datasets.  ...  In this work, we release a large-scale Chinese cross-modal dataset named Wukong, containing 100 million Chinese image-text pairs from the web.  ...  Secondly, we find that cross-modal token-wise similarity from FILIP complements various patch-based visual encoders like SwinT and can contribute to better visual and textual representations.  ... 
arXiv:2202.06767v2 fatcat:ddopguzsnrcsneihwxnndcuu6a

Multimodal Research in Vision and Language: A Review of Current and Emerging Trends [article]

Shagun Uppal, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumdar, Soujanya Poria, Roger Zimmermann, Amir Zadeh
2020 arXiv   pre-print
In this paper, we present a detailed overview of the latest trends in research pertaining to visual and language modalities.  ...  Deep Learning and its applications have cascaded impactful research and development with a diverse range of modalities present in the real-world data.  ...  Transformers in Cross-Modal Research Onset of Transformers for Capturing Temporal Data Characteristics Transformers are architectures that take advantage of two separate networks, namely encoder and  ... 
arXiv:2010.09522v2 fatcat:l4npstkoqndhzn6hznr7eeys4u

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models [article]

Feng Li, Hao Zhang, Yi-Fan Zhang, Shilong Liu, Jian Guo, Lionel M. Ni, PengChuan Zhang, Lei Zhang
2022 arXiv   pre-print
Finally, we discuss some potential future trends towards modality cooperation, unified representation, and knowledge incorporation.  ...  This survey is inspired by the remarkable progress in both computer vision and natural language processing, and recent trends shifting from single modality processing to multiple modality comprehension  ...  The extracted visual and textual tokens are directly combined and fed into Transformers, where cross-modality fusion can be performed implicitly.  ... 
arXiv:2203.01922v1 fatcat:vnjfetgkpzedpfhklufooqet7y

Image-text Retrieval: A Survey on Recent Research and Development [article]

Min Cao, Shiping Li, Juntao Li, Liqiang Nie, Min Zhang
2022 arXiv   pre-print
In the past few years, cross-modal image-text retrieval (ITR) has experienced increased interest in the research community due to its excellent research value and broad real-world application.  ...  It is designed for the scenarios where the queries are from one modality and the retrieval galleries from another modality.  ...  ., 2019] adopted a visual saliency detection module to guide the cross-modal correlation. integrated intraand cross-modal knowledge to learn the image and text features jointly.  ... 
arXiv:2203.14713v2 fatcat:acvezdy23nfobhy5vh7m4hdghq

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [article]

Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman
2021 arXiv   pre-print
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales and is efficient for billions of images  ...  An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings, but is often inapplicable in practice for large-scale  ...  Note that the main difference with VirTex is in the visiontext Transformer architecture: PixelBERT uses a deep 12layer Transformer encoder while VirTex uses a shallow 3layer Transformer decoder to merge  ... 
arXiv:2103.16553v1 fatcat:rw2av5leebdx7kcrqowxv6yo54

Survey on Deep Multi-modal Data Analytics: Collaboration, Rivalry and Fusion [article]

Yang Wang
2020 arXiv   pre-print
Substantial empirical studies are carried out to demonstrate its advantages that are benefited from deep multi-modal methods, which can essentially deepen the fusion from multi-modal deep feature spaces  ...  With the development of web technology, multi-modal or multi-view data has surged as a major stream for big data, where each modal/view encodes individual property of data objects.  ...  [136] proposed to use deep convolutional visual features to address cross-modal retrieval.  ... 
arXiv:2006.08159v1 fatcat:g4467zmutndglmy35n3eyfwxku

All in One: Exploring Unified Video-Language Pre-training [article]

Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, Mike Zheng Shou
2022 arXiv   pre-print
Our pre-trained all-in-one Transformer is transferred to various downstream video-text tasks after fine-tuning, including text-video retrieval, video-question answering, multiple choice and visual commonsense  ...  They pursue better performance via utilizing heavier unimodal encoders or multimodal fusion Transformers, resulting in increased parameters with lower efficiency in downstream tasks.  ...  We would like to thank David Junhao Zhang for his kindly help on Transformer training.  ... 
arXiv:2203.07303v1 fatcat:ypguqswusnhf5nxqoiiq275vga

Start from Scratch

Hanwang Zhang, Yang Yang, Huanbo Luan, Shuicheng Yang, Tat-Seng Chua
2014 Proceedings of the ACM International Conference on Multimedia - MM '14  
The discovery is based on a novel deep architecture, named Independent Component Multimodal Autoencoder (ICMAE), that can continually learn shared higher-level representations across the visual and textual  ...  modalities.  ...  fail to learn useful cross-modality correlations.  ... 
doi:10.1145/2647868.2654915 dblp:conf/mm/ZhangYLYC14 fatcat:sfr7zr2akngtdl5nem2uckh76e

Multi-Modal Retrieval using Graph Neural Networks [article]

Aashish Kumar Misraa, Ajinkya Kale, Pranav Aggarwal, Ali Aminian
2020 arXiv   pre-print
This graph structure helps us learn multi-modal node embeddings using Graph Neural Networks.  ...  Filtering based on image concepts or attributes is traditionally achieved with index-based filtering (e.g. on textual tags) or by re-ranking after an initial visual embedding based retrieval.  ...  Increasing weight towards the visual features captures the retro effect of the query.  ... 
arXiv:2010.01666v1 fatcat:mtp43eajpbabnhownf6tqaxhki

Predicting Visual Features from Text for Image and Video Caption Retrieval

Jianfeng Dong, Xirong Li, Cees G.M. Snoek
2018 IEEE transactions on multimedia  
Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron.  ...  Apart from this conceptual novelty, we contribute Word2VisualVec, a deep neural network architecture that learns to predict a visual feature representation from textual input.  ...  While both visual and textual modalities are used during training, Word2VisualVec performs a mapping from the textual to the visual modality.  ... 
doi:10.1109/tmm.2018.2832602 fatcat:ypowsjvjyvbhfhtf42l6rokhii
« Previous Showing results 1 — 15 out of 2,218 results