12,551 Hits in 6.6 sec

Multi-modal Memory Enhancement Attention Network for Image-Text Matching

Zhong Ji, Zhigang Lin, Haoran Wang, Yuqing He
2020 IEEE Access  
by constructing a Multi-Modal Memory Enhancement (M3E) module.  ...  Image-text matching is an attractive research topic in the community of vision and language.  ...  CONCLUSION In this paper, we proposed a novel Multi-modal Memory Enhancement Attention Network (M3A-Net) for achieving image-text matching.  ... 
doi:10.1109/access.2020.2975594 fatcat:ciiubythzzevpkw2ip5csnjwf4

Review of Recent Deep Learning Based Methods for Image-Text Retrieval

Jianan Chen, Lu Zhang, Cong Bai, Kidiyo Kpalma
2020 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)  
In this paper, we highlight key points of recent cross-modal retrieval approaches based on deep-learning, especially in the image-text retrieval context, and classify them into four categories according  ...  Extracting relevant information efficiently from large-scale multi-modal data is becoming a crucial problem of information retrieval.  ...  for more effective image-text matching.  ... 
doi:10.1109/mipr49039.2020.00042 dblp:conf/mipr/ChenZBK20 fatcat:fps5wiw4ezf7teko3vegaxq4tq

Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

Keyu Wen, Xiaodong Gu, Qingrong Cheng
2020 IEEE transactions on circuits and systems for video technology (Print)  
Image-Text Matching is one major task in cross-modal information processing. The main challenge is to learn the unified visual and textual representations.  ...  Thus, a novel multi-level semantic relations enhancement approach named Dual Semantic Relations Attention Network(DSRAN) is proposed which mainly consists of two modules, separate semantic relations module  ...  [41] proposed a cross memory network with pair discrimination to capture the common knowledge between image and text modalities. More special mechanisms are used in the global-wise matching.  ... 
doi:10.1109/tcsvt.2020.3030656 fatcat:ymindb2imnbgnlmitnkziskkmi

LILE: Look In-Depth before Looking Elsewhere – A Dual Attention Network using Transformers for Cross-Modal Information Retrieval in Histopathology Archives [article]

Danial Maleki, H.R Tizhoosh
2022 arXiv   pre-print
Most contemporary works apply cross attention to highlight the essential elements of an image or text in relation to the other modalities and try to match them together.  ...  Furthermore, the age of networks that used multiple modalities separately has practically ended.  ...  Attention and Gated Memory Blocks After the representation for each modality instance has been extracted, a multi-head selfattention module is applied to obtain m enhanced feature maps for extracted features  ... 
arXiv:2203.01445v2 fatcat:onogf45adrgcvjd5psnm25sbam

New Ideas and Trends in Deep Multimodal Content Understanding: A Review [article]

Wei Chen and Weiping Wang and Li Liu and Michael S. Lew
2020 arXiv   pre-print
The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text.  ...  These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering  ...  To compensate for these limitations, word-level attention [53] , hierarchical text-to-image mapping [46] and memory networks [59] have been explored.  ... 
arXiv:2010.08189v1 fatcat:2l7molbcn5hf3oyhe3l52tdwra

New Ideas and Trends in Deep Multimodal Content Understanding: A Review

Wei Chen, Weiping Wang, Li Liu, Michael S. Lew
2020 Neurocomputing  
The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text.  ...  These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering  ...  To compensate for these limitations, word-level attention [53] , hierarchical text-to-image mapping [46] and memory networks [59] have been explored.  ... 
doi:10.1016/j.neucom.2020.10.042 fatcat:hyjkj5enozfrvgzxy6avtbmoxu

IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [article]

Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, Jungong Han
2020 arXiv   pre-print
In this paper, to address such a deficiency, we propose an Iterative Matching with Recurrent Attention Memory (IMRAM) method, in which correspondences between images and texts are captured with multiple  ...  Enabling bi-directional retrieval of images and texts is important for understanding the correspondence between vision and language.  ...  Conclusion In this paper, we propose an Iterative Matching method with a Recurrent Attention Memory network (IMRAM) for cross-modal image-text retrieval to handle the complexity of semantics.  ... 
arXiv:2003.03772v1 fatcat:s2hqfom3ira4blfaxazzaso73a

Learning to Respond with Your Favorite Stickers: A Framework of Unifying Multi-Modality and User Preference in Multi-Turn Dialog [article]

Shen Gao, Xiuying Chen, Li Liu, Dongyan Zhao, Rui Yan
2020 arXiv   pre-print
Specifically, PESRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances.  ...  Then, we model the user preference by using the recently selected stickers as input, and use a key-value memory network to store the preference representation.  ...  As for sticker recommendation, existing works such as [42] and apps like Hike or QQ directly match the text typed by the user to the short text tag assigned to each sticker.  ... 
arXiv:2011.03322v1 fatcat:krkee37danaipbpbeozwfgc644

Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources [article]

Sahar Abdelnabi, Rakibul Hasan, Mario Fritz
2022 arXiv   pre-print
Our work offers the first step and benchmark for open-domain, content-based, multi-modal fact-checking, and significantly outperforms previous baselines that did not leverage external evidence.  ...  To integrate evidence and cues from both modalities, we introduce the concept of 'multi-modal cycle-consistency check'; starting from the image/caption, we gather textual/visual evidence, which will be  ...  We also thank Rebecca Weil for helpful advice and feedback.  ... 
arXiv:2112.00061v3 fatcat:7w5ndinlbjht7b5e7elzyozycy

Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service

Zhongwei Xie, Ling Liu, Yanzhao Wu, Lin Li, Luo Zhong
2021 IEEE Transactions on Services Computing  
We present a Multi-modal Semantics enhanced Joint Embedding approach (MSJE) for learning a common feature space between the two modalities (text and image), with the ultimate goal of providing high-performance  ...  Third, we further incorporate TFIDF enhanced category semantics to improve the mapping of image modality and to regulate the similarity loss function during the iterative learning of cross-modal joint  ...  Stacked Attention Networks (SAN) [7] : SAN applied a stacked attention network to simultaneously locate ingredient regions in the image and learn multi-modal embedding features between ingredient features  ... 
doi:10.1109/tsc.2021.3098834 fatcat:p6qstgiejbe53p7gnyl2mrfxce

Fine-Grained Image Generation from Bangla Text Description using Attentional Generative Adversarial Network [article]

Md Aminul Haque Palash, Md Abdullah Al Nasim, Aditi Dhali, Faria Afrin
2021 arXiv   pre-print
Considering that, we propose Bangla Attentional Generative Adversarial Network (AttnGAN) that allows intensified, multi-stage processing for high-resolution Bangla text-to-image generation.  ...  For the first time, a fine-grained image is generated from Bangla text using attentional GAN. Bangla has achieved 7th position among 100 most spoken languages.  ...  Second, a deep attentional multi-modal similarity model is presented for training the generator, which can compute the generated exquisite image-text matching loss.  ... 
arXiv:2109.11749v1 fatcat:vezzdd6dyzd4lleltsk5ix2ho4

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features [article]

Byeonghu Na, Yoonsik Kim, Sungrae Park
2022 arXiv   pre-print
This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances.  ...  Based on the spatial encoding, visual and semantic features are enhanced by referring to related features in the other modality.  ...  To answer the question, this paper proposes a new STR model, named Multi-modAl Text Recognition Network (MATRN), that enhances visual and semantic features by referring to features in both modalities.  ... 
arXiv:2111.15263v2 fatcat:sgewvxzf2jfnrah7knkpekrthu

Holistic Multi-modal Memory Network for Movie Question Answering [article]

Anran Wang, Anh Tuan Luu, Chuan-Sheng Foo, Hongyuan Zhu, Yi Tay, Vijay Chandrasekhar
2018 arXiv   pre-print
In this paper, we present the Holistic Multi-modal Memory Network (HMMN) framework which fully considers the interactions between different input sources (multi-modal context, question) in each hop.  ...  Therefore, the proposed framework effectively integrates multi-modal context, question, and answer information, which leads to more informative context retrieved for question answering.  ...  CONCLUSION We presented a Holistic Multi-modal Memory Network framework that learns to answer questions with context from multi-modal data.  ... 
arXiv:1811.04595v1 fatcat:xlxphlnk4rdspixvveeqftl7pu

Why Do We Click: Visual Impression-aware News Recommendation [article]

Jiahao Xun, Shengyu Zhang, Zhou Zhao, Jieming Zhu, Qi Zhang, Jingjie Li, Xiuqiang He, Xiaofei He, Tat-Seng Chua, Fei Wu
2021 arXiv   pre-print
Besides, existing research pays little attention to the click decision-making process in designing multi-modal modeling modules.  ...  To accurately capture users' interests, we propose to model multi-modal features, in addition to the news titles that are widely used in existing works, for news recommendation.  ...  ., IMRec, for multi-modal news recommendation.  ... 
arXiv:2109.12651v1 fatcat:pcjk6p7c4rbbrgc2hovl6zyfku

Enterprise Strategic Management From the Perspective of Business Ecosystem Construction Based on Multimodal Emotion Recognition

Wei Bi, Yongzhen Xie, Zheng Dong, Hongshen Li
2022 Frontiers in Psychology  
Through the comparative analysis of the accuracy of single-modal and multi-modal ER, the self-attention mechanism is applied in the experiment.  ...  Then, two datasets, CMU-MOSI and CMU-MOSEI, are selected to design the scheme for multimodal ER based on self-attention mechanism.  ...  For this model, it is only necessary to apply the self-attention mechanism and Bi-GRU to three modalities of text, image and audio.  ... 
doi:10.3389/fpsyg.2022.857891 pmid:35310264 pmcid:PMC8927019 doaj:82cf2c71b7bf4e4f9bdeda763b6e1939 fatcat:hssh4dpwzbahvpv5vyupuuoxuu
« Previous Showing results 1 — 15 out of 12,551 results