Filters








1,780 Hits in 10.3 sec

Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval

Sheng Zeng, Changhong Liu, Jun Zhou, Yong Chen, Aiwen Jiang, Hanxi Li
2022 Proceedings of the 2022 International Conference on Multimedia Retrieval  
Cross-modal image-text retrieval is a fundamental task in information retrieval.  ...  Fine-grained matching methods can nicely model local semantic correlations between image and text but face two challenges.  ...  Therefore, fine-grained matching methods have effectively improved the accuracy of cross-modal retrieval.  ... 
doi:10.1145/3512527.3531358 fatcat:v6cspdscwff6jjrkf66ygds54q

Cross-media Multi-level Alignment with Relation Attention Network [article]

Jinwei Qi, Yuxin Peng, Yuxin Yuan
2018 arXiv   pre-print
Relation understanding is essential for cross-media correlation learning, which is ignored by prior cross-media retrieval works.  ...  We aim to not only exploit cross-media fine-grained local information, but also capture the intrinsic relation information, which can provide complementary hints for correlation learning.  ...  Cross-modal correlation learning (CCL) [Peng et al., 2017] utilizes fine-grained information, and adopts multitask learning strategy for better performance.  ... 
arXiv:1804.09539v1 fatcat:7bpfoixw2rbfji3b4h7jytyvly

Cross-media Multi-level Alignment with Relation Attention Network

Jinwei Qi, Yuxin Peng, Yuxin Yuan
2018 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence  
Relation understanding is essential for cross-media correlation learning, which is ignored by prior cross-media retrieval works.  ...  We aim to not only exploit cross-media fine-grained local information, but also capture the intrinsic relation information, which can provide complementary hints for correlation learning.  ...  Cross-modal correlation learning (CCL) [Peng et al., 2017] utilizes fine-grained information, and adopts multi-task learning strategy for better performance.  ... 
doi:10.24963/ijcai.2018/124 dblp:conf/ijcai/QiPY18 fatcat:anifftwhbrec7oackglkbn2qra

Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval [article]

Xiao Dong, Xunlin Zhan, Yunchao Wei, Xiaoyong Wei, Yaowei Wang, Minlong Lu, Xiaochun Cao, Xiaodan Liang
2022 arXiv   pre-print
Our goal in this research is to study a more realistic environment in which we can conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.  ...  Specifically, a novel Entity-Graph Enhanced Cross-Modal Pretraining (EGE-CMP) model is proposed for instance-level commodity retrieval, that explicitly injects entity knowledge in both node-based and subgraph-based  ...  Existing pre-trained models for vision-language often learn image-text semantic alignment using a multi-layer Transformer architecture, such as Bert [68] , on multi-modal input in a shared cross-modal  ... 
arXiv:2206.08842v1 fatcat:23zbjfvrqvfx7n4xc6rqgglqd4

Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval [article]

Andres Mafla, Sounak Dey, Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas
2020 arXiv   pre-print
In this paper, we focus on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.  ...  Specifically, we employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found  ...  Additionally, we show experiments of fine-grained image retrieval, using the same multi-modal representation, in the two evaluated datasets.  ... 
arXiv:2009.09809v1 fatcat:mrl2m2cc2farjayr5rxrrpesiy

Fine-Grained Image Analysis with Deep Learning: A Survey [article]

Xiu-Shen Wei and Yi-Zhe Song and Oisin Mac Aodha and Jianxin Wu and Yuxin Peng and Jinhui Tang and Jian Yang and Serge Belongie
2021 arXiv   pre-print
image recognition and fine-grained image retrieval.  ...  The small inter-class and large intra-class variation inherent to fine-grained image analysis makes it a challenging problem.  ...  ACKNOWLEDGMENTS The authors would like to thank the editor and the anonymous reviewers for their constructive comments.  ... 
arXiv:2111.06119v2 fatcat:ninawxsjtnf4lndtqquuwl3weq

Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

Keyu Wen, Xiaodong Gu, Qingrong Cheng
2020 IEEE transactions on circuits and systems for video technology (Print)  
Image-Text Matching is one major task in cross-modal information processing. The main challenge is to learn the unified visual and textual representations.  ...  DSRAN performs graph attention in both modules respectively for region-level relations enhancement and regional-global relations enhancement at the same time.  ...  Further in DSPE [34] , correlation learning between cross-modal encoded features is enhanced by constructing a triplet ranking loss.  ... 
doi:10.1109/tcsvt.2020.3030656 fatcat:ymindb2imnbgnlmitnkziskkmi

Deep Multi-Semantic Fusion-Based Cross-Modal Hashing

Xinghui Zhu, Liewu Cai, Zhuoyang Zou, Lei Zhu
2022 Mathematics  
However, the existing deep hashing methods cannot consider multi-label semantic learning and cross-modal similarity learning simultaneously.  ...  That means potential semantic correlations among multimedia data are not fully excavated from multi-category labels, which also affects the original similarity preserving of cross-modal hash codes.  ...  How to utilize the extensive multi-modal data to improve cross-modal retrieval performance has attracted increasing attention [6, 7] .  ... 
doi:10.3390/math10030430 fatcat:yri6dbd53zglhc77wtpswxgsoa

New Ideas and Trends in Deep Multimodal Content Understanding: A Review [article]

Wei Chen and Weiping Wang and Li Liu and Michael S. Lew
2020 arXiv   pre-print
The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text.  ...  These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering  ...  It is the first benchmark with 4 media types for fine-grained cross-media retrieval. However, this direction is still far from satisfactory.  ... 
arXiv:2010.08189v1 fatcat:2l7molbcn5hf3oyhe3l52tdwra

New Ideas and Trends in Deep Multimodal Content Understanding: A Review

Wei Chen, Weiping Wang, Li Liu, Michael S. Lew
2020 Neurocomputing  
The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text.  ...  These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering  ...  Each iteration provides newly relevant information to discover more fine- grained correlations between image and text.  ... 
doi:10.1016/j.neucom.2020.10.042 fatcat:hyjkj5enozfrvgzxy6avtbmoxu

AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval

Muhammad Shahid Jabbar, Jitae Shin, Jun-Dong Cho
2022 Electronics  
However, it lacks shared cross-modality attention features to model fine-grained relationships.  ...  The test results reflect that the shared attention parameters alleviate fine-grained attribute recognition, and the proposed approach is a significant step towards automatic multi-modal retrieval for improved  ...  and the shared attention parameters learning for multi-modal image-poem data.  ... 
doi:10.3390/electronics11081275 fatcat:4vzquef2mbg3xdatmpqoiurwyi

Cross-modal Subspace Learning for Fine-grained Sketch-based Image Retrieval [article]

Peng Xu, Qiyue Yin, Yongye Huang, Yi-Zhe Song, Zhanyu Ma, Liang Wang, Tao Xiang, W. Bastiaan Kleijn, Jun Guo
2017 arXiv   pre-print
This naturally motivates us to explore the effectiveness of cross-modal retrieval methods in SBIR, which have been applied in the image-text matching successfully.  ...  In this paper, we introduce and compare a series of state-of-the-art cross-modal subspace learning methods and benchmark them on two recently released fine-grained SBIR datasets.  ...  Their performance rankings for subcategory-level SBIR tasks are almost consistent with those in cross-modal retrieval for image and text.  ... 
arXiv:1705.09888v1 fatcat:amgvdpunhfcdth7cuovobgdqn4

Bit-aware Semantic Transformer Hashing for Multi-modal Retrieval

Wentao Tan, Lei Zhu, Weili Guan, Jingjing Li, Zhiyong Cheng
2022 Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval  
multi-modal semantic gaps. 3) Direct coarse pairwise semantic preserving cannot effectively capture the fine-grained semantic correlations.  ...  for multi-modal hash learning on the concept-level.  ...  and perform multi-modal fusion on the fine-grained concept-level for multi-modal hash learning.  ... 
doi:10.1145/3477495.3531947 fatcat:5ndrlr5t35fwlo4kvj7o6s3bdm

Scalable Multi-grained Cross-modal Similarity Query with Interpretability

Mingdong Zhu, Derong Shen, Lixin Xu, Xianfang Wang
2021 Data Science and Engineering  
The main contributions are as follows: (1) By integrating coarse-grained and fine-grained semantic learning models, a multi-grained cross-modal query processing architecture is proposed to ensure the adaptability  ...  and generality of query processing. (2) In order to capture the latent semantic relation between images and texts, the framework combines LSTM and attention mode, which enhances query accuracy for the  ...  Acknowledgements We would like to thank selfless friends and professional reviewers for all the insightful advices.  ... 
doi:10.1007/s41019-021-00162-4 fatcat:7tdgbtoq2jc45ixrdltrl4nofu

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval [article]

Mengjun Cheng, Yipeng Sun, Longchao Wang, Xiongwei Zhu, Kun Yao, Jie Chen, Guoli Song, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang
2022 arXiv   pre-print
Specifically, ViSTA utilizes transformer blocks to directly encode image patches and fuse scene text embedding to learn an aggregated visual representation for cross-modal retrieval.  ...  Visual appearance is considered to be the most important cue to understand images for cross-modal retrieval, while sometimes the scene text appearing in images can provide valuable information to understand  ...  the performance for fine-grained image classification in specific scenarios.  ... 
arXiv:2203.16778v1 fatcat:hldin76ql5hqtiq7ppvwmf6pmy
« Previous Showing results 1 — 15 out of 1,780 results