Filters








5,241 Hits in 6.4 sec

Identity-Aware Textual-Visual Matching with Latent Co-attention [article]

Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, Xiaogang Wang
2017 arXiv   pre-print
In this paper, we propose an identity-aware two-stage framework for the textual-visual matching problem.  ...  The stage-2 CNN-LSTM network refines the matching results with a latent co-attention mechanism.  ...  Identity-Aware Textual-Visual Matching with Latent Co-attention Textual-visual matching aims at conducting accurate verification for images and language descriptions.  ... 
arXiv:1708.01988v1 fatcat:3ad7voibi5agrdg6mtwg7fd5hi

Context-Aware Attention Network for Image-Text Retrieval

Qi Zhang, Zhen Lei, Zhaoxiang Zhang, Stan Z. Li
2020 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
In this work, we propose a unified Context-Aware Attention Network (CAAN), which selectively focuses on critical local fragments (regions and words) by aggregating the global context.  ...  Specifically, it simultaneously utilizes global intermodal alignments and intra-modal correlations to discover latent semantic relations.  ...  By exploiting the context-aware attention, our model can simultaneously perform image-assisted textual attention and text-assisted visual attention.  ... 
doi:10.1109/cvpr42600.2020.00359 dblp:conf/cvpr/ZhangLZL20 fatcat:vuwfvpeo2vbpxkypyj675evhcq

Visual-Textual Association with Hardest and Semi-Hard Negative Pairs Mining for Person Search [article]

Jing Ge, Guangyu Gao, Zhen Liu
2019 arXiv   pre-print
In this paper, we proposed a novel visual-textual association approach with visual and textual attention, and cross-modality hardest and semi-hard negative pair mining.  ...  Intuitively, for person search, the core issue should be visual-textual association, which is still an extremely challenging task, due to the contradiction between the high abstraction of textual description  ...  Meanwhile, the authors in [2] proposed an identityaware two-stage framework for the textual-visual matching, in which, the first stage learned identity-aware representation, and it matched salient image  ... 
arXiv:1912.03083v1 fatcat:3duo3voh6ngctlopcjdgksbje4

Semi-Supervised Variational User Identity Linkage via Noise-Aware Self-Learning [article]

Chaozhuo Li, Senzhang Wang, Zheng Liu, Xing Xie, Lei Chen, Philip S. Yu
2021 arXiv   pre-print
Existing approaches usually first embed the identities as deterministic vectors in a shared latent space, and then learn a classifier based on the available annotations.  ...  To address the mentioned limitations, in this paper we propose a novel Noise-aware Semi-supervised Variational User Identity Linkage (NSVUIL) model.  ...  The visualization of attention scores in different layers. as the evaluation datasets.  ... 
arXiv:2112.07373v1 fatcat:e6zcnwrxifasbdmffrvqzmejiq

Research on event perception based on geo-tagged social media data

Ruoxin Zhu, Chenyu Zuo, Diao Lin
2019 Proceedings of the ICA  
However, widely used social media service provides a unique approach for the event study with individuals as smart sensors.  ...  Then, how to detect and trace event, how to analyze event impact and visually express obtained knowledge are discussed respectively.  ...  The co-occurrence terms were ordered continuously, and the occurring event was summarized with top n co-occurrence terms.  ... 
doi:10.5194/ica-proc-2-157-2019 fatcat:i2vo6okebvfgjl5xuiid235a3m

CPGAN: Full-Spectrum Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis [article]

Jiadong Liang and Wenjie Pei and Feng Lu
2020 arXiv   pre-print
Particularly, we design a memory structure to parse the textual content by exploring semantic correspondence between each word in the vocabulary to its various visual contexts across relevant images during  ...  Meanwhile, the synthesized image is parsed to learn its semantics in an object-aware manner.  ...  LeicaGAN [31] adopts text-visual co-embeddings to replace input text with corresponding visual features. Lao et al.  ... 
arXiv:1912.08562v2 fatcat:xmc5jqrkuvbp3hfviqbqvrz57q

Multi-modal Deep Analysis for Multimedia

Wenwu Zhu, Xin Wang, Hongzhi Li
2019 IEEE transactions on circuits and systems for video technology (Print)  
On knowledge-guided fusion, we discuss the approaches for fusing knowledge with data and four exemplar applications that require various kinds of domain knowledge, including multi-modal visual question  ...  answering, multi-modal video summarization, multi-modal visual pattern mining and multi-modal recommendation.  ...  [111] propose a hierarchical co-attention (HieCoAtt) model that combines "visual attention" and "question attention" via conducting a questionguided attention on image and a image-guided attention on  ... 
doi:10.1109/tcsvt.2019.2940647 fatcat:l4tchrkgrnaeradvc4nhfan2w4

Learning to Match on Graph for Fashion Compatibility Modeling

Xun Yang, Xiaoyu Du, Meng Wang
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
Existing methods have primarily learned visual compatibility from dyadic co-occurrence or co-purchase information of items to model the item-item matching interaction.  ...  Understanding the mix-and-match relationships between items receives increasing attention in the fashion industry.  ...  margin ranking loss with a margin 0.5 and evaluated with both single modality setting (Visual only) and multi-modality setting (Visual and Textual).  ... 
doi:10.1609/aaai.v34i01.5362 fatcat:rpzqrjyiarbejmidblcznqez4i

Disentangled Motif-aware Graph Learning for Phrase Grounding [article]

Zongshen Mu, Siliang Tang, Jie Tan, Qiang Yu, Yueting Zhuang
2021 arXiv   pre-print
Finally, the cross-modal attention network is utilized to fuse intra-modal features, where each phrase can be computed similarity with regions to select the best-grounded one.  ...  In contrast, we pay special attention to different motifs implied in the context of the scene graph and devise the disentangled graph network to integrate the motif-aware contextual information into representations  ...  Representation The dataset co-occurring bias, the irrelevant relations, may lead to in-correct attention.  ... 
arXiv:2104.06008v1 fatcat:lilye6cjgrdy3embtluuk7kjwu

Contrastive Semantic Similarity Learning for Image Captioning Evaluation with Intrinsic Auto-encoder [article]

Chao Zeng, Tiesong Zhao, Sam Kwong
2021 arXiv   pre-print
Our empirical tests show that I^2CE trained with dual branches structure achieves better consistency with human judgments to contemporary image captioning evaluation metrics.  ...  Most of the current captioning metrics rely on token level matching between candidate caption and the ground truth label sentences. It usually neglects the sentence-level information.  ...  In summary, the image captioning system contains three parts: the visual encoder, the language decoder, and the visual-textual interactive part. B.  ... 
arXiv:2106.15312v1 fatcat:r54o6bp4grhw3gcaihdw377msy

MUTAN: Multimodal Tucker Fusion for Visual Question Answering

Hedi Ben-younes, Remi Cadene, Matthieu Cord, Nicolas Thome
2017 2017 IEEE International Conference on Computer Vision (ICCV)  
We introduce MUTAN, a multimodal tensor-based Tucker decomposition to efficiently parametrize bilinear interactions between visual and textual representations.  ...  With MUTAN, we control the complexity of the merging scheme while keeping nice interpretable fusion relations.  ...  The hierarchical co-attention network [17] , after extracting multiple textual and visual features, merges them with concatenations and sums.  ... 
doi:10.1109/iccv.2017.285 dblp:conf/iccv/Ben-younesCCT17 fatcat:pykbnnbvmnhntjulueqsj5e2ea

MUTAN: Multimodal Tucker Fusion for Visual Question Answering [article]

Hedi Ben-younes, Rémi Cadene, Matthieu Cord, Nicolas Thome
2017 arXiv   pre-print
We introduce MUTAN, a multimodal tensor-based Tucker decomposition to efficiently parametrize bilinear interactions between visual and textual representations.  ...  With MUTAN, we control the complexity of the merging scheme while keeping nice interpretable fusion relations.  ...  The hierarchical co-attention network [17] , after extracting multiple textual and visual features, merges them with concatenations and sums.  ... 
arXiv:1705.06676v1 fatcat:k5x426j2rzhszilpvblnjdmruq

A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint

Ubaid Ullah, Jeong-Sik Lee, Chang-Hyeon An, Hyeonjin Lee, Su-Yeong Park, Rock-Hyun Baek, Hyun-Chul Choi
2022 Sensors  
Similarly, text and visual data (images and videos) are two distinct data domains with extensive research in the past.  ...  For decades, co-relating different data domains to attain the maximum potential of machines has driven research, especially in neural networks.  ...  the matching aware pair loss.  ... 
doi:10.3390/s22186816 fatcat:mqhcrujj5bbebgo2brdnad3p6m

Language with Vision: a Study on Grounded Word and Sentence Embeddings [article]

Hassan Shahmohammadi, Maria Heitmeier, Elnaz Shafaei-Bajestan, Hendrik P. A. Lensch, Harald Baayen
2022 arXiv   pre-print
Our model aligns textual embeddings with vision while largely preserving the distributional statistics that characterize word use in text corpora.  ...  Despite many attempts at language grounding, it is still unclear how to effectively inject visual knowledge into the word embeddings of a language in such a way that a proper balance of textual and visual  ...  We believe one reason is that word vectors still need to be aware of the textual context they occur in when they are being coupled with their corresponding visual information in images.  ... 
arXiv:2206.08823v2 fatcat:tecvsmw4xrevvoj4ycoba4g7im

Context-aware Image Tweet Modelling and Recommendation

Tao Chen, Xiangnan He, Min-Yen Kan
2016 Proceedings of the 2016 ACM on Multimedia Conference - MM '16  
., low-level SIFT or high-level detected objects, are far from adequate in interpreting the necessary semantics latent in image tweets.  ...  We start with tweet's intrinsic contexts, namely, 1) text within the image itself and 2) its accompanying text; and then we turn to the extrinsic contexts: 3) the external web page linked to by the tweet's  ...  RELATED WORK Unlike their textual counterparts, images in microblogs have only started attracting academic attention recently.  ... 
doi:10.1145/2964284.2964291 dblp:conf/mm/ChenHK16 fatcat:nepnjdu5vbffbhrhkisac5p5fe
« Previous Showing results 1 — 15 out of 5,241 results