Filters








844 Hits in 4.8 sec

Deep Multimodal Embedding Model for Fine-grained Sketch-based Image Retrieval

Fei Huang, Yong Cheng, Cheng Jin, Yuejie Zhang, Tao Zhang
2017 Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '17  
The difficulties of this task not only come from the ambiguous and abstract characteristics of sketches with less useful information, but also the cross-modal gap at both visual and semantic level.  ...  However, images on the web are always exhibited with multimodal contents.  ...  multimodal embedding model is a combination of classification loss and multimodal ranking loss, which is formulated as follows: : + (1 − ) + || || 2 2 (9) where denotes the parameters of embedding neural  ... 
doi:10.1145/3077136.3080681 dblp:conf/sigir/HuangCJZZ17 fatcat:nifju3tqpjg4xkkobmzocmn33e

Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations

Po-Yao Huang, Xiaojun Chang, Alexander Hauptmann
2019 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)  
We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space.  ...  Specifically, our model attends to different types of textual semantics in two languages and visual objects for finegrained alignments between sentences and images.  ...  We would like to thank the anonymous reviewers for their constructive suggestions.  ... 
doi:10.18653/v1/d19-1154 dblp:conf/emnlp/HuangCH19 fatcat:a3zs43mshrcvzj7b4bmfhbblty

Joint Visual-Textual Embedding for Multimodal Style Search [article]

Gil Sadeh, Lior Fritz, Gabi Shalev, Eduard Oks
2019 arXiv   pre-print
We introduce a multimodal visual-textual search refinement method for fashion garments.  ...  This joint visual-textual embedding space enables manipulating catalog images semantically, based on textual refinement requirements.  ...  [13] employed a triplet-based ranking loss in order to learn a similar embedding space for images and text, for caption generation and ranking tasks. Karpathy et al.  ... 
arXiv:1906.06620v1 fatcat:hma2wxq4rfaurov7s53zwovmay

Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations [article]

Po-Yao Huang, Xiaojun Chang, Alexander Hauptmann
2019 arXiv   pre-print
We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space.  ...  Specifically, our model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images.  ...  We would like to thank the anonymous reviewers for their constructive suggestions.  ... 
arXiv:1910.00058v1 fatcat:2huqze5s5zbj7gxkbfj63aaqc4

Diachronic Cross-modal Embeddings

David Semedo, Joao Magalhaes
2019 Proceedings of the 27th ACM International Conference on Multimedia - MM '19  
To achieve this, we trained a neural cross-modal architecture, under a novel ranking loss strategy, that for each multimodal instance, enforces neighbour instances' temporal alignment, through subspace  ...  Understanding the semantic shifts of multimodal information is only possible with models that capture cross-modal interactions over time.  ...  One of such constraints would then be enforced for each sampled triplet. In the next section we detail how ranking loss is extended to cope with the temporal dimension.  ... 
doi:10.1145/3343031.3351036 dblp:conf/mm/SemedoM19a fatcat:sv6uekobmbfxteqybxt6tnv26i

Learning Social Image Embedding with Deep Multimodal Attention Networks

Feiran Huang, Xiaoming Zhang, Zhoujun Li, Tao Mei, Yueying He, Zhonghua Zhao
2017 Proceedings of the on Thematic Workshops of ACM Multimedia 2017 - Thematic Workshops '17  
To leverage the network structure for embedding learning, a novel Siamese-Triplet neural network is proposed to model the links among images.  ...  However, for social images which contain both link information and multimodal contents (e.g., text description, and visual content), simply employing the embedding learnt from network structure or data  ...  Hinge rank loss and Crossentropy loss are used to learn network information and multimodal contents respectively; (c) Various applications can be conducted on the learnt embedding.  ... 
doi:10.1145/3126686.3126720 dblp:conf/mm/HuangZLMHZ17 fatcat:uuj6zj2ahjhlnexauciynqslya

Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images [article]

Junhua Mao, Jiajing Xu, Yushi Jing, Alan Yuille
2016 arXiv   pre-print
Experiments show that our model benefits from incorporating the visual information into the word embeddings, and a weight sharing strategy is crucial for learning such multimodal embeddings.  ...  In this paper, we focus on training and evaluating effective word embeddings with both text and visual information.  ...  Acknowledgement We are grateful to James Rubinstein for setting up the crowdsourcing experiments for dataset cleanup.  ... 
arXiv:1611.08321v1 fatcat:z3pxvpbxgbenvjocgklfsic4qa

Deep Visual Semantic Embedding with Text Data Augmentation and Word Embedding Initialization

Hai He, Haibo Yang, Pier Luigi Mazzeo
2021 Mathematical Problems in Engineering  
We utilize EDA for text data augmentation, word embedding initialization for text encoder based on recurrent neural networks, and minimizing the gap between the two spaces by triplet ranking loss with  ...  Multimodality methods like visual semantic embedding have been widely studied recently, which unify images and corresponding texts into the same feature space.  ...  Visual semantic embedding (VSE) is proposed for tackling the problem.  ... 
doi:10.1155/2021/6654071 fatcat:2wyvjcg2hffdheeahgm5lurcmy

Image-Text Multi-Modal Representation Learning by Adversarial Backpropagation [article]

Gwangbeen Park, Woobin Im
2016 arXiv   pre-print
We only use category information in contrast with most previous methods using image-text pair information for multi-modal embedding.  ...  And we show our multi-modal feature has universal semantic information, even though it was trained for category prediction.  ...  ., 2016) ) use VGG-net for image feature extracting and use neural-language-model for text feature extracting and apply ranking loss or triplet ranking loss.  ... 
arXiv:1612.08354v1 fatcat:pvixvhdeejfyvkqdqhjdeovtwa

Deep Learning based Enhanced Triplet Nework Model for Landmark Classification in Image Retrieval

K.Shanmuga Sundari
2019 International Journal for Research in Applied Science and Engineering Technology  
Finally, candidate images are ranked with the consequence of classification as well as semantic consistency between the visual and text content of the combination office.  ...  The images with geo-tag will normally noted for classifier learning.  ...  RELATED WORK AND LITERATURE SURVEY This work is related to vast area of literature, particularly, a shot learning through generative models, for one-shot learning an embedding space, and visual based tracking  ... 
doi:10.22214/ijraset.2019.8138 fatcat:a3voqhbueffdfnfh6sgdw5r3k4

Learning Multimodal Representations by Symmetrically Transferring Local Structures

Bin Dong, Songlei Jian, Kai Lu
2020 Symmetry  
Multimodal representations play an important role in multimodal learning tasks, including cross-modal retrieval and intra-modal clustering.  ...  The bidirectional retrieval loss based on multi-layer neural networks is utilized to align two modalities.  ...  Funding: This work is supported by National High-level Personnel for Defense Technology Program (2017-JCJQ-ZQ-013), NSF 61902405, the HUNAN Province Science Foundation 2017RS3045.  ... 
doi:10.3390/sym12091504 fatcat:htfhw5equff7hng4gkglz7sf6m

Learning Shared Semantic Space with Correlation Alignment for Cross-modal Event Retrieval [article]

Zhenguo Yang, Zehang Lin, Peipei Kang, Jianming Lv, Qing Li, Wenyin Liu
2019 arXiv   pre-print
In this paper, we propose to learn shared semantic space with correlation alignment (S^3CA) for multimodal data representations, which aligns nonlinear correlations of multimodal data distributions in  ...  Furthermore, we project the multimodal data into a shared semantic space for cross-modal (event) retrieval, where the distances between heterogeneous data samples can be measured directly.  ...  We contribute a weakly-aligned unpaired Wiki-Flickr Event dataset as a complement of the existing paired datasets for cross-modal retrieval.  ... 
arXiv:1901.04268v3 fatcat:hipjb7ba2fg3hp5g5d3oq3kaki

HUSE: Hierarchical Universal Semantic Embeddings [article]

Pradyumna Narayana, Aniket Pednekar, Abishek Krishnamoorthy, Kazoo Sone, Sugato Basu
2019 arXiv   pre-print
This paper proposes a novel method, HUSE, to learn cross-modal representation with semantic information.  ...  The works in the domain of visual semantic embeddings address this problem by first constructing a semantic embedding space based on some external knowledge and projecting image embeddings onto this fixed  ...  Triplet: This baseline uses the triplet loss with semi hard online learning to decrease the distance between embeddings corresponding to similar classes, while increasing the distance between embeddings  ... 
arXiv:1911.05978v1 fatcat:dvpuwqrdpngj5l5cnieamacpi4

Revisiting Cross Modal Retrieval [article]

Shah Nawaz, Muhammad Kamran Janjua, Alessandro Calefati, Ignazio Gallo
2018 arXiv   pre-print
Most multimodal architectures employ separate networks for each modality to capture the semantic relationship between them.  ...  In our knowledge, this work is the first of its kind in terms of employing a single network and fused image-text embedding for cross-modal retrieval.  ...  Ranking Supervision Signals Many different multimodal approaches employ some kind of ranking loss function as a supervision signal.  ... 
arXiv:1807.07364v1 fatcat:l4pp5cq6f5e77gnwd3tcbggnne

Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval [article]

Donghuo Zeng, Yi Yu, Keizo Oyama
2021 arXiv   pre-print
In particular, two significant contributions are made: i) a better representation by constructing deep triplet neural network with triplet loss for optimal projections can be generated to maximize correlation  ...  The main challenge of audio-visual cross-modal retrieval task focuses on learning joint embeddings from a shared subspace for computing the similarity across different modalities, where generating new  ...  Deep triplet neural network consists of 4 fully connected layers respectively for audio embedding and visual embedding and outputs a feature vector with a size of 10.  ... 
arXiv:1908.03737v3 fatcat:qgldi32rrng27gltfbefqay4rq
« Previous Showing results 1 — 15 out of 844 results