1,120 Hits in 4.7 sec

Exploiting Cross-Modal Prediction and Relation Consistency for Semi-Supervised Image Captioning [article]

Yang Yang, Hongchen Wei, Hengshu Zhu, Dianhai Yu, Hui Xiong, Jian Yang
2021 arXiv   pre-print
To solve this problem, we propose a novel image captioning method by exploiting the Cross-modal Prediction and Relation Consistency (CPRC), which aims to utilize the raw image input to constrain the generated  ...  CPRC utilizes the prediction of raw image as soft label to distill useful supervision for the generated sentence, rather than employing the traditional pseudo labeling; 2) Relation consistency.  ...  by exploiting the Cross-modal Prediction and Relation Consistency (CPRC).  ... 
arXiv:2110.11767v2 fatcat:nv76mbsz5rdl5h4fmuxsfhmp6a

Deep Matching Autoencoders [article]

Tanmoy Mukherjee, Makoto Yamada, Timothy M. Hospedales
2017 arXiv   pre-print
This framework elegantly spans the full regime from fully supervised, semi-supervised, and unsupervised (no paired data) multi-modal learning.  ...  We show promising results in image captioning, and on a new task that is uniquely enabled by our methodology: unsupervised classifier learning.  ...  But unlike prior approaches it can be generalized to the semi-supervised and unsupervised case for exploiting unpaired data.  ... 
arXiv:1711.06047v1 fatcat:cujzxw5pwbg53oyb4w2x3jje4q

Multimodal Chain: Cross-Modal Collaboration Through Listening, Speaking, and Visualizing

Johanes Effendi, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
2021 IEEE Access  
We designed a single model that does both tasks to exploit this relation in the ASR and IC tasks.  ...  These results suggest that the improvement from multimodal chains is positively related to how many more data are used in the semi-supervised step by leveraging the cross-modal augmentation. F.  ...  For more information, see  ... 
doi:10.1109/access.2021.3077886 fatcat:leqvf3ukxjebrfa2gge33sg2lu

Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions [article]

Anil Rahate, Rahee Walambe, Sheela Ramanna, Ketan Kotecha
2021 arXiv   pre-print
The modeling of a (resource-poor) modality is aided by exploiting knowledge from another (resource-rich) modality using transfer of knowledge between modalities, including their representations and predictive  ...  Multimodal deep learning systems which employ multiple modalities like text, image, audio, video, etc., are showing better performance in comparison with individual modalities (i.e., unimodal) systems.  ...  Conformal prediction is used for semi- supervised learning.  ... 
arXiv:2107.13782v2 fatcat:s4spofwxjndb7leqbcqnwbifq4

Localized Vision-Language Matching for Open-vocabulary Object Detection [article]

Maria A. Bravo, Sudhanshu Mittal, Thomas Brox
2022 arXiv   pre-print
Moreover, we introduce a consistency-regularization technique to better exploit image-caption pair information.  ...  It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels for both novel and known classes in a weakly-supervised manner and second specializes  ...  We exploit the multi-modal information by using a cross-attention model and an Image-Caption matching loss L ICM , the mask language modeling loss L M LM and a consistency-regularization loss L Cons .  ... 
arXiv:2205.06160v2 fatcat:apmon75v6jf5nff3o24d5ivimi

Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach

Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, In So Kweon
2019 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)  
In this paper, we develop a novel data-efficient semi-supervised framework for training an image captioning model. We leverage massive unpaired image and caption data by learning to associate them.  ...  To this end, our proposed semi-supervised learning method assigns pseudo-labels to unpaired samples via Generative Adversarial Networks to learn the joint distribution of image and caption.  ...  Related Work The goal of our work is to deal with unpaired image-caption data for image captioning. Therefore, we mainly focus on image captioning and unpaired data handling literature.  ... 
doi:10.18653/v1/d19-1208 dblp:conf/emnlp/KimCOK19 fatcat:zhofvbzjovgu5absnsuxh5rtxy

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Wenhao Chai, Gaoang Wang
2022 Applied Sciences  
Then, we discuss several learning paradigms such as supervised, semi-supervised, self-supervised, and transfer learning.  ...  We also introduce several practical challenges such as missing modalities and noisy modalities.  ...  [53] present a novel self-supervised framework consisting of multiple teachers that have diverse leverage modalities, including RGB, depth, and thermal images, to simultaneously exploit complementary  ... 
doi:10.3390/app12136588 fatcat:bokdxwkcwbgjlpblfrwbj4mtxm

Turbo Learning for Captionbot and Drawingbot [article]

Qiuyuan Huang, Pengchuan Zhang, Dapeng Wu, Lei Zhang
2018 arXiv   pre-print
Furthermore, the turbo-learning approach enables semi-supervised learning since the closed loop can provide pseudo-labels for unlabeled samples.  ...  We study in this paper the problems of both image captioning and text-to-image generation, and present a novel turbo learning approach to jointly training an image-to-text generator (a.k.a.  ...  Hence, many images on the Internet has no caption and semi-supervised learning for CaptionBot is desirable.  ... 
arXiv:1805.08170v2 fatcat:dc3xedekljhrha6r3f3lfaciqa

A Survey of Multi-View Representation Learning [article]

Yingming Li, Ming Yang, Zhongfei Zhang
2017 arXiv   pre-print
sparse coding, and multi-view latent space Markov networks, to neural network-based methods including multi-modal autoencoders, multi-view convolutional neural networks, and multi-modal recurrent neural  ...  This paper introduces two categories for multi-view representation learning: multi-view representation alignment and multi-view representation fusion.  ...  This approach learns multi-modal embeddings for language and visual data and then exploits their complementary information to predict a variable-sized text given an image.  ... 
arXiv:1610.01206v4 fatcat:xsi7ufxnlbdk5lz6ykrsnexfvm

Learning to Recognize Objects from Unseen Modalities [chapter]

C. Mario Christoudias, Raquel Urtasun, Mathieu Salzmann, Trevor Darrell
2010 Lecture Notes in Computer Science  
This allows us to predict the missing data for the labeled examples and exploit all modalities using multiple kernel learning.  ...  We demonstrate the effectiveness of our approach on several multi-modal tasks including object recognition from multi-resolution imagery, grayscale and color images, as well as images and text.  ...  However, these approaches have focused on supervised or semi-supervised scenarios where at least some labels are provided for each modality, and cannot exploit additional unsupervised modalities available  ... 
doi:10.1007/978-3-642-15549-9_49 fatcat:l4y2hqy3f5fcffbqgu53lmeozi

TriReID: Towards Multi-Modal Person Re-Identification via Descriptive Fusion Model

Yajing Zhai, Yawen Zeng, Da Cao, Shaofei Lu
2022 Proceedings of the 2022 International Conference on Multimedia Retrieval  
Particularly, we implement an image captioning model under the active learning paradigm to generate sentences suitable for ReID, in which the quality scores of the three levels are customized.  ...  The cross-modal person re-identification (ReID) aims to retrieve one person from one modality to the other single modality, such as text-based and sketch-based ReID tasks.  ...  ACKNOWLEDGEMENT The authors are highly grateful to the anonymous referees for their careful reading and insightful comments.  ... 
doi:10.1145/3512527.3531397 fatcat:avnfa64cbzfdrh2dvbv5jpfnxm

Aesthetic Image Captioning From Weakly-Labelled Photographs [article]

Koustav Ghosal, Aakanksha Rana, Aljosa Smolic
2019 arXiv   pre-print
Aesthetic image captioning (AIC) refers to the multi-modal task of generating critical textual feedbacks for photographs.  ...  We propose a probabilistic caption-filtering method for cleaning the noisy web-data, and compile a large-scale, clean dataset "AVA-Captions", (230, 000 images with 5 captions per image).  ...  Related Work Due to the multi-modal nature of the task, the problem spans into many different areas of image and text analysis and thus related literature abound.  ... 
arXiv:1908.11310v1 fatcat:6stzxiowb5dmheme4yrhnpdvku

Attributes as Semantic Units between Natural Language and Visual Recognition [article]

Marcus Rohrbach
2016 arXiv   pre-print
However, it remains a challenge to find the best point of interaction for these very different modalities.  ...  Specifically we discuss how attributes allow using knowledge mined from language resources for recognizing novel visual categories, how we can generate sentence description about images and video, how  ...  Interestingly, this allows to provide supervision only for a subset of the phrases (semi-supervised) or all phrases (fully supervised). For supervised grounding, Plummer et al.  ... 
arXiv:1604.03249v1 fatcat:a5dpwgoddvcsvkovik2gupbri4

Aesthetic Image Captioning From Weakly-Labelled Photographs

Koustav Ghosal, Aakanksha Rana, Aljosa Smolic
2019 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)  
We propose a probabilistic caption-filtering method for cleaning the noisy web-data, and compile a large-scale, clean dataset 'AVA-Captions', ( ∼ 230, 000 images with ∼ 5 captions per image).  ...  Aesthetic image captioning (AIC) refers to the multimodal task of generating critical textual feedbacks for photographs.  ...  Related Work Due to the multi-modal nature of the task, the problem spans into many different areas of image and text analysis and thus related literature abound.  ... 
doi:10.1109/iccvw.2019.00556 dblp:conf/iccvw/GhosalRS19 fatcat:w243egxxo5d5nkbdw7k7sjlsay

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions [article]

Hao Zhang, Aixin Sun, Wei Jing, Joey Tianyi Zhou
2022 arXiv   pre-print
Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities.  ...  As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment  ...  CRM [150] uses a cross-sentence relation mining strategy to explicitly model cross-sentence relations in paragraph and explore cross-moment relations in video.  ... 
arXiv:2201.08071v1 fatcat:2k2if6dsyveinec2dmmujcmhkq
« Previous Showing results 1 — 15 out of 1,120 results