Filters








3,466 Hits in 5.8 sec

Self-Supervised Learning from Web Data for Multimodal Retrieval [article]

Raul Gomez, Lluis Gomez, Jaume Gibert, Dimosthenis Karatzas
2019 arXiv   pre-print
Self-Supervised learning from multimodal image and text data allows deep neural networks to learn powerful features with no need of human annotated data.  ...  Web and Social Media platforms provide a virtually unlimited amount of this multimodal data.  ...  Union, grant agreement No 712949 (TECNIOspring PLUS), and the Agency for Business Competitiveness of the Government of Catalonia (ACCIO).  ... 
arXiv:1901.02004v1 fatcat:wpibqwyf2rax7ltrahjnw6vvxy

Multimodal Learning for Web Information Extraction

Dihong Gong, Daisy Zhe Wang, Yang Peng
2017 Proceedings of the 2017 ACM on Multimedia Conference - MM '17  
More specifically, our system learns reliable relationship between multimodal information by multimodal relation analysis on big unstructured data.  ...  Based on the learned relationship, we further train a set of multimodal rules for information extraction.  ...  In the final stage, we apply the learned multimodal rules to extract information from the real-world data.  ... 
doi:10.1145/3123266.3123296 dblp:conf/mm/GongWP17 fatcat:6jamtqpdwrbm7nftyhksc25afa

Table of Contents

2018 IEEE transactions on multimedia  
Wang 914 Big Data Support for Multimedia Twitter100k: A Real-World Dataset for Weakly Supervised Cross-Media Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ...  Yan 985 Knowledge and Semantics Modeling for Multimedia Databases Predicting Microblog Sentiments via Weakly Supervised Multimodal Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ... 
doi:10.1109/tmm.2018.2812678 fatcat:kzbwi44ibjgqxar3ddumod6kwq

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation [article]

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, Steven Hoi
2021 arXiv   pre-print
In order to improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.  ...  Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions.  ...  During weakly-supervised fine-tuning, we follow the same strategy as image-text retrieval except that we do not perform random cropping, and train the model for 5 epochs.  ... 
arXiv:2107.07651v2 fatcat:o7pwaj3b5bhpffklhp2obbxr7m

Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces

Lluis Gomez, Yash Patel, Marcal Rusinol, Dimosthenis Karatzas, C. V. Jawahar
2017 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
We put forward the idea of performing self-supervised learning of visual features by mining a large scale corpus of multimodal (text and image) documents.  ...  Our experiments demonstrate state of the art performance in image classification, object detection, and multimodal retrieval compared to recent self-supervised or natural-supervised approaches.  ...  Acknowledgment We gratefully acknowledge the support of the NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.  ... 
doi:10.1109/cvpr.2017.218 dblp:conf/cvpr/Gomez-BigordaPR17 fatcat:paymsbngcbfblfp5ehxgnk3gpm

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning [article]

Elad Amrani, Rami Ben-Ari, Daniel Rotman, Alex Bronstein
2020 arXiv   pre-print
Recently, self-supervised multimodal methods that combine vision and language were proposed to learn multimodal representations without annotation.  ...  In this work, we show that the problem of noise estimation for multimodal data can be reduced to a multimodal density estimation task.  ...  This scenario is very common in the case of self-supervised multimodal learning and even when learning from unlabeled instructional videos.  ... 
arXiv:2003.03186v3 fatcat:p576x72txrhuzgesvvgs7gbsui

Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions [article]

Anil Rahate, Rahee Walambe, Sheela Ramanna, Ketan Kotecha
2021 arXiv   pre-print
This challenge is addressed by a learning paradigm called multimodal co-learning.  ...  Multimodal machine learning involves multiple aspects: representation, translation, alignment, fusion, and co-learning.  ...  Thus, multimodal co-learning is benefiting from contrastive learning for performance and weakly supervised data from the web.  ... 
arXiv:2107.13782v2 fatcat:s4spofwxjndb7leqbcqnwbifq4

Learning to Learn from Web Data Through Deep Semantic Embeddings [chapter]

Raul Gomez, Lluis Gomez, Jaume Gibert, Dimosthenis Karatzas
2019 Lecture Notes in Computer Science  
In this paper we propose to learn a multimodal image and text embedding from Web and Social Media data, aiming to leverage the semantic knowledge learnt in the text domain and transfer it to a visual model  ...  for semantic image retrieval.  ...  Union, grant agreement No 712949 (TECNIOspring PLUS), and the Agency for Business Competitiveness of the Government of Catalonia (AC-CIO).  ... 
doi:10.1007/978-3-030-11024-6_40 fatcat:crzepesrz5bglj3bmunkldk6ey

Learning to Learn from Web Data through Deep Semantic Embeddings [article]

Raul Gomez, Lluis Gomez, Jaume Gibert, Dimosthenis Karatzas
2018 arXiv   pre-print
In this paper we propose to learn a multimodal image and text embedding from Web and Social Media data, aiming to leverage the semantic knowledge learnt in the text domain and transfer it to a visual model  ...  for semantic image retrieval.  ...  Union, grant agreement No 712949 (TECNIOspring PLUS), and the Agency for Business Competitiveness of the Government of Catalonia (AC-CIO).  ... 
arXiv:1808.06368v1 fatcat:m4dtjcxevnfmdgz7z47rzulv3e

Self-Paced Cross-Modal Subspace Matching

Jian Liang, Zhihang Li, Dong Cao, Ran He, Jingdong Wang
2016 Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval - SIGIR '16  
However, it is usually time-consuming to manually label large-scale data. This paper proposes a Self-Paced Cross-Modal Subspace Matching (SCSM) method for unsupervised multimodal data.  ...  Self-paced learning, which learns samples from 'easy' to 'complex', is further introduced to refine the grouping result.  ...  (IBP) for integrating multimodal data in a latent space.  ... 
doi:10.1145/2911451.2911527 dblp:conf/sigir/LiangLCHW16 fatcat:ogq5jwnxtvhlbkbbzbm7rzzgoa

Multimodal Co-Training for Selecting Good Examples from Webly Labeled Video [article]

Ryota Hinami, Junwei Liang, Shin'ichi Satoh, Alexander Hauptmann
2018 arXiv   pre-print
We tackle the problem of learning concept classifiers from videos on the web without using manually labeled data.  ...  In this paper, we propose an approach called multimodal co-training (MMCo) for selecting good examples from noisy training data.  ...  Learning from search engine results: While we tackle webly labeled learning [26, 27] that learns from web videos with weak annotations, there is another type of webly supervised classifier learning which  ... 
arXiv:1804.06057v1 fatcat:z6ir5vtylfeupiili6yeotf6ym

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval [article]

Zhixiong Zeng, Wenji Mao
2022 arXiv   pre-print
Cross-Modal Retrieval (CMR) is an important research topic across multimodal computing and information retrieval, which takes one type of data as the query to retrieve relevant data of another type.  ...  rarely explored due to the lack of common representation for the multimodal class-level associations.  ...  Existing cross-modal retrieval methods can be broadly categorized into two groups: the unsupervised methods for paired multimodal data and the supervised methods for labeled multimodal data.  ... 
arXiv:2201.02772v2 fatcat:tel7wfh3oncmlmxhzdqsdvq2gi

A Review of Hashing Methods for Multimodal Retrieval

Wenming Cao, Wenshuo Feng, Qiubin Lin, Guitao Cao, Zhihai He
2020 IEEE Access  
With the advent of the information age, the amount of multimedia data has exploded. That makes fast and efficient retrieval in multimodal data become an urgent requirement.  ...  Among many retrieval methods, the hashing method is widely used in multimodal data retrieval due to its low storage cost, fast and effective characteristics.  ...  B ∈ {−1, +1} c×n 3) SELF-SUPERVISED ADVERSARIAL HASHING (SSAH) [48] This method introduces mechanisms such as self-supervised semantic generation and adversarial learning, and has made breakthrough progress  ... 
doi:10.1109/access.2020.2968154 fatcat:e3vmte5hrnhu3b3lf5ws4gwnhm

Self-supervised learning of visual features through embedding images into text topic spaces [article]

Lluis Gomez, Yash Patel, Marçal Rusiñol, Dimosthenis Karatzas, C.V. Jawahar
2017 arXiv   pre-print
We put forward the idea of performing self-supervised learning of visual features by mining a large scale corpus of multi-modal (text and image) documents.  ...  Our experiments demonstrate state of the art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or natural-supervised approaches.  ...  Acknowledgment We gratefully acknowledge the support of the NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.  ... 
arXiv:1705.08631v1 fatcat:c7pu7heiobcexhftqemuuye6pi

Supervised Multimodal Bitransformers for Classifying Images and Text [article]

Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Ethan Perez, Davide Testuggine
2020 arXiv   pre-print
We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks  ...  Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks.  ...  transformers are straightforward and intuitive, and importantly, are easy to implement even for existing self-supervised encoders.  ... 
arXiv:1909.02950v2 fatcat:ujdciqsh5faq3jh2yd4lopraae
« Previous Showing results 1 — 15 out of 3,466 results