Filters








206 Hits in 4.4 sec

Multimodal Machine Learning: A Survey and Taxonomy [article]

Tadas Baltrušaitis, Chaitanya Ahuja, Louis-Philippe Morency
2017 arXiv   pre-print
We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and  ...  It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential.  ...  Semantic correlation maximization method [248] also encourages semantic relevance, while retaining correlation maximization and orthogonality of the resulting space -this leads to a combination of CCA  ... 
arXiv:1705.09406v2 fatcat:262fo4sihffvxecg4nwsifoddm

Variational Autoencoder with CCA for Audio-Visual Cross-Modal Retrieval [article]

Jiwei Zhang, Yi Yu, Suhua Tang, Jianming Wu, Wei Li
2021 arXiv   pre-print
as constraints to reinforce the mutuality of audio-visual information.  ...  On the one hand, audio encoder and visual encoder separately encode audio data and visual data into two different latent spaces.  ...  data (such as text, audio, and visual).  ... 
arXiv:2112.02601v1 fatcat:iowujbzu4vfqhei5gwo3hma2d4

Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network [article]

Yuxin Peng, Jinwei Qi, Yuxin Yuan
2017 arXiv   pre-print
Different modalities such as image and text have imbalanced and complementary relationships, which contain unequal amount of information when describing the same semantics.  ...  Finally, the complementarity between the semantic spaces for different modalities is explored by adaptive fusion of the modality-specific cross-modal similarities to perform cross-modal retrieval.  ...  of adaptive fusion on different semantic spaces.  ... 
arXiv:1708.04776v1 fatcat:276kjyo5vvdhbmsvoqjc4vsc6i

Fusing Music and Video Modalities Using Multi-timescale Shared Representations

Bing Xu, Xiaogang Wang, Xiaoou Tang
2014 Proceedings of the ACM International Conference on Multimedia - MM '14  
The effectiveness of our method is demonstrated through MV classification and retrieval.  ...  We propose a deep learning architecture to solve the problem of multimodal fusion of multi-timescale temporal data, using music and video parts extracted from Music Videos (MVs) in particular.  ...  CCA of deep rep. uses deep representations of videos and audio, and outperforms CCA using initial features as input, which proves deep representations are easier to fuse than initial features.  ... 
doi:10.1145/2647868.2655069 dblp:conf/mm/XuWT14 fatcat:hhkmvyfpzndrllcalllwp7x5oq

An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges [article]

Yuxin Peng, Xin Huang, Yunzhen Zhao
2017 arXiv   pre-print
It is noted that we have constructed a new dataset XMedia, which is the first publicly available dataset with up to five media types (text, image, video, audio and 3D model).  ...  However, the requirements of users are highly flexible, such as retrieving the relevant audio clips with one query of image.  ...  For Wikipedia, XMedia and Clickture datasets, we take the same strategy as [27] to generate both text and image representations, and the representations of video, audio and 3D model are the same as  ... 
arXiv:1704.02223v4 fatcat:z7ez63kodvejpfrodeszdtkccy

Kernel cross-modal factor analysis for multimodal information fusion

Yongjin Wang, Ling Guan, A. N. Venetsanopoulos
2011 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
Experimental results show that the proposed approach outperforms the concatenation based feature level fusion, the linear CFA, as well as the canonical correlation analysis (CCA) and kernel CCA methods  ...  The effectiveness of the introduced solution is demonstrated through experimentation on an audiovisual based emotion recognition problem.  ...  [4] , and fusion of text and image for spectral clustering [5] .  ... 
doi:10.1109/icassp.2011.5946963 dblp:conf/icassp/WangGV11 fatcat:5iuhhhpymrdhbpkcyrjqpttaxi

Cross-media analysis and reasoning: advances and directions

Yu-xin Peng, Wen-wu Zhu, Yao Zhao, Chang-sheng Xu, Qing-ming Huang, Han-qing Lu, Qing-hua Zheng, Tie-jun Huang, Wen Gao
2017 Frontiers of Information Technology & Electronic Engineering  
However, to the best of our knowledge, no existing work has summarized the state-of-the-art methods for cross-media analysis and reasoning or presented advances, challenges, and future directions for the  ...  To address these issues, we provide an overview as follows: (1) theory and model for cross-media uniform representation; (2) cross-media correlation understanding and deep mining; (3) cross-media knowledge  ...  Acknowledgements The authors would like to thank Peng CUI, Shi-kui WEI, Ji-tao SANG, Shu-hui WANG, Jing LIU, and Bu-yue QIAN for their valuable discussions and assistance.  ... 
doi:10.1631/fitee.1601787 fatcat:dqnizhdlbfhpvodzkhv5nlarxq

Semantic Correlation based Deep Cross-Modal Hashing for Faster Retrieval

2019 VOLUME-8 ISSUE-10, AUGUST 2019, REGULAR ISSUE  
NMIST dataset is multi-view dataset and result proves that DCCA outperforms over CCA and KCCA by learning representations with higher correlations.  ...  In this paper, experiment is performed using correlation methods like CCA, KCCA and DCCA on NMIST dataset.  ...  We have two multi-layer perceptron one for each view and at the final output representation is related through CCA.  ... 
doi:10.35940/ijitee.i8157.0881019 fatcat:jy2hfk7vx5dpzcr74lbvtvml7i

Building Multi-model Collaboration in Detecting Multimedia Semantic Concepts

Hsin-Yu Ha, Fausto Fleites, Shu-Ching Chen
2013 Proceedings of the 9th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing  
It has been shown that multimodal fusion plays an important role in elevating the performance of both multimedia content-based retrieval and semantic concepts detection.  ...  The correlation between medoid of a feature cluster and a semantic concept is introduced to identify the capability of a classification model.  ...  [16] proposed a audio-visual fusion framework, in which CCA is used to project the audio and visual features into more compact subspaces.  ... 
doi:10.4108/icst.collaboratecom.2013.254110 dblp:conf/colcom/HaFC13 fatcat:bddsyn2cajgk7h675s5m34hddm

Deep multimodal representation learning: a survey

Wenzhong Guo, Jianwen Wang, Shiping Wanga
2019 IEEE Access  
methods into three frameworks: joint representation, coordinated representation, and encoder-decoder.  ...  INDEX TERMS Multimodal representation learning, multimodal deep learning, deep multimodal fusion, multimodal translation, multimodal adversarial learning.  ...  TABLE 2 . 2 A summary of typical applications of three frameworks. Each application may include some of the modalities such as audio, video, image, and text which are denoted by their first letter.  ... 
doi:10.1109/access.2019.2916887 fatcat:ms4wcgl5rncsbiywz27uss4ysq

VideoStory Embeddings Recognize Events when Examples are Scarce [article]

Amirhossein Habibian, Thomas Mensink, Cees G.M. Snoek
2015 arXiv   pre-print
The key in such a challenging setting is a semantic video representation.  ...  predictability.We show how learning the VideoStory using a multimodal predictability loss, including appearance, motion and audio features, results in a better predictable representation.  ...  Then, the event classifiers are trained and applied on the semantic video representations. 4. VideoStory late fusion.  ... 
arXiv:1511.02492v1 fatcat:67urjusqc5dktce54kmape7mja

Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis [article]

Zhongkai Sun, Prathusha Sarma, William Sethares, Yingyu Liang
2019 arXiv   pre-print
) correlations between features extracted from the outer product of text and audio (we call this text-based audio) and analogous text-based video.  ...  Given that the text, audio, and video are describing the same utterance in different ways, we hypothesize that the multimodal sentiment analysis and emotion recognition can be improved by learning (hidden  ...  Text-based audio and text-based video features derived from the two CNNs are input into a CCA layer which consists of two projections and a CCA Loss calculator.  ... 
arXiv:1911.05544v2 fatcat:pevkekm5mrgzbpbs5637cggxue

Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis

Zhongkai Sun, Prathusha Sarma, William Sethares, Yingyu Liang
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
) correlations between features extracted from the outer product of text and audio (we call this text-based audio) and analogous text-based video.  ...  Given that the text, audio, and video are describing the same utterance in different ways, we hypothesize that the multimodal sentiment analysis and emotion recognition can be improved by learning (hidden  ...  Outer-product matrices of text-audio and text-video are used as input to the Deep CCA network.  ... 
doi:10.1609/aaai.v34i05.6431 fatcat:ybwaoags4bhtppi4nxwehgtkpq

Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA [article]

Donghuo Zeng, Yi Yu, Keizo Oyama
2019 arXiv   pre-print
audio and video.  ...  To this end, we propose a novel audio-visual embedding algorithm by Supervised Deep CanonicalCorrelation Analysis (S-DCCA) that projects audio and video into a shared space to bridge the semantic gap between  ...  The first Author would like to thank Francisco Raposo for discussing how to implement CCA.  ... 
arXiv:1908.03744v1 fatcat:2l2vdm7a7zatvdbk6ja2tfdmqi

Deep Learning Techniques for Future Intelligent Cross-Media Retrieval [article]

Sadaqat ur Rehman, Muhammad Waqas, Shanshan Tu, Anis Koubaa, Obaid ur Rehman, Jawad Ahmad, Muhammad Hanif, Zhu Han
2020 arXiv   pre-print
In this paper, we provide a novel taxonomy according to the challenges faced by multi-modal deep learning approaches in solving cross-media retrieval, namely: representation, alignment, and translation  ...  It plays a significant role in big data applications and consists in searching and finding data from different types of media.  ...  Text Image and Audio Image and Text Supervised Semantic Hashing (DVSH) model for cross-media retrieval.  ... 
arXiv:2008.01191v1 fatcat:t63bg55w2vdqjcprzaaidrmprq
« Previous Showing results 1 — 15 out of 206 results