Filters








1,190 Hits in 2.0 sec

Strong and Simple Baselines for Multimodal Utterance Embeddings [article]

Paul Pu Liang, Yao Chong Lim, Yao-Hung Hubert Tsai, Ruslan Salakhutdinov, Louis-Philippe Morency
2020 arXiv   pre-print
In this paper, we propose two simple but strong baselines to learn embeddings of multimodal utterances. The first baseline assumes a conditional factorization of the utterance into unimodal factors.  ...  Overall, we believe that our strong baseline models offer new benchmarking options for future research in multimodal learning.  ...  We would also like to acknowledge NVIDIA's GPU support and the anonymous reviewers for their constructive comments on this paper.  ... 
arXiv:1906.02125v2 fatcat:olp2n6ewqvg2zcuy3wo4twas6i

Strong and Simple Baselines for Multimodal Utterance Embeddings

Paul Pu Liang, Yao Chong Lim, Yao-Hung Hubert Tsai, Ruslan Salakhutdinov, Louis-Philippe Morency
2019 Proceedings of the 2019 Conference of the North  
In this paper, we propose two simple but strong baselines to learn embeddings of multimodal utterances. The first baseline assumes a conditional factorization of the utterance into unimodal factors.  ...  Overall, we believe that our strong baseline models offer new benchmarking options for future research in multimodal learning.  ...  We would also like to acknowledge NVIDIA's GPU support and the anonymous reviewers for their constructive comments on this paper.  ... 
doi:10.18653/v1/n19-1267 dblp:conf/naacl/LiangLTSM19 fatcat:h3d66l3b2zevthf5j4khobij2i

Which is Making the Contribution: Modulating Unimodal and Cross-modal Dynamics for Multimodal Sentiment Analysis [article]

Ying Zeng, Sijie Mai, Haifeng Hu
2021 arXiv   pre-print
cross-modal embedding.  ...  To address the above-mentioned problems, we propose a novel MSA framework Modulation Model for Multimodal Sentiment Analysis (M^3SA) to identify the contribution of modalities and reduce the impact of  ...  To be consistent with prior works, we use 1,284 utterances for training, 229 utterances for validation, and 686 utterances for testing. 2) CMU-MOSEI is a large dataset of multimodal sentiment analysis  ... 
arXiv:2111.08451v1 fatcat:yntxyipgsndo5biphjrdbyfnie

MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis [article]

Devamanyu Hazarika, Roger Zimmermann, Soujanya Poria
2020 arXiv   pre-print
Here too, our model fares better than strong baselines, establishing MISA as a useful multimodal framework.  ...  Multimodal Sentiment Analysis is an active area of research that leverages multimodal signals for affective understanding of user-generated videos.  ...  Traditionally, language modality features has been GloVe [38] embeddings for each token in the utterance.  ... 
arXiv:2005.03545v3 fatcat:nyomobnpojcefpyllea3scjdpq

Tensor Fusion Network for Multimodal Sentiment Analysis [article]

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, Louis-Philippe Morency
2017 arXiv   pre-print
In the experiments, our model outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.  ...  The proposed approach is tailored for the volatile nature of spoken language in online videos as well as accompanying gestures and voice.  ...  We would like to thank the reviewers for their valuable feedback.  ... 
arXiv:1707.07250v1 fatcat:cfxb23yjunbh3pe6oi7wik7iu4

Tensor Fusion Network for Multimodal Sentiment Analysis

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, Louis-Philippe Morency
2017 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing  
In the experiments, our model outperforms state-ofthe-art approaches for both multimodal and unimodal sentiment analysis.  ...  The proposed approach is tailored for the volatile nature of spoken language in online videos as well as accompanying gestures and voice.  ...  We would like to thank the reviewers for their valuable feedback.  ... 
doi:10.18653/v1/d17-1115 dblp:conf/emnlp/ZadehCPCM17 fatcat:fwvifa4lpbgnpdyaqat7edunc4

Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition [article]

Hang Li, Wenbiao Ding, Zhongqin Wu, Zitao Liu
2021 arXiv   pre-print
The results demonstrate that our approach is superior on the prediction tasks for multimodal speech utterances, and it outperforms a wide range of baselines in terms of prediction accuracy.  ...  Speech emotion recognition is a challenging task because the emotion expression is complex, multimodal and fine-grained.  ...  Unimodal Embedding Acoustic Embedding For each utterance, we first transform it into n frames {fi} n i=1 of width 25ms and step 10ms.  ... 
arXiv:2010.12733v2 fatcat:grht6xtwbja23echyjmmoccbva

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features [article]

Didan Deng, Yuqian Zhou, Jimin Pi, Bertram E.Shi
2018 arXiv   pre-print
multimodal clips.  ...  We describe here a multi-modal neural architecture that integrates visual information over time using an LSTM, and combines it with utterance level audio and text cues to recognize human sentiment from  ...  In the OMG dataset, both our unimodal or multimodal models outperform the baseline methods significantly.  ... 
arXiv:1805.00625v2 fatcat:m2i3twh3jff5nmw5vmg3ij5zqq

DialogWAE: Multimodal Response Generation with Conditional Wasserstein Auto-Encoder [article]

Xiaodong Gu, Kyunghyun Cho, Jung-Woo Ha, Sunghun Kim
2019 arXiv   pre-print
In this paper, we propose DialogWAE, a conditional Wasserstein autoencoder~(WAE) specially designed for dialogue modeling.  ...  to a relatively simple (e.g., unimodal) scope.  ...  Average: cosine similarity between the averaged word embeddings in the two utterances (Mitchell and Lapata, 2008) . 3.  ... 
arXiv:1805.12352v2 fatcat:lxiyzgvo3nettdthjl2adw4xua

Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors

Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency
2019 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
Our proposed model achieves competitive performance on two publicly available datasets for multimodal sentiment analysis and emotion recognition.  ...  To this end, we propose the Recurrent Attended Variation Embedding Network (RAVEN) that models the fine-grained structure of nonverbal subword sequences and dynamically shifts word representations based  ...  We also thank the anonymous reviewers for useful feedback.  ... 
doi:10.1609/aaai.v33i01.33017216 fatcat:cx22rdjwbncf7hpqar6fif6uqe

Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors [article]

Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency
2018 arXiv   pre-print
Our proposed model achieves competitive performance on two publicly available datasets for multimodal sentiment analysis and emotion recognition.  ...  To this end, we propose the Recurrent Attended Variation Embedding Network (RAVEN) that models the fine-grained structure of nonverbal subword sequences and dynamically shifts word representations based  ...  We also thank the anonymous reviewers for useful feedback.  ... 
arXiv:1811.09362v2 fatcat:t6lih6egejcwvgdjzli5jnmyda

Modeling Intent, Dialog Policies and Response Adaptation for Goal-Oriented Interactions [article]

Saurav Sahay, Shachi H Kumar, Eda Okur, Haroon Syed, Lama Nachman
2019 arXiv   pre-print
Our bootstrapped models from limited training data perform better than many baseline approaches we have looked at for intent recognition and dialog action prediction.  ...  We have explored various feature extractors and models for improved intent recognition and looked at leveraging previous user and system interactions in novel ways with attention models.  ...  We greatfully acknowledge and thank the Rasa Team and community developers for the framework and contributions that enabled us to further our research and build newer models for the application.  ... 
arXiv:1912.10130v1 fatcat:gsehxtfzwrfcpojo5hak3t5npa

Multimodal Association for Speaker Verification

Suwon Shon, James Glass
2020 Interspeech 2020  
To verify this, we use the SRE18 evaluation protocol for experiments and use outof-domain data, Voxceleb, for the proposed multimodal finetuning.  ...  In this paper, we propose a multimodal association on a speaker verification system for fine-tuning using both voice and face.  ...  For speaker embeddings, we used the entire utterance as input while the face embedding was extracted from the face in the first frame of each video.  ... 
doi:10.21437/interspeech.2020-1996 dblp:conf/interspeech/ShonG20 fatcat:qe5jqjazeff2rbccn4dg4ykq7q

Improving Context Modelling in Multimodal Dialogue Generation [article]

Shubham Agarwal, Ondrej Dusek, Ioannis Konstas, Verena Rieser
2018 arXiv   pre-print
We introduce a multimodal extension to the Hierarchical Recurrent Encoder-Decoder (HRED) model and show that this extension outperforms strong baselines in terms of text-based similarity metrics.  ...  We also showcase the shortcomings of current vision and language models by performing an error analysis on our system's output.  ...  ., Toronto, Canada and the MaDrIgAL EPSRC project (EP/N017536/1). The Titan Xp used for this work was donated by the NVIDIA Corp.  ... 
arXiv:1810.11955v1 fatcat:7eb55m7iejfsbo5qvu4huhhgcq

Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis [article]

Sijie Mai, Ying Zeng, Shuangjia Zheng, Haifeng Hu
2021 arXiv   pre-print
In the field of multimodal sentiment analysis (MSA), most previous works focus on exploring intra- and inter-modal interactions.  ...  Moreover, HyCon can naturally generate a large amount of training pairs for better generalization and reduce the negative effect of limited datasets.  ...  Following previous works [10] , [12] , we utilize 1,284 utterances for training, 229 utterances for validation, and 686 utterances for testing.  ... 
arXiv:2109.01797v1 fatcat:oqw3edzt7fha5pmzokp36crvwe
« Previous Showing results 1 — 15 out of 1,190 results