Filters








179 Hits in 2.1 sec

Multimodal Transformer for Unaligned Multimodal Language Sequences

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, Ruslan Salakhutdinov
2019 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics  
However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and  ...  Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors.  ...  designed for) multimodal language sequences.  ... 
doi:10.18653/v1/p19-1656 pmid:32362720 pmcid:PMC7195022 fatcat:acl65gg2wncfljqfjumpe5q7gi

LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences [article]

Ziwang Fu, Feng Liu, Hanyang Wang, Siyuan Shen, Jiahao Zhang, Jiayin Qi, Xiangling Fu, Aimin Zhou
2021 arXiv   pre-print
In this paper, we propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multimodal sequences.  ...  Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition.  ...  Conclusion and Future Work In this paper, we propose a neural network to learn modalityfused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multimodal sequences  ... 
arXiv:2112.01697v1 fatcat:g6gjcik3dfeu7dln4igvzjp6iu

Low Rank Fusion based Transformers for Multimodal Sequences [article]

Saurav Sahay, Eda Okur, Shachi H Kumar, Lama Nachman
2020 arXiv   pre-print
transformers.  ...  We present two methods for the Multimodal Sentiment and Emotion Recognition results on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets and show that our models have lesser parameters, train faster and perform  ...  Table 3 : 3 Performance Results for Multimodal Emotion Recognition on IEMOCAP dataset with aligned and unaligned multimodal sequences.  ... 
arXiv:2007.02038v1 fatcat:wtpybgfll5e2noebkknzxjdni4

MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences [article]

Jianing Yang, Yongxin Wang, Ruitao Yi, Yuying Zhu, Azaan Rehman, Amir Zadeh, Soujanya Poria, Louis-Philippe Morency
2021 arXiv   pre-print
We first introduce a procedure to convert unaligned multimodal sequence data into a graph with heterogeneous nodes and edges that captures the rich interactions across modalities and through time.  ...  MTAG is an interpretable graph-based neural model that provides a suitable framework for analyzing multimodal sequential data.  ...  Acknowledgements We thank Jianing Qian, Xiaochuang Han and Haoping Bai at CMU and the anonymous reviewers at NAACL for providing helpful discussions and feedbacks.  ... 
arXiv:2010.11985v2 fatcat:c2xcsgu6ibbivnqwa5zqedfnri

Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion [article]

Sijie Mai, Songlong Xing, Jiaxuan He, Ying Zeng, Haifeng Hu
2021 arXiv   pre-print
In this paper, we study the task of multimodal sequence analysis which aims to draw inferences from visual, language and acoustic sequences.  ...  In the second stage, given that the multimodal sequences are unaligned, the commonly considered word-level fusion does not pertain.  ...  Particularly, it is of great significance to learn longer temporal dependency for unaligned multimodal sequence analysis because the unaligned sequences are often very long.  ... 
arXiv:2011.13572v3 fatcat:dyjfr5vidnheliuupsntv255da

Integrating Multimodal Information in Large Pretrained Transformers [article]

Wasifur Rahman, Md. Kamrul Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, Louis-Philippe Morency, Ehsan Hoque
2020 arXiv   pre-print
While fine-tuning these pre-trained models is straightforward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused  ...  In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis.  ...  MulT (Multimodal Transformer for Unaligned Multimodal Language Sequence) uses three sets of Transformers and combines their output in a late fusion manner to model a multimodal sequence (Tsai et al.,  ... 
arXiv:1908.05787v3 fatcat:rmwtz4xllveafc3wxdupuncpeq

Integrating Multimodal Information in Large Pretrained Transformers

Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, Ehsan Hoque
2020 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics  
While fine-tuning these pre-trained models is straight-forward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused  ...  In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis.  ...  MulT (Multimodal Transformer for Unaligned Multimodal Language Sequence) uses three sets of Transformers and combines their output in a late fusion manner to model a multimodal sequence (Tsai et al.,  ... 
doi:10.18653/v1/2020.acl-main.214 pmid:33782629 pmcid:PMC8005298 fatcat:xh5n4xcxkjhwlnjvbjvu7zxmpy

Deep Multimodal Emotion Recognition on Human Speech: A Review

Panagiotis Koromilas, Theodoros Giannakopoulos
2021 Applied Sciences  
In addition, we review the basic feature representation methods for each modality, and we present aggregated evaluation results on the reported methodologies.  ...  This work reviews the state of the art in multimodal speech emotion recognition methodologies, focusing on audio, text and visual information.  ...  Multimodal emotion recognition evaluation results for unaligned sequential input data.  ... 
doi:10.3390/app11177962 fatcat:cezjfmjmvbgapo3tdz5j3iecp4

Abstractive Sentence Summarization with Guidance of Selective Multimodal Reference [article]

Zijian Zhang, Chenxi Zhang, Qinpei Zhao, Jiangfeng Li
2021 arXiv   pre-print
Existing approaches mainly focus on the enhancement of multimodal fusion, while ignoring the unalignment among multiple inputs and the emphasis of different segments in feature, which has resulted in the  ...  To alleviate these problems, we propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities (by low-level cross-modal interaction module  ...  METHODOLOGY -MHSF In this section, we describe our proposed Multimodal Hierarchical Selective Transformer (mHsf, in Figure 2 ) for modeling unaligned multimodal streams.  ... 
arXiv:2108.05123v1 fatcat:c37pmbyhujhbjkeeojhgghtsrq

Hierachical Delta-Attention Method for Multimodal Fusion [article]

Kunjal Panchal
2020 arXiv   pre-print
The addition of attention is new to the multi-modal fusion field and currently being scrutinized for on what stage the attention mechanism should be used, this work achieves competitive accuracy for overall  ...  "Multimodal Transformer for Unaligned Multimodal Language Sequences" [22] uses the cross-modal mechanism at the start to fuse the modalities in the following manner: Visual → Acoustics || Language →  ...  Acoustics, Visual → Language || Acoustics → Language and Acoustics → Visual || Language → Visual.  ... 
arXiv:2011.10916v1 fatcat:vlxhkdnicvbs7arly4muhnkkqq

From Multimodal to Unimodal Attention in Transformers using Knowledge Distillation [article]

Dhruv Agarwal, Tanay Agrawal, Laura M. Ferrari, François Bremond
2021 arXiv   pre-print
We introduce for the first time the concept of knowledge distillation in transformers to use only one modality at inference time.  ...  Multimodal Deep Learning has garnered much interest, and transformers have triggered novel approaches, thanks to the cross-attention mechanism.  ...  [22] proposed the Multimodal Transformer (MulT) to learn representations directly from unaligned multimodal data.  ... 
arXiv:2110.08270v2 fatcat:6gcafuownvauphwc6r32laqzdu

Factorized Multimodal Transformer for Multimodal Sequential Learning [article]

Amir Zadeh, Chengfeng Mao, Kelly Shi, Yiwei Zhang, Paul Pu Liang, Soujanya Poria, Louis-Philippe Morency
2019 arXiv   pre-print
In this paper, we present a new transformer model, called the Factorized Multimodal Transformer (FMT) for multimodal sequential learning.  ...  The proposed factorization allows for increasing the number of self-attentions to better model the multimodal phenomena at hand; without encountering difficulties during training (e.g. overfitting) even  ...  Transformer for [Un]aligned Sequences, Tsai et al. (2019)).  ... 
arXiv:1911.09826v1 fatcat:b3264cyhubggrckkkjoiwv2s4e

Multimodal Transformer with Multi-View Visual Representation for Image Captioning [article]

Jun Yu, Jing Li, Zhou Yu, Qingming Huang
2019 arXiv   pre-print
Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning.  ...  Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework.  ...  Fig. 2 : 2 Multimodal Transformer (MT) model for image captioning.  ... 
arXiv:1905.07841v1 fatcat:h3mzklznz5fbfe5no7m5x7dmnq

CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French

AmirAli Bagher Zadeh, Yansheng Cao, Simon Hessner, Paul Pu Liang, Soujanya Poria, Louis-Philippe Morency
2020 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)  
Our evaluations on a state-of-the-art multimodal model demonstrates that CMU-MOSEAS enables further research for multilingual studies in multimodal language.  ...  As a step towards building more equitable and inclusive multimodal systems, we introduce the first large-scale multimodal language dataset for Spanish, Portuguese, German and French.  ...  The maximum sequence length is set at 50. Sequences are padded on the left with zeros. For language, we use the one-hot representation of the words.  ... 
doi:10.18653/v1/2020.emnlp-main.141 pmid:33969362 pmcid:PMC8106386 fatcat:rdq566qrk5h5lmweffg6whts2q

When Language Evolution Meets Multimodality: Current Status and Challenges Toward Multimodal Computational Models

Patrizia Grifoni, Arianna D'Ulizia, Fernando Ferri
2021 IEEE Access  
a multimodal language evolution model.  ...  INDEX TERMS Natural languages, multimodality, computational modeling, agent-based modeling, language evolution.  ...  [56] also explored cross-modal attention mechanisms and proposed a multimodal transformer for modeling unaligned multimodal language sequences following a late fusion strategy.  ... 
doi:10.1109/access.2021.3061756 fatcat:f2gutl4pnvdkli7zdxoubzhzka
« Previous Showing results 1 — 15 out of 179 results