23 Hits in 6.6 sec

Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity [article]

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, Geehyuk Lee
2020 pre-print
In this paper, we present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.  ...  All the code and data is available at  ...  Resource supporting this work were provided by the 'Ministry of Science and ICT' and NIPA ("HPC Support" Project).  ... 
doi:10.1145/3414685.3417838 arXiv:2009.02119v1 fatcat:7xm4shpylbfz7oowsor46nsld4

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation [article]

Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, Bolei Zhou
2022 arXiv   pre-print
To fully utilize the rich connections between speech audio and human gestures, we propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.  ...  To enhance the quality of synthesized gestures, we develop a contrastive learning strategy based on audio-text alignment for better audio representations.  ...  Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).  ... 
arXiv:2203.13161v1 fatcat:6rnee7ftjjefdpdfpltt34hody

Freeform Body Motion Generation from Speech [article]

Jing Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei
2022 arXiv   pre-print
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions.  ...  Extensive experiments demonstrate the superior performance against several baselines, in terms of motion diversity, quality and syncing with speech.  ...  network for speech to gesture generation. • Speech Drives Template (Tmpt) [23] learns a set of gesture templates to relieve the ambiguity of the mapping from speech to body motion. • Trimodal-Context  ... 
arXiv:2203.02291v1 fatcat:aqzd5yqpbjebzlubngoqh4gy4y

Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning [article]

Uttaran Bhattacharya and Elizabeth Childs and Nicholas Rewkowski and Dinesh Manocha
2021 arXiv   pre-print
Our network consists of two components: a generator to synthesize gestures from a joint embedding space of features encoded from the input speech and the seed poses, and a discriminator to distinguish  ...  We leverage the Mel-frequency cepstral coefficients and the text transcript computed from the input speech in separate encoders in our generator to learn the desired sentiments and the associated affective  ...  Acknowledgment This work has been supported in part by ARO Grants W911NF1910069 and W911NF1910315, and Intel.  ... 
arXiv:2108.00262v1 fatcat:qkdnpmnldze6pjnnxlnnahix4e

Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation [article]

Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura
2021 arXiv   pre-print
The existing datasets are collected to cover as many different phonemes as possible instead of sentences, thus limiting the capability of the audio-based model to learn more diverse contexts.  ...  In contrast to prior approaches which learn phoneme-level features from the text, we investigate the high-level contextual text features for speech-driven 3D facial animation.  ...  .; and Battenberg, E.; and Nieto, O. 2015. librosa: Audio and mu- Lee, G. 2020. Speech gesture generation from the trimodal sic signal analysis in python.  ... 
arXiv:2112.02214v2 fatcat:77tyq4cslfatrghj7aypwnmnuy

Multimodal Sentiment Analysis: Addressing Key Issues and Setting up the Baselines [article]

Soujanya Poria, Navonil Majumder, Devamanyu Hazarika, Erik Cambria, Alexander Gelbukh, Amir Hussain
2019 arXiv   pre-print
We also discuss some major issues, frequently ignored in multimodal sentiment analysis research, e.g., role of speaker-exclusive models, importance of different modalities, and generalizability.  ...  This framework illustrates the different facets of analysis to be considered while performing multimodal sentiment analysis and, hence, serves as a new benchmark for future research in this emerging field  ...  [15] fused information from audio, visual and text modalities to extract emotion and sentiment. Metallinou et al. [9] fused audio and text modalities for emotion recognition.  ... 
arXiv:1803.07427v2 fatcat:jytchjl3gnbpjkyvp4kb3ih5tu

Deep Multimodal Emotion Recognition on Human Speech: A Review

Panagiotis Koromilas, Theodoros Giannakopoulos
2021 Applied Sciences  
This work reviews the state of the art in multimodal speech emotion recognition methodologies, focusing on audio, text and visual information.  ...  , although in one of the unimodal or multimodal interactions; and (iii) temporal architectures (TA), which try to capture both unimodal and cross-modal temporal dependencies.  ...  The authors propose a deep architecture for the problem of speech emotion recognition, and thus they consider the two modalities of audio and text.  ... 
doi:10.3390/app11177962 fatcat:cezjfmjmvbgapo3tdz5j3iecp4

A review of affective computing: From unimodal analysis to multimodal fusion

Soujanya Poria, Erik Cambria, Rajiv Bajpai, Amir Hussain
2017 Information Fusion  
Multimodality is defined by the presence of more than one modality or channel, e.g., visual, audio, text, gestures, and eye gage.  ...  In this paper, we focus mainly on the use of audio, visual and text information for multimodal affect analysis, since around 90% of the relevant literature appears to cover these three modalities.  ...  ] aimed to integrate information from facial expressions, body movement, gestures and speech, for recognition of eight basic emotions.  ... 
doi:10.1016/j.inffus.2017.02.003 fatcat:ytebhjxlz5bvxcdghg4wxbvr6a

Self-reference in early speech of children speaking Slovak

Jana Kesselová
2018 Journal of Language and Cultural Education  
A child's speech is researched from the very first occurrence of a self-reference mean in 16th month up to the upper limit of early age (36th month) and all that is based on audio-visual records transcripts  ...  The study focuses on the process of being aware of own I in children acquiring Slovak language at an early age and living in a Slovak family.  ...  Acknowledgment This work was supported by the project VEGA 1/0099/16 Personal and Social Deixis in Slovak Language.  ... 
doi:10.2478/jolace-2018-0013 fatcat:o47z3fios5b4vky4stcrllzywu

Speech technology for unwritten languages

Odette Scharenborg, Lucas Ondel, Shruti Palaskar, Philip Arthur, Francesco Ciannella, Mingxing Du, Elin Larsen, Danny Merkx, Rachid Riad, Liming Wang, Emmanuel Dupoux, Laurent Besacier (+7 others)
2020 IEEE/ACM Transactions on Audio Speech and Language Processing  
The results suggest that building systems that go directly from speech-to-meaning and from meaning-tospeech, bypassing the need for text, is possible.  ...  In the case of an unwritten language, however, speech technology is unfortunately difficult to create, because it cannot be created by the standard combination of pre-trained speechto-text and text-to-speech  ...  The authors would like to thank Sanjeev Khudanpur and the rest of the Johns Hopkins University team and the local team at Carnegie Mellon University for organizing the JSALT workshop.  ... 
doi:10.1109/taslp.2020.2973896 fatcat:mjhxfnrnq5g73jis6stemoogem

M2R2: Missing-Modality Robust emotion Recognition framework with iterative data augmentation [article]

Ning Wang
2022 arXiv   pre-print
Present models generally predict the speaker's emotions by its current utterance and context, which is degraded by modality missing considerably.  ...  Firstly, a network called Party Attentive Network (PANet) is designed to classify emotions, which tracks all the speakers' states and context.  ...  More structures and techniques with suitable common representation learning methods should be tested, which we plan to explore in the future.  ... 
arXiv:2205.02524v1 fatcat:vh624wdr3bdjfeqogzv5yh2wri

Crossmodal Audio and Tactile Interaction with Mobile Touchscreens

Eve Hoggan
2010 International Journal of Mobile Human Computer Interaction  
The final study involved a longitudinal evaluation of a touchscreen application, CrossTrainer, focusing on longitudinal effects on performance with audio and tactile feedback, the impact of context on  ...  Experiments showed that keyboards with audio or tactile feedback produce fewer errors and greater speeds of text entry compared to standard touchscreen keyboards.  ...  , bimodal or trimodal conditions containing audio.  ... 
doi:10.4018/jmhci.2010100102 fatcat:wbntzbzojbhejgpkcr3jj7fjhy

The Acoustic and Auditory Contexts of Human Behavior

Elizabeth C. Blake, Ian Cross
2015 Current Anthropology  
We propose that a framework requires to be developed in which inferences can be made about the significance of sound in the past that are not bounded by the particularities of current cultural contexts  ...  Such a framework should be multidisciplinary and draw on what is known scientifically about human sensitivities to and uses of sound, including nonverbal vocalizations, speech and music, ethological studies  ...  Acknowledgments We would like to thank the two anonymous reviewers and Professor Steven Feld for their insightful comments and suggestions during the final preparation of this manuscript.  ... 
doi:10.1086/679445 fatcat:v2roy7hcxzgr3d4bgqmf54ub3y

Ubiquitous emotion-aware computing

Egon L. van den Broek
2011 Personal and Ubiquitous Computing  
The combination of heart rate variability and three speech measures (i.e., variability of the fundamental frequency of pitch (F0), intensity, and energy) explained 90% (p \ .001) of the participants' experienced  ...  Environment (or context), the personality trait neuroticism, and gender proved to be useful when a nuanced assessment of people's emotions was needed.  ...  Acknowledgments The author gratefully acknowledges the support of the BrainGain Smart Mix Programme of the Netherlands Ministry of Economic Affairs and the Netherlands Ministry of Education, Culture and  ... 
doi:10.1007/s00779-011-0479-9 fatcat:rgxhiqgrafewbabhukji4mn4gu

Asian Social Science, Vo. 5, No. 8, August, 2009. All in one PDF file

Editor ASS
2009 Asian Social Science  
In the eyes of literary critics, rhetorical texts include fictions, poems, prose and dramas and so on while they cover texts of speech and debate in the scholars' eyes of communication.  ...  Socio-cultural Knowledge and Miscommunication From the above examples we can see that the signaling of speech activities is not a matter of unilateral action but rather of speaker-listener coordination  ...  How to make the best resource combination and how to realize sales growth in marketing circular is the main subject of this text.  ... 
doi:10.5539/ass.v5n8p0 fatcat:lu76tvd6pjagzkx5aq5n2hta7e
« Previous Showing results 1 — 15 out of 23 results