10,568 Hits in 3.5 sec

Audio-Linguistic Embeddings for Spoken Sentences [article]

Albert Haque, Michelle Guo, Prateek Verma, Li Fei-Fei
2019 arXiv   pre-print
We propose spoken sentence embeddings which capture both acoustic and linguistic content.  ...  Overall, our work illustrates the viability of generic, multi-modal sentence embeddings for spoken language understanding.  ...  T H E S E F E Spoken Sentence Embedding Acoustic Decoder Linguistic Decoder Input Audio Audio-Linguistic Enoder Fig. 1 : Audio-linguistic embedding for spoken sentences.  ... 
arXiv:1902.07817v1 fatcat:nsgpzkgaqvcpxieuqvdytewhcq

Measuring Depression Symptom Severity from Spoken Language and 3D Facial Expressions [article]

Albert Haque, Michelle Guo, Adam S Miner, Li Fei-Fei
2018 arXiv   pre-print
Our multi-modal method uses 3D facial expressions and spoken language, commonly available from modern cell phones.  ...  In this work, we present a machine learning method for measuring the severity of depressive symptoms.  ...  This work was supported by a National Institutes of Health, National Center for Advancing Translational Science, Clinical and Translational Science Award (KL2TR001083 and UL1TR001085).  ... 
arXiv:1811.08592v2 fatcat:axnidcyxi5gu7obx42p3xjiexi

Semantic sentence similarity: size does not always matter [article]

Danny Merkx, Stefan L. Frank, Mirjam Ernestus
2021 arXiv   pre-print
This study addresses the question whether visually grounded speech recognition (VGS) models learn to capture sentence semantics without access to any prior linguistic knowledge.  ...  We produce synthetic and natural spoken versions of a well known semantic textual similarity database and show that our VGS model produces embeddings that correlate well with human semantic similarity  ...  The research presented here was funded by the Netherlands Organisation for Scientific Research (NWO) Gravitation Grant 024.001.006 to the Language in Interaction Consortium.  ... 
arXiv:2106.08648v1 fatcat:opu6kysygjhzfh2s4mgds3fmpe

Bilingual Prosodic Dataset Compilation for Spoken Language Translation

Alp Öktem, Mireia Farrús, Antonio Bonafonte
2018 IberSPEECH 2018  
The almost fully-automatized process serves for building data for training spoken language models without the need for designing and recording bilingual data.  ...  Both the extraction scripts and the dataset are distributed open-source for research purposes.  ...  Mkv files can hold multiple channels of audios and subtitles embedded in it like DVDs. In order to run our scripts we first needed to extract the audio and subtitle pairs for both languages.  ... 
doi:10.21437/iberspeech.2018-5 dblp:conf/iberspeech/OktemFB18 fatcat:d534v5ktjnbs3ak3asmafzihwq

Representations of language in a model of visually grounded speech signal

Grzegorz Chrupała, Lieke Gelderloos, Afra Alishahi
2017 Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)  
We use a multi-layer recurrent highway network to model the temporal nature of spoken speech, and show that it learns to extract both form and meaning-based linguistic knowledge from the input signal.  ...  We present a visually grounded model of speech perception which projects spoken utterances and images to a joint semantic space.  ...  Acknowledgements We would like to thank David Harwath for making the Flickr8k Audio Caption Corpus publicly available.  ... 
doi:10.18653/v1/p17-1057 dblp:conf/acl/ChrupalaGA17 fatcat:tlwabgi43bci3dxwitqqahskyi

CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning [article]

Sameer Khurana, Antoine Laurent, James Glass
2020 arXiv   pre-print
Here, we construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech.  ...  Traditional methods of documenting language proceed by collecting audio data followed by manual annotation by trained linguists at different levels of granularity.  ...  The input to the text network is the sequence of word embeddings that make up the sentence.  ... 
arXiv:2006.02814v2 fatcat:sz32yptl3beeffpkqona57mywi

Overview of the EVALITA 2018 Spoken Utterances Guiding Chef's Assistant Robots (SUGAR) Task [chapter]

Maria Di Maro, Antonio Origlia, Francesco Cutugno
2018 EVALITA Evaluation of NLP and Speech Tools for Italian  
The starting point will be therefore to provide authentic spoken data collected in a simulated natural context from which semantic predicates will be extracted to classify the actions to perform.  ...  Acknowledgments We thank the EVALITA 2018 organisers and the SUGAR participants for the interest expressed.  ...  For this purpose, a training corpus of annotated spoken commands was collected.  ... 
doi:10.4000/books.aaccademia.4523 fatcat:n2aq7535knaynmplqgh64nw6fi

Hybrid Attention based Multimodal Network for Spoken Language Classification

Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, Ivan Marsic
2018 Association for Computational Linguistics (ACL). Annual Meeting Conference Proceedings  
We examine the utility of linguistic content and vocal characteristics for multimodal deep learning in human spoken language understanding.  ...  The proposed hybrid attention architecture helps the system focus on learning informative representations for both modality-specific feature extraction and model fusion.  ...  Acknowledgments We would like to thank the anonymous reviewers for their valuable comments and feedback. This research was funded by the National Institutes of Health under Award Number R01LM011834.  ... 
pmid:30410219 pmcid:PMC6217979 fatcat:jhg2k65gpnh5bp4s7tpxd6d7wa

Text Matters but Speech Influences: A Computational Analysis of Syntactic Ambiguity Resolution [article]

Won Ik Cho, Jeonghwa Cho, Woo Hyun Kang, Nam Soo Kim
2020 arXiv   pre-print
It is, at the same time, one of the most challenging issues for spoken language understanding (SLU) systems as well.  ...  Analyzing how human beings resolve syntactic ambiguity has long been an issue of interest in the field of linguistics.  ...  Acknowledgments This work was supported by the Technology Innovation Program (10076583, Development of free-running speech recognition technologies for embedded robot system) funded By the Ministry of  ... 
arXiv:1910.09275v3 fatcat:hv53kekik5fdribmzmavm4m4ay

Representation Mixing for TTS Synthesis [article]

Kyle Kastner, João Felipe Santos, Yoshua Bengio, Aaron Courville
2018 arXiv   pre-print
We demonstrate a simple method for combining multiple types of linguistic information in a single encoder, named representation mixing, enabling flexible choice between character, phoneme, or mixed representations  ...  However, the choice between character or phoneme input can create serious limitations for practical deployment, as direct control of pronunciation is crucial in certain cases.  ...  LJSpeech consists of 13, 100 audio files (comprising a total time of approximately 24 hours) of read English speech, spoken by Linda Johnson.  ... 
arXiv:1811.07240v2 fatcat:o5z3i7jfpvfd7molvk5jblradm

CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network [article]

Vincent Wan, Chun-an Chan, Tom Kenter, Jakub Vit, Rob Clark
2019 arXiv   pre-print
At inference time, an embedding representing the prosody of a sentence may be sampled from the variational layer to allow for prosodic variation.  ...  the prosody embedding of one sentence to generate the speech signal of another.  ...  We compare using an all-zero sentence prosody embedding to an embedding made by encoding the ground truth audio.  ... 
arXiv:1905.07195v2 fatcat:y2oclozlcfdlhoavmkaailxsqy

Are discrete units necessary for Spoken Language Modeling? [article]

Tu Anh Nguyen, Benoit Sagot, Emmanuel Dupoux
2022 arXiv   pre-print
Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels.  ...  In this work, show that discretization is indeed essential for good results in spoken language modeling, but that can omit the discrete bottleneck if we use using discrete target features from a higher  ...  Tomasello, Wei-Ning Hsu, Yossef Mordechay Adi, Abdelrahman Mohamed, Maureen de Seyssel, Marvin Lavechin, Robin Algayres, Xuan-Nga Cao, Nicolas Hamilakis, Hadrien Titeux, Gwendal Virlet, Marianne Metais for  ... 
arXiv:2203.05936v1 fatcat:n54zmhdxevb5bc6q3ufogiw6b4

A Summary of the First Workshop on Language Technology for Language Documentation and Revitalization [article]

Graham Neubig, Shruti Rijhwani, Alexis Palmer, Jordan MacKenzie, Hilaria Cruz, Xinjian Li, Matthew Lee, Aditi Chaudhary, Luke Gessler, Steven Abney, Shirley Anugrah Hayati, Antonios Anastasopoulos (+12 others)
2020 arXiv   pre-print
In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh to attempt to bring together language community members, documentary linguists, and technologists to discuss how to bridge  ...  This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw'ida,  ...  book in SJQ Chatino, that highlights words when spoken in the audio, and speaks words when they are clicked.  ... 
arXiv:2004.13203v1 fatcat:qslo5auwuzbfpdega3gjwir5ia

What all do audio transformer models hear? Probing Acoustic Representations for Language Delivery and its Structure [article]

Jui Shah, Yaman Kumar Singla, Changyou Chen, Rajiv Ratn Shah
2021 arXiv   pre-print
Moreover, although the standard methodology is to choose the last layer embedding for any downstream task, but is it the optimal choice?  ...  We try to answer these questions for the two recent audio transformer models, Mockingjay and wave2vec2.0.  ...  Surface Level Features Surface level features measure the surface properties of sentences. No linguistic knowledge is required for these features.  ... 
arXiv:2101.00387v2 fatcat:pjjxforqf5ddjfwch6chtxy6rq

A Multi-Modal Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language [article]

S. Soroush Haj Zargarbashi, Bagher Babaali
2019 arXiv   pre-print
Result: This work designs a multi-modal feature embedding on the spoken language audio signal using three approaches; N-gram, i-vector, and x-vector.  ...  We use three (statistical and neural) approaches to classify audio signals from spoken language into two classes of dementia and control.  ...  In this paper a novel framework based on both acoustic and linguistic features of spoken language has been developed which involves both statistical and neural feature embedding techniques and perplexity  ... 
arXiv:1910.00330v1 fatcat:gt4rvhzvtbci5h57gv7inwcjdy
« Previous Showing results 1 — 15 out of 10,568 results