Filters








6,486 Hits in 4.1 sec

2020 Index IEEE/ACM Transactions on Audio, Speech, and Language Processing Vol. 28

2020 IEEE/ACM Transactions on Audio Speech and Language Processing  
., +, TASLP 2020 157-170 Active Learning for Sound Event Detection.  ...  ., +, TASLP 2020 785-797 Probability Active Learning for Sound Event Detection.  ...  T Target tracking Multi-Hypothesis Square-Root Cubature Kalman Particle Filter for Speaker Tracking in Noisy and Reverberant Environments. Zhang, Q., +, TASLP 2020 1183 -1197  ... 
doi:10.1109/taslp.2021.3055391 fatcat:7vmstynfqvaprgz6qy3ekinkt4

Table of Contents

2020 IEEE/ACM Transactions on Audio Speech and Language Processing  
Huang 1198 Spectro-Temporal Sparsity Characterization for Dysarthric Speech Detection . . . . . . . . . . . . . . I. Kodrasi and H.  ...  Amar 1143 Temporarily-Aware Context Modeling Using Generative Adversarial Networks for Speech Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T.  ... 
doi:10.1109/taslp.2020.3046148 fatcat:hirdphjf6zeqdjzwnwlwlamtb4

Author Index

2010 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition  
Li, Ruonan Group Motion Segmentation Using a Spatio-Temporal Driving Force Model Li, Shuo Finding Image Distributions on Active Curves Graph Cut Segmentation with a Global Constraint: Recovering  ...  Optimizing One-Shot Recognition with Micro-Set Learning Tang, Lisa Workshop: Graph-based Tracking of Tongue Contour in Ultrasound Sequences with Adaptive Temporal Regularization Tang, Xiaoou Object  ... 
doi:10.1109/cvpr.2010.5539913 fatcat:y6m5knstrzfyfin6jzusc42p54

An Attention Self-supervised Contrastive Learning based Three-stage Model for Hand Shape Feature Representation in Cued Speech [article]

Jianrong Wang, Nan Gu, Mei Yu, Xuewei Li, Qiang Fang, Li Liu
2021 arXiv   pre-print
Thirdly, we present a module, which combines Bi-LSTM and self-attention networks to further learn sequential features with temporal and contextual information.  ...  Recent supervised deep learning based methods suffer from noisy CS data annotations especially for hand shape modality.  ...  feature extraction model based on self-supervised contrastive learning and self-attention mechanism is proposed to model spatial and temporal features of CS hand shape. • Experimental results in multi-lingual  ... 
arXiv:2106.14016v1 fatcat:vjyb3ajrlncsteozmh7r6kejxi

An Attention-Enhanced Multi-Scale and Dual Sign Language Recognition Network Based on a Graph Convolution Network

Lu Meng, Ronghui Li
2021 Sensors  
MSA allows the GCN to learn the dependencies between long-distance vertices; MSSTA can directly learn the spatiotemporal features; ATCN allows the GCN network to better learn the long temporal dependencies  ...  We first extracted the skeleton data from them and then used the skeleton data for sign language recognition.  ...  to the Institute of Computing Technology, Chinese Academy (Yuecong Min, Xiujuan Chai, Lei Zhao and Xilin Chen) for their support of public dataset DEVISIGN-L.  ... 
doi:10.3390/s21041120 pmid:33562715 pmcid:PMC7915156 fatcat:ustoyetignannkwlezqqiobpqu

Automated Vision-Based Wellness Analysis for Elderly Care Centers [article]

Xijie Huang, Jeffry Wicaksana, Shichao Li, Kwang-Ting Cheng
2021 arXiv   pre-print
We then process and extract personalized facial, activity, and interaction features from the video data using deep neural networks.  ...  We also summarize technical challenges and additional functionalities and technologies needed for offering a comprehensive system.  ...  For talking detection, we adopt a facial landmark based solution and it surprisingly outperforms state-of-the-art multi-modal speaker detection model Active Speakers in Context (Alcázar et al. 2020) .  ... 
arXiv:2112.10381v1 fatcat:x56vn3gmhvb2rcxy3wt2nw367m

Table of Contents

2021 IEEE/ACM Transactions on Audio Speech and Language Processing  
Hu A Graph-to-Sequence Learning Framework for Summarizing Opinionated Texts . . . . . . ....P. Wei, J. Zhao, and W.  ...  -R.Dai Perceptual-Similarity-Aware Deep Speaker Representation Learning for Multi-Speaker Generative Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ... 
doi:10.1109/taslp.2021.3137066 fatcat:ocit27xwlbagtjdyc652yws4xa

End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection [article]

Hemlata Tak, Jee-weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, Nicholas Evans
2021 arXiv   pre-print
The principal contribution is a spectro-temporal graph attention network (GAT) which learns the relationship between cues spanning different sub-bands and temporal intervals.  ...  for the ASVspoof 2019 logical access database.  ...  RawGAT-ST model for anti-spoofing and speech deepfake detection In this section, we introduce the proposed raw GAT with spectro-temporal attention (RawGAT-ST) model.  ... 
arXiv:2107.12710v2 fatcat:eyzbz762dbehrdsgjlnt7676ya

The cognitive function and the framework of the functional hierarchy

Ulla Gain
2018 Applied Computing and Informatics  
These applications mimic the human brain functions, for example, recognize the speaker, sense the tone of the text. On this paper, we present the similarities of these with human cognitive functions.  ...  spatial mapping MS Speaker Recognition Table 3 .  ...  The service can identify food, and colors, categorize and tag the image, detect the face, age, gender, and for the celebrities it returns the identity, knowledge graph and name.  ... 
doi:10.1016/j.aci.2018.03.003 fatcat:xwtvtur4wjhrzabtrt3nezyfem

DEEP-HEAR: A Multimodal Subtitle Positioning System Dedicated to Deaf and Hearing-Impaired People

Ruxandra Tapu, Bogdan Mocanu, Titus Zaharia
2019 IEEE Access  
INDEX TERMS Active speaker recognition, face recognition, dynamic subtitle positioning, convolutional neural networks, assistive framework for deaf and hearing impaired people. III.  ...  The proposed system exploits both computer vision algorithms and deep convolutional neural networks specifically designed and tuned in order to detect and recognize the identity of the active speaker.  ...  , tracking and recognition, video temporal segmentation, active speaker detection and recognition, background text detection and subtitle positioning.  ... 
doi:10.1109/access.2019.2925806 fatcat:yl6kc6vz6bdobcq4orynf7ig6i

Second-Language Learning Ability Revealed by Resting-State Functional Connectivity

Lydia Vinals
2016 Journal of Neuroscience  
For more information on the format and purpose of the Journal Club, please see http://www.jneurosci.org/misc/ifa_features.shtml.  ...  Over the past decade, the cognitive neuroscience of language has gradually shifted its focus from describing the function of specific brain regions to characterizing the spatial and temporal dynamics of  ...  classroom instruction and interactions with native speakers.  ... 
doi:10.1523/jneurosci.0917-16.2016 pmid:27277791 fatcat:univwq7rwbabvjn3jrhm222fsi

3D skeletal movement-enhanced emotion recognition networks

Jiaqi Shi, Chaoran Liu, Carlos Toshinori Ishi, Hiroshi Ishiguro
2021 APSIPA Transactions on Signal and Information Processing  
We propose an attention-based convolutional neural network which takes the extracted data as input to predict the speakers' emotional state.  ...  We also propose a graph attention-based fusion method that combines our model with the models using other modalities, to provide complementary information in the emotion classification task and effectively  ...  Similarly, not all the temporal-spatial regions of skeletal motion data contribute equally to emotional states.  ... 
doi:10.1017/atsip.2021.11 fatcat:bnrelrqbxnhani7sjb44ljpbou

Speaker detection using the timing structure of lip motion and sound

Yu Horii, Hiroaki Kawashima, Takashi Matsuyama
2008 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops  
Our experimental result shows the effectiveness of using temporal relations of intervals for speaker detection. 978-1-4244-2340-8/08/$25.00 ©2008 IEEE  ...  Based on the learned model, we realize speaker detection by evaluating the timing structure of the observed video and audio.  ...  Acknowledgement This study is supported by Grant-in-Aid for Scientific Research No.18049046 of the Ministry of Education, Culture, Sports, Science and Technology.  ... 
doi:10.1109/cvprw.2008.4563183 dblp:conf/cvpr/HoriiKM08 fatcat:s4jiyvmcobaozg3udxs7xo46za

Reading in two writing systems: Accommodation and assimilation of the brain's reading network

CHARLES A. PERFETTI, YING LIU, JULIE FIEZ, JESSICA NELSON, DONALD J. BOLGER, LI-HAI TAN
2007 Bilingualism: Language and Cognition  
occipital-temporal and also middle frontal areas when reading Chinese, similar to the pattern of native speakers and different from alphabetic reading.  ...  acquired for one system must be modified to accommodate the demands of a new system.  ...  The critical results for word perception areas are that English speakers learning Chinese showed only left fusiform activation for English-like stimuli, but bilateral fusiform activation when viewing Chinese-like  ... 
doi:10.1017/s1366728907002891 fatcat:usmvido4vza27cos7k3rpphvry

PoseKernelLifter: Metric Lifting of 3D Human Pose using Sound [article]

Zhijian Yang, Xiaoran Fan, Volkan Isler, Hyun Soo Park
2021 arXiv   pre-print
Existing learning based approaches circumvent this issue by reconstructing the 3D pose up to scale.  ...  For example, we can not measure the exact distance of a person to the camera from a single view image without additional scene assumptions (e.g., known height).  ...  [2] , and spatio-temporal graph convolution is used to capture pose and time dependency [4, 9, 33] .  ... 
arXiv:2112.00216v2 fatcat:mrvnv6omf5bo5gczot3e43uxli
« Previous Showing results 1 — 15 out of 6,486 results