Filters








51,210 Hits in 6.4 sec

Improved Speech Separation with Time-and-Frequency Cross-Domain Feature Selection

Tian Lan, Yuxin Qian, Yilan Lyu, Refuoe Mokhosi, Wenxin Tai, Qiao Liu
2021 Interspeech 2021   unpublished
To make better use of frequency domain feature in decoder, we propose using selection weights to select and fuse features from different domains and unify the features used in separator and decoder.  ...  Most deep learning-based monaural speech separation models only use either spectrograms or time domain speech signal as the input feature.  ...  Conclusion In this paper, we propose several feature selection encoders for time and frequency domain features in speech separation.  ... 
doi:10.21437/interspeech.2021-2246 fatcat:7sicb7khgjcp7llkaaptus7zxa

DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation And Extraction [article]

Jiangyu Han, Yanhua Long, Lukas Burget, Jan Cernocky
2022 arXiv   pre-print
In this paper, from the time-frequency domain perspective, we propose a densely-connected pyramid complex convolutional network, termed DPCCN, to improve the robustness of speech separation under complicated  ...  In recent years, a number of time-domain speech separation methods have been proposed. However, most of them are very sensitive to the environments and wide domain coverage tasks.  ...  INTRODUCTION Speech separation (SS) aims to separate each source signal from mixed speech. Traditionally, it has been done in the time-frequency (T-F) domain [1] [2] [3] .  ... 
arXiv:2112.13520v2 fatcat:exilswr2jfb4znstjkfeg4zek4

Improved Speech Separation with Time-and-Frequency Cross-Domain Joint Embedding and Clustering

Gene-Ping Yang, Chao-I Tuan, Hung-Yi Lee, Lin-shan Lee
2019 Interspeech 2019  
We construct a time-and-frequency feature map by concatenating 1dim convolution encoded feature map (for time domain) and magnitude spectrogram (for frequency domain), which was then processed by an embedding  ...  Speech separation has been very successful with deep learning techniques.  ...  Conclusions In this paper, we propose to integrate the time and frequency domain features and perform cross-domain joint learning for speech separation.  ... 
doi:10.21437/interspeech.2019-2181 dblp:conf/interspeech/YangTLL19 fatcat:jtfcd4mtzzexbdlkqncbn7buxa

Improved Speech Separation with Time-and-Frequency Cross-domain Joint Embedding and Clustering [article]

Gene-Ping Yang, Chao-I Tuan, Hung-Yi Lee, Lin-shan Lee
2019 arXiv   pre-print
We construct a time-and-frequency feature map by concatenating the 1-dim convolution encoded feature map (for time domain) and the spectrogram (for frequency domain), which was then processed by an embedding  ...  Substantial effort has been reported based on approaches over spectrogram, which is well known as the standard time-and-frequency cross-domain representation for speech signals.  ...  Conclusions In this paper, we propose to integrate the time and frequency domain features and perform cross-domain joint learning for speech separation.  ... 
arXiv:1904.07845v1 fatcat:d4qz6kyooza6xkwwrl3t7xg3li

Music tonality features for speech/music discrimination

Gregory Sell, Pascal Clark
2014 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
Even when trained on mismatched data, the new features perform well on their own and also combine with existing features for further improvement.  ...  We introduce a novel set of features for speech/music discrimination derived from chroma vectors, a feature that represents musical tonality.  ...  Evaluations were performed using 8-fold cross-validation with separate files in train and test datasets.  ... 
doi:10.1109/icassp.2014.6854048 dblp:conf/icassp/SellC14 fatcat:g22jq3n3jrh45fiisattpsp5ku

Heterogeneous Separation Consistency Training for Adaptation of Unsupervised Speech Separation [article]

Jiangyu Han, Yanhua Long
2022 arXiv   pre-print
To address these problems, in this study, we propose a novel separation consistency training, termed SCT, to exploit the real-world unlabeled mixtures for improving cross-domain unsupervised speech separation  ...  Recently, supervised speech separation has made great progress.  ...  For example, authors in [24] constructed a timeand-frequency feature map by concatenating both time and time-frequency domain acoustic features to improve separation performance.  ... 
arXiv:2204.11032v2 fatcat:mf2tfgczv5fjfa4kifledp4lli

SpEx: Multi-Scale Time Domain Speaker Extraction Network

Chenglin Xu, Wei Rao, Eng Siong Chng, Haizhou Li
2020 IEEE/ACM Transactions on Audio Speech and Language Processing  
It is common to perform the extraction in frequency-domain, and reconstruct the time-domain signal from the extracted magnitude and estimated phase spectra.  ...  Inspired by Conv-TasNet, we propose a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into  ...  In a frequency-domain implementation, a STFT module serves as the speech encoder that transforms time-domain speech signal into spectrum, with magnitude and phase, while an inverse STFT serves as the speech  ... 
doi:10.1109/taslp.2020.2987429 fatcat:xlsfk6ulufeb3cmxhbrhicnfza

Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam [article]

Marc Delcroix, Tsubasa Ochiai, Katerina Zmolikova, Keisuke Kinoshita, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki
2020 arXiv   pre-print
First, we propose a time-domain implementation of SpeakerBeam similar to that proposed for a time-domain audio separation network (TasNet), which has achieved state-of-the-art performance for speech separation  ...  SpeakerBeam presents a practical alternative to speech separation as it enables tracking speech of a target speaker across utterances, and achieves promising speech extraction performance.  ...  More recently, a convolutional time-domain audio separation network (Conv-TasNet) has been proposed and led to great separation performance improvement surpassing ideal time-frequency masking [4] [5]  ... 
arXiv:2001.08378v1 fatcat:vomyxlidejadxhtdsju3lkx6s4

End-to-End Multi-Channel Speech Separation [article]

Rongzhi Gu, Jian Wu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, Dong Yu
2019 arXiv   pre-print
We demonstrate on the WSJ0 far-field speech separation task that, with the benefit of learnable spatial features, our proposed end-to-end multi-channel model significantly improved the performance of previous  ...  transform (STFT) and inter-channel phase difference (IPD) as a function of time-domain convolution with a special kernel. 3) We further relaxed those fixed kernels to be learnable, so that the entire architecture  ...  For cross-domain training, both LPS and IPDs are served as frequency domain features. These features are extracted with 32ms window length and 16ms hop size with 512 FFT points.  ... 
arXiv:1905.06286v2 fatcat:qyzimil3drbm3m5j3kkbzjutuy

A Joint Approach for Single-Channel Speaker Identification and Speech Separation

Pejman Mowlaee, Rahim Saeidi, Mads Græsbøll Christensen, Zheng-Hua Tan, Tomi Kinnunen, Pasi Franti, Søren Holdt Jensen
2012 IEEE Transactions on Audio, Speech, and Language Processing  
In this paper, we present a novel system for joint speaker identification and speech separation.  ...  For speech separation, we propose a sinusoidal model-based algorithm.  ...  Deliang Wang for his help in implementing and evaluating gammatone filterbank features and Dr. Jon Barker for his helpful discussion concerning the speech intelligibility test.  ... 
doi:10.1109/tasl.2012.2208627 fatcat:lzscqovuvvfobodqzxbjr5i724

Implicit Filter-and-Sum Network for End-to-End Multi-Channel Speech Separation

Yi Luo, Nima Mesgarani
2021 Conference of the International Speech Communication Association  
A feature-level normalized cross correlation (fNCC) feature is also proposed to better match the implicit operation for an improved performance.  ...  Among them, the filter-and-sum network (FaSNet) performs end-to-end time-domain filter-and-sum beamforming and has shown effective in both ad-hoc and fixed microphone array geometries.  ...  Acknowledgments This work was funded by a grant from the National Institute of Health, NIDCD, DC014279; and a grant from Marie-Josée and Henry R. Kravis.  ... 
doi:10.21437/interspeech.2021-1158 dblp:conf/interspeech/LuoM21 fatcat:zc3aqyp77fg4nawjawrkjs4vhq

Emotion Recognition from Speech using Prosodic and Linguistic Features

Mahwish Pervaiz, Tamim Ahmed
2016 International Journal of Advanced Computer Science and Applications  
Separately, prosodic/temporal and linguistic features of speech do not provide results with adequate accuracy. We can also find out emotions from linguistic features if we can identify contents.  ...  We extract emotions from word segmentation combined with linguistic features in the second step.  ...  In Time domain volume and ZCR with high order difference are used and in frequency domain variance and entropy of spectrum is used for end point detection.  ... 
doi:10.14569/ijacsa.2016.070813 fatcat:lp5bqyxjezbx7cgtzfer7b7kry

Implicit Filter-and-sum Network for Multi-channel Speech Separation [article]

Yi Luo, Nima Mesgarani
2020 arXiv   pre-print
Among them, the filter-and-sum network (FaSNet) performs end-to-end time-domain filter-and-sum beamforming and has shown effective in both ad-hoc and fixed microphone array geometries.  ...  From the feature extraction perspective, we modify the calculation of sample-level normalized cross correlation (NCC) features into feature-level NCC (fNCC) features.  ...  Not only the existing neural beamformers are mainly designed in the time-frequency domain, but also the time-domain systems utilize a learnable latent space for better signal representations and separation  ... 
arXiv:2011.08401v1 fatcat:wtnhqarcozfs7mwidl5tiafye4

Use of bimodal coherence to resolve the permutation problem in convolutive BSS

Qingju Liu, Wenwu Wang, Philip Jackson
2012 Signal Processing  
To improve the accuracy of this coherence model, we use a frame selection scheme to discard nonstationary features.  ...  Then with the coherence maximization technique, we develop a new sorting method to solve the permutation problem in the frequency domain.  ...  The authors would like to thank the anonymous reviewers and the guest editors for their insightful comments that considerably improved the quality of this paper.  ... 
doi:10.1016/j.sigpro.2011.11.007 fatcat:pq455md5ajbcheto7duulg5avq

A Feature Study for Classification-Based Speech Separation at Low Signal-to-Noise Ratios

Jitong Chen, Yuxuan Wang, DeLiang Wang
2014 IEEE/ACM Transactions on Audio Speech and Language Processing  
In classification-based speech separation, supervised learning is employed to classify time-frequency units as either speech-dominant or noise-dominant.  ...  In this study, we systematically evaluate a range of promising features for classification-based separation using six nonstationary noises at the low SNR level of dB, which is chosen with the goal of improving  ...  This very low SNR level is selected with the goal of improving speech intelligibility in mind.  ... 
doi:10.1109/taslp.2014.2359159 fatcat:6w5g56cezrfhhi4cmtqtv42fny
« Previous Showing results 1 — 15 out of 51,210 results