2002 International journal of pattern recognition and artificial intelligence  
This paper addresses the problem of speech segmentation and enhancement in the presence of noise. We first propose a new word boundary detection algorithm by using a neural fuzzy network (called ATF-based SONFIN algorithm) for identifying islands of word signals in fixed noise-level environment. We further propose a new RTF-based RSONFIN algorithm where the background noise level varies during the procedure of recording. The adaptive time-frequency (ATF) and refined time-frequency (RTF)
more » ... uency (RTF) parameters extend the TF parameter from single band to multiband spectrum analysis, and help to make the distinction of speech and noise signals clear. The ATF and RTF parameters can extract useful frequency information by adaptively choosing proper bands of the mel-scale frequency bank. Due to the self-learning ability of SONFIN and RSONFIN, the proposed algorithms avoid the need of empirically determining thresholds and ambiguous rules. The RTF-based RSONFIN algorithm can also find the variation of the background noise level and detect correct word boundaries in the condition of variable background noise level by processing the temporal relations. Our experimental results show that both in the fixed and variable noise-level environment, the algorithms that we proposed achieved higher recognition rate than several commonly used word boundary detection algorithms and reduced the recognition error rate due to endpoint detection. robust word boundary detection algorithms. 15,22-24 These algorithms usually use energy (in time domain), zero crossing rate and time duration to find the boundary between the word signal and background noise. It has been found that the energy and zero-crossing rate are not sufficient to get reliable word boundaries in noisy environments, even if more complex decision strategies are used. 13 Especially, the zero-crossing rate is very sensitive to the additive noise. Up to date, several other parameters were proposed such as linear prediction coefficient (LPC), linear prediction error energy 14,21 and pitch information. 7 Although the LPCs are quite successful in modeling vowels, 4 they are not particularly suitable for nasal sounds, fricatives, etc. The reliability of the LPC parameter depends on the noisy environments. The pitch information can help to detect the word boundary, but it is not easy to extract the pitch period correctly in noisy environments. Four endpoint detection algorithms were compared in Ref. 13: an energy-based algorithm with automatic threshold adjustment, 15,23 use of pitch information, 7 a noise adaptive algorithm, and a voiced activation algorithm. The reliability of these four algorithms are strongly dependent on the noise condition. In this connection, Junqua et al. 13 proposed the time-frequency (TF) parameter. They used the frequency energy in the fixed frequency band 250-3500 Hz to enhance the time-energy information. Based on the TF parameter, a TF-based robust algorithm was proposed in Ref. 13 including noise classification, a refinement procedure and some preset thresholds. The TF-based robust algorithm needs to empirically determine thresholds and ambiguous rules which are not easily determined by humans. Some researchers used the neural network's learning ability to solve this problem. In Refs. 5, 14 and 21, multilayer neural networks are used to classify the speech signal into voiced, unvoiced and silence segments. In the neural network approach, the decision rules are in the form of input-output layer mappings and can be learned by the training procedure (supervised learning). However, the proper structure of the network (including numbers of hidden layers and nodes) is not easy to decide. Although the aforementioned TF-based algorithm outperforms several commonly used algorithms for word boundary detection in the presence of noise, for variable-level background noise, this TF-based algorithm usually results in inaccurate detection of the beginning or ending boundaries in the recording interval. In the real world, the background noise level is not always fixed and may gradually vary over the recording interval. It is not reasonable to make these preset thresholds fixed over the recording interval. If the variation of background noise level is large, these fixed preset thresholds will result in incorrect location of word boundaries. The main aim of this paper is to develop a new robust word boundary detection algorithm to attack the problem in fixed-and variable-level background noise conditions. To avoid the problems of the above approaches, this paper first proposes a modified TF parameter and then uses a neural fuzzy network to detect word boundary based on this parameter. By considering multiband analysis of noisy speech Int. J. Patt. Recogn. Artif. Intell. 2002.16:927-955. Downloaded from by NATIONAL CHIAO TUNG UNIVERSITY on 04/27/14. For personal use only.
doi:10.1142/s0218001402002076 fatcat:rjgpg3mh5nhfvc5olbxyxeudu4