Speaker Recognition System Using Symbolic Modelling of Voiceprint
International Journal of Signal Processing, Image Processing and Pattern Recognition
Voice biometric trait is used in speaker recognition system due to its combined behavioral and physiological characteristics. This paper presents a symbolic inference system for text-dependent speaker recognition system by exploring the physiological characteristics embedded in the user utterance. These characteristics also capture the user behaviour. The symbolic data object is constructed using different voiceprint features namely the inter-lexical pause position, complementary spectral
... es such as spectral entropy, spectral centroid and spectral flatness, pitch, loudness and formants. These features are explored in this work as inter-lexical pause position provides the articulation capability of user vocal tract. The spectral characteristics model the functional properties of the human ear and loudness feature provides the strength of ear's perception. The relation between physical and perceptual properties of sound is estimated through pitch whereas formants provide the acoustic reverberation of the human vocal tract. The variability in features of user/speaker utterance of words is represented with symbolic data. The speaker identification is performed using span, content and position symbolic similarity measures  , modified for the current work. The proposed method is evaluated on 100 users of voice corpus of VTU-BEC-DB multimodal biometric database. The experimental results demonstrate an overall identification rate of 90.56%. Experimental results show that the symbolic data representation of voice features provides better speaker recognition. sounds  . Spectral analysis will measure the amount of acoustic energy present at different frequencies in a sound  . Prosodic features are a measure of accent, intonation and stress. The Prosodic features are estimated by calculating the pitch, energy, and duration information from the user's voice  . The idiolectal (i.e. syntactical) features are the measure of the way of using the word utterance i.e. repetition of the user's "favourite" words. The dialogic features extract the conversational patterns of a speaker. The semantics, pronunciation, diction and idiosyncrasy are the learned traits associated to education, socioeconomic status and birth place of a user/speaker and are also used for speaker recognition, but are difficult to extract  . Speaker recognition systems can be classified into text-dependent and textindependent, based on the text used in the testing phase. Text-dependent systems are further divided into fixed-phrase and prompted-phrase systems. Fixed-phrase systems are trained on the phrase that is also used for testing. Prompted-phrase systems ask the user/claimant to utter a word sequence (phoneme sequence) not used in the training phase or in previous tests. Further in text-independent systems, the speech used for testing is unconstrained. In voice biometrics the speaker voiceprint may vary due to variations in the health, environmental conditions and additive noise. The features extracted in such situations form the speaker voiceprints are varying in nature. The symbolic object representation is employed to represent such variability in the features of voiceprints during speaker recognition in a robust manner. Symbolic objects are extensions of classical data types. The real world objects are better described with symbolic objects  . The feature extracted from the real world objects are usually represented by complex data. The knowledge embedded in the complex data is easily extracted by representing them into symbolic data structure. Symbolic data appears in the form of continuous ratio, discrete, absolute, interval, probability distributions, random variables and multi-valued data. In pattern recognition, the variability inside classes of individuals is easily expressed by symbolic data. Symbolic objects offer a better alternative for organizing and summarizing abstract data. Symbolic objects are of three different types, assertion object, hoard object and synthetic objects  . An assertion object is a conjunction of events pertaining to a given object. An event is a pair which links feature variables and feature values. A hoard object is a collection of one or more assertion objects, whereas a synthetic object is a collection of one or more hoard objects [6, 7] . In this work voiceprints are represented as assertion symbolic object. This representation of the symbolic object accommodates the variability in features of speaker voiceprint and is one of the novel contributions of the proposed speaker recognition system. In the proposed work, the text-dependent speaker recognition system is presented, in which the speaker utterance is represented as symbolic object. The object will cover the features of the speaker voice utterance such as inter-lexical pause position, complementary spectral features such as spectral entropy, spectral centroid and spectral flatness, pitch, loudness and formant frequencies. These features are employed in this work as inter-lexical pause position provides the articulation capability of user vocal tracts. The spectral characteristics model the functional properties of the human ear and loudness feature provides the strength of the human ear perception. The relation between physical and perceptual properties of sound is estimated through pitch and formants that provide the acoustic reverberation of the human vocal tract. The intra-speaker variations in the features are captured in a symbolic data structure. This representation of speaker utterance into symbolic objects is a novel technique used by the proposed system. The symbolic knowledge bases for the phrases of English number utterance namely "Twenty One (21)" to "Twenty Nine (29)" are constructed separately for 100 users. Further, speaker identification is performed using span, content and position symbolic similarity measures adopted from [6, 7] . The experimentation is performed on voice corpus of VTU-BEC-DB multimodal biometric database. The experimental results show that the proposed 3 method offers a overall correct identification rate of 90.56% for user recognition using voice biometric trait. The rest of the paper is organized as follows: section 2 presents the recent developments in text-dependent speaker recognition approaches. Section 3 describes the proposed model of user identification using voice symbolic objects. The experimental results and analysis are provided in section 4. Finally, section 5 concludes the work and enlists the future directions.