Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU)

Burak TOMBALOĞLU, Hamit ERDEM
2021 GAZI UNIVERSITY JOURNAL OF SCIENCE  
A typical solution of Automatic Speech Recognition (ASR) problems is realized by feature extraction, feature classification, acoustic modeling and language modeling steps. In classification and modeling steps, Deep Learning Methods have become popular and give more successful recognition results than conventional methods. In this study, an application for solving ASR problem in Turkish Language has been developed. The data sets and studies related to Turkish Language ASR problem are examined.
more » ... nguage models in the ASR problems of agglutative language groups such as Turkish, Finnish and Hungarian are examined. Subword based model is chosen in order not to decrease recognition performance and prevent large vocabulary. The recogniton performance is increased by Deep Learning Methods called Long-Short Term Memory (LSTM) Neural Networks and Gated Recurrent Unit (GRU) in the classification and acoustic modeling steps. The recognition performances of systems including LSTM and GRU are compared with the the previous studies using traditional methods and Deep Neural Networks. When the results were evaluated, it is seen that LSTM and GRU based Speech Recognizers performs better than the recognizers with previous methods. Final Word Error Rate (WER) values were obtained for LSTM and GRU as 10,65% and 11,25%, respectively. GRU based systems have similar performance when compared to LSTM based systems. However, it has been observed that the training periods are short. Computation times are 73.518 and 61.020 seconds respectively. The study gave detailed information about the applicability of the latest methods to Turkish ASR research and applications. Keywords Turkish Speech recognition Lstm Gru the use of technology by everyone will become widespread and it will be easier for people with disabilities to meet their needs. An Automatic Speech Recognition (ASR) system basically translates speech into text. The system extracts features of speech and clasifies the phonemes and word components. Some of the application areas are call centers, security, gaming, support for people with disabilities, in cars, devices control, robotic, dictation, mobile communication applications and home automation. Current speech recognition systems include feature extraction, acoustic model, Language Modeling (LM), vocabulary dictionary and classification sections. In order to recognize the words of sentences, the sound components which form the words must be modeled acoustically. Acoustic analysis is performed by Gaussian Mixture Models (GMM) and posterior probabilities are generated. Acoustic Models are created Burak TOMBALOGLU, Hamit ERDEM/ GU J Sci, 34( ): x-x (2021) using Hidden Markov Models (HMM) and processed by Deep Learning methods with the development of computers and advanced microprocessors in recent years. By this way, words or sentences are able to be predicted. The traditional recognition method used in ASR applications is the use of HMM and GMM. With the development of computer technology and using GPU (Graphics Processing Unit) for computing in recent years, Deep Learning has replaced GMM in ASR applications and provided significant performance increases. Classifiers within this scope can be grouped as GMM-HMM, Deep Neural Networks (DNN)-HMM. DNN and GMM provide status information for HMM, representing each phonemic track. DNNs provide more status information to HMM, which better represents the differences between phonemes. Replacing GMM with DNN has been proposed by many researchers to estimate the probabilities of HMM states [1-6]. Various voice assistants such as, "Apple-Siri" and "Google Voice Transcription" are used on smart communication devices. These applications use a Deep Neural Network (DNN) to convert the acoustic pattern of your voice at each instant into a probability distribution over speech sounds. The ASR implementation of these applications are in the cloud. The cloud servers can provide large storage facilities and updates to the acoustic models used by the ASR [7-9].
doi:10.35378/gujs.816499 fatcat:cwbp4d5hyzd7rifrbpw2rjfwka