Modeling Polyphone Context Withweighted Finite-State Transducers

E. Stoimenov, J. McDonough
2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings  
As coarticulation effects are prevalent in all speech, a phone must be modeled in its context to achieve optimal performance in large vocabulary continuous speech recognition systems. Schuster and Hori [7] proposed a technique for modeling polyphone context with weighted finite-state transducers whereby all valid three-state sequences of Gaussian mixture models are enumerated, and thereafter the possible connections between these three-state sequences are determined. Hence, the explicit
more » ... of all possible polyphones is avoided. Rather, Schuster and Hori derive a transducer HC that translates from sequences of Gaussian mixture models directly to phone sequences. The resulting network HC • L • G is much smaller than the conventional network H • C • L • G proposed by Mohri et al [6] . While Schuster and Hori's approach to modeling polyphone context is quite interesting, it is incorrect for contexts larger than triphones. In this work, we correct the errors of Schuster and Hori. Thereafter we discuss how the intermediate size of the network HC can be held in check. We also present the results of a set of experiments comparing network size and speech recognition performance for networks obtained with Schuster and Hori's technique and with the correct technique.
doi:10.1109/icassp.2006.1659972 dblp:conf/icassp/StoimenovM06 fatcat:x3ikdojazjaf5hcoalf6aui56u