Phoneme recognition using time-delay neural networks

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K.J. Lang
1989 IEEE Transactions on Acoustics Speech and Signal Processing  
In this paper we present a Time-Delay Neural Network (TDNN) approach to phoneme recognition which is characterized by two important properties. 1) Using a 3 layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces. The TDNN learns these decision surfaces automatically using error backpropagation 111. 2) The time-delay arrangement enables the network to discover acoustic-phonetic features and the temporal
more » ... and the temporal relationships between them independent of position in time and hence not blurred by temporal shifts in the input. As a recognition task, the speaker-dependent recognition of the phonemes "B," "D," and "G" in varying phonetic contexts was chosen. For comparison, several discrete Hidden Markov Models (HMM) were trained to perform the same task. Performance evaluation over 1946 testing tokens from three speakers showed that the TDNN achieves a recognition rate of 98.5 percent correct while the rate obtained by the best of our HMM's was only 93.7 percent. Closer inspection reveals that the network "invented" well-known acoustic-phonetic features (e.g., F2-rise, F2-fall, vowel-onset) as useful abstractions. It also developed alternate internal representations to link different acoustic realizations to the same concept.
doi:10.1109/29.21701 fatcat:galo4ikwzrgapiw4f3lpoa6bnq