Weak top-down constraints for unsupervised acoustic model training

Aren Jansen, Samuel Thomas, Hynek Hermansky
2013 2013 IEEE International Conference on Acoustics, Speech and Signal Processing  
Typical supervised acoustic model training relies on strong top-down constraints provided by dynamic programming alignment of the input observations to phonetic sequences derived from orthographic word transcripts and pronunciation dictionaries. This paper investigates a much weaker form of top-down supervision for use in place of transcripts and dictionaries in the zero resource setting. Our proposed constraints, which can be produced using recent spoken term discovery systems, come in the
more » ... ms, come in the form of pairs of isolated word examples that share the same unknown type. For each pair, we perform a dynamic programming alignment of the acoustic observations of the two constituent examples, generating an inventory of cross-speaker frame pairs that each provide evidence that the same subword unit model should account for them. We find these weak top-down constraints are capable of improving model speaker independence by up to 57% relative over bottom-up training alone.
doi:10.1109/icassp.2013.6639241 dblp:conf/icassp/JansenTH13 fatcat:64gexdb5nzgqpbcktz2tbacu4u