On automatic voice casting for expressive speech: Speaker recognition vs. speech classification

Nicolas Obin, Axel Roebel, Gregoire Bachman
2014 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
This paper presents the first large-scale automatic voice casting system, and explores the adaptation of speaker recognition techniques to measure voice similarities. The proposed system is based on the representation of a voice by classes (e.g., age/gender, voice quality, emotion). First, a multi-label system is used to classify speech into classes. Then, the output probabilities for each class are concatenated to form a vector that represents the vocal signature of a speech recording.
more » ... a similarity search is performed on the vocal signatures to determine the set of target actors that are the most similar to a speech recording of a source actor. In a subjective experiment conducted in the real-context of voice casting for video games, the multi-label system clearly outperforms standard speaker recognition systems. This indicates evidence that speech classes successfully capture the principal directions that are used in the perception of voice similarity.
doi:10.1109/icassp.2014.6853737 dblp:conf/icassp/ObinRB14 fatcat:wz3ydfi2dvhpfb4rkobxf33bi4