Object Category Recognition Using Probabilistic Fusion of Speech and Image Classifiers [chapter]

Kate Saenko, Trevor Darrell
Machine Learning for Multimodal Interaction  
Multimodal scene understanding is an integral part of humanrobot interaction (HRI) in situated environments. Especially useful is category-level recognition, where the the system can recognize classes of objects of scenes rather than specific instances (e.g., any chair vs. this particular chair.) Humans use multiple modalities to understand which object category is being referred to, simultaneously interpreting gesture, speech and visual appearance, and using one modality to disambiguate the
more » ... disambiguate the information contained in the others. In this paper, we address the problem of fusing visual and acoustic information to predict object categories, when an image of the object and speech input from the user is available to the HRI system. Using probabilistic decision fusion, we show improved classification rates on a dataset containing a wide variety of object categories, compared to using either modality alone.
doi:10.1007/978-3-540-78155-4_4 dblp:conf/mlmi/SaenkoD07 fatcat:7sc4sbqhibaj5bozehs7r4sdnm