Vision and Language Integration Meets Multimedia Fusion

Marie-Francine Moens, Katerina Pastra, Kate Saenko, Tinne Tuytelaars
2016 Proceedings of the 2016 ACM on Multimedia Conference - MM '16  
Multimodal information fusion both at the signal and the semantics levels is a core part in most multimedia applications, including multimedia indexing, retrieval, summarization and others. Early or late fusion of modality-specific processing results has been addressed in multimedia prototypes since their very early days, through various methodologies including rule-based approaches, information-theoretic models and machine learning. Vision and Language are two of the predominant modalities
more » ... are being fused and which have attracted special attention in international challenges with a long history of results, such as TRECVid, ImageClef and others. During the last decade, vision-language semantic integration has attracted attention from traditionally noninterdisciplinary research communities, such as Computer Vision and Natural Language Processing. This is due to the fact that one modality can greatly assist the processing of another providing cues for disambiguation, complementary information and noise/error filtering. The latest boom of deep learning methods has opened up new directions in joint modelling of visual and co-occurring verbal information in multimedia discourse. The workshop on Vision and Language Integration Meets Multimedia Fusion has been held during the workshop weekend of the ACM Multimedia 2016
doi:10.1145/2964284.2980537 dblp:conf/mm/MoensPST16 fatcat:twzqj5w3ebce7kh5bk6iafcweq