YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension

Weiying Wang, Yongcheng Wang, Shizhe Chen, Qin Jin
2019 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)  
Multimodal semantic comprehension has attracted increasing research interests in recent years, such as visual question answering and caption generation. However, due to the data limitation, fine-grained semantic comprehension which requires to capture semantic details of multimodal contents has not been well investigated. In this work, we introduce "YouMakeup", a large-scale multimodal instructional video dataset to support finegrained semantic comprehension research in specific domain.
more » ... p contains 2,800 videos from YouTube, spanning more than 420 hours in total. Each video is annotated with a sequence of natural language descriptions for instructional steps, grounded in temporal video range and spatial facial areas. The annotated steps in a video involve subtle difference in actions, products and regions, which require fine-grained understanding and reasoning both temporally and spatially. In order to evaluate models' ability for fined-grained comprehension, we further propose two groups of tasks including generation tasks and visual question answering tasks from different aspects. We also establish a baseline of step caption generation for future comparison. The dataset will be publicly available at https:// github.com/AIM3-RUC/YouMakeup to support research investigation in fine-grained semantic comprehension.
doi:10.18653/v1/d19-1517 dblp:conf/emnlp/WangWCJ19 fatcat:aqcsqykdufcnthev3i626eiupi