Cascaded MPN: Cascaded Moment Proposal Network for Video Corpus Moment Retrieval
Video corpus moment retrieval aims to localize temporal moments corresponding to textual query in a large video corpus. Previous moment retrieval systems are largely grouped into two categories: (1) anchor-based method which presets a set of video segment proposals (via sliding window) and predicts proposal that best matches with the query, and (2) anchor-free method which directly predicts frame-level start-end time of the moment related to the query (via regression). Both methods have their
... n inherent weaknesses: (1) anchor-based method is vulnerable to heuristic rules of generating video proposals, which causes restrictive moment prediction in variant length; and (2) anchor-free method, as is based on framelevel interplay, suffers from insufficient understanding of contextual semantics from long and sequential video. To overcome the aforementioned challenges, our proposed Cascaded Moment Proposal Network incorporates the following two main properties: (1) Hierarchical Semantic Reasoning which provides video understanding from anchor-free level to anchor-based level via building hierarchical video graph, and (2) Cascaded Moment Proposal Generation which precisely performs moment retrieval via devising cascaded multi-modal feature interaction among anchor-free and anchor-based video semantics. Extensive experiments show state-of-the-art performance on three moment retrieval benchmarks (TVR, ActivityNet, DiDeMo), while qualitative analysis shows improved interpretability. The code will be made publicly available. INDEX TERMS Video corpus moment retrieval, cascaded moment proposal, multi-modal interaction, vision-language system.