A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is
We address the challenging task of cross-modal moment retrieval, which aims to localize a temporal segment from an untrimmed video described by a natural language query. It poses great challenges over the proper semantic alignment between vision and linguistic domains. Existing methods independently extract the features of videos and sentences and purely utilize the sentence embedding in the multi-modal fusion stage, which do not make full use of the potential of language. In this paper, wearXiv:2006.10457v2 fatcat:eqnjjsvpvfc2darsrkayd3qvxm