A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
[article]
2022
arXiv
pre-print
Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible solution to address the above limitation. In this work, we for the first time investigate masked
arXiv:2204.12408v1
fatcat:hdl63xqcnvb6bj6zjvvv2rqdty