A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is
Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task. In reality, a truly useful VidL system is expected to be easily generalizable to diverse tasks, domains, and datasets. To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) videoarXiv:2106.04632v2 fatcat:zszcuqp6rjexjokioto5riwepy