A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is
Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression. Existing approaches mainly treat this complicated task as a parallel frame-grounding problem and thus suffer from two types of inconsistency drawbacks: feature alignment inconsistency and prediction inconsistency. In this paper, we present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT), toarXiv:2209.13306v2 fatcat:kq7kvbuy65g2nordvgfbggrxby