A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Optical Flow in Mostly Rigid Scenes
[article]
2017
arXiv
pre-print
Sevilla et al. [34] perform semantic segmentation and use different models for different semantic classes. ...
arXiv:1705.01352v1
fatcat:qp7nfsubejd4ndbxb62hbftigm
SMART Frame Selection for Action Recognition
[article]
2020
arXiv
pre-print
Additionally, Sevilla-Lara et al. (2019) show that many action classes in standard datasets do not require motion or temporal information to be identified. ...
We use two subsets (Sevilla-Lara et al. 2019) of Kinetics that have been identified as containing mostly temporal information and mostly static information. ...
arXiv:2012.10671v1
fatcat:3jukutjt45akbdlwc22vnvjzr4
On the Integration of Optical Flow and Action Recognition
[article]
2017
arXiv
pre-print
Most of the top performing action recognition methods use optical flow as a "black box" input. Here we take a deeper look at the combination of flow and action recognition, and investigate why optical flow is helpful, what makes a flow method good for action recognition, and how we can make it better. In particular, we investigate the impact of different flow algorithms and input transformations to better understand how these affect a state-of-the-art action recognition method. Furthermore, we
arXiv:1712.08416v1
fatcat:rgk2fahhorc37phnuxx7jglgfu
more »
... ine tune two neural-network flow methods end-to-end on the most widely used action recognition dataset (UCF101). Based on these experiments, we make the following five observations: 1) optical flow is useful for action recognition because it is invariant to appearance, 2) optical flow methods are optimized to minimize end-point-error (EPE), but the EPE of current methods is not well correlated with action recognition performance, 3) for the flow methods tested, accuracy at boundaries and at small displacements is most correlated with action recognition performance, 4) training optical flow to minimize classification error instead of minimizing EPE improves recognition performance, and 5) optical flow learned for the task of action recognition differs from traditional optical flow especially inside the human body and at the boundary of the body. These observations may encourage optical flow researchers to look beyond EPE as a goal and guide action recognition researchers to seek better motion cues, leading to a tighter integration of the optical flow and action recognition communities.
ALBA : Reinforcement Learning for Video Object Segmentation
[article]
2020
arXiv
pre-print
We consider the challenging problem of zero-shot video object segmentation (VOS). That is, segmenting and tracking multiple moving objects within a video fully automatically, without any manual initialization. We treat this as a grouping problem by exploiting object proposals and making a joint inference about grouping over both space and time. We propose a network architecture for tractably performing proposal selection and joint grouping. Crucially, we then show how to train this network with
arXiv:2005.13039v2
fatcat:myb6tswtqzhbtkuj4vztedx2xu
more »
... reinforcement learning so that it learns to perform the optimal non-myopic sequence of grouping decisions to segment the whole video. Unlike standard supervised techniques, this also enables us to directly optimize for the non-differentiable overlap-based metrics used to evaluate VOS. We show that the proposed method, which we call ALBA outperforms the previous stateof-the-art on three benchmarks: DAVIS 2017 [2], FBMS [20] and Youtube-VOS [27].
FASTER Recurrent Networks for Efficient Video Classification
[article]
2019
arXiv
pre-print
Typical video classification methods often divide a video into short clips, do inference on each clip independently, then aggregate the clip-level predictions to generate the video-level results. However, processing visually similar clips independently ignores the temporal structure of the video sequence, and increases the computational cost at inference time. In this paper, we propose a novel framework named FASTER, i.e., Feature Aggregation for Spatio-TEmporal Redundancy. FASTER aims to
arXiv:1906.04226v2
fatcat:45mfn6rmjrc55km56vxbrjbiyu
more »
... ge the redundancy between neighboring clips and reduce the computational cost by learning to aggregate the predictions from models of different complexities. The FASTER framework can integrate high quality representations from expensive models to capture subtle motion information and lightweight representations from cheap models to cover scene changes in the video. A new recurrent network (i.e., FAST-GRU) is designed to aggregate the mixture of different representations. Compared with existing approaches, FASTER can reduce the FLOPs by over 10x? while maintaining the state-of-the-art accuracy across popular datasets, such as Kinetics, UCF-101 and HMDB-51.
Optical Flow with Semantic Segmentation and Localized Layers
[article]
2016
arXiv
pre-print
Existing optical flow methods make generic, spatially homogeneous, assumptions about the spatial structure of the flow. In reality, optical flow varies across an image depending on object class. Simply put, different objects move differently. Here we exploit recent advances in static semantic scene segmentation to segment the image into objects of different types. We define different models of image motion in these regions depending on the type of object. For example, we model the motion on
arXiv:1603.03911v2
fatcat:hs7umy3kx5givjj4xtx3g3wwqa
more »
... s with homographies, vegetation with spatially smooth flow, and independently moving objects like cars and planes with affine motion plus deviations. We then pose the flow estimation problem using a novel formulation of localized layers, which addresses limitations of traditional layered models for dealing with complex scene motion. Our semantic flow method achieves the lowest error of any published monocular method in the KITTI-2015 flow benchmark and produces qualitatively better flow and segmentation than recent top methods on a wide range of natural videos.
Optical Flow Estimation with Channel Constancy
[chapter]
2014
Lecture Notes in Computer Science
In object tracking, Sevilla-Lara and Learned-Miller [23] use a distribution over grayscale values at each pixel to create an object template that can be smoothed, to reach long displacements. ...
doi:10.1007/978-3-319-10590-1_28
fatcat:543p5qcyvfdczjiamxsmitnpvi
Optical Flow in Mostly Rigid Scenes
2017
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Sevilla et al. [33] perform semantic segmentation and use different models for different semantic classes. ...
doi:10.1109/cvpr.2017.731
dblp:conf/cvpr/WulffSB17
fatcat:ldlx422wmzbwlnr54so3ltwgtq
Unsupervised Batch Normalization
2020
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Batch Normalization is a widely used tool in neural networks to improve the generalization and convergence of training. However, on small datasets due to the difficulty of obtaining unbiased batch statistics it cannot be applied effectively. In some cases, even if there is only a small labeled dataset available, there are larger unlabeled datasets from the same distribution. We propose using such unlabeled examples to calculate batch normalization statistics, which we call Unsupervised Batch
doi:10.1109/cvprw50498.2020.00467
dblp:conf/cvpr/KocyigitSHB20
fatcat:32jtqxhmn5d4ze2y5cxxtbwefu
more »
... malization (UBN). We show that using unlabeled examples for batch statistic calculations results in a reduction of the bias of the statistics, as well as regularization leveraging the data manifold. UBN is easy to implement, computationally inexpensive and can be applied to a variety problems. We report results on monocular depth estimation, where obtaining dense labeled examples is difficult and expensive. Using unlabeled samples, and UBN, we obtain an increase in accuracy of more than 6% on the KITTI dataset, compared to using traditional batch normalization only on the labeled samples.
Distribution Fields with Adaptive Kernels for Large Displacement Image Alignment
2013
Procedings of the British Machine Vision Conference 2013
While region-based image alignment algorithms that use gradient descent can achieve sub-pixel accuracy when they converge, their convergence depends on the smoothness of the image intensity values. Image smoothness is often enforced through the use of multiscale approaches in which images are smoothed and downsampled. Yet, these approaches typically use fixed smoothing parameters which may be appropriate for some images but not for others. Even for a particular image, the optimal smoothing
doi:10.5244/c.27.17
dblp:conf/bmvc/MearsSL13
fatcat:4vzytw2qlzf3hm6l54n7juu744
more »
... eters may depend on the magnitude of the transformation. When the transformation is large, the image should be smoothed more than when the transformation is small. Further, with gradient-based approaches, the optimal smoothing parameters may change with each iteration as the algorithm proceeds towards convergence. We address convergence issues related to the choice of smoothing parameters by deriving a Gauss-Newton gradient descent algorithm based on distribution fields (DFs) and proposing a method to dynamically select smoothing parameters at each iteration. DF and DF-like representations have previously been used in the context of tracking. In this work we incorporate DFs into a full affine model for region-based alignment and simultaneously search over parameterized sets of geometric and photometric transforms. We use a probabilistic interpretation of DFs to select smoothing parameters at each step in the optimization and show that this results in improved convergence rates.
CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition
[article]
2021
arXiv
pre-print
Zero-shot action recognition is the task of recognizingaction classes without visual examples, only with a seman-tic embedding which relates unseen to seen classes. Theproblem can be seen as learning a function which general-izes well to instances of unseen classes without losing dis-crimination between classes. Neural networks can modelthe complex boundaries between visual classes, which ex-plains their success as supervised models. However, inzero-shot learning, these highly specialized class
arXiv:2101.07042v2
fatcat:w5qvvnv3rjdotaqadf5k4v6fvq
more »
... bound-aries may not transfer well from seen to unseen classes.In this paper we propose a centroid-based representation,which clusters visual and semantic representation, consid-ers all training samples at once, and in this way generaliz-ing well to instances from unseen classes. We optimize theclustering using Reinforcement Learning which we show iscritical for our approach to work. We call the proposedmethod CLASTER and observe that it consistently outper-forms the state-of-the-art in all standard datasets, includ-ing UCF101, HMDB51 and Olympic Sports; both in thestandard zero-shot evaluation and the generalized zero-shotlearning. Further, we show that our model performs com-petitively in the image domain as well, outperforming thestate-of-the-art in many settings.
A New Split for Evaluating True Zero-Shot Action Recognition
[article]
2021
arXiv
pre-print
Zero-shot action recognition is the task of classifying action categories that are not available in the training set. In this setting, the standard evaluation protocol is to use existing action recognition datasets(e.g. UCF101) and randomly split the classes into seen and unseen. However, most recent work builds on representations pre-trained on the Kinetics dataset, where classes largely overlap with classes in the zero-shot evaluation datasets. As a result, classes which are supposed to be
arXiv:2107.13029v2
fatcat:dirm6mbfajardfxmi3fjgxkhfy
more »
... een, are present during supervised pre-training, invalidating the condition of the zero-shot setting. A similar concern was previously noted several years ago for image based zero-shot recognition but has not been considered by the zero-shot action recognition community. In this paper, we propose a new split for true zero-shot action recognition with no overlap between unseen test classes and training or pre-training classes. We benchmark several recent approaches on the proposed True Zero-Shot(TruZe) Split for UCF101 and HMDB51, with zero-shot and generalized zero-shot evaluation. In our extensive analysis, we find that our TruZesplits are significantly harder than comparable random splits as nothing is leaking from pre-training, i.e. unseen performance is consistently lower,up to 8.9% for zero-shot action recognition. In an additional evaluation we also find that similar issues exist in the splits used in few-shot action recognition, here we see differences of up to 17.1%. We publish oursplits1and hope that our benchmark analysis will change how the field is evaluating zero- and few-shot action recognition moving forward.
Optical Flow with Semantic Segmentation and Localized Layers
2016
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
a) Initial segmentation [9] (b) Our segmentation (c) DiscreteFlow [38] (d) Semantic Optical Flow Figure 1: (a) Semantic segmentation breaks the image into regions such as road, bike, person, sky, etc. (c) Existing optical flow algorithms do not have access to either the segmentations or the semantics of the classes. (d) Our semantic optical flow algorithm computes motion differently in different regions, depending on the semantic class label, resulting in more precise flow, particularly at
doi:10.1109/cvpr.2016.422
dblp:conf/cvpr/Sevilla-LaraSJB16
fatcat:e3vtev72uvgbznkkctfflwcy2i
more »
... t boundaries. (b) The flow also helps refine the segmentation of the foreground objects. Abstract Existing optical flow methods make generic, spatially homogeneous, assumptions about the spatial structure of the flow. In reality, optical flow varies across an image depending on object class. Simply put, different objects move differently. Here we exploit recent advances in static semantic scene segmentation to segment the image into objects of different types. We define different models of image motion in these regions depending on the type of object. For example, we model the motion on roads with homographies, vegetation with spatially smooth flow, and independently moving objects like cars and planes with affine motion plus deviations. We then pose the flow estimation problem using a novel formulation of localized layers, which addresses limitations of traditional layered models for dealing with complex scene motion. Our semantic flow method achieves the lowest error of any published monocular method in the KITTI-2015 flow benchmark and produces qualitatively better flow and segmentation than recent top methods on a wide range of natural videos.
Only Time Can Tell: Discovering Temporal Data for Temporal Modeling
[article]
2019
arXiv
pre-print
Understanding temporal information and how the visual world changes over time is a fundamental ability of intelligent systems. In video understanding, temporal information is at the core of many current challenges, including compression, efficient inference, motion estimation or summarization. However, in current video datasets it has been observed that action classes can often be recognized without any temporal information from a single frame of video. As a result, both benchmarking and
arXiv:1907.08340v2
fatcat:iqft6dmjbbc6nj3ghnq66e7e4a
more »
... g in these datasets may give an unintentional advantage to models with strong image understanding capabilities, as opposed to those with strong temporal understanding. In this paper we address this problem head on by identifying action classes where temporal information is actually necessary to recognize them and call these "temporal classes". Selecting temporal classes using a computational method would bias the process. Instead, we propose a methodology based on a simple and effective human annotation experiment. We remove just the temporal information by shuffling frames in time and measure if the action can still be recognized. Classes that cannot be recognized when frames are not in order are included in the temporal Dataset. We observe that this set is statistically different from other static classes, and that performance in it correlates with a network's ability to capture temporal information. Thus we use it as a benchmark on current popular networks, which reveals a series of interesting facts. We also explore the effect of training on the temporal dataset, and observe that this leads to better generalization in unseen classes, demonstrating the need for more temporal data. We hope that the proposed dataset of temporal categories will help guide future research in temporal modeling for better video understanding.
Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition
[article]
2022
arXiv
pre-print
In contrast, TSN performs worse on datasets that require explicit temporal reasoning (Goyal et al. 2017; Sevilla-Lara et al. 2021) . ...
been reported that simply increasing the number of RGB frames does not necessarily improve performance or it is very marginal (Zhou et al. 2018) , and can even deter performance (Gowda, Rohrbach, and Sevilla-Lara ...
arXiv:2201.10394v2
fatcat:4yjqtncwpjgfdfzd64432ibaja
« Previous
Showing results 1 — 15 out of 1,120 results