Filters








1,120 Hits in 2.0 sec

Optical Flow in Mostly Rigid Scenes [article]

Jonas Wulff, Laura Sevilla-Lara, Michael J. Black
2017 arXiv   pre-print
Sevilla et al. [34] perform semantic segmentation and use different models for different semantic classes.  ... 
arXiv:1705.01352v1 fatcat:qp7nfsubejd4ndbxb62hbftigm

SMART Frame Selection for Action Recognition [article]

Shreyank N Gowda, Marcus Rohrbach, Laura Sevilla-Lara
2020 arXiv   pre-print
Additionally, Sevilla-Lara et al. (2019) show that many action classes in standard datasets do not require motion or temporal information to be identified.  ...  We use two subsets (Sevilla-Lara et al. 2019) of Kinetics that have been identified as containing mostly temporal information and mostly static information.  ... 
arXiv:2012.10671v1 fatcat:3jukutjt45akbdlwc22vnvjzr4

On the Integration of Optical Flow and Action Recognition [article]

Laura Sevilla-Lara, Yiyi Liao, Fatma Guney, Varun Jampani, Andreas Geiger, Michael J. Black
2017 arXiv   pre-print
Most of the top performing action recognition methods use optical flow as a "black box" input. Here we take a deeper look at the combination of flow and action recognition, and investigate why optical flow is helpful, what makes a flow method good for action recognition, and how we can make it better. In particular, we investigate the impact of different flow algorithms and input transformations to better understand how these affect a state-of-the-art action recognition method. Furthermore, we
more » ... ine tune two neural-network flow methods end-to-end on the most widely used action recognition dataset (UCF101). Based on these experiments, we make the following five observations: 1) optical flow is useful for action recognition because it is invariant to appearance, 2) optical flow methods are optimized to minimize end-point-error (EPE), but the EPE of current methods is not well correlated with action recognition performance, 3) for the flow methods tested, accuracy at boundaries and at small displacements is most correlated with action recognition performance, 4) training optical flow to minimize classification error instead of minimizing EPE improves recognition performance, and 5) optical flow learned for the task of action recognition differs from traditional optical flow especially inside the human body and at the boundary of the body. These observations may encourage optical flow researchers to look beyond EPE as a goal and guide action recognition researchers to seek better motion cues, leading to a tighter integration of the optical flow and action recognition communities.
arXiv:1712.08416v1 fatcat:rgk2fahhorc37phnuxx7jglgfu

ALBA : Reinforcement Learning for Video Object Segmentation [article]

Shreyank N Gowda, Panagiotis Eustratiadis, Timothy Hospedales, Laura Sevilla-Lara
2020 arXiv   pre-print
We consider the challenging problem of zero-shot video object segmentation (VOS). That is, segmenting and tracking multiple moving objects within a video fully automatically, without any manual initialization. We treat this as a grouping problem by exploiting object proposals and making a joint inference about grouping over both space and time. We propose a network architecture for tractably performing proposal selection and joint grouping. Crucially, we then show how to train this network with
more » ... reinforcement learning so that it learns to perform the optimal non-myopic sequence of grouping decisions to segment the whole video. Unlike standard supervised techniques, this also enables us to directly optimize for the non-differentiable overlap-based metrics used to evaluate VOS. We show that the proposed method, which we call ALBA outperforms the previous stateof-the-art on three benchmarks: DAVIS 2017 [2], FBMS [20] and Youtube-VOS [27].
arXiv:2005.13039v2 fatcat:myb6tswtqzhbtkuj4vztedx2xu

FASTER Recurrent Networks for Efficient Video Classification [article]

Linchao Zhu, Laura Sevilla-Lara, Du Tran, Matt Feiszli, Yi Yang, Heng Wang
2019 arXiv   pre-print
Typical video classification methods often divide a video into short clips, do inference on each clip independently, then aggregate the clip-level predictions to generate the video-level results. However, processing visually similar clips independently ignores the temporal structure of the video sequence, and increases the computational cost at inference time. In this paper, we propose a novel framework named FASTER, i.e., Feature Aggregation for Spatio-TEmporal Redundancy. FASTER aims to
more » ... ge the redundancy between neighboring clips and reduce the computational cost by learning to aggregate the predictions from models of different complexities. The FASTER framework can integrate high quality representations from expensive models to capture subtle motion information and lightweight representations from cheap models to cover scene changes in the video. A new recurrent network (i.e., FAST-GRU) is designed to aggregate the mixture of different representations. Compared with existing approaches, FASTER can reduce the FLOPs by over 10x? while maintaining the state-of-the-art accuracy across popular datasets, such as Kinetics, UCF-101 and HMDB-51.
arXiv:1906.04226v2 fatcat:45mfn6rmjrc55km56vxbrjbiyu

Optical Flow with Semantic Segmentation and Localized Layers [article]

Laura Sevilla-Lara, Deqing Sun, Varun Jampani, Michael J. Black
2016 arXiv   pre-print
Existing optical flow methods make generic, spatially homogeneous, assumptions about the spatial structure of the flow. In reality, optical flow varies across an image depending on object class. Simply put, different objects move differently. Here we exploit recent advances in static semantic scene segmentation to segment the image into objects of different types. We define different models of image motion in these regions depending on the type of object. For example, we model the motion on
more » ... s with homographies, vegetation with spatially smooth flow, and independently moving objects like cars and planes with affine motion plus deviations. We then pose the flow estimation problem using a novel formulation of localized layers, which addresses limitations of traditional layered models for dealing with complex scene motion. Our semantic flow method achieves the lowest error of any published monocular method in the KITTI-2015 flow benchmark and produces qualitatively better flow and segmentation than recent top methods on a wide range of natural videos.
arXiv:1603.03911v2 fatcat:hs7umy3kx5givjj4xtx3g3wwqa

Optical Flow Estimation with Channel Constancy [chapter]

Laura Sevilla-Lara, Deqing Sun, Erik G. Learned-Miller, Michael J. Black
2014 Lecture Notes in Computer Science  
In object tracking, Sevilla-Lara and Learned-Miller [23] use a distribution over grayscale values at each pixel to create an object template that can be smoothed, to reach long displacements.  ... 
doi:10.1007/978-3-319-10590-1_28 fatcat:543p5qcyvfdczjiamxsmitnpvi

Optical Flow in Mostly Rigid Scenes

Jonas Wulff, Laura Sevilla-Lara, Michael J. Black
2017 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
Sevilla et al. [33] perform semantic segmentation and use different models for different semantic classes.  ... 
doi:10.1109/cvpr.2017.731 dblp:conf/cvpr/WulffSB17 fatcat:ldlx422wmzbwlnr54so3ltwgtq

Unsupervised Batch Normalization

Mustafa Taha Kocyigit, Laura Sevilla-Lara, Timothy M. Hospedales, Hakan Bilen
2020 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)  
Batch Normalization is a widely used tool in neural networks to improve the generalization and convergence of training. However, on small datasets due to the difficulty of obtaining unbiased batch statistics it cannot be applied effectively. In some cases, even if there is only a small labeled dataset available, there are larger unlabeled datasets from the same distribution. We propose using such unlabeled examples to calculate batch normalization statistics, which we call Unsupervised Batch
more » ... malization (UBN). We show that using unlabeled examples for batch statistic calculations results in a reduction of the bias of the statistics, as well as regularization leveraging the data manifold. UBN is easy to implement, computationally inexpensive and can be applied to a variety problems. We report results on monocular depth estimation, where obtaining dense labeled examples is difficult and expensive. Using unlabeled samples, and UBN, we obtain an increase in accuracy of more than 6% on the KITTI dataset, compared to using traditional batch normalization only on the labeled samples.
doi:10.1109/cvprw50498.2020.00467 dblp:conf/cvpr/KocyigitSHB20 fatcat:32jtqxhmn5d4ze2y5cxxtbwefu

Distribution Fields with Adaptive Kernels for Large Displacement Image Alignment

Benjamin Mears, Laura Sevilla-Lara, Erik Learned-Miller
2013 Procedings of the British Machine Vision Conference 2013  
While region-based image alignment algorithms that use gradient descent can achieve sub-pixel accuracy when they converge, their convergence depends on the smoothness of the image intensity values. Image smoothness is often enforced through the use of multiscale approaches in which images are smoothed and downsampled. Yet, these approaches typically use fixed smoothing parameters which may be appropriate for some images but not for others. Even for a particular image, the optimal smoothing
more » ... eters may depend on the magnitude of the transformation. When the transformation is large, the image should be smoothed more than when the transformation is small. Further, with gradient-based approaches, the optimal smoothing parameters may change with each iteration as the algorithm proceeds towards convergence. We address convergence issues related to the choice of smoothing parameters by deriving a Gauss-Newton gradient descent algorithm based on distribution fields (DFs) and proposing a method to dynamically select smoothing parameters at each iteration. DF and DF-like representations have previously been used in the context of tracking. In this work we incorporate DFs into a full affine model for region-based alignment and simultaneously search over parameterized sets of geometric and photometric transforms. We use a probabilistic interpretation of DFs to select smoothing parameters at each step in the optimization and show that this results in improved convergence rates.
doi:10.5244/c.27.17 dblp:conf/bmvc/MearsSL13 fatcat:4vzytw2qlzf3hm6l54n7juu744

CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition [article]

Shreyank N Gowda, Laura Sevilla-Lara, Frank Keller, Marcus Rohrbach
2021 arXiv   pre-print
Zero-shot action recognition is the task of recognizingaction classes without visual examples, only with a seman-tic embedding which relates unseen to seen classes. Theproblem can be seen as learning a function which general-izes well to instances of unseen classes without losing dis-crimination between classes. Neural networks can modelthe complex boundaries between visual classes, which ex-plains their success as supervised models. However, inzero-shot learning, these highly specialized class
more » ... bound-aries may not transfer well from seen to unseen classes.In this paper we propose a centroid-based representation,which clusters visual and semantic representation, consid-ers all training samples at once, and in this way generaliz-ing well to instances from unseen classes. We optimize theclustering using Reinforcement Learning which we show iscritical for our approach to work. We call the proposedmethod CLASTER and observe that it consistently outper-forms the state-of-the-art in all standard datasets, includ-ing UCF101, HMDB51 and Olympic Sports; both in thestandard zero-shot evaluation and the generalized zero-shotlearning. Further, we show that our model performs com-petitively in the image domain as well, outperforming thestate-of-the-art in many settings.
arXiv:2101.07042v2 fatcat:w5qvvnv3rjdotaqadf5k4v6fvq

A New Split for Evaluating True Zero-Shot Action Recognition [article]

Shreyank N Gowda, Laura Sevilla-Lara, Kiyoon Kim, Frank Keller, Marcus Rohrbach
2021 arXiv   pre-print
Zero-shot action recognition is the task of classifying action categories that are not available in the training set. In this setting, the standard evaluation protocol is to use existing action recognition datasets(e.g. UCF101) and randomly split the classes into seen and unseen. However, most recent work builds on representations pre-trained on the Kinetics dataset, where classes largely overlap with classes in the zero-shot evaluation datasets. As a result, classes which are supposed to be
more » ... een, are present during supervised pre-training, invalidating the condition of the zero-shot setting. A similar concern was previously noted several years ago for image based zero-shot recognition but has not been considered by the zero-shot action recognition community. In this paper, we propose a new split for true zero-shot action recognition with no overlap between unseen test classes and training or pre-training classes. We benchmark several recent approaches on the proposed True Zero-Shot(TruZe) Split for UCF101 and HMDB51, with zero-shot and generalized zero-shot evaluation. In our extensive analysis, we find that our TruZesplits are significantly harder than comparable random splits as nothing is leaking from pre-training, i.e. unseen performance is consistently lower,up to 8.9% for zero-shot action recognition. In an additional evaluation we also find that similar issues exist in the splits used in few-shot action recognition, here we see differences of up to 17.1%. We publish oursplits1and hope that our benchmark analysis will change how the field is evaluating zero- and few-shot action recognition moving forward.
arXiv:2107.13029v2 fatcat:dirm6mbfajardfxmi3fjgxkhfy

Optical Flow with Semantic Segmentation and Localized Layers

Laura Sevilla-Lara, Deqing Sun, Varun Jampani, Michael J. Black
2016 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
a) Initial segmentation [9] (b) Our segmentation (c) DiscreteFlow [38] (d) Semantic Optical Flow Figure 1: (a) Semantic segmentation breaks the image into regions such as road, bike, person, sky, etc. (c) Existing optical flow algorithms do not have access to either the segmentations or the semantics of the classes. (d) Our semantic optical flow algorithm computes motion differently in different regions, depending on the semantic class label, resulting in more precise flow, particularly at
more » ... t boundaries. (b) The flow also helps refine the segmentation of the foreground objects. Abstract Existing optical flow methods make generic, spatially homogeneous, assumptions about the spatial structure of the flow. In reality, optical flow varies across an image depending on object class. Simply put, different objects move differently. Here we exploit recent advances in static semantic scene segmentation to segment the image into objects of different types. We define different models of image motion in these regions depending on the type of object. For example, we model the motion on roads with homographies, vegetation with spatially smooth flow, and independently moving objects like cars and planes with affine motion plus deviations. We then pose the flow estimation problem using a novel formulation of localized layers, which addresses limitations of traditional layered models for dealing with complex scene motion. Our semantic flow method achieves the lowest error of any published monocular method in the KITTI-2015 flow benchmark and produces qualitatively better flow and segmentation than recent top methods on a wide range of natural videos.
doi:10.1109/cvpr.2016.422 dblp:conf/cvpr/Sevilla-LaraSJB16 fatcat:e3vtev72uvgbznkkctfflwcy2i

Only Time Can Tell: Discovering Temporal Data for Temporal Modeling [article]

Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj Goswami, Matt Feiszli, Lorenzo Torresani
2019 arXiv   pre-print
Understanding temporal information and how the visual world changes over time is a fundamental ability of intelligent systems. In video understanding, temporal information is at the core of many current challenges, including compression, efficient inference, motion estimation or summarization. However, in current video datasets it has been observed that action classes can often be recognized without any temporal information from a single frame of video. As a result, both benchmarking and
more » ... g in these datasets may give an unintentional advantage to models with strong image understanding capabilities, as opposed to those with strong temporal understanding. In this paper we address this problem head on by identifying action classes where temporal information is actually necessary to recognize them and call these "temporal classes". Selecting temporal classes using a computational method would bias the process. Instead, we propose a methodology based on a simple and effective human annotation experiment. We remove just the temporal information by shuffling frames in time and measure if the action can still be recognized. Classes that cannot be recognized when frames are not in order are included in the temporal Dataset. We observe that this set is statistically different from other static classes, and that performance in it correlates with a network's ability to capture temporal information. Thus we use it as a benchmark on current popular networks, which reveals a series of interesting facts. We also explore the effect of training on the temporal dataset, and observe that this leads to better generalization in unseen classes, demonstrating the need for more temporal data. We hope that the proposed dataset of temporal categories will help guide future research in temporal modeling for better video understanding.
arXiv:1907.08340v2 fatcat:iqft6dmjbbc6nj3ghnq66e7e4a

Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition [article]

Kiyoon Kim, Shreyank N Gowda, Oisin Mac Aodha, Laura Sevilla-Lara
2022 arXiv   pre-print
In contrast, TSN performs worse on datasets that require explicit temporal reasoning (Goyal et al. 2017; Sevilla-Lara et al. 2021) .  ...  been reported that simply increasing the number of RGB frames does not necessarily improve performance or it is very marginal (Zhou et al. 2018) , and can even deter performance (Gowda, Rohrbach, and Sevilla-Lara  ... 
arXiv:2201.10394v2 fatcat:4yjqtncwpjgfdfzd64432ibaja
« Previous Showing results 1 — 15 out of 1,120 results