112 Hits in 1.1 sec

Temporal Bilinear Networks for Video Action Recognition [article]

Yanghao Li, Sijie Song, Yuqi Li, Jiaying Liu
2018 arXiv   pre-print
One dilemma of two-stream networks lies in the inefficient extraction of optical flow, especially for large-scale datasets (Kay et al. 2017 ) and practical applications.  ...  Following the bilinear models in image recognition (Lin, RoyChowdhury, and Maji 2015; Li et al. 2017 ), we define a generic temporal bilinear operation in deep neural networks as: y c = x i T W c x i+  ... 
arXiv:1811.09974v1 fatcat:vvpgecnfezcarpv43d7vo7kpky

Masked Autoencoders As Spatiotemporal Learners [article]

Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He
2022 arXiv   pre-print
This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs the best. We observe that the optimal
more » ... ng ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. We report competitive results on several challenging video datasets using vanilla Vision Transformers. We observe that MAE can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge.
arXiv:2205.09113v1 fatcat:l5vqo5paynh5pdaped5tsqq2we

Demystifying Neural Style Transfer [article]

Yanghao Li, Naiyan Wang, Jiaying Liu, Xiaodi Hou
2017 arXiv   pre-print
In [Li et al., 2017], Li et al. proposed a parameter-free deep adaptation method by simply modulating the statistics in all Batch Normalization (BN) layers.  ...  The specialty of this problem lies in that we treat the feature at each position of feature map as one individual data sample, instead of that in traditional domain adaptation problem in which we treat  ... 
arXiv:1701.01036v2 fatcat:mxjuftjjonafxhwzvoyeua6mwi

Temporal Network Representation Learning via Historical Neighborhoods Aggregation [article]

Shixun Huang, Zhifeng Bao, Guoliang Li, Yanghao Zhou, J.Shane Culpepper
2020 arXiv   pre-print
Guoliang Li was partially supported by the 973 Program of China (2015CB358700), NSFC (61632016, 61521002, 61661166012), Huawei, and TAL education.  ... 
arXiv:2003.13212v1 fatcat:me7k6zbln5cmhnxzoltalskzhu

Modality Compensation Network: Cross-Modal Adaptation for Action Recognition [article]

Sijie Song, Jiaying Liu, Yanghao Li, Zongming Guo
2020 arXiv   pre-print
One main challenge for this task lies in how to effectively leverage their complementary information.  ... 
arXiv:2001.11657v1 fatcat:xcmed5yqx5bwrjpqzxsyguo7ga

Factorized Bilinear Models for Image Recognition [article]

Yanghao Li, Naiyan Wang, Jiaying Liu, Xiaodi Hou
2017 arXiv   pre-print
Related Work The Tao of tuning the layer-wise capacity of a DNN lies in the balance between model complexity and computation efficiency.  ... 
arXiv:1611.05709v2 fatcat:jmv36zeyhreojpm7yjth42qdm4

Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos [article]

Yanghao Li, Tushar Nagarajan, Bo Xiong, Kristen Grauman
2021 arXiv   pre-print
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Learning from purely egocentric data is limited by low dataset scale and diversity, while using purely exocentric (third-person) data introduces a large domain mismatch. Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties. Incorporating these signals as knowledge distillation losses during pre-training results in
more » ... s that benefit from both the scale and diversity of third-person video data, as well as representations that capture salient egocentric properties. Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models; it outperforms all baselines when fine-tuned for egocentric activity recognition, achieving state-of-the-art results on Charades-Ego and EPIC-Kitchens-100.
arXiv:2104.07905v1 fatcat:t5644cc6nfhcdnuwz5r2dp6aae

Exploring Plain Vision Transformer Backbones for Object Detection [article]

Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He
2022 arXiv   pre-print
We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the
more » ... n FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.
arXiv:2203.16527v2 fatcat:va2cangsdvdqxaeyvl6nm5ka7e

EGO-TOPO: Environment Affordances from Egocentric Video [article]

Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman
2020 arXiv   pre-print
First-person video naturally brings the use of a physical environment to the forefront, since it shows the camera wearer interacting fluidly in a space based on his intentions. However, current methods largely separate the observed actions from the persistent space itself. We introduce a model for environment affordances that is learned directly from egocentric video. The main idea is to gain a human-centric model of a physical space (such as a kitchen) that captures (1) the primary spatial
more » ... s of interaction and (2) the likely activities they support. Our approach decomposes a space into a topological map derived from first-person activity, organizing an ego-video into a series of visits to the different zones. Further, we show how to link zones across multiple related environments (e.g., from videos of multiple kitchens) to obtain a consolidated representation of environment functionality. On EPIC-Kitchens and EGTEA+, we demonstrate our approach for learning scene affordances and anticipating future actions in long-form video.
arXiv:2001.04583v2 fatcat:c2cud3xg55ehdnts54gcgzemai

Scale-Aware Trident Networks for Object Detection [article]

Yanghao Li, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang
2019 arXiv   pre-print
Scale variation is one of the key challenges in object detection. In this work, we first present a controlled experiment to investigate the effect of receptive fields for scale variation in object detection. Based on the findings from the exploration experiments, we propose a novel Trident Network (TridentNet) aiming to generate scale-specific feature maps with a uniform representational power. We construct a parallel multi-branch architecture in which each branch shares the same transformation
more » ... parameters but with different receptive fields. Then, we adopt a scale-aware training scheme to specialize each branch by sampling object instances of proper scales for training. As a bonus, a fast approximation version of TridentNet could achieve significant improvements without any additional parameters and computational cost compared with the vanilla detector. On the COCO dataset, our TridentNet with ResNet-101 backbone achieves state-of-the-art single-model results of 48.4 mAP. Codes are available at
arXiv:1901.01892v2 fatcat:mb5n2jo5zngoxbspmba4oydnha

Negative Frames Matter in Egocentric Visual Query 2D Localization [article]

Mengmeng Xu, Cheng-Yang Fu, Yanghao Li, Bernard Ghanem, Juan-Manuel Perez-Rua, Tao Xiang
2022 arXiv   pre-print
Li, Facebook, • Bernard Ghanem, KAUST, • Tao Xiang, Facebook, Figure 1 . 1 Figure 1.  ...  Participated Challenge: Visual Queries 2D Localization Participants: • Mengmeng Xu, KAUST, • Juan-Manuel Pérez-Rúa, Facebook, • Cheng-Yang Fu, Facebook, • Yanghao  ... 
arXiv:2208.01949v1 fatcat:jakkqu3wk5b3pdwp75evs44etq

Co-occurrence Feature Learning for Skeleton based Action Recognition using Regularized Deep LSTM Networks [article]

Wentao Zhu, Cuiling Lan, Junliang Xing, Wenjun Zeng, Yanghao Li, Li Shen, Xiaohui Xie
2016 arXiv   pre-print
The key to this problem lies mainly in two aspects.  ... 
arXiv:1603.07772v1 fatcat:tujbkxlfmbc7tayerx5jq2sata


Yipeng Zhang, Zhifeng Bao, Songsong Mo, Yuchen Li, Yanghao Zhou
2019 Proceedings of the VLDB Endowment  
In this paper, we demonstrate an Intelligent Trajectorydriven outdoor Advertising deployment Assistant (ITAA), which assists users to find an optimal strategy for outdoor advertising (ad) deployment. The challenge is how to measure the influence to the moving trajectories of ads, and how to optimize the placement of ads among billboards that maximize the influence has been proven NP-hard. Therefore, we develop a framework based on two trajectory-driven influence models. ITAA is built upon this
more » ... ramework with a user-friendly UI. It serves both ad companies and their customers. We enhance the interpretability to improve the user's understanding of the influence of ads. The interactive function of ITAA is made interpretable and easy to engage.
doi:10.14778/3352063.3352067 fatcat:kkjragvrozax7lvj7chrvbm5l4

PyTorchVideo: A Deep Learning Library for Video Understanding [article]

Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick (+4 others)
2021 arXiv   pre-print
Li, Yilei Li, Zhengxing Chen, Zhicheng Yan.  ...  Christoph Feichtenhofer, Dave Schnizlein, Haoqi Fan, Heng Wang, Jackson Hamburger, Jitendra Malik, Kalyan Vasudev Alwala, Matt Feiszli, Nikhila Ravi, Ross Girshick, Tullie Murrell, Wan-Yen Lo, Weiyao Wang, Yanghao  ... 
arXiv:2111.09887v1 fatcat:fruhqcc65ffl3pl4zsghom27v4

Learning Model-Blind Temporal Denoisers without Ground Truths [article]

Yanghao Li, Bichuan Guo, Jiangtao Wen, Zhen Xia, Shan Liu, Yuxing Han
2021 arXiv   pre-print
If it is disabled, lighting variations li−1, li are set to 0. "od": online denoising. "wl": warping loss regularizer.  ...  Forward oi−1 and li−1 are computed similarly with i − 1 and i exchanged. 5: Construct the final input and target from (10)(11)(12), crop the above quantities at same spatial locations and add them to  ... 
arXiv:2007.03241v2 fatcat:o4cquubfpjhobgckwsyjwopmp4
« Previous Showing results 1 — 15 out of 112 results