Filters








8,128 Hits in 3.0 sec

Scaling Autoregressive Video Models [article]

Dirk Weissenborn, Oscar Täckström, Jakob Uszkoreit
2020 arXiv   pre-print
In contrast, we show that conceptually simple autoregressive video generation models based on a three-dimensional self-attention mechanism achieve competitive results across multiple metrics on popular  ...  We also present results from training our models on Kinetics, a large scale action recognition dataset comprised of YouTube videos exhibiting phenomena such as camera movement, complex object interactions  ...  against prior work on autoregressive video modeling.  ... 
arXiv:1906.02634v3 fatcat:eu3jxenc5jg3dfvfxin3hdpnm4

Hierarchical Autoregressive Modeling for Neural Video Compression [article]

Ruihan Yang, Yibo Yang, Joseph Marino, Stephan Mandt
2021 arXiv   pre-print
We draw a connection between such autoregressive generative models and the task of lossy video compression.  ...  Comprehensive evaluations on large-scale video data show improved rate-distortion performance over both state-of-the-art neural and conventional video compression methods.  ...  VIDEO COMPRESSION THROUGH DEEP AUTOREGRESSIVE MODELING We identify commonalities between hierarchical autoregressive flow models (Marino et al., 2020) and state-of-the-art neural video compression architectures  ... 
arXiv:2010.10258v2 fatcat:76yes2d5qjfc5arzx37hpuqdbm

Insights from Generative Modeling for Neural Video Compression [article]

Ruihan Yang, Yibo Yang, Joseph Marino, Stephan Mandt
2021 arXiv   pre-print
In a similar spirit, we view recently proposed neural video coding algorithms through the lens of deep autoregressive and latent variable modeling.  ...  Since our improvements are compatible with a large class of existing models, we provide further evidence that the generative modeling viewpoint can advance the neural video coding field.  ...  Second, we improve a popular model for neural video compression, Scale-Space Flow (SSF) [4] .  ... 
arXiv:2107.13136v1 fatcat:mdx27avdzbabxayvhpbtbjx74a

Predicting Deeper into the Future of Semantic Segmentation

Pauline Luc, Natalia Neverova, Camille Couprie, Jakob Verbeek, Yann LeCun
2017 2017 IEEE International Conference on Computer Vision (ICCV)  
We develop an autoregressive convolutional neural network that learns to iteratively generate multiple frames.  ...  Given a sequence of video frames, our goal is to predict segmentation maps of not yet observed video frames that lie up to a second or further in the future.  ...  We compare different strategies: batch models, autoregressive models (AR), and models with autoregressive fine-tuning (AR fine-tune).  ... 
doi:10.1109/iccv.2017.77 dblp:conf/iccv/LucNCVL17 fatcat:qr524wy4sndzran22xhh4aytoy

Predicting Deeper into the Future of Semantic Segmentation [article]

Pauline Luc, Natalia Neverova, Camille Couprie, Jakob Verbeek, Yann LeCun
2017 arXiv   pre-print
We develop an autoregressive convolutional neural network that learns to iteratively generate multiple frames.  ...  Given a sequence of video frames, our goal is to predict segmentation maps of not yet observed video frames that lie up to a second or further in the future.  ...  In our work we do not explicitly model objects mantic video segmentation.  ... 
arXiv:1703.07684v3 fatcat:4v5rcjh5f5fjve2556d5deoefy

Axial Attention in Multidimensional Transformers [article]

Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, Tim Salimans
2019 arXiv   pre-print
We propose Axial Transformers, a self-attention-based autoregressive model for images and other data organized as high dimensional tensors.  ...  Existing autoregressive models either suffer from excessively large computational resource requirements for high dimensional data, or make compromises in terms of distribution expressiveness or ease of  ...  Weissenborn et al. (2019) similarly scale video autoregressive models by restricting the context, again preventing their model from expressing all joint distributions over pixels.  ... 
arXiv:1912.12180v1 fatcat:hrydpt7n6jevvilykzo5zqtkry

Predicting Video with VQVAE [article]

Jacob Walker, Ali Razavi, Aäron van den Oord
2021 arXiv   pre-print
Compared to pixels, this compressed latent space has dramatically reduced dimensionality, allowing us to apply scalable autoregressive generative models to predict video.  ...  With VQ-VAE we compress high-resolution videos into a hierarchical set of multi-scale discrete latent variables.  ...  “The Pose Knows: Video Forecasting by Generating Pose Futures.” In: ICCV. 2017. [57] Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. “Scaling Autoregressive Video Models.”  ... 
arXiv:2103.01950v1 fatcat:fmbirgg4bnh25akw2nm2yfnxjm

VideoGPT: Video Generation using VQ-VAE and Transformers [article]

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, Aravind Srinivas
2021 arXiv   pre-print
We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos.  ...  A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings.  ...  For simplicity, ease of reproduction and presenting the first VQ-VAE based video generation model with minimal complexity, we stick with a single scale of discrete latents and transformers for the autoregressive  ... 
arXiv:2104.10157v2 fatcat:b6icaie7ljc6lj5zrl4cxdvcpy

VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation [article]

Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, Durk Kingma
2020 arXiv   pre-print
Although a number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally as in the case of pixel-level autoregressive  ...  We describe an approach for modeling the latent space dynamics, and demonstrate that flow-based generative models offer a viable and competitive approach to generative modelling of video.  ...  ACKNOWLEDGEMENTS We would like to thank Ryan Sepassi and Lukasz Kaiser for their extensive help in using Ten-sor2Tensor, Oscar Täckström for finding a bug in our evaluation pipeline that improved results across all models  ... 
arXiv:1903.01434v3 fatcat:xv75l3eewrajna4itq5kz5id64

FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire [article]

Jinglin Liu, Yi Ren, Zhou Zhao, Chen Zhang, Baoxing Huai, Nicholas Jing Yuan
2021 arXiv   pre-print
To breakthrough this constraint, we propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously.  ...  However, existing methods for lipreading mainly build on autoregressive (AR) model, which generate target tokens one by one and suffer from high inference latency.  ...  Following previous works [1, 2, 34] , the input video frames are converted to grey scale and centrally cropped into 114 × 114 images.  ... 
arXiv:2008.02516v4 fatcat:23kdp4folndajlolj67kanw2ba

Parallel Multiscale Autoregressive Density Estimation [article]

Scott Reed, Aäron van den Oord, Nal Kalchbrenner, Sergio Gómez Colmenarejo, Ziyu Wang, Dan Belov, Nando de Freitas
2017 arXiv   pre-print
We evaluate the model on class-conditional image generation, text-to-image synthesis, and action-conditional video generation, showing that our model achieves the best results among non-pixel-autoregressive  ...  density models that allow efficient sampling.  ...  These led to improved autoregressive models for video generation and machine translation .  ... 
arXiv:1703.03664v1 fatcat:h3nz2vwrkrgv3na5llqq2f632u

Predicting 3D Human Dynamics from Video [article]

Jason Y. Zhang, Panna Felsen, Angjoo Kanazawa, Jitendra Malik
2019 arXiv   pre-print
Inspired by the success of autoregressive models in language modeling tasks, we learn an intermediate latent space on which we predict the future.  ...  In this work, we present perhaps the first approach for predicting a future 3D mesh model sequence of a person from past video input.  ...  We use a weak-perspective camera model Π = [s, t x , t y ] that represents scale and translation.  ... 
arXiv:1908.04781v2 fatcat:kmpoy2mqcraujbtite6mnoje7m

A Survey on Audio Synthesis and Audio-Visual Multimodal Processing [article]

Zhaofeng Shi
2021 arXiv   pre-print
Furthermore, the model adopts both mel-scale spectrograms and linear-scale spectrograms as the representation of audio waveform. [3, 15] also proposed models for lipreading task.  ...  SampleRNN [57] is another autoregressive model based on RNN.  ... 
arXiv:2108.00443v1 fatcat:5xkj7lf7pfgpppvfqwynoqkqjm

A Study on the Autoregressive and non-Autoregressive Multi-label Learning [article]

Elham J. Barezi, Iacer Calixto, Kyunghyun Cho, Pascale Fung
2020 arXiv   pre-print
Experimental results show that although the autoregressive models, where use a given order of the labels for chain-order label prediction, work great for the small scale labels or the prediction of the  ...  We apply our models to four standard extreme classification natural language data sets, and one news videos dataset for automated label detection from a lexicon of semantic concepts.  ...  We propose a probabilistic latent-variable based model to utilize high-order label dependencies, and then provide a comprehensive comparison over the large-scale multi-label learning using high-order autoregressive  ... 
arXiv:2012.01711v1 fatcat:pcrdnjwrgfcrnpxkyopeopjv3m

Video Compression With Rate-Distortion Autoencoders [article]

Amirhossein Habibian, Ties van Rozendaal, Jakub M. Tomczak, Taco S. Cohen
2019 arXiv   pre-print
In this paper we present a a deep generative model for lossy video compression.  ...  We employ a model that consists of a 3D autoencoder with a discrete latent space and an autoregressive prior used for entropy coding.  ...  While being powerful and flexible, this model scales rather poorly to larger videos, and can only be used for lossless compression.  ... 
arXiv:1908.05717v2 fatcat:ntnd7l5b3ze73ogdxcajck65cy
« Previous Showing results 1 — 15 out of 8,128 results