241 Hits in 3.8 sec

Optimal memory-aware backpropagation of deep join networks

Olivier Beaumont, Julien Herrmann, Guillaume Pallez (Aupy), Alena Shilova
2020 Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences  
In this work, we propose to use techniques from memory-aware scheduling and automatic differentiation (AD) to execute a backpropagation graph with a bounded memory requirement at the cost of extra recomputations  ...  This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'.  ...  All authors participated to the model, derivations of theoretical results and writing of the manuscript. JH implemented the main algorithm, AS and GP performed the evaluations.  ... 
doi:10.1098/rsta.2019.0049 pmid:31955681 pmcid:PMC7015292 fatcat:rhugm5fgp5cwpawfxwsomu3m3m

The Reversible Residual Network: Backpropagation Without Storing Activations [article]

Aidan N. Gomez, Mengye Ren, Raquel Urtasun, Roger B. Grosse
2017 arXiv   pre-print
Therefore, the activations for most layers need not be stored in memory during backpropagation.  ...  However, memory consumption becomes a bottleneck, as one needs to store the activations in order to calculate gradients using backpropagation.  ...  The full theoretical efficiency can be realized by reusing the F and G graphs' activations that were computed in the reconstruction steps (lines 3 and 4 of Algorithm 1).  ... 
arXiv:1707.04585v1 fatcat:wuqs7o6txrgixcm6xy3qpfr6ei

Memory Optimization for Deep Networks [article]

Aashaka Shah, Chao-Yuan Wu, Jayashree Mohan, Vijay Chidambaram, Philipp Krähenbühl
2021 arXiv   pre-print
MONeT reduces the overall memory requirement by 3x for various PyTorch models, with a 9-16% overhead in computation.  ...  For the same computation cost, MONeT requires 1.2-1.8x less memory than current state-of-the-art automated checkpointing frameworks. Our code is available at  ...  (2018) use the most memory-efficient convolution algorithms in Gist and compare its memory saving against a baseline which also chooses the most memory-efficient convolution algorithm.  ... 
arXiv:2010.14501v3 fatcat:4ohehe2qsjaehonjjzvwvowwdm

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism [article]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen
2019 arXiv   pre-print
In many cases, increasing model capacity beyond the memory limit of a single accelerator has required developing special algorithms or infrastructure.  ...  To address the need for efficient and task-independent model parallelism, we introduce GPipe, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers  ...  Nguyen, Xiaoqiang Zheng, Yonghui Wu, Noam Shazeer, Barret Zoph, Ekin Cubuk, Tianqi Chen, and Vijay Vasudevan for helpful discussions and inspirations; and the larger Google Brain team.  ... 
arXiv:1811.06965v5 fatcat:33fkmob5knakbgo6fjl5x3mvdu

Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization [article]

Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph E. Gonzalez
2020 arXiv   pre-print
We formalize the problem of trading-off DNN training time and memory requirements as the tensor rematerialization optimization problem, a generalization of prior checkpointing strategies.  ...  algorithm, then uses these schedules to accelerate millions of training iterations.  ...  Ng for help in evaluation, and the paper and artifact reviewers for helpful suggestions.  ... 
arXiv:1910.02653v3 fatcat:niyvcoldpnf2rb7zygfp54lqoe

Survey on Large Scale Neural Network Training [article]

Julia Gusak, Daria Cherniuk, Alena Shilova, Alexander Katrutsa, Daniel Bershatsky, Xunyi Zhao, Lionel Eyraud-Dubois, Oleg Shlyazhko, Denis Dimitrov, Ivan Oseledets, Olivier Beaumont
2022 arXiv   pre-print
This survey provides a systematic overview of the approaches that enable more efficient DNNs training.  ...  We analyze techniques that save memory and make good use of computation and communication resources on architectures with a single or several GPUs.  ...  A branch of the Rotor framework 7 provides the implementation of the combined offloading and rematerialization algorithms from [Beaumont et al., 2021a] .  ... 
arXiv:2202.10435v1 fatcat:likjpgsn2ndnxejw2b7kzhmcou

Dynamic Tensor Rematerialization [article]

Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, Zachary Tatlock
2021 arXiv   pre-print
We demonstrate that a simple online algorithm can achieve comparable performance by introducing Dynamic Tensor Rematerialization (DTR), a greedy online algorithm for checkpointing that is extensible and  ...  Checkpointing enables the training of deep learning models under restricted memory budgets by freeing intermediate activations from memory and recomputing them on demand.  ...  Pollock, Samuel Ainsworth, and Sam Kaufman for providing feedback and useful comments on various drafts of this work.  ... 
arXiv:2006.09616v4 fatcat:4omyfg5kkvcyza5jey6hdmysga

Differentiable Programming Tensor Networks [article]

Hai-Jun Liao, Jin-Guo Liu, Lei Wang, Tao Xiang
2019 arXiv   pre-print
By formulating the tensor network algorithm as a computation graph, one can compute higher order derivatives of the program accurately and efficiently using AD.  ...  We present theory and practice of programming tensor network algorithms in a fully differentiable way.  ...  We thank Philippe Corboz and Laurens Vanderstraeten for providing the reference data shown in Fig. 4 .  ... 
arXiv:1903.09650v1 fatcat:n7l5zjfhhvemhianohbig7l4di

Profile-guided memory optimization for deep neural networks [article]

Taro Sekiyama, Takashi Imamichi, Haruki Imai, Rudy Raymond
2018 arXiv   pre-print
We address this challenge by developing a novel profile-guided memory optimization to efficiently and quickly allocate memory blocks during the propagation in DNNs.  ...  The optimization utilizes a simple and fast heuristic algorithm based on the two-dimensional rectangle packing problem.  ...  We develop a simple heuristic algorithm to DSA to obtain efficient and fast memory allocation, and incorporate the heuristic in Chainer.  ... 
arXiv:1804.10001v1 fatcat:uv75yc75crgh5p7dvtqfibzo5u

Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory [article]

Julien Herrmann, Olivier Beaumont (HiePACS, UB, LaBRI), Lionel Eyraud-Dubois, Julien Hermann, Alexis Joly
2019 arXiv   pre-print
this uses more memory, but requires fewer recomputations in the backward phase), and we provide an algorithm to compute the optimal computation sequence for this model.  ...  This paper introduces a new activation checkpointing method which allows to significantly decrease memory usage when training Deep Neural Networks with the back-propagation algorithm.  ...  efficient to checkpoint a i+1 since it avoids to recompute F i+1 .  ... 
arXiv:1911.13214v1 fatcat:ku7spslh45gnxbhskufmqpw6ny

Improving the expressiveness of deep learning frameworks with recursion

Eunji Jeong, Joo Seong Jeong, Soojeong Kim, Gyeong-In Yu, Byung-Gon Chun
2018 Proceedings of the Thirteenth EuroSys Conference on - EuroSys '18  
In this paper, we add recursion to the programming model of existing frameworks by complementing their design with recursive execution of dataflow graphs as well as additional APIs for recursive definitions  ...  However, embedded control flow deep learning frameworks such as TensorFlow, Theano, Caffe2, and MXNet fail to efficiently represent and execute such neural networks, due to lack of support for recursion  ...  Technically, we could recompute the forward operation values during backpropagation instead of retaining them to save memory.  ... 
doi:10.1145/3190508.3190530 dblp:conf/eurosys/JeongJKYC18 fatcat:yjt6dn2sw5fdppk664dtgliqg4

In-place Activated BatchNorm for Memory-Optimized Training of DNNs

Samuel Rota Bulo, Lorenzo Porzi, Peter Kontschieder
2018 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition  
In this work we present In-Place Activated Batch Normalization (INPLACE-ABN) -a novel approach to drasti cally reduce the training memory footprint of modern deep neural networks in a computationally efficient  ...  for existing deep learning frameworks.  ...  Efficient deep learning frameworks like TensorFlow, MxNet or PyTorch follow distinct memory allocation strategies.  ... 
doi:10.1109/cvpr.2018.00591 dblp:conf/cvpr/BuloPK18 fatcat:inqltbivvjcurecpbc7fca24cy

Data Movement Is All You Need: A Case Study on Optimizing Transformers [article]

Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, Torsten Hoefler
2021 arXiv   pre-print
We reduce data movement by up to 22.91% and overall achieve a 1.30x performance improvement over state-of-the-art frameworks when training a BERT encoder layer and 1.19x for the entire BERT.  ...  Using these insights, we present a recipe for globally optimizing data movement in transformers.  ...  ., 2018) , and recomputation for memory reduction (Chen et al., 2016; Jain et al., 2020) are all also applicable.  ... 
arXiv:2007.00072v3 fatcat:sseikiiyhne37lidozromcq6ai

LayerPipe: Accelerating Deep Neural Network Training by Intra-Layer and Inter-Layer Gradient Pipelining and Multiprocessor Scheduling [article]

Nanda K. Unnikrishnan, Keshab K. Parhi
2021 arXiv   pre-print
However, these approaches treat the entire backpropagation as a single task; this leads to an increase in computation time and processor underutilization.  ...  These loops hinder efficient pipelining and scheduling of the tasks within the layer and between consecutive layers.  ...  Section III describes the LayerPipe framework for efficient pipeline parallelism using proposed intra-layer and inter-layer optimizations.  ... 
arXiv:2108.06629v1 fatcat:2zwpwe6sr5ddtaazspi6phmehi

Machines of finite depth: towards a formalization of neural networks [article]

Pietro Vertechi, Mattia G. Bergomi
2022 arXiv   pre-print
We provide a unifying framework where artificial neural networks and their architectures can be formally described as particular cases of a general mathematical construction--machines of finite depth.  ...  Machines of finite depth are modular (they can be combined), efficiently computable and differentiable.  ...  This is ine cient, as some connections need to be recomputed several times, as encoded by the line width of the edges of the bottom graph.  ... 
arXiv:2204.12786v1 fatcat:usefnawxxzcqhekamf2ybaflzm
« Previous Showing results 1 — 15 out of 241 results