Filters








1,164 Hits in 4.8 sec

Stochastic data sweeping for fast DNN training

Wei Deng, Yanmin Qian, Yuchen Fan, Tianfan Fu, Kai Yu
2014 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
Experiments showed that, combining SDS with asynchronous stochastic gradient descent (ASGD) can achieve almost 3.0 times speed-up on 2 GPUs at no loss of recognition accuracy.  ...  In this paper, a novel stochastic data sweeping (SDS) framework is proposed from a different perspective to speed up DNN training with a single GPU.  ...  Asynchronous stochastic gradient descent (ASGD) uses multiple GPUs to compute gradients on different data using the latest model independently, and updates the model in the host server asynchronously  ... 
doi:10.1109/icassp.2014.6853594 dblp:conf/icassp/DengQFFY14 fatcat:ayglelau4jbhbny2ak4qidm4cq

Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation [article]

Nikolay Bogoychev, Marcin Junczys-Dowmunt, Kenneth Heafield, Alham Fikri Aji
2018 arXiv   pre-print
In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly.  ...  Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in asynchronous stochastic gradient descent (SGD) which makes the model convergence poor.  ...  Nikolay Bogoychev was funded by an Amazon faculty research award to Adam Lopez.  ... 
arXiv:1808.08859v2 fatcat:6hopnc2tlzhgnf2pnevvletnme

Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation

Nikolay Bogoychev, Kenneth Heafield, Alham Fikri Aji, Marcin Junczys-Dowmunt
2018 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing  
In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly.  ...  Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in asynchronous stochastic gradient descent (SGD) which makes the model convergence poor.  ...  Nikolay Bogoychev was funded by an Amazon faculty research award to Adam Lopez.  ... 
doi:10.18653/v1/d18-1332 dblp:conf/emnlp/BogoychevHAJ18 fatcat:wtnk5yjkwfabnhhccsl62fy6ju

Empirical Evaluation of Parallel Training Algorithms on Acoustic Modeling [article]

Wenpeng Li, BinBin Zhang, Lei Xie, Dong Yu
2017 arXiv   pre-print
(BMUF), bulk synchronous parallel (BSP) and elastic averaging stochastic gradient descent (EASGD), on 1000-hour LibriSpeech corpora using feed-forward deep neural networks (DNNs) and convolutional, long  ...  In this paper we aim at filling this gap by comparing four popular parallel training algorithms in speech recognition, namely asynchronous stochastic gradient descent (ASGD), blockwise model-update filtering  ...  Many parallel training algorithms have been proposed to speed up training.  ... 
arXiv:1703.05880v2 fatcat:cf5o75pwmfbwjn7q45iwghoxmy

Asynchronous, Data-Parallel Deep Convolutional Neural Network Training with Linear Prediction Model for Parameter Transition [chapter]

Ikuro Sato, Ryo Fujisaki, Yosuke Oyama, Akihiro Nomura, Satoshi Matsuoka
2017 Lecture Notes in Computer Science  
Asynchronous Stochastic Gradient Descent provides a possibility of largescale distributed computation for training such networks.  ...  However, asynchrony introduces stale gradients, which are considered to have negative effects on training speed.  ...  Two strategies mainly exist in data-parallel neural network training: Synchronous Stochastic Gradient Descent (SSGD) [7, 9] and Asynchronous Stochastic Gradient Descent (ASGD) [6, 8, 10, 11] .  ... 
doi:10.1007/978-3-319-70096-0_32 fatcat:njad52dzvbakxjwasyzkvp5zny

On Distributed Deep Network for Processing Large-Scale Sets of Complex Data

Qin Chao, Gao Xiao-Guang, Chen Da-Qing
2016 2016 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC)  
We have successfully used our system to train a distributed deep network, and achieve stateof-the-art performance on MINIST, a visual handwriting font library.  ...  We show that these techniques dramatically accelerate the training of this kind of distributed deep network.  ...  The Bagging-Down SGD algorithm Stochastic gradient descent (SGD) is perhaps the most commonly used optimization procedure for training deep neural networks [4, 26, 27] .  ... 
doi:10.1109/ihmsc.2016.55 fatcat:j36fhxklpbcmxpjtcxklvpdolm

Large-Scale Stochastic Learning using GPUs [article]

Thomas Parnell, Celestine Dünner, Kubilay Atasu, Manolis Sifalakis, Haris Pozidis
2017 arXiv   pre-print
Acceleration is achieved by mapping the training algorithm onto massively parallel processors: we demonstrate a parallel, asynchronous GPU implementation of the widely used stochastic coordinate descent  ...  /ascent algorithm that can provide up to 35x speed-up over a sequential CPU implementation.  ...  ACKNOWLEDGMENT The authors would like to thank Evangelos Eleftheriou, IBM Research -Zurich for his support of this work and Martin Jaggi, EPFL for useful discussions regarding distributed learning algorithms  ... 
arXiv:1702.07005v1 fatcat:eckljceftzbyzdpolhedym32pm

Fast Parallel Training of Neural Language Models

Tong Xiao, Jingbo Zhu, Tongran Liu, Chunliang Zhang
2017 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence  
On four NVIDIA GTX1080 GPUs, it achieves a speedup of 2.1+ times over the standard asynchronous stochastic gradient descent baseline, yet with no increase in perplexity.  ...  Our approach yields significant speed improvements on a recurrent neural network-based language model.  ...  The authors would like to thank anonymous reviewers, Fuxue Li, Yaqian Han, Ambyer Han and Bojie Hu for their comments.  ... 
doi:10.24963/ijcai.2017/586 dblp:conf/ijcai/XiaoZLZ17 fatcat:t2lsbylq4jbprdsa23j7lcrey4

AutoAssist: A Framework to Accelerate Training of Deep Neural Networks [article]

Jiong Zhang, Hsiang-fu Yu, Inderjit S. Dhillon
2019 arXiv   pre-print
In this paper, we propose AutoAssist, a simple framework to accelerate training of a deep neural network.  ...  Deep neural networks have yielded superior performance in many applications; however, the gradient computation in a deep model with millions of instances lead to a lengthy training process even with modern  ...  At each stochastic gradient step, an instance (x i , y i ) or a batch of instances {x i , y i } i∈B are sampled from the training data and a gradient descent step is conducted based on the stochastic gradient  ... 
arXiv:1905.03381v1 fatcat:r3wlowr47bcafn2ogthdacrbri

Gossip training for deep learning [article]

Michael Blot, David Picard, Matthieu Cord, Nicolas Thome
2016 arXiv   pre-print
We address the issue of speeding up the training of convolutional networks. Here we study a distributed method adapted to stochastic gradient descent (SGD).  ...  The parallel optimization setup uses several threads, each applying individual gradient descents on a local variable.  ...  This method called stochastic gradient descent [SGD] has proved to be very efficient to train neural networks in general.  ... 
arXiv:1611.09726v1 fatcat:n7tdch7o7nhzbaik4htdalx6qy

Optimizing Machine Learning on Apache Spark in HPC Environments

Zhenyu Li, James Davis, Stephen A. Jarvis
2018 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)  
To this end we introduce: (i) the application of MapRDD, a fine grained distributed data representation; (ii) a task-based allreduce implementation; and (iii) a new asynchronous Stochastic Gradient Descent  ...  With increasing use of accelerator cards, larger cluster computers and deeper neural network models, we predict a 2x further speedup (i.e. 22.4x accumulated speedup) is obtainable with the new asynchronous  ...  ACKNOWLEDGMENT This research is supported by Atos IT Services UK Ltd and by the EPSRC Centre for Doctoral Training in Urban Science and Progress (grant no. EP/L016400/1).  ... 
doi:10.1109/mlhpc.2018.8638643 fatcat:r2wutc5fzvbpzg6zrqoqe5erna

On parallelizability of stochastic gradient descent for speech DNNS

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, Dong Yu
2014 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
This paper compares the theoretical efficiency of model-parallel and data-parallel distributed stochastic gradient descent training of DNNs.  ...  We arrive at an estimated possible end-to-end speed-up of 5 times or more.  ...  TRAINING CONTEXT-DEPENDENT DEEP-NEURAL-NETWORK HMMS A deep neural network (DNN) is a conventional multi-layer perceptron (MLP [12] ) with many layers, where training is commonly initialized by a pretraining  ... 
doi:10.1109/icassp.2014.6853593 dblp:conf/icassp/SeideFDLY14 fatcat:yqf3byctxjb45bl67lhjgwlpum

Revisiting Distributed Synchronous SGD [article]

Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz
2017 arXiv   pre-print
Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise  ...  Our approach is empirically validated and shown to converge faster and to better test accuracies.  ...  Section 2 describes asynchronous stochastic optimization and presents experimental evidence of gradient staleness in deep neural network models.  ... 
arXiv:1604.00981v3 fatcat:fnfrhsyakjfxxho4f3s2rwnurq

Revisiting Distributed Synchronous SGD [article]

Xinghao Pan, Jianmin Chen, Rajat Monga, Samy Bengio, Rafal Jozefowicz
2017 arXiv   pre-print
Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise  ...  Our approach is empirically validated and shown to converge faster and to better test accuracies.  ...  Section 2 describes asynchronous stochastic optimization and presents experimental evidence of gradient staleness in deep neural network models.  ... 
arXiv:1702.05800v2 fatcat:s2yrnfe7cneetib6rbak25slii

Benchmarking Decoupled Neural Interfaces with Synthetic Gradients [article]

Ekaba Bisong
2018 arXiv   pre-print
This paper performs a speed benchmark to compare the speed and accuracy capabilities of SG-DNI as opposed to a standard neural interface using multilayer perceptron MLP.  ...  To solve this problem, synthetic gradients (SG) with decoupled neural interfaces (DNI) are introduced as a viable alternative to the backpropagation algorithm.  ...  Acknowledgments The author will like to thank Andrew Miles for help and support with the GPU computing facilities at the Carleton School of Computer Science.  ... 
arXiv:1712.08314v3 fatcat:k53n7uyr5rb2pmcujbddfyzaxy
« Previous Showing results 1 — 15 out of 1,164 results