A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Stochastic data sweeping for fast DNN training
2014
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Experiments showed that, combining SDS with asynchronous stochastic gradient descent (ASGD) can achieve almost 3.0 times speed-up on 2 GPUs at no loss of recognition accuracy. ...
In this paper, a novel stochastic data sweeping (SDS) framework is proposed from a different perspective to speed up DNN training with a single GPU. ...
Asynchronous stochastic gradient descent (ASGD) uses multiple GPUs to compute gradients on different data using the latest model independently, and updates the model in the host server asynchronously ...
doi:10.1109/icassp.2014.6853594
dblp:conf/icassp/DengQFFY14
fatcat:ayglelau4jbhbny2ak4qidm4cq
Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation
[article]
2018
arXiv
pre-print
In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. ...
Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in asynchronous stochastic gradient descent (SGD) which makes the model convergence poor. ...
Nikolay Bogoychev was funded by an Amazon faculty research award to Adam Lopez. ...
arXiv:1808.08859v2
fatcat:6hopnc2tlzhgnf2pnevvletnme
Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation
2018
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. ...
Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in asynchronous stochastic gradient descent (SGD) which makes the model convergence poor. ...
Nikolay Bogoychev was funded by an Amazon faculty research award to Adam Lopez. ...
doi:10.18653/v1/d18-1332
dblp:conf/emnlp/BogoychevHAJ18
fatcat:wtnk5yjkwfabnhhccsl62fy6ju
Empirical Evaluation of Parallel Training Algorithms on Acoustic Modeling
[article]
2017
arXiv
pre-print
(BMUF), bulk synchronous parallel (BSP) and elastic averaging stochastic gradient descent (EASGD), on 1000-hour LibriSpeech corpora using feed-forward deep neural networks (DNNs) and convolutional, long ...
In this paper we aim at filling this gap by comparing four popular parallel training algorithms in speech recognition, namely asynchronous stochastic gradient descent (ASGD), blockwise model-update filtering ...
Many parallel training algorithms have been proposed to speed up training. ...
arXiv:1703.05880v2
fatcat:cf5o75pwmfbwjn7q45iwghoxmy
Asynchronous, Data-Parallel Deep Convolutional Neural Network Training with Linear Prediction Model for Parameter Transition
[chapter]
2017
Lecture Notes in Computer Science
Asynchronous Stochastic Gradient Descent provides a possibility of largescale distributed computation for training such networks. ...
However, asynchrony introduces stale gradients, which are considered to have negative effects on training speed. ...
Two strategies mainly exist in data-parallel neural network training: Synchronous Stochastic Gradient Descent (SSGD) [7, 9] and Asynchronous Stochastic Gradient Descent (ASGD) [6, 8, 10, 11] . ...
doi:10.1007/978-3-319-70096-0_32
fatcat:njad52dzvbakxjwasyzkvp5zny
On Distributed Deep Network for Processing Large-Scale Sets of Complex Data
2016
2016 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC)
We have successfully used our system to train a distributed deep network, and achieve stateof-the-art performance on MINIST, a visual handwriting font library. ...
We show that these techniques dramatically accelerate the training of this kind of distributed deep network. ...
The Bagging-Down SGD algorithm Stochastic gradient descent (SGD) is perhaps the most commonly used optimization procedure for training deep neural networks [4, 26, 27] . ...
doi:10.1109/ihmsc.2016.55
fatcat:j36fhxklpbcmxpjtcxklvpdolm
Large-Scale Stochastic Learning using GPUs
[article]
2017
arXiv
pre-print
Acceleration is achieved by mapping the training algorithm onto massively parallel processors: we demonstrate a parallel, asynchronous GPU implementation of the widely used stochastic coordinate descent ...
/ascent algorithm that can provide up to 35x speed-up over a sequential CPU implementation. ...
ACKNOWLEDGMENT The authors would like to thank Evangelos Eleftheriou, IBM Research -Zurich for his support of this work and Martin Jaggi, EPFL for useful discussions regarding distributed learning algorithms ...
arXiv:1702.07005v1
fatcat:eckljceftzbyzdpolhedym32pm
Fast Parallel Training of Neural Language Models
2017
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence
On four NVIDIA GTX1080 GPUs, it achieves a speedup of 2.1+ times over the standard asynchronous stochastic gradient descent baseline, yet with no increase in perplexity. ...
Our approach yields significant speed improvements on a recurrent neural network-based language model. ...
The authors would like to thank anonymous reviewers, Fuxue Li, Yaqian Han, Ambyer Han and Bojie Hu for their comments. ...
doi:10.24963/ijcai.2017/586
dblp:conf/ijcai/XiaoZLZ17
fatcat:t2lsbylq4jbprdsa23j7lcrey4
AutoAssist: A Framework to Accelerate Training of Deep Neural Networks
[article]
2019
arXiv
pre-print
In this paper, we propose AutoAssist, a simple framework to accelerate training of a deep neural network. ...
Deep neural networks have yielded superior performance in many applications; however, the gradient computation in a deep model with millions of instances lead to a lengthy training process even with modern ...
At each stochastic gradient step, an instance (x i , y i ) or a batch of instances {x i , y i } i∈B are sampled from the training data and a gradient descent step is conducted based on the stochastic gradient ...
arXiv:1905.03381v1
fatcat:r3wlowr47bcafn2ogthdacrbri
Gossip training for deep learning
[article]
2016
arXiv
pre-print
We address the issue of speeding up the training of convolutional networks. Here we study a distributed method adapted to stochastic gradient descent (SGD). ...
The parallel optimization setup uses several threads, each applying individual gradient descents on a local variable. ...
This method called stochastic gradient descent [SGD] has proved to be very efficient to train neural networks in general. ...
arXiv:1611.09726v1
fatcat:n7tdch7o7nhzbaik4htdalx6qy
Optimizing Machine Learning on Apache Spark in HPC Environments
2018
2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)
To this end we introduce: (i) the application of MapRDD, a fine grained distributed data representation; (ii) a task-based allreduce implementation; and (iii) a new asynchronous Stochastic Gradient Descent ...
With increasing use of accelerator cards, larger cluster computers and deeper neural network models, we predict a 2x further speedup (i.e. 22.4x accumulated speedup) is obtainable with the new asynchronous ...
ACKNOWLEDGMENT This research is supported by Atos IT Services UK Ltd and by the EPSRC Centre for Doctoral Training in Urban Science and Progress (grant no. EP/L016400/1). ...
doi:10.1109/mlhpc.2018.8638643
fatcat:r2wutc5fzvbpzg6zrqoqe5erna
On parallelizability of stochastic gradient descent for speech DNNS
2014
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This paper compares the theoretical efficiency of model-parallel and data-parallel distributed stochastic gradient descent training of DNNs. ...
We arrive at an estimated possible end-to-end speed-up of 5 times or more. ...
TRAINING CONTEXT-DEPENDENT DEEP-NEURAL-NETWORK HMMS A deep neural network (DNN) is a conventional multi-layer perceptron (MLP [12] ) with many layers, where training is commonly initialized by a pretraining ...
doi:10.1109/icassp.2014.6853593
dblp:conf/icassp/SeideFDLY14
fatcat:yqf3byctxjb45bl67lhjgwlpum
Revisiting Distributed Synchronous SGD
[article]
2017
arXiv
pre-print
Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise ...
Our approach is empirically validated and shown to converge faster and to better test accuracies. ...
Section 2 describes asynchronous stochastic optimization and presents experimental evidence of gradient staleness in deep neural network models. ...
arXiv:1604.00981v3
fatcat:fnfrhsyakjfxxho4f3s2rwnurq
Revisiting Distributed Synchronous SGD
[article]
2017
arXiv
pre-print
Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise ...
Our approach is empirically validated and shown to converge faster and to better test accuracies. ...
Section 2 describes asynchronous stochastic optimization and presents experimental evidence of gradient staleness in deep neural network models. ...
arXiv:1702.05800v2
fatcat:s2yrnfe7cneetib6rbak25slii
Benchmarking Decoupled Neural Interfaces with Synthetic Gradients
[article]
2018
arXiv
pre-print
This paper performs a speed benchmark to compare the speed and accuracy capabilities of SG-DNI as opposed to a standard neural interface using multilayer perceptron MLP. ...
To solve this problem, synthetic gradients (SG) with decoupled neural interfaces (DNI) are introduced as a viable alternative to the backpropagation algorithm. ...
Acknowledgments The author will like to thank Andrew Miles for help and support with the GPU computing facilities at the Carleton School of Computer Science. ...
arXiv:1712.08314v3
fatcat:k53n7uyr5rb2pmcujbddfyzaxy
« Previous
Showing results 1 — 15 out of 1,164 results