222 Hits in 6.6 sec

Decoupled Asynchronous Proximal Stochastic Gradient Descent with Variance Reduction [article]

Zhouyuan Huo, Bin Gu, Heng Huang
2016 arXiv   pre-print
In this paper, we propose a faster method, decoupled asynchronous proximal stochastic variance reduced gradient descent method (DAP-SVRG).  ...  Asynchronous optimization algorithms come out as a promising solution. Recently, decoupled asynchronous proximal stochastic gradient descent (DAP-SGD) is proposed to minimize a composite function.  ...  DAP-SGD constant denotes decoupled asynchronous proximal stochastic gradient descent method with constant learning rate, DAP-SGD decay denote decoupled asynchronous proximal stochastic gradient descent  ... 
arXiv:1609.06804v2 fatcat:v6woa635mjaunfxk6vlptrqece

Distributed Dynamic Safe Screening Algorithms for Sparse Regularization [article]

Runxue Bao, Xidong Wu, Wenhan Xian, Heng Huang
2022 arXiv   pre-print
Distributed optimization has been widely used as one of the most efficient approaches for model training with massive samples.  ...  However, large-scale learning problems with both massive samples and high-dimensional features widely exist in the era of big data.  ...  On variance reduction in stochastic gradient descent and its asynchronous variants. In NeurIPS, 2015. [Shevade and Keerthi, 2003] Shirish Krishnaj Shevade and S Sathiya Keerthi.  ... 
arXiv:2204.10981v1 fatcat:7bmotznwhfbtbebllt354pjekm

The Sound of APALM Clapping: Faster Nonsmooth Nonconvex Optimization with Stochastic Asynchronous PALM [article]

Damek Davis, Brent Edmunds, Madeleine Udell
2016 arXiv   pre-print
We introduce the Stochastic Asynchronous Proximal Alternating Linearized Minimization (SAPALM) method, a block coordinate stochastic proximal-gradient method for solving nonconvex, nonsmooth optimization  ...  SAPALM is the first asynchronous parallel optimization method that provably converges on a large class of nonconvex, nonsmooth problems.  ...  The following theorem guarantees convergence of asynchronous stochastic block gradient descent with a constant minibatch size. See the appendix for a proof.  ... 
arXiv:1606.02338v1 fatcat:jpphrmyfojcpjif5adaiqmtby4

A Primer on Coordinate Descent Algorithms [article]

Hao-Jun Michael Shi, Shenyinying Tu, Yangyang Xu, Wotao Yin
2017 arXiv   pre-print
This monograph presents a class of algorithms called coordinate descent algorithms for mathematicians, statisticians, and engineers outside the field of optimization.  ...  Coordinate descent algorithms solve optimization problems by successively minimizing along each coordinate or coordinate hyperplane, which is ideal for parallelized and distributed computing.  ...  Variance Reduction Techniques Alternatively, we can also consider stochastic variance-reduced gradients, which use a combination of stale gradients with new gradients to reduce the variance in the chosen  ... 
arXiv:1610.00040v2 fatcat:fo3xzcsx4rb4xauip34j5jbm3y

Proximal SCOPE for Distributed Sparse Learning: Better Data Partition Implies Faster Convergence Rate [article]

Shen-Yi Zhao, Gong-Duo Zhang, Ming-Wei Li, Wu-Jun Li
2018 arXiv   pre-print
In this paper, we propose a novel method, called proximal SCOPE (pSCOPE), for distributed sparse learning with L_1 regularization. pSCOPE is based on a cooperative autonomous local learning (CALL) framework  ...  Distributed sparse learning with a cluster of multiple machines has attracted much attention in machine learning, especially for large-scale applications with high-dimensional data.  ...  Decoupled asynchronous proximal stochastic gradient descent with variance reduction. CoRR, abs/1609.06804, 2016. [10] Rie Johnson and Tong Zhang.  ... 
arXiv:1803.05621v2 fatcat:k5qpjftigbfablarhodfpngxae

99 Fix it [article]

Konstantin Mishchenko and Filip Hanzely and Peter Richtárik
2019 arXiv   pre-print
Namely, we develop a new variant of parallel block coordinate descent based on independent sparsification of the local gradient estimates before communication.  ...  The average is broadcast back to the workers, which use it to perform a gradient-type step to update the local version of the model.  ...  Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching. arXiv preprint arXiv:1805.02632, 2018. Hanzely, F. and Richtárik, P.  ... 
arXiv:1901.09437v2 fatcat:up2xhbyfojhgndvomvuhq3qxd4

Cogradient Descent for Dependable Learning [article]

Runqi Wang, Baochang Zhang, Li'an Zhuo, Qixiang Ye, David Doermann
2021 arXiv   pre-print
Conventional gradient descent methods compute the gradients for multiple variables through the partial derivative.  ...  In this paper, we propose a dependable learning based on Cogradient Descent (CoGD) algorithm to address the bilinear optimization problem, providing a systematic way to coordinate the gradients of coupling  ...  In the deep learning era, with large-scale dataset, stochastic gradient descent (SGD) and its variants are practical choices.  ... 
arXiv:2106.10617v1 fatcat:s5gthlcvfzh6jgqx3mwm2obtbq

Direct Acceleration of SAGA using Sampled Negative Momentum [article]

Kaiwen Zhou
2019 arXiv   pre-print
Among existing variance reduction methods, SVRG and SAGA adopt unbiased gradient estimators and are the most popular variance reduction methods in recent years.  ...  Variance reduction is a simple and effective technique that accelerates convex (or non-convex) stochastic optimization.  ...  Inspired by the acceleration technique proposed in Nesterov's accelerated gradient descent [Nesterov, 2004] , accelerated variants of stochastic variance reduced methods have been proposed in recent years  ... 
arXiv:1806.11048v4 fatcat:srtwhsgm4rawnlktoqg335zm6m

A Hitchhiker's Guide On Distributed Training of Deep Neural Networks [article]

Karanbir Chahal, Manraj Singh Grover, Kuntal Dey
2018 arXiv   pre-print
More specifically, we explore the synchronous and asynchronous variants of distributed Stochastic Gradient Descent, various All Reduce gradient aggregation strategies and best practices for obtaining higher  ...  Training a benchmark dataset like ImageNet on a single machine with a modern GPU can take upto a week, distributing training on multiple machines has been observed to drastically bring this time down.  ...  s [24] proposed algorithm combines merits of Delayed Proximal Gradient algorithm [25] and Stochastic Variance Reduced Gradient [26] which guarantees convergence to optimal solution at fast linear  ... 
arXiv:1810.11787v1 fatcat:wy36x3sdwvhvfdrnc5tvzn7sty

Stochastic, Distributed and Federated Optimization for Machine Learning [article]

Jakub Konečný
2017 arXiv   pre-print
First, we propose novel variants of stochastic gradient descent with a variance reduction property that enables linear convergence for strongly convex objectives.  ...  We propose a communication-efficient framework which iteratively forms local subproblems that can be solved with arbitrary local optimization algorithms.  ...  Part II Parallel and Distributed Methods with Variance Reduction Mini-batch Semi-Stochastic Gradient Descent in the Proximal Setting Introduction In this work we are concerned with the problem of minimizing  ... 
arXiv:1707.01155v1 fatcat:t6uqrmnssrafze6l6c7gk5vcyu

Optimization Methods for Large-Scale Machine Learning [article]

Léon Bottou, Frank E. Curtis, Jorge Nocedal
2018 arXiv   pre-print
A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional  ...  gradient-based nonlinear optimization techniques typically falter.  ...  ., the iteration is equivalent to applying a proximal mapping to the result of a gradient descent step.  ... 
arXiv:1606.04838v3 fatcat:7gksju7azndy5almouyzycayci

On Seven Fundamental Optimization Challenges in Machine Learning [article]

Konstantin Mishchenko
2021 arXiv   pre-print
The exchange of ideas between these fields has worked both ways, with machine learning building on standard optimization procedures such as gradient descent, as well as with new directions in the optimization  ...  The fifth challenge is the development of an algorithm for distributed optimization with quantized updates that preserves linear convergence of gradient descent.  ...  Above all, with Peter Richtárik's help, I quickly became able to work independently and collaborate with people from different countries and backgrounds, which lays a solid foundation for my future work  ... 
arXiv:2110.12281v1 fatcat:c4oc7xv6fvdqdik4hwegrcnsqm

Quasi-hyperbolic momentum and Adam for deep learning [article]

Jerry Ma, Denis Yarats
2019 arXiv   pre-print
Momentum-based acceleration of stochastic gradient descent (SGD) is widely used in deep learning.  ...  We propose the quasi-hyperbolic momentum algorithm (QHM) as an extremely simple alteration of momentum SGD, averaging a plain SGD step with a momentum step.  ...  Accelerating stochastic gradient descent. CoRR, abs/1704.08227, 2017. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.  ... 
arXiv:1810.06801v4 fatcat:tq3iul7mdnhjhjjtq5d7edjacm

Relative Entropy Regularized Policy Iteration [article]

Abbas Abdolmaleki, Jost Tobias Springenberg, Jonas Degrave, Steven Bohez, Yuval Tassa, Dan Belov, Nicolas Heess, Martin Riedmiller
2018 arXiv   pre-print
We present an off-policy actor-critic algorithm for Reinforcement Learning (RL) that combines ideas from gradient-free optimization via stochastic search with learned action-value function.  ...  Our comparison on 31 continuous control tasks from parkour suite [Heess et al., 2017], DeepMind control suite [Tassa et al., 2018] and OpenAI Gym [Brockman et al., 2016] with diverse properties, limited  ...  (s t , a t ), which we optimize via gradient descent.  ... 
arXiv:1812.02256v1 fatcat:bfwxwdrtejed7hjewdheh2yoxy

Deep Reinforcement Learning Overview of the state of the Art

Youssef Fenjiro, Houda Benbrahim
2018 Journal of Automation, Mobile Robotics & Intelligent Systems  
Artificial intelligence has made big steps forward with reinforcement learning (RL) in the last century, and with the advent of deep learning (DL) in the 90s, especially, the breakthrough of convolutional  ...  this new and promising field, by browsing a set of algorithms (Value optimization, Policy optimization and Actor-Critic), then, giving an outline of current challenges and real-world applications, along with  ...  , instead of 104 iterations with SGD (stochastic gradient descent).  ... 
doi:10.14313/jamris_3-2018/15 fatcat:wn5i7y7tgfhvnhz3u5xkqlgvpe
« Previous Showing results 1 — 15 out of 222 results