Filters








21 Hits in 2.1 sec

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning [article]

Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, Michael W. Mahoney
2021 arXiv   pre-print
We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN.  ...  Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam.  ...  CONCLUSIONS In this work, we proposed ADAHESSIAN, an adaptive Hessian based optimizer.  ... 
arXiv:2006.00719v3 fatcat:fqd7xykr3jbmne3ybwsx57xbxu

Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information [article]

Majid Jahani, Sergey Rusakov, Zheng Shi, Peter Richtárik, Michael W. Mahoney, Martin Takáč
2021 arXiv   pre-print
We present a novel adaptive optimization algorithm for large-scale machine learning problems.  ...  first-order and second-order methods.  ...  Michael Mahoney would like to acknowledge the US NSF and ONR via its BRC on RandNLA for providing partial support of this work.  ... 
arXiv:2109.05198v1 fatcat:uh5yudwqarbztf3tafsjkuhfte

Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization [article]

Xuezhe Ma
2021 arXiv   pre-print
Importantly, the update and storage of the diagonal approximation of Hessian is as efficient as adaptive first-order optimization methods with linear complexity for both time and memory.  ...  In this paper, we introduce Apollo, a quasi-Newton method for nonconvex stochastic optimization, which dynamically incorporates the curvature of the loss function by approximating the Hessian via a diagonal  ...  methods, and Agarwal et al.(2019)proposed an efficient method for full-matrix adaptive regularization.Stochastic Second-Order Hessian-Free Methods.  ... 
arXiv:2009.13586v6 fatcat:amo5fj3uingldbsnvr5ubl6dpq

Adaptive Second Order Coresets for Data-efficient Machine Learning [article]

Omead Pooladzandi, David Davini, Baharan Mirzasoleiman
2022 arXiv   pre-print
We prove rigorous guarantees for the convergence of various first and second-order methods applied to the subsets chosen by AdaCore.  ...  We propose AdaCore, a method that leverages the geometry of the data to extract subsets of the training examples for efficient machine learning.  ...  Acknowledgements This research was supported in part by UCLA-Amazon Science Hub for Humanity and Artificial Intelligence.  ... 
arXiv:2207.13887v1 fatcat:5cbipmcdkrcnxpn4sm74rejo5y

Better SGD using Second-order Momentum [article]

Hoang Tran, Ashok Cutkosky
2021 arXiv   pre-print
We develop a new algorithm for non-convex stochastic optimization that finds an ϵ-critical point in the optimal O(ϵ^-3) stochastic gradient and Hessian-vector product computations.  ...  In contrast to prior work, we do not require excessively large batch sizes, and are able to provide an adaptive algorithm whose convergence rate automatically improves with decreasing variance in the gradient  ...  Introduction First-order algorithms such as Stochastic Gradient Descent (SGD) or Adam (Kingma & Ba, 2014) have emerged as the main workhorse for modern Machine Learning (ML) tasks.  ... 
arXiv:2103.03265v2 fatcat:gthbvysbk5b4jdslhdfsjzzlbi

Adaptive Second Order Coresets for Data-efficient Machine Learning

Omead Pooladzandi, David Davini, Baharan Mirzasoleiman
2022 International Conference on Machine Learning  
We prove rigorous guarantees for the convergence of various first and second-order methods applied to the subsets chosen by ADACORE.  ...  We propose ADACORE, a method that leverages the geometry of the data to extract subsets of the training examples for efficient machine learning.  ...  Acknowledgements This research was supported in part by UCLA-Amazon Science Hub for Humanity and Artificial Intelligence.  ... 
dblp:conf/icml/PooladzandiDM22 fatcat:27h5y6pgwnfnzgmm4cujipl45y

AdaCN: An Adaptive Cubic Newton Method for Nonconvex Stochastic Optimization

Yan Liu, Maojun Zhang, Zhiwei Zhong, Xiangrong Zeng, Paolo Gastaldo
2021 Computational Intelligence and Neuroscience  
In this work, we introduce AdaCN, a novel adaptive cubic Newton method for nonconvex stochastic optimization.  ...  It only requires at most first order gradients and updates with linear complexity for both time and memory.  ...  Introduction Stochastic gradient descent (SGD) [1] is the workhorse method for nonconvex stochastic optimization in machine learning, particularly for training deep neural networks (DNNs).  ... 
doi:10.1155/2021/5790608 pmid:34804146 pmcid:PMC8598341 fatcat:wlftz755hbc2reqibqyk7aluj4

Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction [article]

Yun Yue, Yongchao Liu, Suo Tong, Minghao Li, Zhen Zhang, Chunyang Wen, Huanjun Bao, Lihong Gu, Jinjie Gu, Yixiang Mu
2021 arXiv   pre-print
We develop a novel framework that adds the regularizers of the sparse group lasso to a family of adaptive optimizers in deep learning, such as Momentum, Adagrad, Adam, AMSGrad, AdaHessian, and create a  ...  new class of optimizers, which are named Group Momentum, Group Adagrad, Group Adam, Group AMSGrad and Group AdaHessian, etc., accordingly.  ...  Adaptive Optimization Methods Due to the simplicity and effectiveness, adaptive optimization methods [20, 17, 4, 27, 8, 19, 26] have become the de-facto standard algorithms used in deep learning.  ... 
arXiv:2107.14432v1 fatcat:nn4cpgvfvzdllmqzkbhond7yqe

LogGENE: A smooth alternative to check loss for Deep Healthcare Inference Tasks [article]

Aryaman Jeendgar, Aditya Pola, Soma S Dhavala, Snehanshu Saha
2022 arXiv   pre-print
In our work, we develop methods for Gene Expression Inference based on Deep neural networks.  ...  We adopt the Quantile Regression framework to predict full conditional quantiles for a given set of house keeping gene expressions.  ...  AdaHessian Yao et al. [2021] , in the Deep Learning context, is one promising second-order method.  ... 
arXiv:2206.09333v1 fatcat:iit7y3wntffwrkorhodku5dnt4

M-FAC: Efficient Matrix-Free Approximations of Second-Order Information [article]

Elias Frantar, Eldar Kurtic, Dan Alistarh
2021 arXiv   pre-print
These two algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods.  ...  The second algorithm targets an optimization setting, where we wish to compute the product between the inverse Hessian, estimated over a sliding window of optimization steps, and a given gradient direction  ...  Second Order Comparison.  ... 
arXiv:2107.03356v5 fatcat:qmvcglffezcrbcfqy26z2q3dhi

Neural Nets with a Newton Conjugate Gradient Method on Multiple GPUs [article]

Severin Reiz, Tobias Neckel, Hans-Joachim Bungartz
2022 arXiv   pre-print
Our goal is (1) to enhance this by enabling second-order optimization methods with fewer hyperparameters for large-scale neural networks and (2) to perform a survey of the performance optimizers for specific  ...  For the largest setup, we efficiently parallelized the optimizers with Horovod and applied it to a 8 GPU NVIDIA P100 (DGX-1) machine.  ...  AdaHessian uses the Hutchinson method for adapting learning rate [7] , other work involves inexact newton methods for neural networks [8] or a comparison of optimizers [9] .  ... 
arXiv:2208.02017v1 fatcat:hjnc3fyyvbfshm5fot2io7k6fu

Dual Averaging is Surprisingly Effective for Deep Learning Optimization [article]

Samy Jelassi, Aaron Defazio
2020 arXiv   pre-print
First-order stochastic optimization methods are currently the most widely used class of methods for training deep neural networks.  ...  However, the choice of the optimizer has become an ad-hoc rule that can significantly affect the performance.  ...  Gower and Michael Rabbat for helpful discussions and Anne Wu for her precious help in setting some numerical experiments.  ... 
arXiv:2010.10502v1 fatcat:6j6ytgrjqjd3jhox7r3dckskh4

Low Rank Saddle Free Newton: A Scalable Method for Stochastic Nonconvex Optimization [article]

Thomas O'Leary-Roseberry, Nick Alger, Omar Ghattas
2021 arXiv   pre-print
Additionally, due to perceived costs of forming and factorizing Hessians, second order methods are not used for these problems.  ...  on large scale deep learning tasks in terms of generalizability for equivalent computational work.  ...  For first order methods, these two effects can be easily decoupled, but as we will show there is an interplay between optimization geometry and stochastic errors for stochastic second order methods.  ... 
arXiv:2002.02881v3 fatcat:u3us3hyhkjedxbr22je3nvq6d4

Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers [article]

Robin M. Schmidt, Frank Schneider, Philipp Hennig
2021 arXiv   pre-print
Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods.  ...  To do so, we perform an extensive, standardized benchmark of fifteen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices.  ...  Introduction Large-scale stochastic optimization drives a wide variety of machine learning tasks.  ... 
arXiv:2007.01547v6 fatcat:643sm5zgtremjnuyvqq5jqsr7q

FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid? [article]

Shikhar Tuli, Bhishma Dedhia, Shreshth Tuli, Niraj K. Jha
2022 arXiv   pre-print
We also propose a novel NAS policy, called BOSHNAS, that leverages this new scheme, Bayesian modeling, and second-order optimization, to quickly train and use a neural surrogate model to converge to the  ...  optimal architecture.  ...  We also thank Xiaorun Wu for initial discussions.  ... 
arXiv:2205.11656v1 fatcat:msq4drt2rbg7pmiinf6ygxsrvq
« Previous Showing results 1 — 15 out of 21 results