Filters








544 Hits in 8.4 sec

Bad Global Minima Exist and SGD Can Reach Them [article]

Shengchao Liu, Dimitris Papailiopoulos, Dimitris Achlioptas
2021 arXiv   pre-print
The consensus explanation that has emerged credits the randomized nature of SGD for the bias of the training process towards low-complexity models and, thus, for implicit regularization.  ...  We find that if we do not regularize explicitly, then SGD can be easily made to converge to poorly-generalizing, high-complexity models: all it takes is to first train on a random labeling on the data,  ...  Acknowledgements Dimitris Papailiopoulos is supported by an NSF CAREER Award #1844951, two Sony Faculty Innovation Awards, an AFOSR & AFRL Center of Excellence Award FA9550-18-1-0166, and an NSF TRIPODS  ... 
arXiv:1906.02613v2 fatcat:bh3pgrt3jvddxhknmreimoipl4

Distributed Gradient Methods for Nonconvex Optimization: Local and Global Convergence Guarantees [article]

Brian Swenson, Soummya Kar, H. Vincent Poor, José M. F. Moura, Aaron Jaech
2020 arXiv   pre-print
The article discusses distributed gradient-descent algorithms for computing local and global minima in nonconvex optimization.  ...  For global optimization, we discuss annealing-based methods in which slowly decaying noise is added to D-SGD. Conditions are discussed under which convergence to global minima is guaranteed.  ...  Under what conditions can D-SGD be guaranteed to converge to local minima (or not converge to saddle points)? 2. Can simple variants of D-SGD converge to global minima?  ... 
arXiv:2003.10309v2 fatcat:fxwhaljvu5e3fbasgznjnnooqe

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data [article]

Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz
2017 arXiv   pre-print
Specifically, we prove convergence rates of SGD to a global minimum and provide generalization guarantees for this global minimum that are independent of the network size.  ...  Therefore, our result clearly shows that the use of SGD for optimization both finds a global minimum, and avoids overfitting despite the high capacity of the model.  ...  Acknowledgments This research is supported by the Blavatnik Computer Science Research Fund and the European Research Council (TheoryDL project).  ... 
arXiv:1710.10174v1 fatcat:y7z4a3hdtfhxni42jcf2vi7rhq

The Global Landscape of Neural Networks: An Overview [article]

Ruoyu Sun, Dawei Li, Shiyu Liang, Tian Ding, R Srikant
2020 arXiv   pre-print
Second, we discuss a few rigorous results on the geometric properties of wide networks such as "no bad basin", and some modifications that eliminate sub-optimal local minima and/or decreasing paths to  ...  In this article, we review recent findings and results on the global landscape of neural networks.  ...  15 "Conjecture" or "open" to find them). How to leverage the insight obtained from the theory to design better methods/architectures is also an interesting question.  ... 
arXiv:2007.01429v1 fatcat:4j4qnvsfdfeirp4fvxjbszgaxu

Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization

Gang Wang, Georgios B. Giannakis, Jie Chen
2019 IEEE Transactions on Signal Processing  
optimality, despite the presence of infinitely many bad local minima, maxima, and saddle points in general.  ...  Leveraging the power of random noise perturbation, this paper presents a novel stochastic gradient descent (SGD) algorithm, which can provably train any single-hidden-layer ReLU network to attain global  ...  It is evident that the developed Algorithm 1 trained all considered ReLU networks to global optimality, while plain-vanilla SGD can get stuck with bad local minima, for small k in particular.  ... 
doi:10.1109/tsp.2019.2904921 fatcat:p2cshe3w3vbx3lqlchkxzuspby

Explorations on high dimensional landscapes [article]

Levent Sagun, V. Ugur Guney, Gerard Ben Arous, Yann LeCun
2015 arXiv   pre-print
We finally observe that both the gradient descent and the stochastic gradient descent methods can reach this level within the same number of steps.  ...  Finding minima of a real valued non-convex function over a high dimensional space is a major challenge in science.  ...  ACKNOWLEDGEMENTS We thank David Belius for valuable discussions, Taylan Cemgil and Atilla Yılmaz for valuable feedback, and reviewers for valuable suggestions.  ... 
arXiv:1412.6615v4 fatcat:oechrrjw4fd7zjxdm3pjtd6uqm

Gradient Omissive Descentis A Minimization Algorithm

Gustavo A. Lado, Enrique C. Segura
2019 International Journal on Soft Computing Artificial Intelligence and Applications  
The method requires no manual selection of global hyperparameters and is capable of dynamic local adaptations using only first-order information at a low computational cost.  ...  Its semistochastic nature makes it fit for mini-batch training and robust to different architecture choices and data distributions.  ...  Special thanks to her for the invitation and to the Emerging Leaders in the Americas Program (ELAP) of the Canada's government for making that visit possible.  ... 
doi:10.5121/ijscai.2019.8103 fatcat:ei4xh3rn2bdbfgfg36imat3kgu

On the loss landscape of a class of deep neural networks with no bad local valleys [article]

Quynh Nguyen, Mahesh Chandra Mukkamala, Matthias Hein
2018 arXiv   pre-print
space there exists a continuous path on which the cross-entropy loss is non-increasing and gets arbitrarily close to zero.  ...  We identify a class of over-parameterized deep neural networks with standard activation functions and cross-entropy loss which provably have no bad local valley, in the sense that from any point in parameter  ...  Due to this property, we do not study the global minima of the cross-entropy loss but the question if and how one can achieve zero training error.  ... 
arXiv:1809.10749v2 fatcat:zvv47luo5bfcvajcnvpzyba5pm

How noise affects the Hessian spectrum in overparameterized neural networks [article]

Mingwei Wei, David J Schwab
2019 arXiv   pre-print
We test our results with experiments on toy models and deep neural networks.  ...  Stochastic gradient descent (SGD) forms the core optimization method for deep neural networks.  ...  ACKNOWLEDGMENTS We thank Boris Hanin, Sho Yaida, and Dan Roberts for valuable discussions.  ... 
arXiv:1910.00195v2 fatcat:5ynvf4n64ne7nb4vmgoamiiagi

A Walk with SGD [article]

Chen Xing, Devansh Arpit, Christos Tsirigotis, Yoshua Bengio
2018 arXiv   pre-print
Specifically we study the DNN loss surface along the trajectory of SGD by interpolating the loss surface between parameters from consecutive iterations and tracking various metrics during training.  ...  Based on this and other metrics, we deduce that for most of the training update steps, SGD moves in valley like regions of the loss surface by jumping from one valley wall to another at a height above  ...  loss with good generalization despite the existence of numerous bad minima.  ... 
arXiv:1802.08770v4 fatcat:m5tnke5zvjhlbgw2jnq2c7cafi

Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing the Early Softmax Saturation [article]

Binghui Chen, Weihong Deng, Junping Du
2017 arXiv   pre-print
and help to find a better local-minima.  ...  In this paper, we first emphasize that the early saturation behavior of softmax will impede the exploration of SGD, which sometimes is a reason for model converging at a bad local-minima, then propose  ...  Adding annealed noise can help the solver escape from a bad local-minima and find a better one. We follow these inspiring ideas to address individual saturation and encourage SGD to explore more.  ... 
arXiv:1708.03769v1 fatcat:5sjoz4rxjbc2vezy7hygea5c5i

Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing the Early Softmax Saturation

Binghui Chen, Weihong Deng, Junping Du
2017 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
and help to find a better local-minima.  ...  In this paper, we first emphasize that the early saturation behavior of softmax will impede the exploration of SGD, which sometimes is a reason for model converging at a bad local-minima, then propose  ...  Adding annealed noise can help the solver escape from a bad local-minima and find a better one. We follow these inspiring ideas to address individual saturation and encourage SGD to explore more.  ... 
doi:10.1109/cvpr.2017.428 dblp:conf/cvpr/ChenDD17 fatcat:mzqopc2vorhbdlmmnwal6nibrq

Generalization Performance of Empirical Risk Minimization on Over-parameterized Deep ReLU Nets [article]

Shao-Bo Lin, Yao Wang, Ding-Xuan Zhou
2021 arXiv   pre-print
Since over-parameterization is crucial to guarantee that the global minima of ERM on deep ReLU nets can be realized by the widely used stochastic gradient descent (SGD) algorithm, our results indeed fill  ...  Using a novel deepening scheme for deep ReLU nets, we rigorously prove that there exist perfect global minima achieving almost optimal generalization error bounds for numerous types of data under mild  ...  Due to the nonlinear nature of (2.2), we rigorously prove the existence of bad minima and perfect global ones.  ... 
arXiv:2111.14039v2 fatcat:5ozxnksdyjflrhbadi4kuptvhe

Stochastic Training is Not Necessary for Generalization [article]

Jonas Geiping, Micah Goldblum, Phillip E. Pope, Michael Moeller, Tom Goldstein
2022 arXiv   pre-print
To this end, we show that the implicit regularization of SGD can be completely replaced with explicit regularization even when comparing against a strong and well-researched baseline.  ...  In this work, we demonstrate that non-stochastic full-batch training can achieve comparably strong performance to SGD on CIFAR-10 using modern architectures.  ...  training via navigating the optimization landscape, finding global minima, and avoiding bad local minima and saddlepoints.  ... 
arXiv:2109.14119v2 fatcat:izkob2pvcfefhaqospgdzjnr7e

Deep Networks on Toroids: Removing Symmetries Reveals the Structure of Flat Regions in the Landscape Geometry [article]

Fabrizio Pittorino, Antonio Ferraro, Gabriele Perugini, Christoph Feinauer, Carlo Baldassi, Riccardo Zecchina
2022 arXiv   pre-print
are closer to each other and that the barriers along the geodesics connecting them are small.  ...  This lets us derive a meaningful notion of the flatness of minimizers and of the geodesic paths connecting them.  ...  Therefore, in practical applications, bad minima are seldom reported or observed, even though they exist in the landscape.  ... 
arXiv:2202.03038v2 fatcat:wjpmbduepzhbxg4m677ia2w6xm
« Previous Showing results 1 — 15 out of 544 results