18,068 Hits in 7.7 sec

Distributed SGD Generalizes Well Under Asynchrony [article]

Jayanth Regatti, Gaurav Tendolkar, Yi Zhou, Abhishek Gupta, Yingbin Liang
2019 arXiv   pre-print
Such adaptive learning rate strategy improves the stability of the distributed algorithm and reduces the corresponding generalization error.  ...  In particular, our results suggest to reduce the learning rate as we allow more asynchrony in the distributed system.  ...  We use a batch size of 64 per worker and the gradient updates are performed using (1).  ... 
arXiv:1909.13391v1 fatcat:26e2nk5xdjeannnnrpz765clnu

Is SGD a Bayesian sampler? Well, almost [article]

Chris Mingard, Guillermo Valle-Pérez, Joar Skalse, Ard A. Louis
2020 arXiv   pre-print
A function probability picture, based on P_SGD(f| S) and/or P_B(f| S), can shed new light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser  ...  Overparameterised deep neural networks (DNNs) are highly expressive and so can, in principle, generate almost any function that fits a training dataset with zero error.  ...  Changing batch size and learning rate In a well-known study, (Keskar et al., 2016) showed that, for a fixed learning rate, using smaller batch sizes could lead to better generalisation.  ... 
arXiv:2006.15191v2 fatcat:4z6qvgaeizbvhojnl7n364ywai

Understanding Why Neural Networks Generalize Well Through GSNR of Parameters [article]

Jinlong Liu, Guoqing Jiang, Yunzhi Bai, Ting Chen, Huayan Wang
2020 arXiv   pre-print
As deep neural networks (DNNs) achieve tremendous success across many application domains, researchers tried to explore in many aspects on why they generalize well.  ...  Based on several approximations, we establish a quantitative relationship between model parameters' GSNR and the generalization gap.  ...  Previous work tried to use GSNR to conduct theoretical analysis on deep learning.  ... 
arXiv:2001.07384v2 fatcat:dhgmajab5jdrrng3w7vf4ywjfu

Sequential Tests for Large-Scale Learning

Anoop Korattikara, Yutian Chen, Max Welling
2016 Neural Computation  
The statistical properties of this subsampling process can be used to control the efficiency and accuracy of learning or inference.  ...  We argue that when faced with big data sets, learning and inference algorithms should compute updates using only subsets of data items.  ...  Since the acceptance rate approaches 1 as α goes to 0, we can keep the bias under control by keeping α small. However, we have to use a reasonably large α to keep the mixing rate high.  ... 
doi:10.1162/neco_a_00796 pmid:26599710 fatcat:d3mq76obmrab5kdxbi6heu23i4

Modeling Category-Selective Cortical Regions with Topographic Variational Autoencoders [article]

T. Anderson Keller, Qinghe Gao, Max Welling
2021 arXiv   pre-print
., and discuss both theoretical and empirical similarities.  ...  with objects or other generic stimuli.  ...  Acknowledgments and Disclosure of Funding We would like to thank the reviewers for providing helpful constructive feedback, and the organizers of the workshop for their service.  ... 
arXiv:2110.13911v2 fatcat:qp3k5rek2zhgtisxtwtxj6droi

Well placement optimization using imperialist competitive algorithm

Mohammad A. Al Dossary, Hadi Nasrabadi
2016 Journal of Petroleum Science and Engineering  
The empires then compete with each other, which cause the weak empires to collapse and the powerful empires to dominate and overtake their colonies.  ...  parameters generally led to acceptable performances in our examples.  ...  placement, it would be beneficial for the framework to couple well placement with the optimization of well control parameters (flow rates/bottomhole pressures). • To generate a multiobjective Pareto surface  ... 
doi:10.1016/j.petrol.2016.06.017 fatcat:67bamanmgbgs3oqgtp7vp5zogm

Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio [chapter]

Stanislaw Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey
2018 Lecture Notes in Computer Science  
We show that the dynamics and convergence properties of SGD are set by the ratio of learning rate to batch size.  ...  We observe that this ratio is a key determinant of the generalization error, which we suggest is mediated by controlling the width of the final minima found by SGD.  ...  However, batch size does not enter their analysis. In contrast, our analysis makes the role of batch size evident and shows the dynamics are set by the ratio of learning rate to batch size.  ... 
doi:10.1007/978-3-030-01424-7_39 fatcat:q6sk2gbltnhkli2gc4qwujf25i

A Clean, Well-Lighted Place

Polyxeni Potter
2010 Emerging Infectious Diseases  
Ferrets were fed fish in small batches.  ...  We necropsied the ferrets and examined the recovered Dracunculus worms to determine sex and whether females were mated or gravid.  ...  We compared case villages to both sets of controls in terms of human and bat population size, as well as human behavior patterns regarding date palm sap and fruit consumption.  ... 
doi:10.3201/eid1604.000000 fatcat:53fmnxwgyfgxvdjy5iactnlzka

Factor Analysis of Well Logs for Total Organic Carbon Estimation in Unconventional Reservoirs

Norbert P. Szabó, Rafael Valadez-Vergara, Sabuhi Tapdigli, Aja Ugochukwu, István Szabó, Mihály Dobróka
2021 Energies  
The estimation method is applied both to synthetic and real datasets from different reservoir types and geologic basins, i.e., Derecske Trough in East Hungary (tight gas); Kingak formation in North Slope  ...  Uncorrelated factors are extracted from well logging data using Jöreskog's algorithm, and then the factor logs are correlated with estimated petrophysical properties.  ...  The batch size is the number of training examples in one forward/backward pass. The higher the batch size, the more memory space we need.  ... 
doi:10.3390/en14185978 fatcat:cnhzhcrzx5b7lktnirs5mejlne

Herding as a Learning System with Edge-of-Chaos Dynamics [article]

Yutian Chen, Max Welling
2016 arXiv   pre-print
The herding algorithm can also be generalized to models with latent variables and to a discriminative learning setting.  ...  This chapter studies the distinct statistical characteristics of the herding algorithm and shows that the fast convergence rate of the controlled moments may be attributed to edge of chaos dynamics.  ...  φ = xy, and we use a mini-batch of size 1 at every iteration.  ... 
arXiv:1602.03014v2 fatcat:pxweht2l7vgvjozwvd4re3qomi

The Break-Even Point on Optimization Trajectories of Deep Neural Networks [article]

Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, Krzysztof Geras
2020 arXiv   pre-print
Complementing prior work, we also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.  ...  In particular, we demonstrate on multiple classification tasks that using a large learning rate in the initial phase of training reduces the variance of the gradient, and improves the conditioning of the  ...  When varying the batch size, we use learning rate of 1.0. Experiments are repeated with two different seeds that control initialization and data shuffling. DenseNet on ImageNet.  ... 
arXiv:2002.09572v1 fatcat:qyrskuopzrex7f2zz5mrq6w764

Three Factors Influencing Minima in SGD [article]

Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey
2018 arXiv   pre-print
We theoretically argue that three factors - learning rate, batch size and gradient covariance - influence the minima found by SGD.  ...  Further, we include experiments which show that learning rate schedules can be replaced with batch size schedules and that the ratio of learning rate to batch size is an important factor influencing the  ...  evidence that the learning rate to batch size ratio is theoretically important in SGD.  ... 
arXiv:1711.04623v3 fatcat:335mzui2rnekzklii5ckzejdhe

A Bayesian Perspective on Generalization and Stochastic Gradient Descent [article]

Samuel L. Smith, Quoc V. Le
2018 arXiv   pre-print
Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, B_opt∝ϵ N. We verify these predictions empirically.  ...  We consider two questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well?  ...  ACKNOWLEDGMENTS We thank Pieter-Jan Kindermans, Prajit Ramachandran, Jascha Sohl-Dickstein, Jon Shlens, Kevin Murphy, Samy Bengio, Yasaman Bahri and Saeed Saremi for helpful comments on the manuscript.  ... 
arXiv:1710.06451v3 fatcat:ypgtckf4zrfltebmml5dy6idgu

Super-convergence: very fast training of neural networks using large learning rates

Leslie N. Smith, Nicholay Topin, Tien Pham
2019 Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications  
The existence of super-convergence is relevant to understanding why deep networks generalize well.  ...  One of the key elements of super-convergence is training with cyclical learning rates and a large maximum learning rate.  ...  ratio of the learning rate to batch size alone controls the entropic regularization term.  ... 
doi:10.1117/12.2520589 fatcat:jvkiuhrajrf2plx2sabrnu4zee

Local Regularizer Improves Generalization

Yikai Zhang, Hui Qu, Dimitris Metaxas, Chao Chen
Our thorough theoretical analysis is supported by experimental evidence. It advances our theoretical understanding of deep learning and provides new perspectives on designing training algorithms.  ...  Regularization plays an important role in generalization of deep learning. In this paper, we study the generalization power of an unbiased regularizor for training algorithms in deep learning.  ...  This work was partially supported by NSF IIS-1855759, CCF-1855760, and CCF-1733843.  ... 
doi:10.1609/aaai.v34i04.6167 fatcat:nlyri6hieve3tm6dprpfh4fxwa
« Previous Showing results 1 — 15 out of 18,068 results