4 Hits in 2.1 sec

Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory [article]

Micah Goldblum, Jonas Geiping, Avi Schwarzschild, Michael Moeller, Tom Goldstein
2020 arXiv   pre-print
In this work, we: (1) prove the widespread existence of suboptimal local minima in the loss landscape of neural networks, and we use our theory to find examples; (2) show that small-norm parameters are  ...  plays a role; (4) find that rank does not correlate with generalization or robustness in a practical setting.  ...  In this section, we investigate the existence of suboptimal local minima from a theoretical perspective and an empirical one.  ... 
arXiv:1910.00359v3 fatcat:oas2iunoyfantiepiklcz5pude

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent [article]

William Merrill and Vivek Ramanujan and Yoav Goldberg and Roy Schwartz and Noah Smith
2021 arXiv   pre-print
Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP.  ...  Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining.  ...  ., 2020), suggesting the benefit of weight decay may arise from more subtle effects on the GD trajectory.  ... 
arXiv:2010.09697v4 fatcat:gp7hyvv6xjefvbeh4ouq5q4bwu

Piecewise linear activations substantially shape the loss surfaces of neural networks [article]

Fengxiang He, Bohan Wang, Dacheng Tao
2020 arXiv   pre-print
Understanding the loss surface of a neural network is fundamentally important to the understanding of deep learning.  ...  We first prove that the loss surfaces of many neural networks have infinite spurious local minima which are defined as the local minima with higher empirical risks than the global minima.  ...  Truth or backpropaganda? an empirical investigation of deep learning theory. In International Conference on Learning Representations, 2020. Benjamin D. Haeffele and Rene Vidal.  ... 
arXiv:2003.12236v1 fatcat:r54rh2tczzbkhgbduhwcocl3zm

Pareto Probing: Trading Off Accuracy for Complexity [article]

Tiago Pimentel, Naomi Saphra, Adina Williams, Ryan Cotterell
2020 arXiv   pre-print
Our experiments using Pareto hypervolume as an evaluation metric show that probes often do not conform to our expectations---e.g., why should the non-contextual fastText representations encode more morpho-syntactic  ...  To measure complexity, we present a number of parametric and non-parametric metrics.  ...  Acknowledgments Thanks to Max Balandat for comments on an earlier version of this work.  ... 
arXiv:2010.02180v2 fatcat:d6w36kqm5zbndafg7jupirk26u