Generalization in Deep Networks: The Role of Distance from Initialization [article]

Vaishnavh Nagarajan, J. Zico Kolter
2019 arXiv   pre-print
Why does training deep neural networks using stochastic gradient descent (SGD) result in a generalization error that does not worsen with the number of parameters in the network? To answer this question, we advocate a notion of effective model capacity that is dependent on a given random initialization of the network and not just the training algorithm and the data distribution. We provide empirical evidences that demonstrate that the model capacity of SGD-trained deep networks is in fact
more » ... cted through implicit regularization of the ℓ_2 distance from the initialization. We also provide theoretical arguments that further highlight the need for initialization-dependent notions of model capacity. We leave as open questions how and why distance from initialization is regularized, and whether it is sufficient to explain generalization.
arXiv:1901.01672v2 fatcat:e46fgius35a5vae66hs6ut5lyy