Convergence Theory of Learning Over-parameterized ResNet: A Full Characterization
ResNet structure has achieved great empirical success since its debut. Recent work established the convergence of learning over-parameterized ResNet with a scaling factor τ=1/L on the residual branch where L is the network depth. However, it is not clear how learning ResNet behaves for other values of τ. In this paper, we fully characterize the convergence theory of gradient descent for learning over-parameterized ResNet with different values of τ. Specifically, with hiding logarithmic factor
... d constant coefficients, we show that for τ< 1/√(L) gradient descent is guaranteed to converge to the global minma, and especially when τ< 1/L the convergence is irrelevant of the network depth. Conversely, we show that for τ>L^-1/2+c, the forward output grows at least with rate L^c in expectation and then the learning fails because of gradient explosion for large L. This means the bound τ< 1/√(L) is sharp for learning ResNet with arbitrary depth. To the best of our knowledge, this is the first work that studies learning ResNet with full range of τ.