Approximation and estimation bounds for artificial neural networks

Andrew R. Barron
1994 Machine Learning  
For a common class of artificial neural networks, the mean integrated squared error between the estimated network and a target function f is shown to be bounded by where n is the number of nodes, d is the input dimension of the function, N is the number of training observations, and Cy is the first absolute moment of the Fourier magnitude distribution of f. The two contributions to this total risk are the approximation error and the estimation error. Approximation error refers to the distance
more » ... tween the target function and the closest neural network function of a given architecture and estimation error refers to the distance between this ideal network function and an estimated network function. With n ~ Cf (N/(d log N))1/2 nodes, the order of the bound on the mean integrated squared error is optimized to be O(Cf ((d/N) log N)1/2). The bound demonstrates surprisingly favorable properties of network estimation compared to traditional series and nonparametric curve estimation techniques in the case that d is moderately large. Similar bounds are obtained when the number of nodes n is not preselected as a function of Cf (which is generally not known a priori), but rather the number of nodes is optimized from the observed data by the use of a complexity regularization or minimum description length criterion. The analysis involves Fourier techniques for the approximation error, metric entropy considerations for the estimation error, and a calculation of the index of resolvability of minimum complexity estimation of the family of networks. g e y w o r d s : Neural nets, approximation theory, estimation theory, complexity regularization, statistical risk A.R. BARRON proximation, respectively.) If attention were restricted to approximation by linear combinations of a fixed set of n basis functions, then by a result in (Barton 1993) there is no such basis for which the integrated squared approximation error is less than order (C/d)(1/n) 2/g uniformly for all functions with Cy < C for any C > 0. Consequently, it is seen that for the class of functions considered here, (adaptive) neural network estimation has approximation and estimation properties that are superior to traditional linear expansions for each dimension d ~ 3.
doi:10.1007/bf00993164 fatcat:pdvlyivhqndfnjwa7nbrctnffu