### Uniform asymptotic inference and the bootstrap after model selection

Ryan J. Tibshirani, Alessandro Rinaldo, Rob Tibshirani, Larry Wasserman
2018 Annals of Statistics
Recently, Tibshirani et al. (2016) proposed a method for making inferences about parameters defined by model selection, in a typical regression setting with normally distributed errors. Here, we study the large sample properties of this method, without assuming normality. We prove that the test statistic of Tibshirani et al. (2016) is asymptotically valid, as the number of samples n grows and the dimension d of the regression problem stays fixed. Our asymptotic result holds uniformly over a
more » ... niformly over a wide class of nonnormal error distributions. We also propose an efficient bootstrap version of this test that is provably (asymptotically) conservative, and in practice, often delivers shorter intervals than those from the original normality-based approach. Finally, we prove that the test statistic of Tibshirani et al. (2016) does not enjoy uniform validity in a high-dimensional setting, when the dimension d is allowed grow. for all t ∈ [0, 1]. The interpretation is that the TG tests of H 0,M : v T M θ 0 = µ M , M ∈ M have the correct conditional size in a suitable weighted average sense, where the weights w M = P( M(Y ) = M), M ∈ M are given by the model selection probabilities. Finally, it is worth emphasizing once again that the testing property in (12) is written in such a way that it is easy to establish the confidence interval property in (13). Thus, a third way to address any concerns about interpreting (12) (or even (14) or (15) ) is to switch the focus from unconditional hypothesis testing to unconditional confidence intervals; in many ways, we find that the latter is the more natural of the two perspectives, from an unconditional point of view. The master statistic Given a response y and predictors X , our description thus far of the selected model M(y), statistics T(y; M, v, µ) and T (y; V ,U), etc., has ignored the role of X . This was done for simplicity. The theory to come in Section 4 will consider X to be nonrandom, but asymptotically X must (of course) grow with n, and so it will help to be precise about the dependence of the selected model and statistics on X . We will denote these quantities by M(X , y), T(X , y; M, v, µ), and T (X , y; V ,U) to emphasize this dependence. We define a d(d + 3)/2-dimensional quantity that we will call the master statistic. As its name might suggest, this plays an important role: all normalized coefficients from regressing y onto subsets of the variables X can be written in terms of Ω n . That is, for an arbitrary set A ⊆ {1, . . . , p}, the jth normalized coefficient from the regression of y onto X A is , 12 which only depends on (X , y) through Ω n . The same dependence is true, it turns out, for the selected models from FS, LAR, and the lasso. We defer the proof of the next lemma, as with all proofs in this paper, until the appendix. Lemma 3. For each the FS, LAR, and lasso procedures, run for k steps on data (X , y), the selected model M(X , y) only depends on (X , y) through Ω n = ( 1 n X T X , 1 n X T y), the master statistic. In more detail, for any fixed M ∈ M, the matrix Q M (X ) such that M(X , y) = M ⇐⇒ Q M (X ) y ≥ 0 can be written as Q M (X ) = P M ( 1 n X T X ) 1 n X T , where P M depends only on 1 n X T X . Hence This lemma asserts that the master statistic governs model selection, as performed by FS, LAR, and the lasso. It is also central to TG pivot for these procedures. Denoting M = M(X , y), the statistic T(X , y; M, v, µ) in (10) only depends on (X , y) through three quantities: The third quantity is always a function of Ω n , by Lemma 3. When v is chosen so that v T y is a normalized coefficient in the regression of y onto a subset of the variables in X , the first two quantities are also functions of Ω n . Thus, in this case, the TG pivot only depends on (X , y) through the master statistic Ω n ; in fact, for fixed 1 n X T X , it is a smooth function of 1 n X T y. Lemma 4. Fix any model M ∈ M, and suppose that v is chosen so that v T y is a normalized coefficient from projecting y onto a subset of the variables in X . Then the TG statistic only depends on (X , y) by means of Ω n , so that we may write T(X , y; M, v, µ) = ψ M 1 n X T X , 1 n X T y . The above function ψ M , with its first argument fixed at (any arbitrary value) 1 n X T X , is continuous in its second argument, on the interior of the cone {η : P M ( 1 n X T X ) η ≥ 0} ⊆ R d . Consider now the behavior of the unconditional TG statistic T (X , y; V ,U) over y ∈ R n , the entire sample space. Given a catalog V = {v M : M ∈ M} such that v T M y is a normalized regression coefficient from projecting y onto some subset of the variables, for each M ∈ M, Lemmas 3 and 4 combine to imply that the unconditional TG statistic is still only a function of (X , y) through the master statistic Ω n . When 1 n X T X is fixed, the discontinuities of this function over 1 n X T y lie on the boundaries of the partition elements, which have measure zero. Lemma 5. Suppose that the catalog V = {v M : M ∈ M} is chosen so that each v T M y gives a normalized coefficient from regressing y onto a subset of the variables in X , for M ∈ M. Then the unconditional TG statistic only depends on (X , y) by means of Ω n , so that we may write T (X , y; V ,U) = ψ 1 n X T X , 1 n X T y . The above function ψ, with its first argument fixed at (an arbitrary value of) 1 n X T X , is continuous in its second argument, on a set D ⊆ R d with full Lebesgue measure (i.e., R d \ D has measure zero).