DTDA: AnRPackage to Analyze Randomly Truncated Data
Journal of Statistical Software
In this paper, the R package DTDA for analyzing truncated data is described. This package contains tools for performing three different but related algorithms to compute the nonparametric maximum likelihood estimator of the survival function in the presence of random truncation. More precisely, the package implements the algorithms proposed by Efron and Petrosian (1999) and Shen (2008) , for analyzing randomly one-sided and two-sided (i.e., doubly) truncated data. These algorithms and some
... ithms and some recent extensions are briefly reviewed. Two real data sets are used to show how DTDA package works in practice. the truncation set is an interval unbounded from above. In epidemiological studies and industrial life-testing, left-truncation arises e.g., when performing some cross-sectional sampling, under which only individuals "in progress" at a given date (also referred as prevalent cases) are eligible. As a result, large progression times are more probably observed, and this may dramatically damage the observation of the DF of interest. For left-truncated data, the NPMLE has an explicit form and it can be computed from a simple algorithm that goes back to Lynden-Bell (1971) . See Woodroofe (1985) and Stute (1993) for the statistical analysis of this estimator. The right-truncated scenario, under which the truncation sets are intervals unbounded from below, can be dealt with similarly by means of a sign change. Inference becomes more complicated, however, when other ways of truncation appear. In many applications, the truncation sets are bounded intervals, that is, the variable of interest X * is only observed when it falls on a (subject-specific) random interval [U * , V * ]. Efron and Petrosian (1999) motivated this double-truncation issue by means of data on quasars, which are only detected when their luminosity lies between two observational limits. In epidemiology, doubly-truncated data are also encountered. For example, acquired immunodeficiency syndrome (AIDS) incubation times (from human immunodeficiency virus (HIV) infection) databases report information restricted to those individuals diagnosed prior to some specific date. This typically introduces a strong observational bias associated to right-truncation, i.e., relatively small incubation times are more probably observed. Besides, since HIV was unknown before 1982, there is some left-truncation effect too. Bilker and Wang (1996) noticed this problem and they discussed the relative impact of each type of truncation in the final sample. Moreira and de Uña-Álvarez (2010) motivated the random double-truncation phenomenon by analyzing the age at diagnosis for childhood cancer patients; as for the AIDS example, in this case the double truncation emerges from the fact that the recruited subjects are those with terminating event falling on a given observational window. Note that left (or right) truncation can be obtained from double-truncation by letting V * (respectively U * ) be degenerated at infinity (respectively minus infinity). A cumbersome issue with doubly-truncated data is that the NPMLE has no explicit form, and it must be computed iteratively. This complicates the analysis of its statistical properties, posing also a challenge in the design of suitable algorithms for its practical computation. See Efron and Petrosian (1999) and Shen (2008) for technical details. For the best of our knowledge, there is no package oriented to the computation of the NPMLE under double-truncation. The DTDA package described in this work fills this gap. DTDA has been implemented in R (R Development Core Team 2010) system for statistical computing. This package also allows for the analysis of one-sided (left or right) truncated data. The package DTDA contains three different algorithms for the approximation of the NPMLE under double-truncation (in its more general version), as well as some recent extensions, e.g., bootstrap confidence bands (Moreira and de Uña-Álvarez 2010). As it will be described below, it provides useful numerical outputs and automatic graphical displays too. Results in this document have been obtained with version 2.1-1, available from http://CRAN.R-project.org/package=DTDA. The paper is organized as follows. In Section 2, a brief review of the existing algorithms to compute the NPMLE under double-truncation is given. In Section 3 the DTDA is described and its usage is illustrated through the analysis of two real data sets. Finally, Section 4 is devoted to conclusions and future possible extensions of the package.