Invited Commentary: Surveilling Surveillance--Some Statistical Comments

L. A. Waller
2004 American Journal of Epidemiology  
I congratulate Kleinman et al. (1) on their thoughtful application of generalized linear mixed models (GLMM) to disease surveillance in space and time. In this commentary, I amplify some appealing features of the approach, provide an overview of data issues that may affect the field performance of a surveillance system based on such a method, and discuss several technical issues. Attractive features of the approach The authors' approach offers a movement from statistical testing to statistical
more » ... odeling for disease surveillance. Traditionally, statistical methods for surveillance tend to evolve from a hypothesis-testing framework, wherein one "detects" an outbreak (anomaly, cluster, etc.) as a "statistically significant" departure from a null hypothesis defined as the lack of an outbreak (e.g., constant age-specific incidence proportions or monthly seasonal incidence proportions based on historical data). The current approach uses GLMM to provide predictions of the expected number of cases under the model in the absence of an outbreak and then to compare observed case counts with those model-based expected values. On one level, the goals appear the same, but the end result of a hypothesis-testing approach tends to be an assessment of statistical significance (e.g., a p value) reflecting whether or not we have sufficient evidence to reject our null hypothesis, while the end result of the model-based approach is a description of which data appear to deviate from the model and by how much. As a result, a testing approach tends to focus on a "yes/no" ("detect"/"nondetect") assessment, but basic features of the surveillance problem considerably complicate formal statistical inference in this setting. For instance, the ongoing temporal nature of surveillance is somewhat similar to methods of sequential analysis, but without an endpoint, and somewhat similar to methods in statistical quality control, but conducted for multiple regions and/or outcomes simultaneously. A model-based approach does not solve these problems per se but places emphasis on the description of patterns rather than solely on assessing their significance via a binary decision. Modeling also offers the opportunity to improve the model (through inclusion of additional covariates, etc.) in order to better characterize and understand patterns in the data, rather than reach a simple "significant"/"nonsignificant" conclusion. While admittedly glossing over a myriad of technical details regarding the mathematical subtleties of statistical testing and modeling, I personally find the model-based approach better suited to the exploration and understanding of observed patterns. The proposed GLMM approach has several appealing features. First, it allows ready incorporation of covariate effects to permit adjustment for regional and temporal variations in known risk factors (e.g., age and sex). In addition, the random effects provide a powerful tool for incorporating possible correlations within and between small geographic areas and/or time periods. The "shrinkage" aspect of GLMM estimation, as described by the authors, coupled with the specification of random effects, allows the model to "borrow strength" to improve precision where it is needed most (i.e., the approach borrows more "outside" information for the least precise crude estimates). Very generally, one can consider the specification of random effects to define "from whom each estimate should borrow information." That is, the random effects define which observations have similarities and correlations unaccounted for by the "fixed-effect" covariates. The authors define random intercept terms for each small region, but one could also consider random intercepts for each neighborhood (to allow within-neighborhood correlations due to unmeasured behavioral similarities in residents of each city neighborhood) or spatially correlated random effects (to permit spatial correlation between regions, allowing for broad spatial trends unaccounted for by the covariates within the model). While random effects offer a broad set of possibilities, not all of these possibilities are easily fitted with current software. In the spatial setting, users often build from the work of Clayton and Kaldor (2) and Besag et al. (3) and use Markov chain Monte Carlo algorithms to fit GLMM with spatial random effects. As Kleinman et al. noted (1), such methods (while increasing in popularity) currently do not provide the sort of quick and repetitive kinds of analyses (analyze today's data by Correspondence to Prof. Lance A. Waller,
doi:10.1093/aje/kwh030 pmid:14742280 fatcat:6zb7vln5m5fefkk5ty4orbzofe