Clustering in General Measurement Error Models

Raymond Carroll, Ya Su, Jill Reedy
2018 Statistica sinica  
This paper is dedicated to the memory of Peter G. Hall. It concerns a deceptively simple question: if one observes variables corrupted with measurement error of possibly very complex form, can one recreate asymptotically the clusters that would have been found had there been no measurement error? We show that the answer is yes, and that the solution is surprisingly simple and general. The method itself is to simulate, by computer, realizations with the same distribution as that of the true
more » ... at of the true variables , and then to apply clustering to these realizations. Technically, we show that if one uses K-means clustering or any other risk minimizing clustering, and a multivariate deconvolution device with certain smoothness and convergence properties, then, in the limit, the cluster means based on our method converge to the same cluster means as if there is no measurement error. Along with the method and its technical justification, we analyze two important nutrition data sets, finding patterns that make sense nutritionally. Dedication to the memory of Peter G. Hall The last author, Raymond J. Carroll, was very fond of Peter and visited him many times. His facets included his brilliance, dedication, kindness, sense of humor, graciousness to young researchers, puzzle solving, madcap driving to take photos of trains, discussions about airplanes, love of cats and photographic advice. As Peter said in his Statistical Science interview (Delaigle and Wand, 2016), I always like working with Ray, because I felt I could contribute something from the problem solving side, the theoretical side, whereas he is more an applied person, ... in working with Ray we bring to the table things that don't overlap, and which complemented each other very well. In his interview, Peter also mentioned that a lot of his joint work with Raymond grew out of nutrition research, and hence this paper is an appropriate contribution to this special issue. It involves a deceptively simple question: if one observes variables corrupted with measurement error of possibly complex form, such as occurs in nutritional and radiation applications, can one recreate the clusters that would have been found had there been no measurement error? We show that the answer is yes, and that the solution is surprisingly simple and general. in pregnant women: relationship with nutrient intakes and dietary patterns in 7-year-old offspring. . (2015). Dietary patterns derived by cluster analysis are associated with cognitive function among Korean older adults. Nutrients, 7, 4154-4169. . (2009). Modeling data with excess zeros and measurement error: application to evaluating relationships between episodically consumed foods and health outcomes.
doi:10.5705/ss.202017.0093 fatcat:v6iqa5k6yzcezl75qvamrakbrm