The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data [post]

Ayesha Sania, Nicolo Pini, Morgan Nelson, Michael Myers, Lauren Shuffrey, Maristella Lucchini, Amy J. Elliott, Hein J. Odendaal, William Fifer
2021 unpublished
Background — Missing data are a source of bias in epidemiologic studies. This is problematic in alcohol research where data missingness is linked to drinking behavior. Methods — The Safe Passage study was a prospective investigation of prenatal drinking and fetal/infant outcomes (n=11,083). Daily alcohol consumption for last reported drinking day and 30 days prior was recorded using Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing
more » ... a using a machine learning algorithm; "K Nearest Neighbor" (K-NN). K-NN imputes missing values for a participant using data of participants closest to it. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. Validation was done on randomly deleted data for 5-15 consecutive days. Results — Data from 5 nearest neighbors and segments of 55 days provided imputed values with least imputation error. After deleting data segments from with no missing days first trimester, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.
doi:10.21203/rs.3.rs-153387/v1 fatcat:s2gkdotm2bejpjk7xot5ve6fgi