Reverse Engineered Virtual Patient Populations as Surrogates for Real Patient-Level Data [article]

Francis J Alenghat
2018 bioRxiv   pre-print
Dissemination of clinical data for research has limitations. The most coveted data is richly descriptive at the individual level, but acquiring such granularity comes with significant cost, effort, or time. De-identification of individual records is not foolproof, with potential for privacy breaches, especially for "real-world" data derived from electronic health records. Also, the open data movement has progressed slowly for clinical trials, partly due to concerns about data ownership. Here,
more » ... verse engineered virtual patient populations (RE-ViPPs) are described, based on aggregate cross-tabulated categorical data from populations. The method does not require end-user access to individual-level data. Rather, using sequential linear regressions and random number generation, it generates virtual individual patients to comprise populations that, on average, closely resemble the real population in question. The method is validated by applying it to aggregated data derived from the seminal SPRINT trial, for which the individual-level data is known. The method yields virtual populations, each with 9361 patients, faithfully mimicking the 9361 real SPRINT participants. Multiple logistic regression on 100 such populations shows that, just as in SPRINT, risk factors with the highest odds ratio for the primary event are, in descending order, past clinical cardiovascular disease, age ≥ 75, chronic kidney disease, high non-HDL, and smoking history. Factors associated with fewer events are female sex and intensive blood pressure treatment (the trial's intervention). Application of RE-ViPPs to trials, registries, and health record databases could reduce the cost, time, ownership, and de-identification burdens hindering open data by encouraging dissemination of aggregate, richly cross-tabulated real data that investigators can use to construct virtual patients and make meaningful conclusions.
doi:10.1101/308403 fatcat:2u4ysne7n5aj5cte7civsnntwm