Evaluation of a Deidentification (De-Id) Software Engine to Share Pathology Reports and Clinical Documents for Research

Dilip Gupta, Melissa Saul, John Gilbertson
2004 American Journal of Clinical Pathology  
A b s t r a c t We evaluated a comprehensive deidentification engine at the University of Pittsburgh Medical Center (UPMC), Pittsburgh, PA, that uses a complex set of rules, dictionaries, pattern-matching algorithms, and the Unified Medical Language System to identify and replace identifying text in clinical reports while preserving medical information for sharing in research. In our initial data set of 967 surgical pathology reports, the software did not suppress outside (103) , UPMC (47), and
more » ... non-UPMC (56) accession numbers; dates (7); names (9) or initials (25) of case pathologists; or hospital or laboratory names (46). In 150 reports, some clinical information was suppressed inadvertently (overmarking). The engine retained eponymic patient names, eg, Barrett and Gleason. In the second evaluation (1,000 reports), the software did not suppress outside (90) or UPMC (6) accession numbers or names (4) or initials (2) of case pathologists. In the third evaluation, the software removed names of patients, hospitals (297/300), pathologists (297/300), transcriptionists, residents and physicians, dates of procedures, and accession numbers (298/300). By the end of the evaluation, the system was reliably and specifically removing safe-harbor identifiers and producing highly readable deidentified text without removing important clinical information. Collaboration between pathology domain experts and system developers and continuous quality assurance are needed to optimize ongoing deidentification processes.
doi:10.1309/e6k3-3gbp-e5c2-7fyu pmid:14983930 fatcat:r4vocaer5zbcjarbgik7dbbnfe