De-Identifying Swedish EHR Text Using Public Resources in the General Domain

Taridzo Chomutare, Kassaye Yitbarek Yigzaw, Andrius Budrionis, Alexandra Makhlysheva, Fred Godtliebsen, Hercules Dalianis
2020 Studies in Health Technology and Informatics  
Sensitive data is normally required to develop rule-based or train machine learning-based models for de-identifying electronic health record (EHR) clinical notes; and this presents important problems for patient privacy. In this study, we add non-sensitive public datasets to EHR training data; (i) scientific medical text and (ii) Wikipedia word vectors. The data, all in Swedish, is used to train a deep learning model using recurrent neural networks. Tests on pseudonymized Swedish EHR clinical
more » ... tes showed improved precision and recall from 55.62% and 80.02% with the base EHR embedding layer, to 85.01% and 87.15% when Wikipedia word vectors are added. These results suggest that non-sensitive text from the general domain can be used to train robust models for de-identifying Swedish clinical text; and this could be useful in cases where the data is both sensitive and in low-resource languages.
doi:10.3233/shti200140 pmid:32570364 fatcat:7o3gyfiuhzen5cqjuwncnxmsq4