Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations

Thomas Vakili, Hercules Dalianis
2021 AAAI Fall Symposia  
Language models may be trained on data that contain personal information, such as clinical data. Such sensitive data must not leak for privacy reasons. This article explores whether BERT models trained on clinical data are susceptible to training data extraction attacks. Multiple large sets of sentences generated from the model with top-k sampling and nucleus sampling are studied. The sentences are examined to determine the degree to which they contain information associating patients with
more » ... conditions. The sentence sets are then compared to determine if there is a correlation between the degree of privacy leaked and the linguistic quality attained by each generation technique. We find that the relationship between linguistic quality and privacy leakage is weak and that the risk of a successful training data extraction attack on a BERT-based model is small.
dblp:conf/aaaifs/VakiliD21 fatcat:zxdosrh6i5gkbn4ysrolorph24