Automatic Breast Cancer Survivor Detection from Social Media for Studying Latent Factors Affecting Treatment Success [article]

Abeed Sarker, Mohammed Ali Al-Garadi, Yuan-Chi Yang, Sahithi Lakamana, Jie Lin, Sabrina Li, Angel Xie, Whitney Hogg-Bremer, Mylin Torres, Imon Banerjee, Abeed Sarker
2020 medRxiv   pre-print
Breast cancer patients often discontinue their long-term treatments, such as hormone therapy, increasing the risk of cancer recurrence. These discontinuations are often caused by adverse patient-centered outcomes (PCOs) due to hormonal drug side effects or other factors. PCOs are not detectable through laboratory tests and are sparsely documented in electronic health records. Thus, there is a need to explore other sources of information for PCOs associated with breast cancer treatments. Social
more » ... edia is a promising resource, but extracting true PCOs from it first requires the accurate detection of breast cancer patients. We describe a natural language processing (NLP) architecture for automatically detecting breast cancer patients from Twitter based on their self-reports. The architecture employs breast cancer-related keywords to collect streaming data from Twitter, applies NLP patterns to pre-filter noisy posts, and then employs a machine learning classifier trained using manually-annotated data (n=5019) for distinguishing firsthand self-reports of breast cancer from other tweets. A classifier based on bidirectional encoder representations from transformers (BERT) showed human-like performance and achieved F1-score of 0.857 (inter-annotator agreement: 0.845; Cohen's kappa) for the positive class, considerably outperforming the next best classifier--a deep neural network (F1-score: 0.665). Qualitative analyses of posts from automatically-detected users revealed discussions about side effects, non-adherence, and mental health conditions, illustrating the feasibility of our social media-based approach for studying breast cancer-related PCOs from a large population.
doi:10.1101/2020.05.17.20104778 fatcat:vwncebqdvbbh3cu25m2v2atnq4