Construction of a Personal Experience Tweet Corpus for Health Surveillance

Keyuan Jiang, Ricardo Calix, Matrika Gupta
2016 Proceedings of the 15th Workshop on Biomedical Natural Language Processing  
Studies have shown that Twitter can be used for health surveillance, and personal experience tweets (PETs) are an important source of information for health surveillance. To mine Twitter data requires a relatively balanced corpus and it is challenging to construct such a corpus due to the labor-intensive annotation tasks of large data sets. We developed a bootstrap method of finding PETs with the use of the machine learning-based filter. Through a few iterations, our approach can efficiently
more » ... rove the balance of two class dataset with a reduced amount of annotation work. To demonstrate the usefulness of our method, a PET corpus related to effects caused by 4 dietary supplements was constructed. In 3 iterations, a corpus of 8,770 tweets was obtained from 108,528 tweets collected, and the imbalance of two classes was significantly reduced from 1:31 to 1:3. In addition, two out of three classifiers used showed improved performance over iterations. It is conceivable that our approach can be applied to various other health surveillance studies that use machine learning-based classifications of imbalanced Twitter data.
doi:10.18653/v1/w16-2917 dblp:conf/bionlp/JiangCG16 fatcat:k7njlnekkfgu3cobdy3p4qafxm