Tackling class imbalance and data scarcity in literature-based gene function annotation

Mathieu Blondel, Kazuhiro Seki, Kuniaki Uehara
2011 Proceedings of the 34th international ACM SIGIR conference on Research and development in Information - SIGIR '11  
In recent years, a number of machine learning approaches to literature-based gene function annotation have been proposed. However, due to issues such as lack of labeled data, class imbalance and computational cost, they have usually been unable to surpass simpler approaches based on stringmatching. In this paper, we propose a principled machine learning approach based on kernel classifiers. We show that kernels can address the task's inherent data scarcity by embedding additional knowledge and
more » ... e propose a simple yet effective solution to deal with class imbalance. From experiments on the TREC Genomics Track data, our approach achieves better F1-score than two state-of-the-art approaches based on string-matching and cross-species information.
doi:10.1145/2009916.2010080 dblp:conf/sigir/BlondelSU11 fatcat:rciumsw23fd5rauyoarkh7moai