Predicting enhancers using a small subset of high confidence examples and co-training [post]

Matthew R Huska, Anna Ramisch, Martin Vingron, Annalisa Marsico
2016 unpublished
Enhancers are important regulatory regions located throughout the genome, primarily in non-coding regions. Several experimental methods have been developed over the last several years to identify their location, but the search space is large and the overlap between the putative enhancer identified using these methods tends to be very small. Computational methods for enhancer prediction often use one large set of experimentally identified enhancer regions as input, and therefore rely critically
more » ... re rely critically on their correctness. We chose to take a different approach, and start with a high confidence set of 21 enhancer that are in the intersection of enhancers identified using three completely unrelated experimental approaches: deepCAGE, HiCap and classical enhancer reporter assays. Because this starting set is so small, we use a semi-supervised approach called co-training rather than a fully supervised approach to progressively predict enhancers from unlabeled regions. Using this approach we are able to outperform supervised learning as well as simpler semi-supervised learning methods and achieve an average area under the ROC curve of 0.84.
doi:10.7287/peerj.preprints.2407 fatcat:y3dha3pdfnc67jconpks5g4k44