DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention

Fabian Hausmann, Stefan Kurtz
2021 Algorithms for Molecular Biology  
Background Repetitive elements contribute a large part of eukaryotic genomes. For example, about 40 to 50% of human, mouse and rat genomes are repetitive. So identifying and classifying repeats is an important step in genome annotation. This annotation step is traditionally performed using alignment based methods, either in a de novo approach or by aligning the genome sequence to a species specific set of repetitive sequences. Recently, Li (Bioinformatics 35:4408–4410, 2019) developed a novel
more » ... ftware tool to annotate repetitive sequences using a recurrent neural network trained on sample annotations of repetitive elements. Results We have developed the methods of further and engineered a new software tool . This combines the basic concepts of Li (Bioinformatics 35:4408–4410, 2019) with current techniques developed for neural machine translation, the attention mechanism, for the task of nucleotide-level annotation of repetitive elements. An evaluation on the human genome shows a 20% improvement of the Matthews correlation coefficient for the predictions delivered by , when compared to . predicts two additional classes of repeats (compared to ) and is able to transfer repeat annotations, using RepeatMasker-based training data to a different species (mouse). Additionally, we could show that predicts repeats annotated in the Dfam database, but not annotated by RepeatMasker. is highly scalable due to its implementation in the TensorFlow framework. For example, the GPU-accelerated version of is approx. 1.8 times faster than , approx. 8.6 times faster than RepeatMasker and over 100 times faster than HMMER searching for models of the Dfam database. Conclusions By incorporating methods from neural machine translation, achieves a consistent improvement of the quality of the predictions compared to . Improved running times are obtained by employing TensorFlow as implementation framework and the use of GPUs. By incorporating two additional classes of repeats, provides more complete annotations, which were evaluated against three state-of-the-art tools for repeat annotation.
doi:10.1186/s13015-021-00199-0 fatcat:i4c5y6cm4zahzoogftd2djl4re