Feature Extraction Approaches for Biological Sequences: A Comparative Study of Mathematical Models [article]

Robson Parmezan Bonidia, Lucas Dias Hiera Sampaio, Fabrício Martins Lopes, André Carlos Ponce de Leon Ferreira de Carvalho, Danilo Sipoli Sanches
2020 bioRxiv   pre-print
The number of available biological sequences has increased significantly in recent years due to various genomic sequencing projects, creating a huge volume of data. Consequently, new computational methods are needed to analyze and extract information from these sequences. Machine learning methods have shown broad applicability in computational biology and bioinformatics. The utilization of machine learning methods has helped to extract relevant information from various biological datasets.
more » ... ical datasets. However, there are still several problems that motivate new algorithms and pipeline proposals, mainly involving feature extraction problems, in which extracting significant discriminatory information from a biological set is challenging. Considering this, our work proposes to study and analyze a feature extraction pipeline based on mathematical models (Numerical Mapping, Fourier, Entropy, and Complex Networks). As a case study, we analyze Long Non-Coding RNA sequences. Moreover, we divided this work into two studies, e.g., (I) we assessed our proposal with the most addressed problem in our review, e.g., lncRNA vs. mRNA; (II) we tested its generalization on different classification problems, e.g., circRNA vs. lncRNA. The experimental results demonstrated three main contributions: (1) An in-depth study of several mathematical models; (2) a new feature extraction pipeline and (3) its generalization and robustness for distinct biological sequence classification.
doi:10.1101/2020.06.08.140368 fatcat:lavdmhv3inecfnvlgwmig5ptoy