Biological Sequences Encoding for Supervised Classification [chapter]

Rabie Saidi, Mondher Maddouri, Engelbert Mephu Nguifo
Bioinformatics Research and Development  
The classification of biological sequences is one of the significant challenges in bioinformatics as well for protein as for nucleic sequences. The presence of these data in huge masses, their ambiguity and especially the high costs of the in vitro analysis in terms of time and money, make the use of data mining rather a necessity than a rational choice. However, the data mining techniques, which often process data under the relational format, are confronted with the inappropriate format of the
more » ... biological sequences. Hence, an inevitable step of pre-processing must be established. This work presents the biological sequences encoding as a preparation step before their classification. We present three existing encoding methods based on the motifs extraction. We also propose to improve one of these methods and we carry out a comparative study which takes into account, of course, the effect of each method on the classification accuracy but also the number of generated attributes and the CPU time.
doi:10.1007/978-3-540-71233-6_18 dblp:conf/bird/SaidiMN07 fatcat:7qeinrcpzfco7ethw5z3a53yqa