A COMPARATIVE STUDY ON GENE SELECTION METHODS FOR TISSUES CLASSIFICATION ON LARGE SCALE GENE EXPRESSION DATA

Farzana Kabir Ahmad
2016 Jurnal Teknologi  
Graphical abstract Abstract Deoxyribonucleic acid (DNA) microarray technology is the recent invention that provided colossal opportunities to measure a large scale of gene expressions simultaneously. However, interpreting large scale of gene expression data remain a challenging issue due to their innate nature of "high dimensional low sample size". Microarray data mainly involved thousands of genes, n in a very small size sample, p which complicates the data analysis process. For such a reason,
more » ... feature selection methods also known as gene selection methods have become apparently need to select significant genes that present the maximum discriminative power between cancerous and normal tissues. Feature selection methods can be structured into three basic factions; a) filter methods; b) wrapper methods and c) embedded methods. Among these methods, filter gene selection methods provide easy way to calculate the informative genes and can simplify reduce the large scale microarray datasets. Although filter based gene selection techniques have been commonly used in analyzing microarray dataset, these techniques have been tested separately in different studies. Therefore, this study aims to investigate and compare the effectiveness of these four popular filter gene selection methods namely Signal-to-Noise ratio (SNR), Fisher Criterion (FC), Information Gain (IG) and t-Test in selecting informative genes that can distinguish cancer and normal tissues. In this experiment, common classifiers, Support Vector Machine (SVM) is used to train the selected genes. These gene selection methods are tested on three large scales of gene expression datasets, namely breast cancer dataset, colon dataset, and lung dataset. This study has discovered that IG and SNR are more suitable to be used with SVM. Furthermore, this study has shown SVM performance remained moderately unaffected unless a very small size of genes was selected. (Sciences & Engineering) 78: 5-10 (2016) 116-125 digunakan dalam menganalisis microarray dataset, teknik-teknik ini telah diuji secara berasingan dalam kajian yang berbeza. Oleh itu, kajian ini berhasrat untuk menyiasat dan membandingkan keberkesanan empat popular kaedah pemilihan penapis gen iaitu Signal-Noise-Ratio (SNR), Fisher Criterion (FC), Information Gain (IG) dan t-Test dalam memilih gen bermaklumat yang boleh membezakan kanser dan tisu normal. Dalam kajian ini, pengelas biasa Support Vector Machine (SVM) telah digunakan untuk melatih gen yang dipilih. Kaedah-kaedah pemilihan gen diuji pada tiga skala besar set data ekspresi gen, iaitu kanser dataset payudara, dataset kolon dan dataset paruparu. Kajian ini telah mendapati bahawa IG dan SNR adalah lebih sesuai untuk digunakan dengan SVM. Tambahan pula, prestasi SVM dalam kajian ini telah menunjukkan kekal sederhana dan tidak terjejas kecuali gen dengan saiz yang sangat kecil dipilih. Kata kunci: Microarray DNA, pemilihan gen, klasifikasi, pemilihan ciri-ciri, kaedah pemilihan berdasarkan gen penapis
doi:10.11113/jt.v78.8843 fatcat:kv44eunb4bgkhpovq2vchi7iny