Clustering Genes Using Heterogeneous Data Sources

Erliang Zeng, Chengyong Yang, Tao Li, Giri Narasimhan
2010 International Journal of Knowledge Discovery in Bioinformatics  
Clustering of gene expression data is a standard exploratory technique used to identify closely related genes. Many other sources of data are also likely to be of great assistance in the analysis of gene expression data. Such sources include proteinprotein interaction data, transcription factor and regulatory elements data, comparative genomics data, protein expression data and much more. These data provide us with a means to begin elucidating the large-scale modular organization of the cell.
more » ... nclusions drawn from more than one data source is likely to lead to new insights. Data sources may be complete or incomplete depending on whether or not they provide information about every gene in the genome. With a view toward a combined analysis of heterogeneous sources of data, we consider the challenging task of developing exploratory analytical techniques to deal with multiple complete and incomplete information sources. The Multi-Source Clustering (MSC) algorithm we developed performs clustering with multiple, but complete, sources of data. To deal with incomplete data sources, we have adopted the MPCK-means clustering algorithm, which is a constrained clustering algorithm, to perform exploratory analysis on one complete source (such as gene expression data) and other potentially incomplete sources provided in the form of constraints. We have shown that the MSC algorithm produces clusters that are biologically more meaningful when integrating gene expression data and text data than those identified using only one source of data. For the constrained clustering algorithm, we have studied the effectiveness of various constraints sets. To address the problem of automatically generating constraints from biological text literature, we considered two methods (cluster-based and similarity-based). The novelty of research presented here is the development of a new clustering algorithm MSC to perform exploratory analysis using two or more diverse but complete data sources, and study of effectiveness of constraints sets and robustness of the constrained clustering algorithm using multiple sources of incomplete biological data, and incorporating such incomplete data into constrained clustering algorithm in form of constraints sets.
doi:10.4018/jkdb.2010040102 fatcat:i65e5huzurcord6yaojw44jknu