Functional annotation and network reconstruction through cross-platform integration of microarray data

Xianghong Jasmine Zhou, Ming-Chih J Kao, Haiyan Huang, Angela Wong, Juan Nunez-Iglesias, Michael Primig, Oscar M Aparicio, Caleb E Finch, Todd E Morgan, Wing Hung Wong
2005 Nature Biotechnology  
The rapid accumulation of microarray data translates into a need for methods to effectively integrate data generated with different platforms. Here we introduce an approach, 2 nd -order expression analysis, that addresses this challenge by first extracting expression patterns as meta-information from each data set (1 st -order expression analysis) and then analyzing them across multiple data sets. Using yeast as a model system, we demonstrate two distinct advantages of our approach: we can
more » ... ify genes of the same function yet without coexpression patterns and we can elucidate the cooperativities between transcription factors for regulatory network reconstruction by overcoming a key obstacle, namely the quantification of activities of transcription factors. Experiments reported in the literature and performed in our lab support a significant number of our predictions. Microarray gene expression profiling is now done in many laboratories, resulting in the rapid accumulation of data in public repositories 1,2 . Despite recent advances in analysis techniques, several important challenges remain. (i) There is an urgent need for methods to effectively integrate multiple microarray data sets. Gene expression values generated with different platforms (such as spotted cDNA or Affymetrix high-density oligonucleotide arrays) are not directly comparable. Even within the same technology, alternative experimental parameters result in systematic variations among data sets often beyond the capability of statistical normalization. (ii) There is a lack of algorithms that can identify functionally related genes which do not have similar expression patterns. Most methods for functional analysis of microarray data make the implicit assumption that genes with similar expression profiles have similar functions 3,4 . However, among genes involved in the same pathway, many gene pairs do not show similar expression profiles 5 . (iii) The reconstruction of transcriptional regulatory networks remains the key challenge for microarray analysis. A major issue is the measurement of transcription factor activities because changes in their expression are often subtle and their activities are often controlled at levels other than expression. This further leads to difficulties in the elucidation of cooperativity between transcription factors. Recently, several approaches have been proposed to address some of these individual problems 5-7 , yet there remains a lack of unified frameworks that can simultaneously respond to these challenges. Here we introduce an approach termed 2 nd -order expression analysis, which we will show to be useful in overcoming the three aforementioned problems. We define 1 st -order expression analysis as the extraction of expression patterns from one microarray data set, which contains a set of expression profiles measured under relevant conditions. We propose 2 nd -order expression analysis as a study of the correlated occurrences of those expression patterns across multiple data sets measured under different types of conditions (e.g., starvation, heat shock). By first extracting expression patterns as metainformation from each data set and then analyzing them comparatively, the results are not affected by variations among data sets. This allows integration of multiple microarray data sets in a platformindependent manner. Here, we apply 2 nd -order analysis to 618 yeast expression profiles comprising 39 cDNA or Affymetrix array data sets to group genes that have the same function but may not be coexpressed, to annotate their functions, to quantify the activity profiles of transcription factors and reconstruct regulatory networks. We illustrate 2 nd -order expression analysis with a simple case, the analysis of expression patterns of coexpressed gene pairs. If a pair of genes is tightly coexpressed in multiple data sets, the genes are likely to be functionally linked. We term such gene pairs doublets. Our first objective is to find pairs of such doublets that simultaneously exhibit either high or low expression correlations across multiple data sets, that is, simultaneously turn on or off their functional links over different types of conditions. Such a set of four genes, termed a quadruplet, is likely to be functionally related, even though the global expression profiles of those genes do not exhibit gross similarities (see an example in Fig. 1 ). We identify quadruplets using a two-step procedure: (i) calculate the expression correlations of the doublet in each of the data sets and store them in a vector, termed 1 st -order expression correlation profile; (ii) calculate the correlation between two 1 st -order profiles to generate the 2 nd -order expression correlation, and define those pairs of doublets with high 2 nd -order correlations as quadruplets. Throughout the paper, an expression correlation or a
doi:10.1038/nbt1058 pmid:15654329 fatcat:hpvz6njlf5cftfeapjedzsqv4e