Low-level Analysis of High-density Oligonucleotide Array Data: Background, Normalization and Summarization Low-level Analysis of High-density Oligonucleotide Array Data: Background, Normalization and Summarization

Benjamin Bolstad, Benjamin Bolstad, Benjamin Bolstad, Terence Speed
2004 unpublished
Low-level Analysis of High-density Oligonucleotide Array Data: Background, Normalization and Summarization Microarray experiments are currently widely applied in many areas of biomedical research. The Affymetrix GeneChip ® system is a commercial high-density oligonucleotide microarray platform which measures gene expression using hundreds of thousands of 25-mer oligonucleotide probes. This dissertation addresses how probe intensity data from GeneChips ® are processed to produce gene expression
more » ... alues and shows how better pre-processing leads to gene expression measures, that after further analysis, yield biologically meaningful conclusions. An ideal expression measure is one which is both precise and accurate. A three-stage procedure for producing an expression measure is proposed. For each of the three stages, background correction, normalization and summarization, numerous methods are developed and assessed using spike-in datasets. Bias and variance criteria are used to compare the different methods of producing expression values. The methods are also judged by how well they correctly identify the differential genes. The background method has a significant effect on the bias, which is reduced, and the variability, which is usually increased. Non-linear normalization methods are found to reduce the non-biological variability between multiple arrays without introducing any significant bias. Robust multi-chip linear models are found to fit the data well and provide the recommended summarization method. The summarization methodology is extended to produce test statistics for determining differential genes. These test statistics perform favorably at correctly detecting differential genes when 2 compared with alternative methods based on expression values. Finally, using case study data, no statistical benefit is found for using arrays hybridized with mRNA from a pool rather than from a single biological source.
fatcat:4jyotm62hjfavdk6w6ggxgigse