KAOS: a new automated computational method for the identification of overexpressed genes

Angelo Nuzzo, Giovanni Carapezza, Sebastiano Di Bella, Alfredo Pulvirenti, Antonella Isacchi, Roberta Bosotti
2016 BMC Bioinformatics  
Kinase over-expression and activation as a consequence of gene amplification or gene fusion events is a well-known mechanism of tumorigenesis. The search for novel rearrangements of kinases or other druggable genes may contribute to understanding the biology of cancerogenesis, as well as lead to the identification of new candidate targets for drug discovery. However this requires the ability to query large datasets to identify rare events occurring in very small fractions (1-3 %) of different
more » ... mor subtypes. This task is different from what is normally done by conventional tools that are able to find genes differentially expressed between two experimental conditions. Results: We propose a computational method aimed at the automatic identification of genes which are selectively over-expressed in a very small fraction of samples within a specific tissue. The method does not require a healthy counterpart or a reference sample for the analysis and can be therefore applied also to transcriptional data generated from cell lines. In our implementation the tool can use gene-expression data from microarray experiments, as well as data generated by RNASeq technologies. Conclusions: The method was implemented as a publicly available, user-friendly tool called KAOS (Kinase Automatic Outliers Search). The tool enables the automatic execution of iterative searches for the identification of extreme outliers and for the graphical visualization of the results. Filters can be applied to select the most significant outliers. The performance of the tool was evaluated using a synthetic dataset and compared to state-of-the-art tools. KAOS performs particularly well in detecting genes that are overexpressed in few samples or when an extreme outlier stands out on a high variable expression background. To validate the method on real case studies, we used publicly available tumor cell line microarray data, and we were able to identify genes which are known to be overexpressed in specific samples, as well as novel ones.
doi:10.1186/s12859-016-1188-1 pmid:28185541 pmcid:PMC5123341 fatcat:h5oz5n46mva7pcnn5dbmdoovna