Using logical entailments of gene annotations for biological discovery [article]

William Baumgartner
2021
Enrichment analysis is the primary method biologists use for the initial interpretation of genome-scale experimental data. With the hallmark of improved explanatory power through complexity reduction, knowledge base-driven enrichment analysis is used ubiquitously in the biomedical community to lend insight into underlying biological mechanisms at play in complex biological phenomena. By combining statistical reasoning approaches common to biology with the powerful deductive reasoning
more » ... s offered by description logics, the work presented in this thesis significantly advances the state-of-the-art of knowledge based-enrichment analysis. We present several methodologies that, when used collectively, vastly increase available gene annotations in both number and type. Using the maturing community of biomedical ontologies, we demonstrate that with careful consideration it is possible to integrate a large portion of the Open Biomedical Ontologies while maintaining logical soundness. Our method takes advantage of available GO and phenotype ontology annotations and uses the principle of deductive entailment to mine this integrated set of ontologies to produce novel, high quality annotations to a variety of biomedical ontologies previously not annotated to genes. Taking advantage once again of the logical definitions integrating the ontologies, our method improves on the typically returned lists of enriched concepts provided by many tools by enabling the return of enriched modules of biology. By providing interconnected modules of enriched concepts, the researcher is afforded larger pieces of biology with which to incorporate into their hypotheses. Novel gene annotations are validated quantitatively through an intrinsic analysis that evaluates entailed gene annotations against experimentally verified protein localization data as well as curated gene-chemical interactions. Overall performance is gauged extrinsically through retrospective analyses of previously published research as well as the analysis of a number of targeted gene lists. Our methodology overcomes clear limitations of previous approaches iii and is complementary to many of the recent enrichment efforts that have begun to integrate disparate data types. Our method responds to past calls for enrichment methodologies to incorporate more than just the Gene Ontology, and in doing so we have addressed a number of the current challenges that face the field of contemporary enrichment analysis. Given that integration of ontologies by the biomedical community through the use of logical definitions is an ongoing process, the utility of our methodology will only improve over time thus enabling a more comprehensive, intuitive, and adaptable resource to help biologists better interpret and understand their genome-scale experimental data. The form and content of this abstract are approved. I recommend its publication. Approved: Lawrence E. Hunter iv To my wife, Heather, for being my partner, for your unending love, support, and encouragement that fueled me to finish, for selflessly taking on our boys single-handedly for the past six months, the work herein is very much a joint effort and would not have been possible without you To my boys, Billy and James, for providing brief breaks of normalcy during the final push, for your surprising patience and understanding for why I haven't been able to throw the football with you as much as both of us would have liked
doi:10.25677/1fsc-th88 fatcat:hhdrrlxdvjbpdm53y3w2g64axu