Peer Review #1 of "HPO2GO: prediction of human phenotype ontology term associations for proteins using cross ontology annotation co-occurrences (v0.2)" [peer_review]

I Kahanda
2018 unpublished
Analysing the relationships between biomolecules and the genetic diseases is a highly active area of research, where the aim is to identify the genes and their products that cause a particular disease due to functional changes originated from mutations. Biological ontologies are frequently employed in these studies, which provides researchers with extensive opportunities for knowledge discovery through computational data analysis. In this study, a novel approach is proposed for the
more » ... or the identification of relationships between biomedical entities by automatically mapping phenotypic abnormality defining HPO terms with biomolecular function defining GO terms, where each association indicates the occurrence of the abnormality due to the loss of the biomolecular function expressed by the corresponding GO term. The proposed HPO2GO mappings were extracted by calculating the frequency of the co-annotations of the terms on the same genes/proteins, using already existing curated HPO and GO annotation sets. This was followed by the filtering of the unreliable mappings that could be observed due to chance, by statistical resampling of the co-occurrence similarity distributions. Furthermore, the biological relevance of the finalized mappings were discussed over selected cases, using the literature. The resulting HPO2GO mappings can be employed in different settings to predict and to analyse novel gene/protein -ontology term -disease relations. As an application of the proposed approach, HPO term -protein associations (i.e., HPO2protein) were predicted. In order to test the predictive performance of the method on a quantitative basis, and to compare it with the state-of-the-art, CAFA2 challenge HPO prediction target protein set was employed. The results of the benchmark indicated the potential of the proposed approach, as HPO2GO performed better than all of the models from 38 participating groups (with Fmax = 0.402), by a margin of 12.6% compared to the top performer. The automated cross ontology mapping approach developed in this work may be extended to other ontologies as well, to identify unexplored relation patterns at PeerJ reviewing PDF | Manuscript to be reviewed the systemic level. The datasets, results and the source code of HPO2GO are available for download at: https://github.com/cansyl/HPO2GO. PeerJ reviewing PDF | Manuscript to be reviewed 14 ABSTRACT 15 Analysing the relationships between biomolecules and the genetic diseases is a highly active area 16 of research, where the aim is to identify the genes and their products that cause a particular disease 17 due to functional changes originated from mutations. Biological ontologies are frequently 18 employed in these studies, which provides researchers with extensive opportunities for knowledge 19 discovery through computational data analysis. 20 In this study, a novel approach is proposed for the identification of relationships between 21 biomedical entities by automatically mapping phenotypic abnormality defining HPO terms with 22 biomolecular function defining GO terms, where each association indicates the occurrence of the 23 abnormality due to the loss of the biomolecular function expressed by the corresponding GO term. 24 The proposed HPO2GO mappings were extracted by calculating the frequency of the co-25 annotations of the terms on the same genes/proteins, using already existing curated HPO and GO 26 annotation sets. This was followed by the filtering of the unreliable mappings that could be 27 observed due to chance, by statistical resampling of the co-occurrence similarity distributions. 28 Furthermore, the biological relevance of the finalized mappings were discussed over selected 29 cases, using the literature. 30 The resulting HPO2GO mappings can be employed in different settings to predict and to analyse 31 novel gene/protein -ontology term -disease relations. As an application of the proposed approach, 32 HPO term -protein associations (i.e., HPO2protein) were predicted. In order to test the predictive 33 performance of the method on a quantitative basis, and to compare it with the state-of-the-art, 34 CAFA2 challenge HPO prediction target protein set was employed. The results of the benchmark 35 indicated the potential of the proposed approach, as HPO2GO performance was among the best 36 (Fmax = 0.35). The automated cross ontology mapping approach developed in this work may be 37 extended to other ontologies as well, to identify unexplored relation patterns at the systemic level. 38 The datasets, results and the source code of HPO2GO are available for download at: 39 https://github.com/cansyl/HPO2GO. PeerJ reviewing PDF | Manuscript to be reviewed 40 1. INTRODUCTION AND BACKGROUND 41 Systematic definition of biomedical entities (e.g., diseases, abnormalities, symptoms, traits, gene 42 and protein attributes, activities, functions and etc.) is crucial for computational studies in 43 biomedicine. Ontological systems, composed of standardized controlled vocabularies, are 44 employed for this purpose. Human Phenotype Ontology (HPO) system annotates disease records 45 (i.e., terms and definitions about diseases together with related information) with a standardized 46 phenotypic vocabulary (Robinson et al., 2008; Köhler et al., 2017). HPO is composed of five 47 independent sub-ontologies namely, phenotypic abnormality (i.e., the main sub-ontology defining 48 the basic qualities of diseases), mode of inheritance (i.e., annotates diseases in terms of mendelian 49 or non-mendelian principles), mortality / aging (i.e., information related to age of death due to the 50 corresponding disease), frequency (i.e., frequency of the disease in a patient cohort) and the clinical 51 modifier (i.e., additional disease characterization such as lethality, severity, etc.). Within each sub-52 ontology, all terms are related to each other with a parent-child relationship, where each child term 53 defines a specific aspect of its parent. HPO has a directed acyclic graph (DAG) structure. The 54 sources of the disease information in HPO are Orphanet (Rath et al., 2012), DECIPHER (Firth et 55 al., 2009), and OMIM (Amberger et al., 2014) databases. Each term in the phenotypic abnormality 56 sub-ontology define a specific type of abnormality encountered in human diseases (e.g., 57 HP:0001631 -atrial septal defect). The generation of HPO terms (and their associations with 58 diseases) are carried out via both manual curation efforts and automated procedures (e.g., text 59 mining). The curation job is usually done by experts by reviewing the relevant literature 60 publications along with the disease centric information at various biomedical data resources. For 61 each association between a disease term and an HPO term, there is an evidence code tag to specify 62 the source of the information (i.e., curated or automated). The evidence codes used in HPO are 63 IEA (inferred from electronical annotation), PCS (published clinical study), ICE (individual 64 clinical experience), ITM (inferred by text mining), TAS (traceable author statement). As of 65 January 2018, the growing library of HPO contains nearly 12,000 phenotype terms, providing 66 more than 123,000 annotations to 7,000 different rare (mostly Mendelian) diseases and the newly 67 added 132,000 annotations to 3,145 common diseases (Groza et al., 2015) . A long-term goal of 68 the HPO project is for the system to be adopted for clinical diagnostics. This will both provide a 69 standardized approach to medical diagnostics and present structured machine readable biomedical 70 data for the development of novel computational methods. Apart from phenotype-disease PeerJ reviewing PDF | Manuscript to be reviewed 71 associations, which is the main aim of the HPO project, HPO also provides phenotype-gene 72 associations by using the known rare disease -gene relations (i.e., the information which is in the 73 form of: "certain mutation(s) in Gene X causes the hereditary Disease Y"), directly using the 74 abovementioned disease centric resources (e.g., Orphanet and OMIM). The disease-gene 75 associations in the source databases are produced by expert curation from the publications of 76 clinical molecular studies. The associations between HPO terms and biomolecules, together with 77 the downstream analysis of these associations, help in disease gene identification and prioritization 78 (Köhler et al., 2009). With the mapping of phenotypes to human genes, HPO currently (January 79 2018) provides 122,166 annotations between 3,698 human genes and 6,729 HPO terms. 80 The Gene Ontology (GO) is an ontological system to define gene/protein attributes with an 81 extensive controlled vocabulary (GO Consortium, 2014). Each GO term defines a unique aspect 82 of biomolecular attributes. Similar to other ontological systems, GO has a directed acyclic graph 83 (DAG) structure, where terms are related to each other mostly with "is_a" or "part_of" 84 relationships. GO is composed of three categories (i.e., aspects) in terms of the type of the defined 85 gene product / protein attribute such as: (i) molecular function -MF (i.e., the fundamental function 86 of the protein at the molecular level; e.g., GO:0016887 -ATPase activity), (ii) biological process 87 -BP (i.e., the high level process, in which the protein plays a role; e.g., GO:0005975 -carbohydrate 88 metabolic process), and (iii) cellular component -CC (i.e., subcellular location, where the protein 89 carries out its intended activity; e.g., GO:0016020 -membrane). Similar to the other ontological 90 systems, the basic way of annotating a gene or protein with a GO term is the manual curation by 91 reviewing the relevant literature. GO also employs the concept of "evidence codes", where all 92 annotations are labelled with descriptions indicating the quality of the source information used for 93 the annotation (e.g., ECO:0000006 -experimental evidence, ECO:0000501 -IEA: evidence used 94 in automatic assertion). UniProt-GOA (Gene Ontology Annotation) database (Huntley et al., 2015) 95 houses an extensive collection of GO annotations for UniProt protein sequence and annotation 96 knowledgebase records. In the UniProtKB/Swiss-Prot database (i.e., housing manually reviewed 97 protein entries with highly reliable annotation) version 2018_02, there are a total of 2,850,015 GO 98 term annotations for 529,941 protein records; whereas in UniProtKB/TrEMBL database (i.e., 99 housing mostly electronically translated uncharacterized protein entries) version 2018_02, there 100 are a total of 189,560,296 GO term annotations for 67,760,658 protein records. Most of the PeerJ reviewing PDF | Manuscript to be reviewed 101 annotations for the UniProtKB/TrEMBL database entries are produced by automated predictions 102 (UniProt Consortium, 2017). 103 Due to the high volume of experimental research that (i) discover new associations between 104 biomolecules and ontological terms, and (ii) produce completely new and uncharacterized 105 gene/protein sequences; curation efforts are having hard time in keeping up with the annotation 106 process. To aid manual curation efforts, automated computational methods come into play. These 107 computational methods exploit the approaches and techniques widely used in the fields of data 108 mining, machine learning and statistics, to produce probabilistic associations between biomedical 109 entities. Critical Assessment of Functional Annotation (CAFA) challenge (Radivojac et al., 2013; 110 Jiang et al., 2016) aims to evaluate the automated methods that produce GO and HPO term 111 association predictions for protein entries, on standard temporal hold-out benchmarking datasets.
doi:10.7287/peerj.5298v0.2/reviews/1 fatcat:fzb4whi75vgknkteumlbsqag74