Analysis and prediction of the effects of single amino acid variants in human disease

Christopher Michael Yates, Michael Sternberg, Medical Research Council (Great Britain)
Over the past fifty years, the genetic bases for many human diseases have been discovered. Genome-wide association studies (GWAS) have pin-pointed genetic loci containing disease-associated variants, which can be probed in more detail using targeted DNA sequencing and imputation. The falling cost of DNA sequencing has led to the adoption of whole-genome and whole-exome sequencing, enabling more of the genome to be investigated. However, with the masses of data being produced, it is vital to
more » ... tools to identify truly interesting disease-associated variation. This thesis describes the development of three tools to help researchers identify and characterise disease-associated genetic variants. Whole-genome sequencing data contain many reported variants that may be due to errors in sequencing, read-mapping or variant calling. It is therefore vital to filter out these incorrect variants and Chapter 6 describes the development of an ensemble machine learning method for filtering these variants, which is able to remove over 80% of incorrect variants at the expense of under 4% of true variation. Single amino acid variants (SAVs) are one of the best-studied groups of variants in human disease, and there are a number of tools available to pre- dict whether SAV will be deleterious. SuSPect (disease-susceptibility–based SAV phenotype predictor) out-performs other tested methods, thanks to its incorporation of information from protein-protein interaction networks. Finally, knowing a variant is potentially damaging to a protein's function does not tell a researcher whether or not it will cause the disease of interest,and even healthy human exomes contain many SAVs predicted to be deleterious. PriMe SuSPect (prioritisation method using SuSPect) is an extension to SuSPect, which provides disease-specific scores for over 5,000 diseases by leveraging information from PPI networks. This enables researchers to determine which SAVs across an entire exome are likely to be involved in their specific disease of interest.
doi:10.25560/44953 fatcat:bc6pz3z2vfhftk74sw5jh4vpe4