Filters








265 Hits in 2.0 sec

Evaluation of Model Fit of Inferred Admixture Proportions [article]

Genís Garcia-Erill, Anders Albrechtsen
2019 bioRxiv   pre-print
Model based methods for genetic clustering of individuals such as those implemented in structure or ADMIXTURE allow to infer individual ancestries and study population structure. The underlying model makes several assumptions about the demographic history that shaped the analyzed genetic data. One assumption is that all individuals are a result of K ancestral homogeneous populations that are all represented well in the data while another assumption is that no drift happened after the admixture
more » ... vent. The histories of many real world populations do not conform to that model, and in that case taking the inferred admixture proportions at face value might be misleading. We propose a method to evaluate the fit of admixture models based on calculating the genotypes predicted by the admixture model, and obtaining the residuals as the difference between the true and predicted genotypes. The correlation of residuals between pairs of individuals can then be used as a measure of model fit. When the model assumptions are not violated and the inferred admixture proportions are accurate then the residuals from a pair of individuals are not correlated. In case of a bad fit, individuals with similar histories have a positive correlation of their residuals. Using simulated and real data, we show how the method is able to detect a bad fit of inferred admixture proportions due to using an insufficient number of clusters K or to demographic histories that deviate significantly from the admixture model assumptions, such as admixture from ghost populations, drift after admixture events and non-discrete ancestral homogeneous populations.
doi:10.1101/708883 fatcat:dbjqm7omonaxfkcwlolo75q6ly

Ancestry specific association mapping in admixed populations [article]

Line Skotte, Emil Joersboe, Thorfinn Sand S Korneliussen, Ida Moltke, Anders Albrechtsen
2015 bioRxiv   pre-print
During the last decade genome-wide association studies have proven to be a powerful approach to identifying disease-causing variants. However, for admixed populations, most current methods for performing association testing are based on the assumption that the effect of a genetic variant is the same regardless of its ancestry. This is a reasonable assumption for a causal variant, but may not hold for the genetic variants that are tested in genome-wide association studies, which are usually not
more » ... ausal. The effects of non-causal genetic variants depend on how strongly their presence correlate with the presence of the causal variant, which may vary between ancestral populations because of different linkage disequilibrium patterns and allele frequencies. Motivated by this, we here introduce a new statistical method for association testing in recently admixed populations, where the effect size is allowed to depend on the ancestry of a given allele. Our method does not rely on accurate inference of local ancestry, yet using simulations we show that in some scenarios it gives a dramatic increase in statistical power to detect associations. In addition, the method allows for testing for difference in effect size between ancestral populations, which can be used to help determine if a SNP is causal. We demonstrate the usefulness of the method on data from the Greenlandic population.
doi:10.1101/014001 fatcat:w4prwt6cwjbjdeewd7it6ltm5u

Testing for Hardy-Weinberg Equilibrium in Structured Populations using NGS Data [article]

Jonas Meisner, Anders Albrechtsen
2018 bioRxiv   pre-print
Testing for Hardy-Weinberg Equilibrium (HWE) is a common practice for quality control in genetic studies. Variable sites violating HWE may be identified as technical errors in the sequencing or genotyping process, or they may be of special evolutionary interest. Large-scale genetic studies based on next-generation sequencing (NGS) methods have become more prevalent as cost is decreasing but these methods are still associated with statistical uncertainty. The large-scale studies usually consist
more » ... f samples from diverse ancestries that make the existence of some degree of population structure almost inevitable. Precautions are therefore needed when analyzing these datasets, as population structure causes deviations from HWE. Here we propose a method that takes population structure into account in the testing for HWE, such that other factors causing deviations from HWE can be detected. We show the effectiveness of our method in NGS data, as well as in genotype data, for both simulated and real datasets, where the use of genotype likelihoods enables us to model the uncertainty for low-depth sequencing data.
doi:10.1101/468611 fatcat:lykcflm7nzd6rg2dzljowknbmq

Inferring Population Structure and Admixture Proportions in Low Depth NGS Data [article]

Jonas Meisner, Anders Albrechtsen
2018 bioRxiv   pre-print
PCAngsd was only seen to converge to a single Meisner and Albrechtsen .  ... 
doi:10.1101/302463 fatcat:jbjgl6essvdzzfqwyptnott7fm

ANGSD: Analysis of Next Generation Sequencing Data

Thorfinn Sand Korneliussen, Anders Albrechtsen, Rasmus Nielsen
2014 BMC Bioinformatics  
High-throughput DNA sequencing technologies are generating vast amounts of data. Fast, flexible and memory efficient implementations are needed in order to facilitate analyses of thousands of samples simultaneously. Results: We present a multithreaded program suite called ANGSD. This program can calculate various summary statistics, and perform association mapping and population genetic analyses utilizing the full information in next generation sequencing data by working directly on the raw
more » ... encing data or by using genotype likelihoods. Conclusions: The open source c/c++ program ANGSD is available at http://www.popgen.dk/angsd. The program is tested and validated on GNU/Linux systems. The program facilitates multiple input formats including BAM and imputed beagle genotype probability files. The program allow the user to choose between combinations of existing methods and can perform analysis that is not implemented elsewhere.
doi:10.1186/s12859-014-0356-4 pmid:25420514 pmcid:PMC4248462 fatcat:xujrgymmmrg2vhehih35egbc4q

Estimating IBD tracts from low coverage NGS data

Filipe G. Vieira, Anders Albrechtsen, Rasmus Nielsen
2016 Bioinformatics  
Motivation: The amount of IBD in an individual depends on the relatedness of the individual's parents. However, it can also provide information regarding mating system, past history and effective size of the population from which the individual has been sampled. Results: Here, we present a new method for estimating inbreeding IBD tracts from low coverage NGS data. Contrary to other methods that use genotype data, the one presented here uses genotype likelihoods to take the uncertainty of the
more » ... a into account. We benchmark it under a wide range of biologically relevant conditions and show that the new method provides a marked increase in accuracy even at low coverage. Availability and implementation: The methods presented in this work were implemented in C/C þþ and are freely available for non-commercial use from https://github.com/fgvieira/ngsF-HMM.
doi:10.1093/bioinformatics/btw212 pmid:27153648 fatcat:edqelkwunbcnnguzhkmgsmzdnq

Powerful Inference with the D-statistic on Low-Coverage Whole-Genome Data [article]

Samuele Soraggi, Carsten Wiuf, Anders Albrechtsen
2017 bioRxiv   pre-print
The detection of ancient gene flow between human populations is an important issue in population genetics. A commonly used tool for detecting ancient admixture events is the D-statistic. The D-statistic is based on the hypothesis of a genetic relationship that involves four populations, whose correctness is assessed by evaluating specific coincidences of alleles between the groups. When working with high throughput sequencing data is it not always possible to accurately call genotypes. When
more » ... type calling is not possible the D-statistic that is currently used samples a single base from the reads of one chosen individual per population. This method has the drawback of ignoring much of the information in the data. Those issues are especially striking in the case of ancient genomes, often characterized by low sequencing depth and high error rates for the sequenced bases. Here we provide a significant improvement to overcome the problems of the present-day D-statistic by considering all reads from multiple individuals in each population. Moreover we apply type-specific error correction to combat the problems of sequencing errors and show a way to correct for introgression from an external population that is not part of the supposed genetic relationship, and how this method leads to an estimate of the admixture rate. We prove that the improved D-statistic, as well as the traditional one, is approximated by a standard normal. Furthermore we show that our method overperforms the traditional D-statistic in detecting admixtures. The power gain is most pronounced for low/medium sequencing depth (1-10X) and performances are as good as with perfectly called genotypes at a sequencing depth of 2X. We also show the reliability of error correction on scenarios with simulated errors and ancient data, and correct for introgression in known scenarios to verify the correctness the estimation of the admixture rates.
doi:10.1101/127852 fatcat:vyjntyxxozdi5kcpov3sf6susy

Haplotype and Population Structure Inference using Neural Networks in Whole-Genome Sequencing Data [article]

Jonas Meisner, Anders Albrechtsen
2020 bioRxiv   pre-print
Accurate inference of population structure is important in many studies of population genetics. In this paper we present, HaploNet, a novel method for performing dimensionality reduction and clustering in genetic data. The method is based on local clustering of phased haplotypes using neural networks from whole-genome sequencing or genotype data. By utilizing a Gaussian mixture prior in a variational autoencoder framework, we are able to learn a low-dimensional latent space in which we cluster
more » ... aplotypes along the genome in a highly scalable manner. We demonstrate that we can use encodings of the latent space to infer global population structure using principal component analysis with haplotype information. Additionally, we derive an expectation-maximization algorithm for estimating ancestry proportions based on the haplotype clustering and the neural networks in a likelihood framework. Using different examples of sequencing data, we demonstrate that our approach is better at distinguishing closely related populations than standard principal component analysis and admixture analysis. We show that HaploNet performs similarly to ChromoPainter for principal component analysis while being much faster and allowing for unsupervised clustering.
doi:10.1101/2020.12.28.424587 fatcat:lerfmpn5b5avhmw2zsceq4erai

Large-scale Inference of Population Structure in Presence of Missingness using PCA [article]

Jonas Meisner, Anders Albrechtsen, Siyang Liu, Mingxi Huang
2020 bioRxiv   pre-print
Principal component analysis (PCA) is a commonly used tool in genetics to capture and visualize population structure. Due to technological advances in sequencing, such as the widely used non-invasive prenatal test, massive datasets of ultra-low coverage sequencing are being generated. These datasets are characterized by having a large amount of missing genotype information. We present EMU, a method for inferring population structure in the presence of rampant non-random missingness. We show
more » ... ugh simulations that several commonly used PCA methods can not handle missing data arisen from various sources, which leads to biased results as individuals are projected into the PC space based on their amount of missingness. In terms of accuracy, EMU outperforms an existing method that also accommodates missingness while being competitively fast. We further tested EMU on around 100K individuals of the Phase 1 dataset of the Chinese Millionome Project, that were shallowly sequenced to around 0.08x. From this data we are able to capture the population structure of the Han Chinese and to reproduce previous analysis in a matter of CPU hours instead of CPU years. EMU's capability to accurately infer population structure in the presence of missingness will be of increasing importance with the rising number of large-scale genetic datasets. EMU is written in Python and is freely available at https://github.com/Rosemeis/emu/.
doi:10.1101/2020.04.29.067496 fatcat:3ges5bonbzg3zpavmbhu4khen4

Estimating Individual Admixture Proportions from Next Generation Sequencing Data

Line Skotte, Thorfinn Sand Korneliussen, Anders Albrechtsen
2013 Genetics  
Albrechtsen Copyright © 2013 by the Genetics Society of America DOI: 10.1534/genetics.113.154138 Table 1 : 1 Table showing the fraction of times the EM algorithm has converged to the same maximum  ...  ://www.genetics.org/lookup/suppl/doi:10.1534/genetics.113.154138/-/DC1 Estimating Individual Admixture Proportions from Next Generation Sequencing Data Line Skotte, Thorfinn Sand Korneliussen, and Anders  ... 
doi:10.1534/genetics.113.154138 pmid:24026093 pmcid:PMC3813857 fatcat:zb3dnisemrf2vevxvitfcwhu4q

A Genotype Likelihood Framework for GWAS with Low Depth Sequencing Data from Admixed Individuals [article]

Emil Jørsboe, Anders Albrechtsen
2019 bioRxiv   pre-print
., 2015] or PCAngsd [Meisner and Albrechtsen, 2018] , where the population structure between individuals is modelled using principal components rather than a discrete number of ancestral populations.  ... 
doi:10.1101/786384 fatcat:dn3m25bw65ekfmvy4fpxjtiex4

Detecting Selection in Low-Coverage High-Throughput Sequencing Data using Principal Component Analysis [article]

Jonas Meisner, Anders Albrechtsen, Kristian Hanghøj
2021 bioRxiv   pre-print
1AbstractIdentification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful
more » ... its which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data. Here, we present two selections statistics which we have implemented in thePCAngsdframework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes. Moreover, we show thatPCAngsdoutperform selection statistics obtained from called genotypes from low-coverage sequencing data.
doi:10.1101/2021.03.01.432540 fatcat:2r2iedr44rb2xikvgiim4ousha

Archaic adaptive introgression in TBX15/WARS2 [article]

Fernando Racimo, David Gokhman, Matteo Fumagalli, Amy Ko, Torben Hansen, Ida Moltke, Anders Albrechtsen, Liran Carmel, Emilia Huerta-Sanchez, Rasmus Nielsen
2015 bioRxiv   pre-print
A recent study conducted the first genome-wide scan for selection in Inuit from Greenland using SNP chip data. Here, we report that selection in the region with the second most extreme signal of positive selection in Greenlandic Inuit favored a deeply divergent haplotype that is closely related to the sequence in the Denisovan genome, and was likely introgressed from an archaic population. The region contains two genes, WARS2 and TBX15, and has previously been associated with adipose tissue
more » ... erentiation and body-fat distribution in humans. We show that the adaptively introgressed allele has been under selection in a much larger geographic region than just Greenland. Furthermore, it is associated with changes in expression of WARS2 and TBX15 in multiple tissues including the adrenal gland and subcutaneous adipose tissue, and with regional DNA methylation changes in TBX15.
doi:10.1101/033928 fatcat:gkt2r7qmfngl5gq5d67zbm4nxu

Inferring Population Structure and Admixture Proportions in Low-Depth NGS Data

Jonas Meisner, Anders Albrechtsen
2018 Genetics  
We here present two methods for inferring population structure and admixture proportions in low-depth next-generation sequencing (NGS) data. Inference of population structure is essential in both population genetics and association studies, and is often performed using principal component analysis (PCA) or clustering-based approaches. NGS methods provide large amounts of genetic data but are associated with statistical uncertainty, especially for low-depth sequencing data. Models can account
more » ... this uncertainty by working directly on genotype likelihoods of the unobserved genotypes. We propose a method for inferring population structure through PCA in an iterative heuristic approach of estimating individual allele frequencies, where we demonstrate improved accuracy in samples with low and variable sequencing depth for both simulated and real datasets. We also use the estimated individual allele frequencies in a fast non-negative matrix factorization method to estimate admixture proportions. Both methods have been implemented in the PCAngsd framework available at http://www.popgen.dk/software/.
doi:10.1534/genetics.118.301336 pmid:30131346 pmcid:PMC6216594 fatcat:vwmwotzx3zarpdxe4wwjitstca

PCAone: fast and accurate out-of-core PCA framework for large scale biobank data [article]

Zilong Li, Jonas Meisner, Anders Albrechtsen
2022 bioRxiv   pre-print
Other fast methods also exists that can deal with missing data such as PCAngsd (Meisner and Albrechtsen 2018) , EMU (Meisner, Liu, et al. 2021) and ProPCA (Agrawal et al. 2020) .  ...  For instance, we can deal with missingness in the genotypes, genotype dosages and genotype likelihood data, which is achieved with an implementation of the PCAngsd method (Meisner and Albrechtsen 2018  ... 
doi:10.1101/2022.05.25.493261 fatcat:gyuc6iqhzfgsfblecckohriyii
« Previous Showing results 1 — 15 out of 265 results