Efficient toolkit implementing best practices for principal component analysis of population genetic data [article]

Florian Privé, Keurcien Luu, Michael G.B. Blum, John J. McGrath, Bjarni J. Vilhjálmsson
2019 bioRxiv   pre-print
Principal Component Analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (1) capturing Linkage Disequilibrium (LD) structure instead of population structure, (2) projected PCs that suffer from shrinkage bias when projecting PCA from a reference dataset to another independent dataset, (3) detecting sample
more » ... liers, and (4) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. For example, we show that PC19 to PC40 in the UK Biobank capture LD structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank. We provide evidence for a shrinkage bias when projecting PCs computed with data from the 1000 Genomes project. Although PC1 to PC4 suffer from only moderate shrinkage (1.01-1.09), PC5 (resp. PC10) for example suffers from a shrinkage factor of 1.50 (resp. 3.14). We provide a fast way to project new individuals that is not affected by this shrinkage bias. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.
doi:10.1101/841452 fatcat:wrjh5qwnqngvjihvzfwi77lasy