Distinguishing among complex evolutionary models using unphased whole-genome data through Approximate Bayesian Computation
Inferring past demographic histories is crucial in population genetics, and the amount of complete genomes now available should in principle facilitate this inference. In practice, however, the available inferential methods suffer from severe limitations. Although hundreds complete genomes can be simultaneously analyzed, complex demographic processes can easily exceed computational constraints, and the procedures to evaluate the reliability of the estimates contribute to increase the
... ease the computational effort. Here we present an Approximate Bayesian Computation (ABC) framework, based on the Random Forest algorithm, to infer complex past population processes using complete genomes. To do this, we propose to summarize the data by the full genomic distribution of the four mutually exclusive categories of segregating sites (FDSS), a statistic fast to compute from unphased genome data. We constructed an efficient ABC pipeline and tested how accurately it allows one to recognize the true model among models of increasing complexity, using simulated data and taking into account different sampling strategies in terms of number of individuals analyzed, number and size of the genetic loci considered. We tested the power of the FDSS to be informative about even complex evolutionary histories and compared the results with those obtained summarizing the data through the unfolded Site Frequency Spectrum, thus highlighting for both statistics the experimental conditions maximizing the inferential power. Finally, we analyzed two datasets, testing models (a) on the dispersal of anatomically modern humans out of Africa and (b) the evolutionary relationships of the three species of Orangutan inhabiting Borneo and Sumatra.