View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Erasmus University Digital Repository WollsteinandLaoInvestigativeGenetics (2015) 6:7 DOI10.1186/s13323-015-0019-x REVIEW Open Access Detecting individual ancestry in the human genome Andreas Wollstein1,2*† and Oscar Lao1,3*† Abstract Detecting and quantifyingthe population substructure present ina sampleof individuals are ofmain interest in the fields ofgenetic epidemiology, population genetics, and forensics among others. Todate, severalalgorithms have been proposedfor estimatingthe amount of genetic ancestry within an individual. Inthe present review, we introduce the most widely used methods inpopulation genetics for detectingindividualgeneticancestry. We further show, by means of simulations, the performance ofpopular algorithms for detectingindividualancestryin various controlled demographicscenarios. Finally,weprovidesome hints on how to interpretthe results from thesealgorithms. Keywords: Population substructure,Human geneticvariability, SNPs, Global ancestry, Individual ancestry, ADMIXTURE, fastSTRUCTURE,MDS, PCA, sNMF Review [11]. Some argue that humans should be considered as Introduction one genetically homogeneous group [12]; others suggest The genetic variability among the human species is that, although small, the geographic dependence of hu- known to be relatively low compared to other primate man genetic diversity (at least) supports the existence of species [1]. There are paradoxically more genetic differ- continentalgroups[11,13]. encesbetweenWesternandEasternchimpanzeeindivid- Inferring population substructure in the human gen- uals sampled in the African continent [2] than in any ome is cumbersome and is the main goal for the large genome of two human individuals sampled in different number of genetic ancestry algorithms and approaches continents [3]. Human genetic diversity also tends to be that have been proposed in the last decade. A basic as- positively correlated with the geographic distance be- sumption is that any current individual genome or tween the sampled individuals [4-6], which is mainly a population is a mixture of ancestries from past popula- result from isolation by distance [7]. Studies using clas- tions [14]. Therefore, genetic ancestry is defined at dif- sical partition of the human genetic variance based on ferent scales of complexity: at populations, at individuals analysis of molecular variance (AMOVA [8]), and its within a population, and at a locus within an individual. generalization GAMOVA [9], have consistently shown In the present review, we focus on current methods for that a small proportion (approximately 10% to 15%) of inferring genetic ancestry in the genome of an individ- the total genetic variability is explained by continent of ual. We analyze the performance of some of the most origin, whereas the majority (approximately 80%) is ex- commonly used programs through simulated data and plained by within-individual variation. The remaining show the range of parameters in which each program approximately 5% of the genetic variation is explained providesreliable resultsinthose settings. by the populations [10]. Interpreting these results in terms of human population substructure and individual prediction to a population cluster is still controversial Methodsforidentifyingindividualancestry Methods for estimating ancestry have traditionally fo- *Correspondence:[email protected];[email protected] cused on populations; their main interests are to estab- †Equalcontributors lish the relationship among populations and to quantify 1DepartmentofForensicMolecularBiology,ErasmusMCUniversityMedical the admixture proportions in the admixed populations CenterRotterdam,3000CA,Rotterdam,TheNetherlands Fulllistofauthorinformationisavailableattheendofthearticle [15,16]. Admixture proportions are computed from the ©2015WollsteinandLao;licenseeBioMedCentral.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/4.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycredited.TheCreativeCommonsPublicDomain Dedicationwaiver(http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle, unlessotherwisestated. WollsteinandLaoInvestigativeGenetics (2015) 6:7 Page2of12 amount of loci that can be traced back to a certain an- approaches such as multithreading (for example, in AD- cestral population. Population methods are the oldest in MIXTURE and sNMF). However, multithreading can literature [17] and are a large number of available appli- only provide a linear time factor gain, which in the case cations [18-21]. However, it has been suggested that of higher polynomial complexities does not have a there could be hidden population substructure among strongcomputationalimpact. the individuals from an assumed population [22]. The Depending on which methodological approach is used, main goal of global individual ancestry methods is to de- global individual ancestry methods have been divided by scribe the relationship between individuals in terms of Alexander et al. [24] into algorithmic and model-based genetic ancestry. This can either mean the identification methods [24]. We use this classification through the ofthe aprioriunknownancestrycomponents,thequan- manuscript with some modifications. By definition, all tification of the proportions of these components, or the the algorithms are ‘algorithmic’. Therefore, we will use identification of the assumed population of an individ- the term ‘model-free’ for referring to the ancestry ual. Individual ancestry methods can be classified de- methods classified by Alexander et al. [24] as algorith- pending on the assumptions of the method, the scope of mic, and point out that the use of ‘model’ refers here to the algorithms (that is, the whole genome is assigned to a population-based statistical model, as further de- one ancestry versus the whole genome is a mixture of scribed. Nevertheless, we acknowledge that some of the ancestries), and the use of prior ancestry information, newest proposed methods can also be considered as hy- among others (see Table 1). From a technical point of brids of the two classifications or even can be barely view, there is large variation in the speed and computa- assigned to any of them. Model-free methods are based tional requirements of the different methods [16,23]. on the use of multivariate techniques [26] such as Prin- Speed depends on the computational complexity of each cipal component analysis (PCA; [27]) or Multidimen- method which, for example, is O(n m K2) for ADMIX- sional scaling (MDS [28,29]). For a given measured TURE [24] and O(n m K) for sNMF [25], as well as the divergence between any pair of sampled individuals, the possibility to apply divide-and-conquer computational basic idea behind all these techniques is to represent the Table1CommonlyappliedalgorithmstoSNPdataforquantifyingindividualpopulationsubstructureinhumans Type Method Nameof Webaddress Reference package Model-free Principalcomponentanalysis EIGENSOFTa http://genetics.med.harvard.edu/reich/Reich_Lab/Software.html [70] PrincipalcomponentsandMoran’sI adegenet http://adegenet.r-forge.r-project.org/ [71] (Rsoftware) Multidimensionalscaling PLINKa http://pngu.mgh.harvard.edu/~purcell/plink/ [28] Principalcoordinates PCO-MC http://lamar.colostate.edu/~reevesp/PCOMC/PCOMC.html [72] Spectralgraphtheory GemTools http://wpicr.wpic.pitt.edu/WPICCompGen/GemTools/GemTools.htm [43] Spectralgraphtheory SpectralGem http://wpicr.wpic.pitt.edu/WPICCompGen/Spectral-GEM/GEM+.htm [56] Laplacianeigenfunction LAPSTRUCT http://galton.uchicago.edu/~junzhang/LAPSTRUCT.html [57] GeneticalgorithmcoupledtoAMOVA GAGA http://www.erasmusmc.nl/fmb/resources/GAGA/ [73] Model-based Log-likelihoodHWE ADMIXTURE https://www.genetics.ucla.edu/software/admixture/ [24] Log-likelihoodHWE FRAPPE http://med.stanford.edu/tanglab/software/frappe.html [31] BayesianHWE STRUCTURE http://pritchardlab.stanford.edu/structure.html [22] BayesianHWE fastSTRUCTURE http://pritchardlab.stanford.edu/structure.html [59] Nonnegativematrixfactorization sNMF http://membres-timc.imag.fr/Eric.Frichot/snmf/index.htm [25] Bayesian BAPS http://www.helsinki.fi/bsg/software/ [74] ChromopaintingandBayesian fineSTRUCTURE http://www.paintmychromosomes.com [60] classifier Log-likelihoodgenotypic/haplotypic LOCO-LD http://loco.icsi.berkeley.edu/loco/ [37] gradients Log-likelihoodallelicgradients SPA http://genetics.cs.ucla.edu/spa/ [36] ADMIXTUREandlinearregression GPS http://chcb.saban-chla.usc.edu/gps/ [39] Bayesianclusteringwithspatial TESS http://membres-timc.imag.fr/Olivier.Francois/tess.html [38] information aWeprovideoneofthepossibleimplementationspresentintheliterature. WollsteinandLaoInvestigativeGenetics (2015) 6:7 Page3of12 geneticrelationshipsbyanewsetoforthogonalvariables case of model-free methods, using their output, a classi- that are ordered by the decreasing amount of explained fier algorithm can be applied in order to identify the variation. Both methods can be considered as equivalent number of genetically homogeneous population clusters if Euclidean distances are used [29]. Visualization of (see for example [41,42], or [43]). One exception is these relationships becomes very meaningful if only the sNMF [25], a new algorithm for inferring ancestry pro- variables with the highest amount of explained variation portions. sNMF models the probability of the observed are considered. Because multivariate methods are ex- genotypes p in individual i at locus l as a fraction q of il ik ploratory, they do not make any assumption about the K ancestral genotype probability g , similar in spirit as kl underlying genetic model of the data [26]. Nevertheless, STRUCTUREorADMIXTURE: in some idealized cases, the proposed coordinates in some of these methods can be interpreted in demo- XK graphic terms (for example, PCA [30]). In contrast, p ðjÞ¼ q g ðjÞ il ik kl model-based methods estimate ancestry coefficients as k¼1 the parameters of a statistical model. This model takes into account basic demographic assumptions, such as where j=0,1,2 denotes the number of alleles. However, the presence of the Hardy-Weinberg equilibrium (HWE; thisalgorithmdoesnotmakeanyassumptionaboutHWE [22]) in the allelic frequencies of the K ‘ancestral’ popu- in the ancestral populations. The corresponding matrix lations that produced the currently observed data representation is P=QG, where the unknown Q and G [22,24].Forexample,intheoriginaldefinitionofindivid- canbeestimatedbynonlinearmatrixfactorization.Thisis ual ancestry provided by STRUCTURE [22], the geno- achievedbymeansofminimizingtwoleastsquarecriteria: type g counted as the number of alleles {0,1,2} in a diploid organism at locus j of individual i is modeled as Ls ¼jX−QGjandLs ¼(cid:7)(cid:7)(cid:5)GT;pffiαffiffi1 (cid:6)QT−(cid:5)XT;0 (cid:6)(cid:7)(cid:7); a mixture of the q fractions of the K ancestral popula- 1 2 K n tions at the allelic frequencies f. The log-likelihood under the assumption of HWE for all the individuals i where alpha is a regularization parameter, and 1 and 0 k n and loci j is then computed using the Alexander et al. describe a column vector with ones and zeros of size K [24]notation as: andn(see[25]forfurtherdetails;thesemicolonindicates a line break). Starting from random matrices as initial ! !! XX X (cid:2) (cid:3) X (cid:2) (cid:3) condition, the algorithm applies both criteria consecu- LðQ;FÞ¼ g ln q f þ 2−g ln q 1−f ij ik kj ij ik kj tively to obtain estimates about Q from Ls and G from i j k k 1 Ls ,respectively,untilconvergencehasbeenreached. 2 Popular methods for estimating the allelic frequencies Since model-based methods explore the space of pos- f in the ancestral populations for all the loci and the an- sible solutions starting from an initial point, it is recom- cestry q proportions in each individual include Bayesian mended to run the algorithm several times at different (for example STRUCTURE [22]) and maximum likeli- initial starting points for each proposed K and to check hood approaches (for example, FRAPPE [31] and AD- for reproducibility of results [44]. Different strategies MIXTURE[24]). have been proposed for combining the results from dif- Recently, new types of global ancestry methods have ferent runs. One possibility is to compute a consensus been proposed. These methods take advantage of the ancestry value by merging all the solutions [44]. Another spatial dependence of human population substructure is just to take the run that provides the best value of [32] to estimate ancestral geographic coordinates of an model performance [24]. individual (BAPS2 [33], GENELAND [34], sPCA [35], Usually, investigators apply both model-free (for ex- SPA[36],LOCO-LD[37],TESS[38],orGPS[39]among ample, PCA or MDS) and model-based methods (for ex- others). ample, ADMIXTURE, FRAPPE, or STRUCTURE) to the There are several ways to estimate the unknown num- same dataset [45,46]. Plots (and further interpretation) ber (K) of ancestral populations from the data (for ex- tend to include the solutions of the optimal/best sup- ample, [40]). In model-based methods, the algorithm is portednumberofclusters. explicitly run by the user at different Ks. The most sup- Further improvements on genotyping technology, with ported number of clusters or ancestral components is the description of millions of single nucleotide polymor- then ascertained by taking the one that optimizes the phisms (SNPs) in the human genome [15], have allowed parameter of performance of the algorithm (for example, the third generation of ancestry methods by modeling it maximizes the log-likelihood of the posterior in the the genetic ancestry of local fragments of the genome, case of STRUCTURE; minimization of cross-validation such as HapMix or StepPCO scripts [14,47] among error is applied in ADMIXTURE among others). In the others. WollsteinandLaoInvestigativeGenetics (2015) 6:7 Page4of12 Upsanddownsofindividualgeneticancestryestimation algorithms (such as in the case of PCA, [57]). Further- Individual ancestry methods can depict a detailed pic- more, the outcome from the different algorithms can dif- ture of the genetic landscape of human populations [15]. fersubstantiallyevenforthesamedataset[58].Ultimately, Furthermore, these algorithms are routinely applied to thereisthequestionofwhataproposed‘ancestralpopula- any dataset before conducting a genome-wide associ- tion’ is. By definition, since new populations appear by ation study (GWAS), in order to correct for the putative splitting from previous ones, population ancestry (and presence of hidden population substructure [48]. More- hence genetic admixture) can be defined at different time over, they have been used to test the hypothesis of the scales, takinginto account that all individuals froma spe- ancestry origin of the perpetrator at a crime scene in fo- ciesultimatelyshareacommonancestralorigin.However, rensiccases[49]. thispopulation‘birthanddeath’processisnotreallymod- In principle, averaging the fragments of local ancestry eled in the model-based methods (and by default, neither over the genome of one individual computes the global is it in the model-free methods); in contrast, it is one of ancestry estimation in that individual; similarly, aver- the main goals of population-based methods, conditioned aging all of the global individual ancestries in one popu- totheproperdefinitionof ‘whatacurrentpopulationis’. lation provides a migration/admixture estimation in that Weexemplifysomeofthesecaveatsusingunsupervised population. Moreover, the mean and variance in the analysesfromfourascertainedglobal-basedalgorithmson lengthoftheancestryfragmentsandtheglobalancestry simulated and real data using the default parameter set- proportions can be used to estimate parameters such as tings from each algorithm. In particular, we considerAD- the time or migration rate of the admixture event in MIXTURE [24], sNMF [25], fastSTRUCTURE [59], PCA particular demographic scenarios [50]. Nevertheless, [27], and MDS in PLINK [28]. This selection is based on population-based methods are sometimes preferred methodological,historical,andcomputationalcharacteris- over global or local ancestry methods [18,51]. The main tics. For example, we did not consider fineSTRUCTURE reason is that the results of global and local ancestry [60],arecentlydevelopedalgorithmwithenhancedpower methods can be particularly difficult to interpret for detecting population substructure [61], because of its [21,52]. For example, several demographic scenarios computational burden when the number of SNPs and can produce the same observed admixture pattern in sampled individuals are large (see the manual of fineS- PCA [30,53,54]. In humans, multiple demographic TRUCTUREandchromoPainterfordetails).Thefirsttwo events can be identified in the same geographic area methods represent model-based algorithms. ADMIX- [55];therefore,itis likely tofindanadhocplausibleex- TURE [24] is a maximum likelihood algorithm. It can be planation for any estimated admixture pattern (for ex- considered the gold standard of model-based methods; it ample, see [53]). The presence of unequal sample size is relatively fast and allows for the use of a large number of the (a priori unknown) populations can also bias the of SNPs and samples. fastSTRUCTURE is a new software output of some algorithms, such as PCA [30,56]; the thatimplementsaBayesianframeworksimilartoSTRUC- presenceof highly genetically related individuals and gen- TURE [22]. However, in contrast to STRUCTURE, fas- etic outliers can also bias the output from different tSTRUCTURE allows the fast analysis of a large number Figure1Basicadmixturemodelscommonlyusedinpopulationgenetics.Eachrectanglerepresentsapopulation.Bothmodelsconsider oneinitialancestralpopulation(graycolor)thatsplitsintotwonewpopulationst_splitgenerationsago.Eachofthenewpopulationsevolveswithout exchangingmigrantsforaperiodoftime,duringwhichgeneticdifferentiationbetweenthemcantakeplaceasexemplifiedbythepresenceofa differentcolor.(A)Continuousgeneflow(CGF)model.Thebluepopulationcontributes4Nmchromosomemigrantstotheredpopulationfromtime pointt_splitonwards,replacingthesamenumberofchromosomesfromthispopulation.(B)Hybrid(HI)model.Att_admixture,thereisasingleevent ofadmixture,andanewhybridpopulationiscreatedfrommfractionofchromosomemigrantsfromthebluepopulationand1-mfractionof migrantsfromtheredpopulation.Afterthisevent,eachpopulationcontinuestoevolveindependently.Adaptedfrom[20]. WollsteinandLaoInvestigativeGenetics (2015) 6:7 Page5of12 of samples and SNPs. PCA, MDS, and sNFM are model- Table3Resultsfromthetwo-populationmodel freemethods.PCAand MDSarebasedoneigenvaluede- simulations composition.Theyproducealmostidenticalresultsinreal Variable sNMF Admixture fastStructure data [62,63]; therefore, we have used either one or the (R2) (R2) (R2) other indistinctly in the different simulations. sNMF [25] Samplingdepth,n1,n2 isanovel softwarewhich in principle produces verysimi- 8 99.92 100 39.56 lar results to ADMIXTURE [24] but at a computationally 10 99.83 100 34.03 fasterspeed. 20 99.87 100 100 We focused our analyses on two simple, controlled, 40 99.81 100 100 demographic models. The first demographic model de- scribes an ancestral population that splits t generations 100 99.74 100 100 ago in two populations. In one version of the model, the Unevensampling,n1 two descendent populations start evolving independ- 8 98.94 99.45 98.59 ently. In another version, migration between the two 10 99.43 99.78 99.32 populations is allowed. The second model comprises an 20 99.61 100 92.21 ancestral population that splits in two, which after a 40 99.67 100 100 certain number of generations evolving with a genetic barrier, create a new population by admixture (see 100 99.74 100 100 Figure 1). Because of their simplicity, the proposed Sequencingdepth,nsnps demographic models fit better into the assumptions of 10 3.13 0.65 18.51 model-based methods. Furthermore, it has been shown 50 66.56 75.54 74.42 that the first dimension of the PCA can differentiate the 100 85.33 92.95 91.89 genetic ancestry of populations, and it is indicative of 500 96.78 99.87 99.93 the ancestry proportions in the admixed populations [30].Inouranalyses,weusedmarkersinlinkageequilib- 1,000 98.62 99.99 100 rium;thisconditionwaseitherimposedonthesimulator 5,000 99.74 100 100 (case of ms simulations) or achieved by the use of com- Populationsize,theta monly applied LD pruning techniques. Therefore, any 1 99.73 100 100 difference observed in the estimated ancestry propor- 2 99.74 100 100 tions must reflect inner algorithmic assumptions or sen- 5 99.74 100 100 sitivity tothemodification ofthe consideredparameters. 10 99.72 100 100 Effectivepopulationsize, N2 Performanceofglobal-basedalgorithmstoestimate geneticancestryontwosimulatedpopulations 100 99.98 100 100 Twopopulationswithageneticbarrier 2,500 99.94 100 100 The results from the two-population model (Figure 1A) 7,500 99.82 100 100 with a genetic barrier and the details of the implementa- 10,000 99.74 100 100 tion areshown inTables2and3. Divergencetime(Fst), T/(4N) 1 Table2Defaultparameterusedintwo-population 0.000075 0.54 0.38 0.01 models,withandwithoutmigration 0.00025 0.24 0.03 0 Parameter Abbreviation Defaultvalue 0.00125 6.19 0.03 0.24 Samplesizepopulation1 n1 100 0.0025 69.36 95.28 0.53 Samplesizepopulation2 n2 100 0.0125 98.36 100 100 NumberofindependentSNPs nsnps 5,000 0.05 99.74 100 100 Mutationrate(length)a theta 2 Constantmigrationrate, Effectivepopulationsizeb N1,N2 10,000 4Nm Divergencetime T1 2,000 0.1 99.77 100 100 Constantmigrationrate 4Nm 0 1 99.78 100 100 aThescaledmutationratetheta=2*Ne*mu=2describesaregionofabout 5 99.56 100 100 2kbassumingamutationrateof2.5e−8.bTheeffectivepopulationsize correspondsbroadlytothatofAfrica. WollsteinandLaoInvestigativeGenetics (2015) 6:7 Page6of12 Table3Resultsfromthetwo-populationmodel Table4Resultsfromadmixturesimulationwithchanging simulations(Continued) parameterintheHImodelfromHapMapIIIdata 10 99.15 99.99 100 Parameter sNMF(R2) ADMIXTURE(R2) fastSTRUCTURE(R2) Samplesize 50 93.95 99.98 33.3 8 98.2 99 86.66 100 41.61 94.06 0.56 10 99.5 99.52 98.69 Wesimulatedtwopopulationsusingms[75],whichsplittedandevolved independentlytgenerationsago.SeeTable1fordefaultparameters.Each 20 99.74 99.82 99.71 simulationcomprises1,000independentregionsof2kb,fromwhichoneSNP perregionissampledatrandom.Eachparametersetwasreplicatedtentimes. 40 99.85 99.9 99.86 Foreachalgorithm,theestimatedancestryproportionsoverthedifferentruns 50 99.87 99.93 99.9 weresortedaccordingtotheexpectedancestrymatrixdenotingthetrue populationlabelsusingCLUMPP[44].Fromthis,standarddenoteddemographic 100 99.91 99.95 99.95 parametersweresuccessivelyvariedtoexemplifytheimpactontheestimates. Wereportthecoefficientofdeterminationthatcanbeunderstoodasthe nsnps percentageofthetrueoutcome. 5 4.56 15.38 19.44 10 15.92 47.37 46.2 Overall, sNMF and ADMIXTURE show similar results and outperform fastSTRUCTURE for most of the con- 50 80.62 86.31 86.89 sidered demographic values (see Table 4). Nevertheless, 100 89.67 93.04 93.33 the predictive power of ADMIXTURE is slightly higher 500 98.46 99.07 99.11 than that of sNMF (100% compared to 99% in most 1,000 99.19 99.54 99.56 cases). Low sample size decreases the power mostly in 5,000 99.84 99.92 99.91 fastSTRUCTURE (for n=8, fastSTRUCTURE: 35%, 10,000 99.91 99.95 99.95 sNMF:99%,ADMIXTURE:100%),whereasunevensam- pling does not influence the estimates of the ancestry Nbreaks components with any of the programs. The number of 5 88.82 88.37 87.46 SNPs has a strong impact on all programs. When only 10 94.38 94.86 94.43 very few sites are available (that is, less than 50 snps), 50 98.74 98.87 98.8 fastSTRUCTURE produces the best outcome. This is 100 99.33 99.41 99.38 not surprising, as ADMIXTURE and sNMF have been 500 99.81 99.85 99.84 particularly developed to consider a dense number of markers [25]. The effective population size and differ- 1,000 99.86 99.91 99.9 ences in population size did not show any direct impact 5,000 99.91 99.94 99.94 on the results, which however might matter in combin- 10,000 99.91 99.95 99.95 ation with divergence time. The power for all programs alpha decreases dramatically for populations that do not ex- 0.01 99.94 99.99 99.99 hibit substantialpopulationsubdivisionduetolowdiver- 0.03 99.93 99.97 99.95 gence times or high migration rates, mostly for fastSTRUCTURE. Reliable ancestry estimates are pos- 0.07 99.93 99.97 99.92 sible for t>0.0125 that correspond to F >0.0124 [64]. 0.1 99.93 99.97 99.91 st The counter effect of constant migration becomes evi- 0.3 99.91 99.96 99.95 dent for a migration rate of 4Nm>10 (see Figure 2B), 0.5 99.91 99.95 99.95 which homogenizes the population. Sampling more sites TheadmixedpopulationwasgeneratedfromtheAfrican(YRI)andEuropean islikely toincreasethesensitivitytodetectboth effects. (CEU)populationfromHapMapIII.Asamplefromanadmixedpopulationis knowntoconsistofamosaicofchromosomalregionsorblocksfromthe ancestralpopulation.Withincreasingtimesincetheadmixtureevent,these Migrationbetweenthetwodescendentpopulations regionsarebecomingbrokenupintosmallerpiecesthroughrecombination (continuousgeneflowmodel) thatisdenotedbythenumberofbreakpoints(Nbreaks).Individualsfromthe In addition, we studied the parameter range where mi- syntheticallyadmixedpopulationweresampledrandomlyfromblocksfrom sourcepopulations,respectively(thedefinedadmixtureproportions,alpha). gration becomes detectable depending on the start time Finally,asubsample(nsnps)ofuniformlydistributedsiteswaschosen.The and rate of migration in the continuous gene flow (CGF) distanceofthesiteshasbeenchosentobegreaterthan1Mbtoassure linkageequilibrium. model (see Figure 1A for the model and Figure 2 for re- sults).Keepingthemigrationratefixedathighmigration to be panmictic. In contrast, when fixing the start time rate (4Nm=2,000), the populations become distinguish- of migration at ten generations, we observe that all able if the migration starts before 100 generations back- populations become recognizable by all programs for 4 ward intime (Figure 2B).Beyondthatvalue,the effect of Nm<500. The estimated proportions of ancestry do migration is so strong that the two populations appear not match the proportion of migrants over time. A WollsteinandLaoInvestigativeGenetics (2015) 6:7 Page7of12 Figure2Estimatedproportionsofancestryfromthecontinuousgeneflow(CGF)model(seemaintext).SeeTable2fordefault parameters.(A)Resultsforvaryingdivergencetimewhilekeepingthemigrationrateconstantat4Nm=50.(B)Theestimatedancestryproportionsfor keepingthedivergencetimeconstantatT=10whilevaryingthemigrationrate.Errorbarsdenotethestandarddeviationoftheestimatedancestry proportionperpopulation.Simulationswereproducedusingthefollowingmscommand[75]:ms2005000-t2-I2100100-em122000-n21-ej21. possible reason is that there is a continuous gene flow values (F <0.1), we observe that sNMF and fastSTRUC- is from one population into the other so that recombin- TUREindicatecorrectlytheeffectofmigrationintheires- ation has not enough time to produce the homoge- timates(seeFigure3).Incontrast,forhighF values(F > is is neous mosaic of ancestral fragments that is emerging 0.1), the genetic variation is more divergent in sNMF and from the HI model (see below). Therefore, the migra- fastSTRUCTURE; in contrast, both populations appear tion rate cannot be inferred from this analysis. more similar with ADMIXTURE. Therefore, sNMF and We further investigated how the presence of hidden in- fastSTRUCTURE seem to provide better ancestry esti- breedingaffectstheestimatedgeneticancestryproportions mates compared to ADMIXTURE, particularly when in- from each algorithm. We used the two-population model breeding is high (F >0.1). If migration is absent, is with constant migration (4Nm=100) as previously de- inbreeding has a minor effect on the ancestry estimates scribed. In each simulation, a fraction of heterozygote ge- fromthedifferentalgorithms(datanotshown). notypeswasdecreasedproportionaltotheF (forexample, For completeness, we studied the running time per- is [65])byreplacingthembyrandomhomozygotegenotypes formance of each algorithm as a function of the number in one population. We estimated the genetic ancestry by of considered SNPs and for either K=2 or K=4 as- the different programs (see Figure 3 for results). The mi- sumedancestralpopulations (seeFigure4).Weobserved gration has a homogenizing effect on the genetic variation that sNMF shows the lowest running times for a given in both populations, whereas the inbreeding in one of the number of SNPs and K, followed by ADMIXTURE. In populations results in the opposite pattern. For low F contrast, fastSTRUCTURE exhibits the worst runtime is WollsteinandLaoInvestigativeGenetics (2015) 6:7 Page8of12 Figure3Migrationandinbreedingusingthetwo-populationmodel(seelegendofFigure2formscommand).Inbreedingwassimulated byareductionoftheheterozygotegenotypesproportionaltothegivenFisvalue(seemaintextfordetails). and scaling with higher K as expected from the com- model (Figure 1B). We compared them with the real plexitydescribed above. proportions of genomic admixture in each individual; this measure was estimated for each simulation by tra- cing back the ancestry of the genomic fragments that Performanceofthealgorithmsonthehybridadmixture compose the genome of each admixed individual to ei- (HI)model ther of the two parental populations. Therefore, in con- Simulateddata trast to other approaches, which produce admixed Analyses focused on the estimated individual ancestry individuals in forward generations from sampled real proportions in the hybrid population using the HI populations (that is, African Americans have been modeled as a mixture of CEU and YRI individuals from HapMap III [66]; also see the next section), we avoid the artificial introduction of strong bottlenecks. As seen in Figure 5, the error of the estimated ances- try proportions differ based on the software, the amount of genetic differentiation present among the parental populations, and the ratios of sampled individ- uals between the parental populations. With the same number of sampled individuals by parental population, ancestry proportions estimated by fastSTRUCTURE show the largest deviation to the real ancestral propor- tions in all the simulations. In all cases, admixture pro- portions in the admixed population tend to be better estimated if the parental populations are genetically dif- ferentiated (F >0.1); nevertheless, even in that case, st the mean difference between the estimated and the real admixture proportion can reach 5% in the case of sNMF and MDS, and 6% in the case of fastSTRUC- TURE. Unequal sample sizes of the parental popula- Figure4NettotimeestimatesforfastSTRUCTURE,sNMF,and tions also affect the performance of the different ADMIXTURE.Meantimeestimatesoftheterminationofthe algorithms. ADMIXTURE and fastSTRUCTURE show respectiveprogramsfromtenindependentreplications.Wesimulated 100chromosomesfromtwopopulationswithaneffectivepopulation a systematic error bias in the estimation of the admix- sizeof10,000andaNe*m=20usingms[75](seelegendofFigure2 ture proportions in the hybrid population when there is forcommanddetails).Theterminationtimecanbeexpectedtoscale unequal sample size in the parental populations, inde- similarlyasthenumberofusedSNPsgiventhecomplexityofthe pendently of the amount of population differentiation programs. among the parental populations. WollsteinandLaoInvestigativeGenetics (2015) 6:7 Page9of12 Figure5Estimatederrorintheestimatedindividualadmixtureproportionsfromthesimulatedadmixedpopulation(HImodel).We usedanextendedversionofthebackwarddemographicsimulatordescribedin[76]thatincludesrecombinationanddifferenttypesofmating andallowsforancestrypainting[14].Overallparametersthataredefinedinthismodel[19],wevariedthetimeofsplitoftheancestral populations,whichrangedbetween50and2,000generationsamongsimulations.Eachsimulationgenerated75(25bypopulation)fullhuman genomeswith22diploidchromosomes(l)withthefollowingsizes:13.65,13.15,11.20,10.65,10.20,9.65,9.35,8.50,8.40,8.95,7.95,8.65,6.35,5.80, 6.30,6.75,6.50,5.95,5.40,5.40,3.10,and3.65Mb[77].Themutationratewassetto2.5×10−8[78]andtherecombinationrateto1.8×10−8.PLINK wasappliedtoexcludeSNPswithminorallelefrequencylessthan0.05andLD(defaultPLINK–indep5052).Theeffectivepopulationsizesof theparentalandhybridpopulationsweresetto5,000diploidindividuals;thetimeofadmixturewastengenerationsago,andeachparental populationequallycontributedtotheadmixedpopulation.Bythisway,weminimizedtheputativeeffectofgeneticdriftintheadmixture proportionsofthehybridpopulation.Furthermore,inordertoincludetheeffectsofbiassamplesize,werepeatedalltheanalyseswith1:1(A) and1:5(B)parentalpopulationsizeratios.Fourdifferentalgorithmswereconsidered:sNMF,ADMIXTURE,fastSTRUCTURE,andMDS.Inthecaseof MDS,ancestryproportionsofeachindividualfromtheadmixedpopulationwereestimatedastherelativepositioninthefirstdimensionin relationtothemeanestimatedcoordinateoftheparentalpopulations. RealdatafromHapMapIIIdata power of sNMF and ADMIXTURE is quite comparable. Simulations from synthetically generated admixed popu- fastSTRUCTURE loses power more rapidly with lower lations from African (YRI) and European (CEU) as an- sample size and maintains a better power for low num- cestral populations were produced (see Table 4 for bers of SNPs. All programs have an equally high power resultsandclarification ofthe applied methodology).We toestimatethe ancestry components. use the number of breakpoints to mimic the time of ad- mixture [14] and sampled SNPs with a minimum dis- Conclusions tance of 1 Mb to ensure linkage equilibrium. The results Identifying hidden population substructure in the gen- for sample size, number of SNPs, and admixture time, ome of an individual is important for a number of scien- represented here as the number of breaks, are quite tific disciplines. So far, the proposed algorithms are similar to the two-population simulations above. The invaluable tools for detecting and controlling for the WollsteinandLaoInvestigativeGenetics (2015) 6:7 Page10of12 presence of hidden population substructure. In the sim- Authors’contributions plest demographic models, these methods can also be AWandOLparticipatedinthedesignofthestudyandperformedthe statisticalanalyses.AWandOLequallyparticipatedinthedraftingofthe usedtoestimatedemographicparameters.However,inter- manuscript.Bothauthorsreadandapprovedthefinalmanuscript. pretingtheoutputofeachalgorithmfromanevolutionary point of view can be difficult. Different demographic sce- Acknowledgements narioscanleadtothesameancestryestimates,anddiffer- ThisstudywasfundedinpartbytheErasmusUniversityMedicalCenter Rotterdam.AWwasadditionallysupportedbyVolkswagenFoundation(ref ent estimates can be retrieved when applied to the same 80462).WewouldliketothankSusanWalshandWolfgangStephanfor dataset.Extrapolatingthe results from our simple simula- helpfulcommentsonthemanuscript. tions to real data (that is, suggesting which is the best al- Authordetails gorithm) can be misleading; except for cases such as the 1DepartmentofForensicMolecularBiology,ErasmusMCUniversityMedical admixture of European and Sub-Saharan African popula- CenterRotterdam,3000CA,Rotterdam,TheNetherlands.2Sectionof tionsintheUS[67],admixtureusuallyinvolvesmorethan EvolutionaryBiology,DepartmentofBiologyII,UniversityofMunich,82152 Planegg-Martinsried,Germany.3Currentaddress:CentroNacionaldeAnálisis two parental populations (for example, Latin America, al- Genómico,BaldiriReixac,4,BarcleonaSciencePark-TowerI,08028Barcelona, thoughsee[68]).Inaddition,parentalpopulationstendto Spain. showanon-negligiblegeneflow[61] withadmixedpopu- Received:13October2014Accepted:12January2015 lationsthatcansubstantiallydifferintheeffectivepopula- tion size compared to the parental populations (for example,seetheEuropeanRomani[46]),whileusuallythe References parentalpopulationsareunknown. 1. Cavalli-SforzaLL.Humanevolutionanditsrelevanceforgenetic epidemiology.AnnuRevGenomicsHumGenet.2007;8:1–15. The number of SNPs and sample size seem to be a 2. ChimpanzeeSequencingandAnalysisConsortium.Initialsequenceofthe limiting factor in all the algorithms that we have tested; chimpanzeegenomeandcomparisonwiththehumangenome.Nature. therefore, it would be recommended to use as many 2005;437:69–87. 3. BarbujaniG,ColonnaV.Humangenomediversity:frequentlyasked markers (conditioned in the absence of LD when re- questions.TrendsGenet.2010;26:285–95. quired by the algorithm) and samples as possible. How- 4. HandleyLJL,ManicaA,GoudetJ,BallouxF.Goingthedistance:human ever, in our simple model, we observe already good populationgeneticsinaclinalworld.TrendsGenet.2007;23:432–9. 5. RamachandranS,DeshpandeO,RosemanCC,RosenbergNA,FeldmanMW, estimates for >10 samples and >1,000 markers. In case Cavalli-SforzaLL.Supportfromtherelationshipofgeneticandgeographic fewer markers are available, fastSTRUCTURE provides distanceinhumanpopulationsforaserialfoundereffectoriginatingin the best estimates followed by ADMIXTURE and sNMF. Africa.ProcNatlAcadSciUSA.2005;102:15942–7. 6. RosenbergNA.Algorithmsforselectinginformativemarkerpanelsfor Furthermore, it is recommendable to run more than one populationassignment.JComputBiol.2005;12:1183–201. algorithm on the same data at the same time given the 7. JayF,SjödinP,JakobssonM,BlumMGB.Anisotropicisolationbydistance: observed diversity of results, different sensitivity to themainorientationsofhumangeneticdifferentiation.MolBiolEvol. 2013;30:513–25. biased sample size of the different algorithms, and an- 8. ExcoffierL,SmousePE,QuattroJM.Analysisofmolecularvarianceinferred cestry noise. In this sense, combining global ancestry frommetricdistancesamongDNAhaplotypes:applicationtohuman and population ancestry methods (for example, [69]), or mitochondrialDNArestrictiondata.Genetics.1992;131:479–91. 9. NievergeltCM,LibigerO,SchorkNJ.Generalizedanalysisofmolecular using the output from these algorithms as summary sta- variance.PLoSGenet.2007;3:e51. tistics [40], can improve the identification of population 10. LewontinRC.Theapportionmentofhumandiversity.In:Evolutionary substructure. Finally, although they can be used to pro- biology.US:Springer;1995.p.381–98. 11. EdwardsAWF.Humangeneticdiversity:Lewontin’sfallacy.Bioessays. vide hypotheses about the origin and evolution of popu- 2003;25:798–801. lations, it is recommended to test the evolutionary 12. BarbujaniG.Humanraces:classifyingpeoplevsunderstandingdiversity. hypotheses by means of other methods [46], rather than CurrGenomics.2005;6:215–26. 13. RischN.Dissectingracialandethnicdifferences.NEnglJMed. providing an ad hoc interpretation; in particular, any 2006;354:408–11. demographic interpretation from these methods should 14. PugachI,MatveyevR,WollsteinA,KayserM,StonekingM.Datingtheage be further validated by means of demographic simula- ofadmixtureviawavelettransformanalysisofgenome-widedata.Genome Biol.2011;12:R19. tions, showing that the proposed demographic model 15. NovembreJ,RamachandranS.Perspectivesonhumanpopulationstructure can producethe observedoutputofgeneticancestry. atthecuspofthesequencingera.AnnuRevGenomicsHumGenet. 2011;12:245–74. 16. LiuY,NyunoyaT,LengS,BelinskySA,TesfaigziY,BruseS.Softwaresand Abbreviations methodsforestimatinggeneticancestryinhumanpopulations.Hum AMOVA:Analysisofmolecularvariance;CGF:Continuousgeneflowmodel; Genomics.2013;7:1. GAMOVA:Generalizedanalysisofmolecularvariance;GWAS:Genome-wide 17. BernsteinF.DiegeographischeVerteilungderBlutgruppenundihre associationstudy;HWE:Hardy-Weinbergequilibrium;MDS:Classical anthropologischeBedeutung.1932. multidimensionalscaling,alsocalledprincipalcoordinateanalysis; 18. PickrellJK,PritchardJK.Inferenceofpopulationsplitsandmixturesfrom PCA:Principalcomponentanalysis;SNP:Singlenucleotidepolymorphism. genome-wideallelefrequencydata.PLoSGenet.2012;8:e1002967. 19. BertorelleG,ExcoffierL.Inferringadmixtureproportionsfrommolecular data.MolBiolEvol.1998;15:1298–311. Competinginterests 20. LongJC.Thegeneticstructureofadmixedpopulations.Genetics. Theauthorsdeclarethattheyhavenocompetinginterests. 1991;127:417–28.
Description: