Copyright2002bytheGeneticsSocietyofAmerica Likelihood and Bayes Estimation of Ancestral Population Sizes in Hominoids Using Data From Multiple Loci Ziheng Yang1 GaltonLaboratory,DepartmentofBiology,UniversityCollegeLondon,LondonWC1E6BT,England ManuscriptreceivedApril18,2002 AcceptedforpublicationSeptember6,2002 ABSTRACT Polymorphismsinanancestralpopulationcancauseconflictsbetweengenetreesandthespeciestree. Suchconflictscanbeusedtoestimateancestralpopulationsizeswhendatafrommultiplelociareavailable. In this article I extend previous work for estimating ancestral population sizes to analyze sequence data fromthreespeciesunderafinite-sitenucleotidesubstitutionmodel.Bothmaximum-likelihood(ML)and Bayesmethodsareimplementedforjointestimationofthetwospeciationdatesandthetwopopulation sizeparameters.Bothmethodsaccountforuncertaintiesinthegenetreeduetofewinformativesitesat eachlocusandmakeanefficientuseofinformationinthedata.TheBayesalgorithmusingMarkovchain Monte Carlo (MCMC) enjoys a computational advantage over ML and also provides a framework for incorporatingpriorinformationabouttheparameters.Themethodsareappliedtoadatasetof53nuclear noncoding contigs from human, chimpanzee, and gorilla published by Chen and Li. Estimates of the effective population size for the common ancestor of humans and chimpanzees by both ML and Bayes methodsare(cid:1)12,000–21,000,comparabletoestimatesformodernhumans,anddonotsupportthenotion of a dramatic size reduction in early human populations. Estimates published previously from the same dataareseveraltimeslargerandappeartobebiasedduetomethodologicaldeficiency.Thedivergence betweenhumansandchimpanzeesisdatedat(cid:1)5.2millionyearsagoandthegorilladivergence1.1–1.7 million years earlier. The analysis suggests that typical data sets contain useful information about the ancestralpopulationsizesandthatitisadvantageoustoanalyzedataofseveralspeciessimultaneously. THEamountofsequencepolymorphisminapopula- etal.2000forextensivereviews).Theyrequiredatafrom tion is mainly determined by the parameter (cid:1) (cid:2) multiple loci and make use of the information in the 4N(cid:3), where N is the (effective) population size and (cid:3) conflictofgenetreesamongloci.Forexample,Takahata isthemutationratepernucleotidesitepergeneration. (1986) suggested a method for estimating the popula- Inadditiontoitseffectontheamountofneutralvaria- tionsizeofthecommonancestoroftwocloselyrelated tion maintained in a population, the population size specieswhenapairofhomologousDNAsequencesare is also important in affecting the efficiency of natural availablefromthetwospeciesateachlocus.Theaverage selectionandisacriticalparameterinpopulationgenet- divergence at many loci provides information on the icsmodelsofmutationandselection.Itisimportantto speciesdivergencetime,whilethevariationofsequence competing theoriesof originand evolutionof modern divergenceamonglocireflectstheancestralpopulation humans.Whenanestimateofthemutationrateisavail- size. The method of Takahata (1986) uses summary able, as is assumed here, the population size can be statistics from the data and was later extended to a obtained from estimates of (cid:1). The population size of a full-likelihoodanalysisandtothecaseofthreespecies, present-dayspeciescanbeestimatedbyusingobserved wherethepopulationsizesofthetwoextinctancestors DNAsequencevariationinthepopulation(e.g.,Kreit- aswellasthetwospeciationdateswereestimated(Taka- man1983;Fu1994;Takahataetal.1995;Yang1997a). hata et al. 1995). The population size of modern humans has been esti- Inthecaseofthreespecies,anothersimplermethod matedtobe(cid:1)10,000(e.g.,Takahataetal.1995;Zhao hasalsobeenused(Nei1987;Wu1991;Hudson1992). et al. 2000; Yu et al. 2001). Thisapproachusestheproportionoflociatwhichthe The population size of an extinct ancestral species, gene tree does not match the species tree to estimate suchasthecommonancestorofhumansandchimpan- the ancestral population size and exploits the fact that zees,ishardertoestimate,butafewmethodshavebeen ancestral polymorphism creates conflicts between the developed (see Edwards and Beerli 2000 and Satta gene tree and the species tree. I refer to it as the tree- mismatch method. Its application to human and great apesequencedatahasledtoestimatesofthepopulation size for the ancestor of humans and chimpanzees that 1Addressforcorrespondence:DepartmentofBiology,UniversityCollege are much larger than estimated population sizes for London,DarwinBldg.,GowerSt.,LondonWC1E6BT,England. E-mail:[email protected] modern humans (Ruvolo 1997; Chen and Li 2001). Genetics162:1811–1823(December2002) 1812 Z.Yang For example, Chen and Li (2001) sequenced 53 non- data contain DNA sequences from multiple loci, with codingcontigsfromhuman,commonchimpanzee,go- oneindividualsequencedfromeachofthethreespecies rilla,andorangutan,andestimatedthepopulationsize ateachlocus.Itisassumedthatthereisnorecombina- ofthecommonancestorofhumansandchimpanzeesto tion within a locus and free recombination between rangefrom52,000to150,000,dependingonthegenera- loci. The population is assumed to be random mating, tiontimeused(15or20years)andonwhetherparsimony with no gene flow after species divergence. or neighbor joining was used to infer the gene trees. Thelikelihoodfunction:Considertheprobabilitydis- The tree-mismatch approach has room for improve- tribution of the gene tree and its branch lengths at ment. First, the conflicts in the estimated gene trees any locus i. Two cases are possible and are dealt with amonglocicanbeduetobothancestralpolymorphism separately. Case I is represented by Figure 1b, where and sampling errors in tree reconstruction. As the se- the gene tree is T (cid:2) ((12)3), identical to the species 0 quencesarehighlysimilar,withfewvariableorinforma- tree, and sequences 1 and 2 coalesce between the two tivesitesateachlocus,thereconstructedgenetree,no speciationeventsC andD.Thecoalescent timest and 0 matter what method was used to infer it, is unreliable. t are defined in Figure 1b. Case I occurs when t (cid:4) (cid:5), 1 1 0 Ignoringthesamplingerrorintheestimatedgenetree with probability loevaedrsesttoimoavteiroenstoimftahtieoannocfestthrealcpoonpfluiclattaiomnosnizgel(osceieabned- (cid:6) (cid:2) (cid:1)(cid:5)0 1 e(cid:7)t1/(2N1)dt1 (cid:2) 1 (cid:7) e(cid:7)2(cid:8)0/(cid:1)1 (1) 2N low).Second,the branchlengthsinthe genetreepro- 0 1 vide, in a probabilistic sense, information about the (e.g.,Takahataetal.1995).IncaseII,bothcoalescent ancestralpopulationparameters,butareignoredbythe events occur in the population ancestral to all three method. The method clearly makes poor use of the species (Figure 1, c–e). The three gene trees T (cid:2) 1 information in the data. Third, the method assumes ((12)3), T (cid:2) ((23)1), and T (cid:2) ((31)2) occur with 2 3 thatoneknowsthespeciesdivergencetimes,whilethey equal probability (1 (cid:7) (cid:6))/3. mightnotbeknown.Indeed,someauthorsargueforthe In case I (tree T in Figure 1b), the coalescent time 0 importance of accounting for ancestral polymorphism t is an exponential random variable truncated at t (cid:4) 1 1 whenestimatingspeciationdates(Takahataetal.1995). (cid:5), and t is an independent random variable with an 0 0 Itseemsadvantageoustouseinformationinboththe exponential distribution conflicting gene genealogies as well as the branch 1 lengths while accounting for their uncertainties. This fI(t1|T0) (cid:2) e(cid:7)t1/(2N1)/(cid:6), 0 (cid:4) t1 (cid:4) (cid:5)0, canbeachievedbyusingthelikelihoodmethodundera 2N1 coalescentmodeldevelopedbyTakahataetal.(1995). 1 However,theinfinite-sitesmodelassumedinthemethod fI(t0|T0) (cid:2) e(cid:7)t0/(2N0), 0 (cid:4) t0 (cid:4) ∞. 2N (2) 0 issometimesviolatedbythedata.Inthisarticle,Iextend the method of Takahata et al. for estimating ancestral Let branchlengths b andb in thegene treeof Figure 0 1 population sizes, using data from three species. I use 1b be defined as follows: thenucleotidesubstitutionmodelofJukesandCantor (1969)tocorrectformultiplesubstitutionsatthesame b0 (cid:2) ((cid:5)0 (cid:9) t0 (cid:7) t1)(cid:3) (cid:2) (cid:8)0 (cid:9) t0(cid:3) (cid:7) t1(cid:3), site. The model is also implemented in a Bayes frame- b (cid:2) ((cid:5) (cid:9) t)(cid:3) (cid:2) (cid:8) (cid:9) t(cid:3). (3) 1 1 1 1 1 work, with Markov chain Monte Carlo (MCMC) used to achieve efficient calculation. The methods are ap- Notethatb0isthelengthoftheinternalbranchABand plied to the data of Chen and Li (2001). b1isthedistancefromsequence1toancestorB(Figure 1b).GiventhatthegenetreeatlocusiisT (cid:2)T andits i 0 branchlengthsareb andb ,theprobabilityofobserving i0 i1 MAXIMUM-LIKELIHOOD ESTIMATION sequencedataD atlocusi,P(D|T,b ,b ),isthelikelihood i i 0 i0 i1 The species phylogeny is represented ((12)3) and is intraditionalmolecularphylogenetics(Felsenstein1981). assumedknown.Species1and2diverged(cid:5) generations Calculation of this probability is discussed in the next 1 agowhilespecies3diverged(cid:5) generationsearlier(Fig- section. By averaging over the distribution of the ran- 0 ure 1a). In this article, species 1, 2, and 3 represent domvariablest0andt1(orb0andb1),theprobabilityof human (H), chimpanzee (C), and gorilla (G), respec- observing the data at locus i for case I is tively. The effective population size of the ancestor of (cid:1)∞(cid:1)(cid:5) f(D|T)(cid:2) 0P(D|T,(cid:8) (cid:9)t(cid:3)(cid:7)t(cid:3),(cid:8) (cid:9)t(cid:3)) species1and2isN,andthatofthecommonancestor i 0 i 0 0 0 1 1 1 1 0 0 of all three species is N. The mutation rate (cid:3) is mea- suredasthenumberofn0ucleotidesubstitutionspersite (cid:10)fI(t1|T0)fI(t0|T0)dt1dt0 psuecrhgeannaelryastiiso,nt.heAspraartaemaentedrstiomfethaeremcoodnefoluanrede(cid:1)d0 i(cid:2)n (cid:2)(cid:6)1(cid:1)∞(cid:1)2(cid:8)0/(cid:1)1P(Di|T0,(cid:8)0(cid:9)21(cid:1)0x0(cid:7)21(cid:1)1x1,(cid:8)1(cid:9)21(cid:1)1x1) 4N(cid:3), (cid:1) (cid:2) 4N(cid:3), (cid:8) (cid:2) (cid:5)(cid:3), and (cid:8) (cid:2) (cid:5)(cid:3) (Figure 1a). 0 0 0 1 1 0 0 1 1 Collectively they are denoted (cid:11) (cid:2) {(cid:1)0, (cid:1)1, (cid:8)0, (cid:8)1}. The (cid:10)e(cid:7)(x0(cid:9)x1)dx1dx0, (4) AncestralPopulationSizesinHominoids 1813 Figure 1.—(a) The species tree ((12)3) for three species: 1 (human), 2 (chimpanzee), and 3 (gorilla). Species 1 and 2 diverged (cid:5) generations ago 1 andspecies3diverged(cid:5) gener- 0 ations earlier. The population sizesareN inthecommonan- 0 cestor of all three species and N inthecommonancestorof 1 species 1 and 2. The four pa- rametersinthemodelare(cid:1) (cid:2) 0 4N(cid:3),(cid:1) (cid:2)4N(cid:3),(cid:8) (cid:2)(cid:5)(cid:3),and 0 1 1 0 0 (cid:8) (cid:2)(cid:5)(cid:3),where(cid:3)isthemuta- 1 1 tion rate per site per genera- tion. There are four possible genetrees,representedbyb–e. In case I (b), sequences 1 and 2 coalescence between the two speciation events and the gene tree T (cid:2) ((12)3) is consistent 0 withthespeciestree.ThisisreferredtoascaseIinthetext.Inc–e,bothcoalescenceeventsoccurinthecommonancestorof allthreespecies.Inthiscase(caseII),thegenetreecanbeT (cid:2)((12)3),T (cid:2)((23)1),orT (cid:2)((31)2),withequalprobability. 1 2 3 after a change of variables. f(D|(cid:1),(cid:1),(cid:8),(cid:8))(cid:2)(cid:6)f(D|T)(cid:9)1(cid:7)(cid:6)(cid:2)3 f(D|T) In case II (Figure 1, c–e), both coalescence events i 0 1 0 1 i 0 3 k(cid:2)1 i k TochceutrhirneethgeenpeotpreuelastTio1,nTa2,nacnedstrTa3lhtaovealelqtuharlepersopbeacbieilsi.- (cid:2)(cid:1)0∞(cid:1)02(cid:8)0/(cid:1)1P(Di|T0,(cid:8)0(cid:9)21(cid:1)0x0(cid:7)21(cid:1)1x1,(cid:8)1(cid:9)21(cid:1)1x1) ties.Thecoalescenttimest andt,definedinFigure1, 0 1 c–e, are independent random variables with exponen- (cid:10)e(cid:7)(x0(cid:9)x1)dx1dx0 tial distributions fII(t1|Tk) (cid:2) 3 e(cid:7)3t1/(2N0), 0 (cid:4) t1 (cid:4) ∞, (cid:9)e(cid:7)2(cid:8)0/(cid:1)1(cid:1)0∞(cid:1)0∞k(cid:2)(cid:2)31P(Di|Tk,21(cid:1)0x0,(cid:8)0(cid:9)(cid:8)1(cid:9)21(cid:1)0x1) 2N 0 (cid:10)e(cid:7)(x0(cid:9)3x1)dx1dx0. (8) 1 fII(t0|Tk) (cid:2) 2Ne(cid:7)t0/(2N0), 0 (cid:4) t0 (cid:4) ∞, (5) The log likelihood is a sum over all the L loci, 0 for k (cid:2) 1, 2, or 3. Similarly, the branch lengths in the (cid:1)((cid:1)0, (cid:1)1, (cid:8)0, (cid:8)1|D) (cid:2) (cid:2)L log(cid:3)f(Di|(cid:1)0, (cid:1)1, (cid:8)0, (cid:8)1)(cid:4), (9) gene tree are defined as i(cid:2)1 b (cid:2) t(cid:3), whereD(cid:2){Di}representsdataatallLloci.Parameters 0 0 (cid:1), (cid:1), (cid:8), and (cid:8) are estimated by numerical maximiza- 0 1 0 1 b (cid:2) (cid:8) (cid:9) (cid:8) (cid:9) t(cid:3). (6) tion of the log-likelihood function. A C-optimization 1 0 1 0 routine from the PAML package (Yang 1997b) was Calculation of the probability, P(D|T, b , b ), of ob- servingdataD atlocusi,giventhegenei trkeei0T (cid:2)i1 T (k(cid:2) used.Eachlikelihoodcalculationrequiresevaluationof i i k 2L two-dimensional integrals (Equation 8), which are 1,2,3)anditsbranchlengthsb andb ,willbedescribed i0 i1 calculated numerically using Mathematica. The Math- in the next section. The probability of observing the link library was used to establish communication be- data at locus i for case II is tween the C routine and Mathematica. For the data of (cid:1)∞(cid:1)∞ f(D|T) (cid:2) P(D|T, t(cid:3), (cid:8) (cid:9) (cid:8) (cid:9) t(cid:3)) ChenandLi(2001)withL(cid:2)53loci,theMLiteration i k i k 0 0 1 1 0 0 takes (cid:1)1 hr on a Pentium III PC at 1.2 GHz. (cid:10) f (t|T)f (t|T)dtdt Probabilityofdataatalocusgiventhegenetreeand II 1 k II 0 k 1 0 branchlengths:GiventhegenetreeT,whichisoneof (cid:1)∞(cid:1)∞ 1 1 i (cid:2) P(Di|Tk, (cid:1)0x0, (cid:8)0 (cid:9) (cid:8)1 (cid:9) (cid:1)0x1) T0,T1,T2,orT3inFigure1b,anditsbranchlengths(bi0 0 0 2 2 and bi1), the probability of observing the data, P(Di|Ti, (cid:10) 3e(cid:7)(x0(cid:9)3x1)dx1dx0, (7) bi0,bi1),inEquation8canbecalculatedunderanymodel of nucleotide substitution (Lio and Goldman 1998). for k (cid:2) 1, 2, 3. ThemodelofJukesandCantor(1969)isusedinthis Averaging over cases I and II or over the four gene article, which seems sufficient for such highly similar trees T, T, T, T in Figure 1, we obtain the marginal sequences. Under this model, the sequence data can 0 1 2 3 probability of observing data D at locus i as be summarized as counts of five possible site patterns i 1814 Z.Yang TABLE1 with those five site patterns: D (cid:2) {n , n , n , n , n }. i i0 i1 i2 i3 i4 TheobservednumberofcountsfromthedataofChen NumberofsitepatternsfromthedataofChenandLi(2001) and Li (2001) is listed in Table 1. Locus(i) ni ni0 ni1 ni2 ni3 ni4 dHCG-O Rate ForgenetreesT0orT1,theprobabilitiesofobserving the five site patterns are 1-2609 472 462 3 3 4 0 0.0299 1.010 2-1251 531 509 13 4 5 0 0.0662 2.236 p(b,b)(cid:2)prob(xxx) 0 0 1 3-2659 560 551 4 1 4 0 0.0206 0.696 7-2012 528 511 10 2 5 0 0.0284 0.959 (cid:2)(1(cid:9)3e(cid:7)8b1/3(cid:9)6e(cid:7)8(b0(cid:9)b1)/3(cid:9)6e(cid:7)(8b0(cid:9)12b1)/3)/16, 8-1364 475 465 6 3 1 0 0.0295 0.996 9-1386 484 471 3 3 7 0 0.0437 1.476 p(b,b)(cid:2)prob(xxy) 1 0 1 10-1412N 474 462 8 3 1 0 0.0254 0.858 10-2207 480 475 3 2 0 0 0.0337 1.138 (cid:2)(3(cid:9)9e(cid:7)8b1/3(cid:7)6e(cid:7)8(b0(cid:9)b1)/3(cid:7)6e(cid:7)(8b0(cid:9)12b1)/3)/16, 10-2215-3 515 505 4 3 3 0 0.0278 0.939 10-2891-2 545 538 1 3 3 0 0.0227 0.767 p2(b0,b1)(cid:2)prob(yxx) 11-1419-2 474 464 4 1 5 0 0.0317 1.071 11-2224 371 369 0 0 2 0 0.0200 0.675 (cid:2)(3(cid:7)3e(cid:7)8b1/3(cid:9)6e(cid:7)8(b0(cid:9)b1)/3(cid:7)6e(cid:7)(8b0(cid:9)12b1)/3)/16, 11-73646 463 451 5 1 6 0 0.0294 0.993 12-1482-2 368 359 4 2 3 0 0.0299 1.010 p3(b0,b1)(cid:2)prob(xyx) 12-2906 396 390 5 0 1 0 0.0259 0.875 12-2924 492 479 6 3 4 0 0.0229 0.773 (cid:2)(3(cid:7)3e(cid:7)8b1/3(cid:9)6e(cid:7)8(b0(cid:9)b1)/3(cid:7)6e(cid:7)(8b0(cid:9)12b1)/3)/16(cid:2)p2, 12-2927-2 471 461 3 5 2 0 0.0298 1.006 p(b,b)(cid:2)prob(xyz) 14-2960 301 297 3 1 0 0 0.0248 0.838 4 0 1 14-2963 366 360 6 0 0 0 0.0253 0.854 (cid:2)(6(cid:7)6e(cid:7)8b1/3(cid:7)12e(cid:7)8(b0(cid:9)b1)/3(cid:9)12e(cid:7)(8b0(cid:9)12b1)/3)/16 15-2265-2 459 450 5 2 2 0 0.0259 0.875 15-2266 510 497 8 1 4 0 0.0243 0.821 (10) 16-598D4 518 511 3 0 4 0 0.0164 0.554 (Yang1994).ForgenetreesT andT (Figure1,dand 17-0787 450 443 2 3 2 0 0.0264 0.892 2 3 17-0801 491 485 3 2 1 0 0.0321 1.084 e),theprobabilitiescanbeobtainedbyconsideringthe 17-0812-2 374 361 8 2 3 0 0.0348 1.175 symmetry of the problem. Thus with functions p–p 0 4 17-0813 444 436 4 2 2 0 0.0291 0.983 definedabove,theprobabilityofdataatlocusi,condi- 17-1574 359 352 4 1 2 0 0.0226 0.763 tional on the gene tree T, k (cid:2) 1 (or 0), 2, 3, and its 17-2294 514 500 9 4 1 0 0.0277 0.936 k branch lengths b and b, is given by the multinomial 17-2987 497 486 4 1 5 1 0.0285 0.963 0 1 17-2988 494 481 7 4 2 0 0.0306 1.034 distribution 17-0784 462 454 3 1 4 0 0.0484 1.635 17-276O15 433 419 7 3 4 0 0.0324 1.094 P(Di|T1, b0, b1) (cid:2) C (cid:10) p0ni0p1ni1pn2i2(cid:9)ni3p4ni4, 17-2984 502 483 8 7 4 0 0.0245 0.827 17-2986 419 412 2 2 3 0 0.0185 0.625 P(Di|T2, b0, b1) (cid:2) C (cid:10) p0ni0p1ni2pn2i3(cid:9)ni1p4ni4, 18-0864 373 360 7 5 1 0 0.0415 1.402 18-0866 443 434 2 4 3 0 0.0175 0.591 P(Di|T3, b0, b1) (cid:2) C (cid:10) p0ni0p1ni3pn2i1(cid:9)ni2p4ni4, (11) 18-1506 461 456 3 0 2 0 0.0257 0.868 18-1584 481 470 1 5 5 0 0.0400 1.351 where C (cid:2) ni!/(cid:12)j4(cid:2)0nij!, and ni (cid:2) ni0 (cid:9) ni1 (cid:9) ni2 (cid:9) 18-2558 443 434 5 4 0 0 0.0372 1.256 n (cid:9) n . i3 i4 19-0946 320 310 6 4 0 0 0.0169 0.571 Mutationratevariationamongloci:Animportantfac- 19-0953 479 474 2 1 1 1 0.0240 0.811 torthatmayinfluencetheestimationofancestralpopu- 20-1636 535 517 4 10 4 0 0.0447 1.510 lation sizes is the variation of mutation rates among 20-2012 511 502 5 2 2 0 0.0391 1.321 20-2018 535 522 5 4 4 0 0.0319 1.077 loci (Yang 1997a; Chen and Li 2001). For example, 20-2019 410 401 5 2 2 0 0.0299 1.010 estimationofancestralpopulationsizefromcomparison 20-2020 450 439 3 3 5 0 0.0334 1.128 between two species was found to be sensitive to even 20-2064 512 504 6 1 1 0 0.0260 0.878 slight rate variation (Yang 1997a). If relative rates for 20-2085 542 534 1 2 5 0 0.0219 0.740 20-2352 511 502 2 4 3 0 0.0279 0.942 thelociareavailable,itwillbestraightforwardtoincor- 20-2472 452 444 4 2 2 0 0.0520 1.756 poratetheminthelikelihoodcalculation(Yang1997a). 20-2560 517 510 5 1 0 1 0.0286 0.966 Inthisarticle,Iusetheaveragedistancefromtheorang- 20-2563 545 535 1 4 5 0 0.0217 0.733 utan to the three African apes to calculate the relative 20-2568 487 480 2 4 1 0 0.0195 0.659 rate for the locus (Table 1). This ad hoc approach ap- n ,n ,n ,n ,n arecountsofsiteswithpatternsxxx,xxy, pears sensible since the orangutan diverged from the i0 i1 i2 i3 i4 yxx,xyx,andxyzinhuman(H),chimpanzee(C),andgorilla African apes very early and ancestral polymorphism in (G),whilen (cid:2)n (cid:9)n (cid:9)n (cid:9)n (cid:9)n isthetotalnumber i i0 i1 i2 i3 i4 their common ancestor does not seem important. The ofsitesatlocusi. likelihood calculation proceeds as before except that the branch lengths for the gene tree at each locus are (configurations): xxx, xxy, yxx, xyx, and xyz, where x, y, multiplied by the relative rate for that locus. and z are any three different nucleotides. The data at ApplicationtothedataofChenandLi:ChenandLi locus i can thus be represented by the number of sites (2001)sequencedoneindividualfromeachofthefour AncestralPopulationSizesinHominoids 1815 TABLE2 amongloci,Icalculatedtheaveragesequencedistance under the model of Jukes and Cantor (1969) from Maximum-likelihoodestimatesofparameters thehuman,chimpanzee,andgorillatotheorangutan, Variablerates that is, dHCG-O (cid:2) (dH-O (cid:9) dC-O (cid:9) dG-O)/3. This is divided Parameter Onerateforallloci amongloci bytheaverageacrossalllocitogivetherelativeratefor that locus (Table 1). The average distance from the (cid:1)ˆ (Nˆ ) 0.003057(38,000) 0.002348(29,000) (cid:1)ˆ0(Nˆ0) 0.000990(12,000) 0.001650(21,000) orangutan to the African apes is found to be 0.0296; (cid:8)ˆ1 (tim1 einMY) 0.001089(1.1MY) 0.001704(1.7MY) this is consistent with a mutation rate of 10(cid:7)9 substitu- 0 (cid:8)ˆ (timeinMY) 0.005194(5.2MY) 0.004936(4.9MY) tions per site per year and an orangutan divergence 1 (cid:1) (cid:7)3,099.41 (cid:7)3,100.01 date of (cid:1)13 MYA (e.g., Hasegawa et al. 1987) since 0.0296/(2 (cid:10) 13 (cid:10) 106) (cid:2) 1.1387 (cid:10) 10(cid:7)9. The relative Inconverting(cid:1)intoNand(cid:8)intospeciationtime,thegener- ation time is assumed to be g (cid:2) 20 years in all species and rates for loci calculated this way are used as fixed con- themutationratetobe10(cid:7)9substitutionspersiteperyear. stantsinthelikelihoodcalculationforthedataofthree species.TheMLEsofparametersareshowninTable2. Using the same generation time and mutation rate as species,human,chimpanzee,gorilla,andorangutan,at above, we get the estimate of the population size for 53independentnoncodingloci(contigs).Anadvantage theancestorofhumansandchimpanzeestobe21,000, of the data set is that the sequences are expected to larger than the estimate under the assumption of a be outside and far away from coding regions and not constant rate for all loci. The population size for the affected by selection at linked sites or loci. The model ancestor of all three species is estimated to be 29,000, ofthisarticleassumesthemolecularclockandusesonly smallerthanundertheconstant-ratemodel.Thediffer- three species. The counts of site patterns are listed in encesbetweenthetwoanalysesaresomewhatsurprising, Table1.Atsomeloci,thetotalnumberofsitesusedin as one might expectthe population sizes to be smaller this article is larger than that in Chen and Li (2001; when rate variation among loci is accounted for. How- Table1),becausesomesiteshadalignmentgapsinthe ever,itisnotedthatthedistancefromorangutantothe orangutan and were removed by Chen and Li. Africanapeshasonlyaweakcorrelation(0.44)withthe Thethree-speciesdataareanalyzedbythemaximum- averagedistancewithintheH-C-Ggroup,whichappears likelihood (ML) method of this article. The estimates to suggest that the mutation rates are rather homoge- of parameters are given in Table 2. If we assume a neous among loci and that the conflict among loci in generation time of 20 years and a mutation rate of sequencedivergenceismainlycausedbyancestralpoly- 10(cid:7)9substitutionspersiteperyear(e.g.,Nachmanand morphism. Crowell2000),theestimatessuggestapopulationsize If the average distances within the H-C-G group, for the ancestor of humans and chimpanzees of (d (cid:9) d (cid:9) d )/3, are used as relative rates for the HC CG GH (cid:1)12,000.Thisisseveraltimessmallerthantheestimates loci,parameterestimatesbecome(cid:1)ˆ (cid:2)0.0000014,(cid:1)ˆ (cid:2) 0 1 of Chen and Li (2001) from the same data, at a mini- 0.002902, (cid:8)ˆ (cid:2) 0.003229, and (cid:8)ˆ (cid:2) 0.004555, with (cid:1) (cid:2) 0 1 mumof52,000.Theestimateisalsosimilartoestimates (cid:7)3069.73. Those correspond to a population size of ofthepopulationsizeofmodernhumans,forexample, 36,000fortheancestorofhumansandchimpanzees,a 12,000 by Yu et al. (2001). The population size for the population size of only 18 for the ancestor of all three common ancestor of all three species is estimated to species, 4.5 MY for the H-C divergence, and 7.7 MY be (cid:1)38,000. The same analysis estimated the human- for the gorilla divergence. This calculation effectively chimpanzee divergence time at 5.1 million years ago attributes all variation in sequence divergence among (MYA)andthegorilladivergenceat(cid:1)1.1millionyears locitomutationratevariationandcausesunderestima- (MY)earlier.Thoseestimatesarelargelyconsistentwith tion of (cid:1) and (cid:8) and overestimation of (cid:1) and (cid:8) (see 0 1 1 0 those of previous studies (e.g., Hasegawa et al. 1985; also Table 2). Ruvolo 1997; Kumar and Hedges 1998; Yoder and Comparison with the tree-mismatch method: When Yang 2000). Figure 2a shows the likelihood surface as thegenetreeisT orT (Figure1),thereisamismatch 2 3 afunctionof(cid:1) and(cid:1) when(cid:8) and(cid:8) arefixedattheir betweenthespeciestreeandthegenetree.Thisoccurs 0 1 0 1 maximum-likelihoodestimates(MLEs).The(cid:1)95%con- with probability P (cid:2) f(T) (cid:9) f(T) (cid:2) 2(1 (cid:7) (cid:6))/3 (cid:2) SG 2 3 fidenceregionisgivenbythelikelihoodcontourat3.32 2⁄3e(cid:7)2(cid:8)0/(cid:1)1 (e.g., Nei 1987, pp. 288–289). The tree-mis- units below the optimum, that is, at (cid:7)3102.73 (Figure matchmethodestimates(cid:1) byequatingthisprobability 1 2a).Thesamplingerrorsarequitelarge.Analysisofthe to the proportion (pˆ) of loci at which the estimated humanandchimpanzeesequencesatthe53lociusing gene tree differs from the species tree, with (cid:8) being 0 the ML method of Takahata et al. (1995) under the assumedknown;thatis,(cid:1)ˆ (cid:2)(cid:7)2(cid:8)/log{3pˆ/2}.Chenand 1 0 infinite-sitesmodelleadsto(cid:1)ˆ (cid:2)0.0017and(cid:8)ˆ (cid:2)0.0055 Li (2001) used the orangutan to root the H-C-G tree 1 1 (N. Takahata, personal communication). Those esti- and were able to resolve the gene tree at 33 loci, out mates are similar to the MLEs of Table 2. of which 9 mismatches were found, at a proportion of To examine the effect of mutation rate variation 27.3%.Severalcodinglociwereincludedaswell,sothat 1816 Z.Yang clock-rooting approach uses more “informative” sites thanout-grouprootingandresolvesthegenetreeat49 of the 53 loci, out of which 18 are mismatches, at the proportion36.7%.Thisproportionisevenhigherthan thoseofChenandLiandproducesevenlargerestimates of (cid:1) and N. 1 1 To understand the difference between the tree-mis- match method and ML, note that three aspects of the data are ignored by the tree-mismatch method and ac- counted for by ML: (i) uncertainty in the estimated gene tree due to the finite number of nucleotide sites atthelocus,(ii)unresolvedloci(ties),and(iii)branch lengthsinthegenetreereflectedinthesequencediver- gences.Whileallthreeprobablycontributetothelarge differences discussed above, uncertainty in the esti- mated gene tree seems to be the predominant reason. Amore“proper”tree-mismatchmethodshouldequate the observed proportion of mismatches (pˆ) not to P SG but to P , the probability of a mismatch between the SE speciestreeandtheestimatedgenetree.Thisprobability is given by P (cid:2) (cid:2) f(D|(cid:1), (cid:1), (cid:8), (cid:8)), (12) SE i 0 1 0 1 max{ni2,ni3}(cid:13)ni1 where f is the probability of data D (cid:2) {n , n , n , n , i i0 i1 i2 i3 n } in Equation 8, and the summation is over all data i4 configurations in which the ML tree for the locus is eitherT orT (Figure1). UnlikeP , P is dependent 2 3 SG SE on all four parameters, (cid:1), (cid:1), (cid:8), and (cid:8), as well as the 0 1 0 1 sequence length n and appears no easier to calculate i than the full likelihood (Equation 9). Instead I use MonteCarlosimulationtocalculatethoseprobabilities toassesstheimpactoferrorsingenetreereconstruction on the difference between P and P . The MLEs of SG SE parameters in Table 2 (“constant rate”) are used to generate gene trees, which are used to “evolve” se- quences.Thesitesarecountedto obtainthedataD (cid:2) i {n ,n ,n ,n ,n },whichareusedtoestimatethegene i0 i1 i2 i3 i4 tree by ML. Figure2.—Log-likelihoodsurface(contour)asafunction Figure3showstheprobabilitiesthatthespeciestree of(cid:1)0and(cid:1)1when(cid:8)0and(cid:8)1arefixedattheirMLEs.(a)The (S),thegenetree(G),andtheestimatedgenetree(E) samesubstitution(mutation)rateisassumedforallloci.(b) differ from each other. The probability of a mismatch Fixed relative rates obtained from comparison between the between the species tree and the gene tree is P (cid:2) orangutan and the African apes (human, chimpanzee, and SG gorilla) are used to account for possible evolutionary rate 2(1 (cid:7) (cid:6))/3 (cid:2) 0.0739, much lower than the observed variation among loci. Maximum-likelihood estimates of pa- mismatch proportion pˆ (cid:2) 0.367. The probability of a rametersarelistedinTable2. mismatch between the species tree and the estimated gene tree is higher. With 466 sites (the average across the 53 loci, Table 1), P (cid:2) 0.2028, which is 2.7 times SE 16mismatcheswerefoundatatotalof52resolvedloci, as high as P (Figure 3). Now consider the four gene SG attheproportion30.8%.TheauthorsassumedanH-C trees T, T, T, and T (Figure 1), which occur with 0 1 2 3 divergenceat(cid:8) (cid:2)1.6MYAandarrivedat(cid:1)ˆ (cid:2)0.00414, probabilities f(T) (cid:2) (cid:6) (cid:2) 0.8892 and f(T) (cid:2) f(T) (cid:2) 1 1 0 1 2 which, if the generation time is 20 years, corresponds f(T) (cid:2) (1 (cid:7) (cid:6))/3 (cid:2) 0.0369 (Equation 1). According 3 to a minimum population size of Nˆ (cid:2) 52,000 for the to the simulation, the probability that the topology of 1 ancestor of humans and chimpanzees. In this article, thegenetreeisincorrectlyreconstructedwhenthetrue the molecular clock has been assumed, which can also gene tree is T, T, T, or T is P (cid:2) 0.1453 and P (cid:2) 0 1 2 3 0 1 beusedtoroottheH-C-Gtree.Undertheclock,T,T, P (cid:2) P (cid:2) 0.1981. Note that for the estimated gene 1 2 2 3 orT istheMLtreeifn ,n ,orn isthegreatestamong tree, we consider only the topology and disregard its 3 i1 i2 i3 the three, respectively (Saitou 1988; Yang 1994). The divergence times relative to the speciation times. The AncestralPopulationSizesinHominoids 1817 Figure3.—Tree-mismatchprob- abilities calculated using Monte Carlo simulation plotted as func- tions of the sequence length. MLEs of parameters in Table 2 (onerateforallloci)areusedin thesimulation.NotethatS,G,and E in the subscripts stand for the speciestree,thegenetree,andthe estimatedgenetree(theMLtree), respectively.ThusP istheproba- SG bilitythatthespeciestreeandthe genetreediffer;thisis0.0739for the parameter values used. P is SE theprobabilityofamismatchbe- tweenthespeciestreeandtheesti- mated gene tree, and P is the GE probabilityofamismatchbetween the gene tree and the estimated genetree.P istheproportionof tie replicates in which a tie occurs, that is, the two best trees are equally good. Ties are excluded incalculationofP andP .Ten SE GE million replicates were simulated foreachsequencelength. averageerrorprobabilityofgenetreereconstructionis g(x; (cid:14), (cid:15)) (cid:2) (cid:15)(cid:14)e(cid:7)(cid:15)xx(cid:14)(cid:7)1/(cid:16)((cid:14)), (13) 0th.1u9s8P1GE(cid:2)(cid:2)0(cid:2).1i3(cid:2)501f2(T(is)ePei(cid:2)Fi0g.u8r8e932)(cid:10).W0.h1e4n53g(cid:9)en3e(cid:10)tr0e.e0s3T69o(cid:10)r with mean (cid:14)/(cid:15) and variance (cid:14)/(cid:15)2. The hyperparamet- 0 ers (cid:14) and (cid:15) are chosen to reflect the range and likely T areincorrectlyreconstructed,theestimatedgenetree 1 values of the parameters. will always be a mismatch with the species tree; such errors will cause an overcount of f(T)P (cid:9) f(T)P (cid:2) Instead of the coalescent times t0 and t1, which have 0 0 1 1 differentdefinitions indifferentgenetrees (Figure1), 0.1366.Conversely,whengenetreesT orT areincor- 2 3 branch lengths b and b in the gene tree are used in rectly reconstructed, the estimated tree will not be 0 1 the MCMC. The joint prior distributions of the gene counted as a mismatch one-half of the time, so the undercountisf(T)P/2(cid:9)f(T)P/2(cid:2)f(T)P (cid:2)0.0074. tree T0and its branchlengths b0 andb1 given (cid:11)can be 2 2 3 3 2 2 easily derived from the distributions of the coalescent Thedifferencebetweenthosetwoerrorratesgivesrise totheneterrorduetogenetreereconstruction:P (cid:7) times t0 and t1 (Equation 2), SE PSG (cid:2) f(T0)P0 (cid:2) (cid:6)P0 (cid:2) 0.1290. The above argument f(T0, b0, b1|(cid:11)) (cid:2) f(T0|(cid:11))f(b1|T0, (cid:11))f(b0|T0, b1, (cid:11)) suggeststhatignoringerrorsingenetreereconstruction 2 2 always causes overestimation of the mismatch between (cid:2) (cid:6) (cid:10) (cid:6)(cid:1)e(cid:7)2(b1(cid:7)(cid:8)1)/(cid:1)1 (cid:10) (cid:1)e(cid:7)2(b0(cid:9)b1(cid:7)(cid:8)0(cid:7)(cid:8)1)/(cid:1)0 thespeciestreeandthegenetreeandleadstooveresti- 1 0 mationoftheancestralpopulationsizeN1.Itisinterest- (cid:2) 4/((cid:1)0(cid:1)1) (cid:10) e(cid:7)2(b1(cid:7)(cid:8)1)/(cid:1)1(cid:7)2(b0(cid:9)b1(cid:7)(cid:8)0(cid:7)(cid:8)1)/(cid:1)0, ingthatthebiasinthetree-mismatchmethodiscaused (14) byreconstructionerrorsforgenetreeT aloneandthus 0 can be reduced if (cid:6) is reduced, for example, if the for (cid:8) (cid:4) b (cid:4) (cid:8) (cid:9) (cid:8) and (cid:8) (cid:9) (cid:8) (cid:7) b (cid:4) b (cid:4) ∞. 1 1 1 0 1 0 1 0 two speciation events are very close or if the ancestral Similarly,thejointpriordistributionofgenetreeT, k populationsizeN islarge.Obviouslyfactorsthatreduce k (cid:2) 1, 2, 3, and its branch lengths b and b given (cid:11) is 1 0 1 the reconstruction error P, such as longer sequences 0 f(T, b, b|(cid:11)) (cid:2) f(T|(cid:11))f(b|T, (cid:11))f(b|T, (cid:11)) (Figure 3) and higher mutation rates, will reduce the k 0 1 k 0 k 1 k bias as well. 1(cid:7)(cid:6) 2 6 (cid:2) 3 (cid:10) (cid:1)e(cid:7)2b0/(cid:1)0 (cid:10) (cid:1)e(cid:7)6(b1(cid:7)(cid:8)0(cid:7)(cid:8)1)/(cid:1)0 0 0 4 THE BAYES APPROACH USING MCMC (cid:2) (cid:1)2e(cid:7)2(cid:8)0/(cid:1)1(cid:7)[2b0(cid:9)6(b1(cid:7)(cid:8)0(cid:7)(cid:8)1)]/(cid:1)0, (15) A Bayes approach is implemented under the same 0 model, using MCMC. As parameters (cid:11) (cid:2) {(cid:1), (cid:1), (cid:8), (cid:8)} with 0 (cid:4) b (cid:4) ∞ and (cid:8) (cid:9) (cid:8) (cid:4) b (cid:4) ∞. 0 1 0 1 0 1 0 1 areallpositive,Iuseindependentgammadistributions The variables to be updated in the Markov chain as the prior. The gamma density is includetheparameters(cid:11)(cid:2){(cid:1),(cid:1),(cid:8),(cid:8)}andthegene 0 1 0 1 1818 Z.Yang TABLE3 PriorandposteriordistributionsofparametersintheBayesanalysis Prior Posterior Parameter ((cid:14),(cid:15))a Mean(95%interval)b Mean Mean(95%interval)b Goodpriors (cid:1) (N) (2,2,000) 12,500(1,500,34,800) 0.00158 19,700(2,900,41,600) 0 0 (cid:1) (N) Asabove 0.00100 12,400(1,700,32,100) 1 1 (cid:8) (timeinMY) (4,2,500) 1.6MY(0.44MY,3.51MY) 0.00164 1.6MY(0.7MY,2.7MY) 0 (cid:8) (timeinMY) (20,4,000) 5.0MY(3.1MY,7.4MY) 0.00530 5.3MY(4.4MY,6.1MY) 1 Poorpriors (cid:1) (N) (1.5,300) 62,500(4,500,194,800) 0.00263 32,900(5,300,57,800) 0 0 (cid:1) (N) Asabove 0.00265 33,100(5,300,88,600) 1 1 (cid:8) (timeinMY) (4,2,000) 2.0MY(0.55MY,4.4MY) 0.00190 1.9MY(0.8MY,3.2MY) 0 (cid:8) (timeinMY) (4,800) 5.0MY(1.3MY,11.0MY) 0.00464 4.6MY(3.4MY,5.7MY) 1 aParameters(cid:14)and(cid:15)areforthegammapriors;thepriormeanis(cid:14)/(cid:15)(notshown). bMeanand2.5and97.5%percentilesofthepriororposteriordistributionsforpopulationsizesorspeciation times. In converting (cid:1) and (cid:8) into N and speciation time, the generation time is assumed to be g (cid:2) 20 years andthemutationrate10(cid:7)9substitutionspersiteperyear. trees and branch lengths at all L loci, G (cid:2) {T, b , b }, Ifthenewstateisaccepted,thechainmovestothenew i i0 i1 i (cid:2) 1, 2, . . . , L. The Markov chain is constructed so state((cid:11)*,G*).Otherwisethechainstaysintheoldstate thatitssteady-statedistributionistheposteriordistribu- ((cid:11),G).Notethatf(D)inEquation16cancelsincalcula- tion of those variables. Bayes inference is then based tion of the acceptance ratio R. Calculation of f((cid:11)*, on the joint posterior distribution G*|D)/f((cid:11), G|D) is straightforward due to the condi- f(D|G)f(G|(cid:11))f((cid:11)) tional independence in the model as described above. f((cid:11),G|D)(cid:2) f(D) So the focus here is the proposal mechanism and the (cid:5)Li(cid:2)1P(Di|Ti,bi0,bi1)(cid:10)(cid:5)Li(cid:2)1f(Ti,bi0,bi1|(cid:11))(cid:10)f((cid:1)0)f((cid:1)1)f((cid:8)0)f((cid:8)1) proposal ratio q((cid:11), G|(cid:11)*, G*)/q((cid:11)*, G*|(cid:11), G). (cid:2) f(D) Theproposaldensityqcanberatherflexibleaslong as it specifies an aperiodic and irreducible Markov (16) chain. The algorithm I implemented cycles through The denominator f(D) is the marginal probability of several steps, with each step updating some but not all the data variables. In step 1, the gene tree and branch lengths f(D) (cid:2) (cid:1) (cid:1) f(D|G)f(G|(cid:11))f((cid:11))dGd(cid:11), (17) ateachlocusi(Ti,bi0,bi1)areupdated,whileparameters (cid:11)G (cid:11) are fixed. Each locus is updated once in this step. where the integral over G represents summation over Step 2 updates parameters (cid:11) while the branch lengths the four gene trees (T, T, T, T in Figure 1) and {b , b } are fixed. This step can cause changes to the 0 1 2 3 i0 i1 integrationoverbranchlengthsineachtree.Theposte- genetreesatsomeloci.Step3isamixingstep,inwhich riordistributionofanyparameteristhengivenbyinte- parameters (cid:1), (cid:1), (cid:8), (cid:8) and branch lengths at all loci 0 1 0 1 grating over the joint posterior distribution. For ex- aremultipliedbyaconstantwhilethegenetreesremain ample, unchanged. The MCMC algorithm is tedious and the f((cid:11)|D) (cid:2) (cid:1) f((cid:11), G|D)dG. (18) details are given in the appendix. G The Markov chain is started from a random initial A Metropolis-Hastings algorithm (Metropolis et al. state.Samplingstartsafteracertainnumberofgenera- 1953;Hasting1970)isusedtoupdatevariablesinthe tions, which are discarded as burn-in, and samples are MCMC. Given the current state of the chain ((cid:11), G), taken every certain number of steps, thus “thinning” a new state ((cid:11)*, G*) is proposed through a proposal thechain.Convergenceofthechainischeckedbyexam- distribution q((cid:11)*, G*|(cid:11), G). The new state is then ac- ining the output and also by running multiple chains. cepted with probability The algorithm is also run without sequence data, and theposteriordistributiongeneratedisfoundtobeclose (cid:6) f((cid:11)*,G*|D) q((cid:11),G|(cid:11)*,G*)(cid:7) R(cid:2)min1, (cid:10) to the prior. f((cid:11),G|D) q((cid:11)*,G*|(cid:11),G) Application to the data of Chen and Li: The Bayes (cid:6) f(D|G*)f(G*|(cid:11)*)f((cid:11)*) q((cid:11),G|(cid:11)*,G*)(cid:7) MCMC algorithm is applied to the data of Chen and (cid:2)min1, (cid:10) . (19) f(D|G)f(G|(cid:11))f((cid:11)) q((cid:11)*,G*|(cid:11),G) Li (2001; seeTable 1). I used twosets of priors (Table AncestralPopulationSizesinHominoids 1819 Figure4.—Priorandposterior distributionsforparameters(cid:1),(cid:1), 0 1 (cid:8),and(cid:8).Parameterestimatesare 0 1 showninTable3(goodpriors). 3).Parameters(cid:14)and(cid:15)inthegammapriordistributions the ancestor of humans and chimpanzees is estimated are chosen by considering likely values and ranges of to be 12,400, with the 95% credibility interval (CI) to ancestralpopulationsizesandspeciesdivergencetimes be (1700, 32,100). The H-C divergence is dated at and converting them into parameters (cid:1), (cid:1), (cid:8), and (cid:8) 5.3MY,withthe95%CItobe(4.4,6.1).Theestimates 0 1 0 1 usingagenerationtimeof20yearsandamutationrate of(cid:1) and(cid:8) areverysimilartotheMLEs.Theposterior 1 1 of 10(cid:7)9 substitutions per site per year. The first set is mean of (cid:1) is smaller and that of (cid:8) is larger than the 0 0 considered more realistic and referred to as the “good MLEs (Tables 2 and 3). The correlation coefficients priors”inTable3.AncestralpopulationsizesN andN calculated from the posterior distributions of parame- 0 1 are centered (cid:1)12,500, close to estimates for modern ters(cid:1),(cid:1),(cid:8),and(cid:8) areshowninTable4.Thereisstrong 0 1 0 1 humans,withthe2.5and97.5%percentilesat1500and negative correlation between (cid:1) and (cid:8) and between (cid:1) 0 0 1 34,800, respectively. The divergence time for humans and(cid:8).Comparisonofthepriorandposteriordistribu- 1 and chimpanzees is centered (cid:1)5 MY, while the diver- tions (Figure 4) suggests that the data contain much gence time for the gorilla is centered (cid:1)1.6 MY. Note moreinformationaboutthedivergencetimes,especially that parameters (cid:1), (cid:1), (cid:8), and (cid:8) are all (cid:17)1 but are theH-Cdivergencetime((cid:8)),thanaboutthepopulation 0 1 0 1 1 definitely (cid:13)0; thus values of (cid:14) (cid:13)1 are used so that the sizes. gamma distribution has a mode (cid:13)0. Toseetheeffectofpriorassumptionsontheposterior The posterior distributions of parameters (cid:1), (cid:1), (cid:8), distributions, I used a second set of priors, which are 0 1 0 and(cid:8) areshowninFigure4togetherwiththeirpriors. morespreadoutandalsoassumelargepopulationsizes 1 They are also summarized in Table 3 (good priors). (mean62,500).Theposteriordistributionsaresumma- The means of the posterior distributions for (cid:1), (cid:1), (cid:8), rized in Table 3 under the heading “Poor priors.” The 0 1 0 and (cid:8) are listed, and then the means and the 95% pointestimatesofbothN andN are(cid:1)33,000,smaller 1 0 1 credibilitysetsforthetwopopulationsizes(N andN) than the prior means. The H-C divergence is dated at 0 1 andforthetwospeciationtimesarelisted.Theposterior 4.6MY,andthegorilladivergenceisdated1.9MYear- means and medians are close. The population size for lier. Those estimates appear reasonable, although the 1820 Z.Yang TABLE4 and (cid:8) is not well understood, although Satta et al. 1 (2000) emphasized its possible significance. As the hu- Correlationcoefficientsintheposteriordistribution man,chimpanzee,andgorillasequencesareextremely (cid:1) (cid:1) (cid:8) similar, most of the recombination events will not be 0 1 0 visible in the sequence data, and the few sites at which (cid:1) 0.05 1 morethantwonucleotidesareobservedinthedata(see (cid:8) (cid:7)0.58 0.43 (cid:8)0 (cid:7)0.16 (cid:7)0.60 (cid:7)0.41 counts ni4 for site pattern xyz in Table 1) are probably 1 due to multiple substitutions at the same site. Third, the substitution model of Jukes and Cantor (1969) is simplistic. More complex models, such as those that H-Cdivergencedateistoorecent.Similarstrongcorre- account for variable substitution rates among sites lationsbetweentheparametersarediscoveredasinthe within the locus, can be easily implemented, but are analysisusingthegoodpriors.Thenegativecorrelation expected to have little effect. The most serious issue between (cid:8)1 and (cid:1)1 (calculated to be (cid:7)0.76), combined seemstobe mutationratevariationamong loci.Inthe withtheassumedandestimatedlargepopulationsizes, caseoftwospecies,theancestralpopulationsizeisover- appearstohaveledtoaH-Cdivergencedatethatistoo estimated when mutation rate variation is ignored and recent. accounting for the bias leads to dramatic reduction in the estimated population size (Yang 1997a). In this article, the population size of the ancestor of humans DISCUSSION and chimpanzees is not very large under the constant- ML and Bayes methods of this article estimated the ratemodelandbecomes largerwhenvariableratesfor population size for the common ancestor of humans lociareassumed.Theeffectismuchlessimportantand and chimpanzees to be (cid:1)12,000, similar to estimates also in the opposite direction compared with the two- for modern humans. The estimates are several times speciescase.Lackofstrongcorrelationamongsequence smaller than those obtained by Chen and Li (2001) distances with the orangutan seems to suggest that the from the same data using the tree-mismatch method, rates are relatively homogeneous among those loci. It which range from 52,000 to 150,000. Even the worst- seems that simultaneous analysis of data from three caseestimates—e.g.,36,000byMLundertheassumption species allows the parameters to constrain each other, thatallsequencedivergencevariationamonglociisdue leading to a better use of information in the data. It is to mutation rate variation and 33,000 from the Bayes quitelikelythattheestimationcanbefurtherimproved analysis using the poor priors—are smaller than the bysamplingmultipleindividualsfromthesamespecies. minimumestimateofChenandLi.Thetree-mismatch TheMLandBayesmethodsproducedsimilarresults method used by Chen and Li appears to have serious for the data analyzed in this article. However, the ML biases due to errors in gene tree reconstruction, and calculation is slower than the MCMC algorithm. The the likelihood and Bayes estimates reported here are Bayesapproachalsoprovidesaframeworkforincorpo- probablymorereliable.Thusitmaybeconcludedthat ratingpriorinformationabouttheparameters.Forex- thesequencedataofChenandLi(2001)donotsupport ample, a wealth of information is available about the much larger ancestral populations than the modern divergencetimebetweenhumansandchimpanzees.By humans or the notion that early human populations forcing a very narrow prior distribution for (cid:8), such 1 experienced dramatic size reductions (Hacia 2001; information can be incorporated in the Bayes analysis. Kaessmann et al. 2001). Usinganinformativepriorwillreducetheadverseeffect While the ML and Bayes methods are expected to of strong correlation among parameters when other have better statistical properties than the simple tree- parametersareestimated.Furthermore,theBayesalgo- mismatchmethod,itisworthwhiletoexaminesomeof rithmseemseasierthanMLtoextendtodatathatcon- theassumptionsmadeinthisarticle.First,theevolution- tainmorethanthreespeciesandmorethanoneindivid- ary rate is assumed to be constant over lineages. This ual from each species. assumption seems reasonable as the species compared Program availability: C programs implementing the are very closely related; Chen and Li’s (2001) relative- MCMCalgorithmandcalculatingthemismatchproba- ratetestssupportedthemolecularclock.Thelargedif- bilities(P ,P ,andP ,etc.)areavailablefromtheauthor SG SE GE ferences between the tree-mismatch method and the uponrequest.TheCandMathematicaprogramsforthe likelihoodandBayesmethodsareclearlynotduetothe likelihood method are available as well, but they make useoftheclockassumptioninthisarticle;useofclock use of the Mathlink library and are awkward to use. rooting in the tree-mismatch method produced even largerestimatesofthepopulationsizefortheancestor I am verygrateful to Drs. W.-H. Li andF.-C. Chen for providing thedataanalyzedinthisarticle.IthankM.HasegawaandB.Larget of humans and chimpanzees. Second, the analysis as- fordiscussionsandN.Takahataforcomments.Thisstudyissupported sumes no recombination within a locus. The effect of bygrant31/G13580fromtheBiotechnologyandBiologicalSciences recombination on estimation of parameters (cid:1)0, (cid:1)1, (cid:8)0, ResearchCouncil(UnitedKingdom).
Description: