Reversing Gene Erosion(cid:151)Reconstructing Ancestral Bacterial Genomes from Gene-Content and Order Data 1 2 1 JoelV.Earnest-DeYoung ,EmmanuelleLerat ,andBernardM.E.Moret 1 Dept.ofComputerScience,Univ.ofNewMexico,Albuquerque,NM87131,USA, joeled,[email protected] 2 Dept.ofEcologyandEvolutionaryBiology,Univ.ofArizona,Tucson,AZ85721,USA, [email protected] Abstract. Inthelastfewyears,ithasbecomeroutinetousegene-orderdatato reconstructphylogenies,bothintermsofedgedistances(parsimonioussequences of operations that transform one end point of the edge into the other) and in termsofgenomesatinternalnodes,onsmall,duplication-freegenomes.Current gene-order methodsbreakdown,though,whenthegenomescontainmorethan afewhundredgenes,possesshighcopynumbersofduplicatedgenes,orcreate edge lengths in the tree of over one hundred operations. We have constructed aseriesofheuristicsthatallowustoovercome theseobstacles and reconstruct edgesdistancesandgenomesatinternalnodesforgroupsoflarger,morecomplex genomes.Wepresentresultsfromtheanalysisofagroupofthirteenmodern(cid:13)- proteobacteria,aswellasfromsimulateddatasets. 1 Introduction Althoughphylogeny,theevolutionaryrelationshipsbetweenrelatedspeciesortaxa,is a fundamentalbuildingblockin muchof biology,it has beensurprisinglydif(cid:2)cultto automate the process of inferring these evolutionary relationships from modern data (usually molecular sequence data). These relationships include both the evolutionary distanceswithinagroupofspeciesandthegeneticformoftheircommonancestors. Inthelastdecade,anewformofmoleculardatahasbecomeavailable:gene-content andgene-orderdata;thesenewdatahaveprovedusefulinsheddinglightonthesere- lationships [1(cid:150)4].The orderandthe orientationin which genes lie ona chromosome changesveryslowly,inevolutionaryterms,andthustogetherprovidearichsourceof informationforreconstructingphylogenies.Until recently,however,algorithmsusing suchdatarequiredthatallgenomeshaveidenticalgenecontentwithnoduplications,re- strictingapplicationstoverysimplegenomes(suchasorganelles)orforcingresearchers toreducetheirdatabyequalizingthegenecontent(deletingallgenesnotpresentinev- erygenomeandall(cid:147)copies(cid:148)ofeachgene,e.g.,usingtheexemplar strategy[5]).The formerwas frustratingto biologistswantingtostudymorecomplexorganisms,while thelatterresultedindatalossandconsequentlossofaccuracyinreconstruction[6]. Ourgrouprecentlydevelopedamethodtocomputethedistancebetweentwonearly arbitrarygenomes[7,?]andanothertoreconstructphylogeniesbasedongene-content and gene-orderin the presence of mildly unequal gene content [6]. In this paper, we bringtogetherthesemethodsinaframeworkthatenablesustoreconstructthegenomes 2 Pasteurella multocida Haemophilus influenzae Yersinia pestis−CO92 Yersinia pestis−KIM Salmonella typhimurium Escherichia coli Wigglesworthia brevipalpis Buchnera aphidicola Vibrio cholerae Pseudomonas aeruginosa Xylella fastidiosa Xanthomonas axonopodis Xanthomonas campestris Fig.1.The13gamma-proteobacteriaandtheirreferencephylogeny[8]. ofthecommonancestorsofthe13modernbacteriashowninFig.1(from[8]).Gamma- proteobacteriaareanancient groupofbacteria,at least 500millionyearsold[9];the groupincludesendosymbiotic,commensal,andpathogenicspecies,withmanyspecies playinganimportantmedicaloreconomicrole.Theevolutionaryhistoryofthegroupis quitecomplex,includinghighlevelsofhorizontalgenetransfer[10(cid:150)12]and,inthecase ofB.aphidicolaandW.brevipalpis,massivelevelsofgeneloss.Thesefactorsmakea phylogeneticanalysisofthisgroupbothinterestingandchallenging. Therestofthispaperisorganizedasfollows.Section2presentstheproblem.Sec- tion 3 summarizes prior work on phylogeneticreconstructionfrom gene-content and gene-orderdata.Section4presentsourframeworkfortacklingtheproblemofancestral genomereconstructiongivenareferencephylogeny;itisitselfdividedintothreesub- sections,oneon eachofourthreemaintools: median-(cid:2)nding,contentdetermination, andgeneclustering.Section5discussesourapproachtothetestingofourframework: giventhatwehaveonlyonedatasetandthatancestralgenomesforthatdatasetareen- tirelyunknown,ourtestingwas ofnecessitybasedonsimulations.Section6 presents theresultsofthistesting. 2 The Problem Wephrasethereconstructionproblemintermsofaparsimonycriterion: Giventhegeneordersofagroupofgenomesandgivenarootedtreewiththese genomesat theleaves,(cid:2)ndgeneordersforthe internalnodesofthetreethat minimizethesumofalledgelengthsinthetree. Thelengthofanedgeisde(cid:2)nedintermsofthenumberofevolutionaryevents(permis- sibleoperations)neededtotransformthegenomeatoneendoftheedgeintothegenome attheotherend.Thepermissibleoperationsinourcaseareinversions,insertions(and duplications),anddeletions;alloperationsaregiventhesamecostincomputingedge lengths.Restrictingrearrangementstoinversionsfollowsfrom(cid:2)ndingsthattheinver- sion phylogenyis robust evenwhenotherrearrangements,such as transpositions,are 3 usedincreatingthedata[13].Ourassignmentofunitcoststoalloperationssimplyre- (cid:3)ectsinsuf(cid:2)cientbiologicalknowledgeabouttherelativefrequencyoftheseoperations. Inoursetting, one insertionmay addanarbitrarynumberof genesto a single lo- cationandonedeletionmayremoveacontiguousrunofgenesfromasinglelocation, a convention consistent with biological reality. Gene duplications are treated as spe- cialized insertionsthat onlyinsert repeats.Finally,oneach edgea genecan either be insertedordeleted,butnotboth;thesameholdsformultiplecopiesofthesamegene. Allowingdeletionandinsertionofthesamegenesonthesameedgewouldleadtobio- logicallyridiculousresultssuchasdeletingtheentiresourcegenomeandtheninserting theentiretargetgenomeinjusttwooperations. Finding internal labels that minimize edge distances over the tree has been ad- dressed byourgroup(cid:151)thisis the mainoptimizationperformedbyour softwaresuite GRAPPA [14]. However, even the most recent version of GRAPPA [6] is limited to relativelysmallgenomes(typicallyoforganellarsize,withfewerthan200genes),with modestly unequal content and just a few duplications. In stark contrast, the bacterial genomes in our dataset contain 3,430 different genes and range in size from 540 to 2,987genes,withsevencontainingover2,300genes;moreover,thesegenomescontain alargenumberofduplications,rangingfrom3%to30%ofthegenome.Thus,inour model,mostpairwisegenomicdistancesareverylarge:asimplepairwisecomparison alongthetreeofFig.1indicatesthatsomeedgesofthetreemustrepresentatleast300 events.SuchlengthsareatleastanorderofmagnitudelargerthanGRAPPAcanhandle. Thelargegenomesize,vastlyunequalgenecontent,largenumberofduplications,and largeedgelengthsallcombinetomakethisdatasetordersofmagnitudemoredif(cid:2)cult toanalyzethanpreviouslyanalyzedgenomesets. 3 PriorWork Athoroughrecentreviewofthecurrentworkinphylogeneticreconstructionbasedon genecontentandgeneorderappearsin[15];wereviewonlytherelevantpointshere. TheGRAPPAsoftwarepackage[16]computesinternallabelsintwophases.First, itinitializesinternallabelsofthetreebysomemethod.Thenititerativelyre(cid:2)neslabels until convergence:each newly labeled (or relabeled) node is pushed on a queue and, while the queue is not empty, the node at the head of the queue is removed, a new labelcomputedforit(bycomputingthemedianofitsthreeneighbors),and,ifthenew labelreducesthetotaldistancetothethreeneighbors,theexistinglabelisreplacedwith the improved label and the three neighbors are placed on the queue. Thus GRAPPA relies on the computation of the median of three genomes, that is, a fourth genome whichminimizesthesumofthenumberofoperationsneededtoconvertitintoeachof thethreegivengenomes.GRAPPA(cid:2)ndsoptimalinversionmedianswithanalgorithm that runs in worst-case exponential time, but (cid:2)nishes quickly when the edge lengths aresmall (10to40operationsperedge)[6,17].GRAPPA treats groupsofgenesthat occur in the same order and orientation in all genomes as a single genetic unit; this condensationstepreducescomputationalcostsanddoesnotaffectthe(cid:2)nalresult[18]. Ourgroupdevelopedamethodto(cid:2)ndthedistancebetweentwogenomeswithar- bitrarygenecontent[7,?];thismethodreliesonaduplication-renamingheuristicthat 4 matches multiple copies of genes between genomes and renames each pair and each unmatchedcopytoanew,unusedgenenumber.Thusarbitrarygenomesareconverted intoduplication-freegenomes.Weprovedthat,giventwogenomeswithunequalgene contentandnoduplications,anyoptimalsortingsequencecanberearrangedtocontain (cid:2)rstallinsertions,thenallinversions,and(cid:2)nallyalldeletions(cid:151)atypeofnormalform for edit sequences [7].(Deletions here are genes uniqueto the source genome,while insertionsaregenesfoundonlyinthetargetgenome.) Usingthegenomesproducedby the duplication-renamingmethod,an optimalinversionsequencecanbe calculatedin timequadraticinthesizeoftheconsensusgenomes[19,20].Thenumberofdeletions iscalculatedbycountingthenumberofHannenhalli-Pevznercyclesthatcontaindele- tions,asdescribedin[21].Finally,thenumberofinsertionsisestimatedbycalculating allpossiblepositionsinthesourcegenometowhichtheinversionsequencecouldmove insertions,thenchoosingthe(cid:2)nalpositionforeachinsertionthatminimizesthenumber ofgroupsofinsertedgenes. Insome genomes,especiallybacterial ones,geneswith similar functionareoften locatedtogetherononestrandofachromosome;thesefunctionalunitsarecalledoper- ons.Inbacteria,atleast, whiletheorderofgenesin anoperonmaychange,thegene contentoftheoperonis muchless likelytodoso[22].Ingene-orderdata,anoperon appearsas a cluster of genenumberswith thesame sign, with content,but notorder, preservedacrossgenomes.HeberandStoyedevelopedalinear-timecluster-(cid:2)ndingal- gorithmtoidentifytheseoperon-likeclusterswithinequal-contentgenomes[23]. McLysaght et al. [4] reconstructed ancestral genomes for a group of poxviruses; shedeterminedgenecontentbyassumingthatthephylogenetictreecontainedasingle pointoforiginforeachgenefamilyinthemoderngenomes.Eachpointoforiginwas assignedtothatinternalnodewhichminimizedthenumberoflosseventsnecessaryto achievethegenecontentoftheleafgenomes. 4 DesigninganAlgorithmicFramework Toaddresstheproblemofreconstructingancestralgenomesatthelevelofcomplexity ofgamma-proteobacteria,weusecondensationofgeneclustersinordertoreducethe sizeofthegenomes,describeaproceduresimilartothatofMcLysaghtetal.todeter- minethegenecontentofeveryinternalnode,andpresenta newheuristic to compute themedianofthreeverydifferentgenomes. 4.1 Medians Weusethequeue-basedtree-labelingheuristicdescribedinSection3.Sinceleavescon- taintheonlylabelsknowntobecorrect,weupdatethenodesinorderoftheirdistance fromtheleaves,as showninFig.2.Theheartofthetop-levelheuristicis themedian computation.Exactmedian-(cid:2)ndingalgorithmsarelimitedtosmallgenomes,smalledge lengthsinthetree,andfewchangesincontent(cid:151)andnoneofthesepropertiesholdsin ourproblem.Wethereforepursueasimpleheuristicinspiredbygeometry.Themedian ofatriangleintheplanecanbefoundbydrawingalinefromonevertextothemiddle oftheoppositesegment,thenmovingtwothirdsofthewayalongthisline.Byanalogy, 5 R 4 3 2 1 1 1 1 L L L L L L L L L Fig.2.Internalnodesorderedbytheirdistancefromtheleaves.Nodeswithlowerindiceswillbe labeled(cid:2)rst;nolabelisgeneratedfortheroot. wegenerateasortingsequencefromonegenometoanother(anedgeofthetriangle), thenchooseagenomehalfwayalongthissortingsequenceandgenerateanewsorting sequencefromittothethirdgenome,stoppingone-thirdalongtheway. We extend the method of Marron et al. [7] to enumerate all possible positions, orientations,andorderingsofgenesaftereachoperation.Deletedgenesattheendpoint ofaninversionaremovedtotheotherendpointifdoingsoavoids(cid:147)trapping(cid:148)thedeleted genes between two consensus genes that are adjacent in the target genome. Inserted genesaremovedsoastoremainadjacenttooneofthetwoconsensusgenesbetween which they lie in the target genome.We can thus generate the genomesproducedby (cid:147)running(cid:148)a portionofthe sortingsequence,thenuse these intermediategenomesfor themedianheuristicjustdescribed,allinpolynomialtime. Thishandlingofinsertedgenesleadstoanoverestimateoftheeditdistance,which Marron et al. showed at most doubles the number of operations [7]. Their original method calculates all possible positions in the source genome to which the inversion sequence could moveinsertions and chooses the (cid:2)nal position (for each insertion)to minimizethenumberofgroupsofinsertedgenes;itmayunderestimatetheeditdistance becausethegroupingofinsertedgenesmayrequireaninversiontojoininsertedgenes andsimultaneouslysplit deleted genes,which is not possible. We comparedpairwise distancesproducedbyourmethodandbytheirstogetanupperboundontheoveresti- mation:averageandmaximumdifferencesbetweentheoverestimateandunderestimate were11.3%and24.1%,respectively. 4.2 GeneContent Wepredeterminethegenecontentofeveryinternalnodebeforecomputinganymedian: oncethegenecontentofaninternalnodeisassigned,itremainsunchanged.Sincethe tree is rooted,we knowthe directionoftime (cid:3)owoneach treeedge;we also assume thatdeletionsarefarmorelikelythaninsertions,Thenumberofcopiesofeachgeneg isdecidedindependentlyofallothergenes;atinternalnodei,itissettothemaximum numberofcopiesofgfoundinanyoftheleavesini’ssubtreeif:(i)thereareleavesboth insideandoutsidei’ssubtreethatcontaing;or(ii)thereareleavescontaininggineach halfofi’ssubtree.Otherwisethenumberofcopiesofgeneginnodeiissettozero. ThisvaluecanbecalculatedinO(NG) time,whereN is thenumberofnodesin thetreeandGisthenumberofdistinctgenesinalltheleaves,asfollows.Foreachnode inthetree,wedeterminethemaximumnumberofcopiesofeachgenefromamongthe 6 0 0 2 0 1 0 0 0 2 1 0 Fig.3.Thenumberofcopiesofageneininternalnodes. leavesofthatnode’ssubtree,usingasingledepth-(cid:2)rsttraversal.Weuseaseconddepth- (cid:2)rsttraversaltosettheactualnumberofcopiesofeachgeneateachinternalnode.If eitheroftheroot’schildrenhasasubtreemaximumofzero,thenwesettheroot’sactual numbertozeroaswell.Foreachinternalnodeotherthantheroot,ifitsparent’sactual numberofcopiesiszeroandatleastoneofitstwochildren’ssubtreemaximumiszero, thenwesetthenumberofcopiesforthegenetozero;otherwisewesetthenumberof copiestothenode’ssubtreemaximumforthegene. Internalnodes will thus possess at least as manycopies ofa geneas the majority consensus of their neighbors’ gene contents. An internal node will always possess a copy of a gene if two or more of its neighbors do. (We consider the two children of the rootto be neighbors.) Moreover,if a nodeis the nearest commonancestorof all genomespossessingthegene,itmayhavemorecopiesofthegenethanitsparentand oneofitschildren,asinthecaseoftheblacknodeinFig.3.Thegenecontentofinter- mediategenomesalongsortingsequencesisaunionofthegenecontentsofthestarting genomes,becausethesortingsequenceofoperationsthatweusealwaysinvolves(cid:2)rst insertions,theninversions,and(cid:2)nallydeletions.Therefore,whencalculatingmedians fromsortingsequences,weface threecases in whichthenumberofcopiesofa gene differbetweentheintermediategenome,themediangenome,andthemedian’sparent(cid:151) see Fig.4. InFig.4a,the intermediategenomehas thesame numberofcopies as the median,butfewerthantheparent,aswiththeblacknode’srightchildinFig.3.Each copy in the parentthat is not matchedby the duplication-renamingalgorithmwill be excludedfromthe mediangenome.The case of Fig. 4b onlyarises whenthe median genome is the nearest common ancestor of all genomes containingthe gene in ques- tion,aswiththeblacknodeinFig.3.Genomesalongtheintermediatesequencehave thesamenumberofcopiesasthemedian,whiletheparentofthemediancontainsno >m 0 0. PSfragreplacements PSfragreplacements PSfragreplacements m m 0 m m (cid:20)m m m (cid:20)m 0 >0 >0 0 (a) (b) (c) Fig.4.Caseswherethemediananditsneighborshavedifferentnumbersofcopiesofagene.Solid 1 1 linesaretreeedges;dashedanddottedlinesarefractions( and ,resp.)ofsortingsequences. 2 3 7 copyatall.Finally,thecaseofFig.4ccanonlyarisewhentherightchildofthemedian isthenearestcommonancestorofallgenomescontainingthegene,aswiththeparent oftheblacknodeinFig.3. Biologically,thisprocessof(cid:2)ndingwhichduplicatestoincludeinthemediancor- responds to matching orthologous duplicates of each gene between genomes and to discarding unmatchedparalogousduplicates. Since the original nucleotidesequences are abstracted away beforethe analysis begins,this orthologmatchingis decided en- tirely on the basis of which other genes are located next to the different homologs. Fortunately, orthologs and paralogs that can be distinguished by a nucleotide-based analysisareassigneddifferentgenenumbersbeforeouranalysisbegins.Therefore,our methodrepresentsareasonablewaytointegratebothnucleotideandgene-orderdatain differentiatingorthologousandparalogoushomologsofgenes. 4.3 ClusterCondensation Toextractinformationfromlargerandmorecomplexbiologicaldatasets,weneedfast algorithmswithfastimplementations;fasterprocessingalsoenablesa morethorough analysisandthusproducesresultsofhigherquality.Thekeyfactorhereisthesizeof thegenomes(cid:151)theirnumberisamuchsmallerissue.Wethusdevelopedatechniqueto identifyandcondensegeneclustersinordertoreducethesizeofthegenomes.Ourap- proachgeneralizesthatusedingenomeswithequalcontent[23];incontrast,GRAPPA onlycondensesidenticalsubsequencesofgenes,becauseitaimstopreservetheidentity ofeditsequences.Ourmethodallowsthecondensationofclustersbasedonlyoncontent (notorder,atleastaslongasgenesstayonthesamestrand)andalsohandlesthedif(cid:2)cult casesthatariseoutofunequalgenecontent(suchasaninsertionwithinacluster). Toidentifyclusters,we(cid:2)rstusetheduplication-renamingtechniqueofMarronetal. tocreateduplication-freegenomes.Afterrenaming,weremoveanygenesnotpresent in all of the genomes under examination.This step creates a group of genomes with equalgenecontent.Wethenusethecluster-(cid:2)ndingalgorithmofHeberandStoye[23] to (cid:2)nd equivalent clusters of genes within the equal-content genomes. Once clusters are identi(cid:2)ed, each one is condensed out of the original genomes and replaced with a single marker (as if it were a single gene). In a set of genomes with unequal gene content, there can be genes inside a cluster that are not present in the corresponding equal-contentgenomes.Wedealwiththesegenesinoneoftwoways.Ifeveryoccur- rence of that gene is located inside the cluster in each of the genomes that possesses thegene,thenthegeneiscondensedalongwiththerestofthecluster.Otherwise,the extrageneismovedtoonesideoftheclusterandtheclustercondensed.Whena me- diangenomeiscomputed,amedianforeachclusterisalsocomputedandeachcluster’s marker in the median genome is replaced with the cluster’s median. At this point, if anyextragenesmovedtothesideoftheclusterarestillbesideit,theyaremovedback insidetheclustertoapositionsimilartotheiroriginalone. 4.4 PuttingItAllTogether Ancestral genomereconstructionsare performedusing these three main components. Initialization of the internal nodes of the tree is done from the leaves up by taking 8 eitherthemidpointoroneofthetwoendpoints(alongtheinversionportionofanedit sequence)ofaninternalnode’stwochildrenanddiscardinganygenesnotallowedby themediangenecontent.ThismethodaccountsforallthreeofthecasesinFig.4and produceslabelswiththedesiredgenecontent.Newmediansarecomputedlocallynode by node in a postorder traversal of the tree, so as to propagate information from the leaves towards the root. Whenever a median is found that reduces the local score at a node, it immediately replaces the previous label at that node; that node and all its neighborsarethenmarkedforfurtherupdate. 5 Testing Weusedourlabelreconstructionmethodonthebacterialdatasetaswellasonsimulated datasets. With simulated datasets, we know the true labels for the internal nodes as wellastheexactevolutionaryeventsalongeachedge,sothatwecantesttheaccuracy of the reconstruction. The goal of our experiments was to generate datasets roughly comparabletoourbiologicaldatasetsothatourexperimentalresultswouldenableus topredictarangeofaccuracyfortheresultsonthebiologicaldataset. ThesimulateddatawerecreatedusingthetreeofFig.1;edgelengthswereassigned tothetreebasedonourbestestimateoftheedgelengthsforthebacterialgenomes.To keepthedataconsistent,edgelengthswereinterpretedasthenumberofoperationsper generatherthanasanabsolutenumber,allowingustousethesamerelativevaluefor genomesof differentsizes. The tree was labeled by (cid:2)rst constructinga root genome. Thenumberofgenesgandthetotalsizenoftherootgenomeweresetasvariableuser parameters.Oneofeachgenefrom1tog wasaddedtotherootgenome,afterwhich n(cid:0)gadditionalgeneswerechosenuniformlyatrandomintherange[1;g]andadded totherootgenome.Therootgenomewasthenrandomlypermutedandeachgeneas- signedarandomsign.Theothernodeswerethenlabeledfromtherootbyevolvingthe genomesdownthetreeaccordingtotheprescribednumberofoperations.Theallowed operationswereinsertions,deletions,andinversions.Althoughthetotalnumberofop- erationswas(cid:2)xed,theproportionofeachofthethreetypesofoperationswasleftasa variableparameterbysettingtheratioofinversionstoinsertionstodeletions.Thismix ofoperationswasusedoveralledgesofthetree. Thecharacteristicsofeachtypeofoperationweredeterminedseparately.Thelength ofeachinversionwas chosenuniformlyat randombetween1andhalfthesize ofthe genome, with a start point chosen uniformly at random from the beginning of the genometo the size of the genomeminus the lengthof the inversion.The average in- sertionlengthwassetviaauserparameterasaportionofthesizeoftherootgenome andwasusedunchangedovertheentiretree,whiletheactuallengthofeachinsertion wasdrawnfromaPoissondistributionwiththisexpectationanditslocationwaschosen uniformlyatrandomfromthebeginningtotheendofthegenome.Inmovingfromthe roottotheleaves,itwasassumedthataparticulargenecouldonlybeinsertedalongone edgeofthetree(cid:151)multipleinsertionsofthesamegene,evenalongseparatepaths,were notallowed.Theaveragedeletionlengthwaschosenasauser-speci(cid:2)edportionofthe genomefromwhichgeneswerebeingdeleted,thusvaryingfromedgetoedgeaswell asalongeachedgewitheachsuccessivedeletion,whiletheactualsizeofeachdeletion 9 was drawnfroma Poisson distributionwith this expectationandwith a start location chosenuniformlyatrandomfromthebeginningofthegenometothesizeofthegenome minusthelengthofthedeletion.Withtheconstantexpectedinsertionlength,genomes growlinearlyintheabsenceofdeletions,while,withaproportionalexpecteddeletion length,genomesshrinkexponentiallyintheabsenceofinsertions.Whenbothinsertions anddeletionsareused,genomesfartherfromtheroottendtowardsastablesize.Along eachedge,theprescribednumberofinsertionsareperformed(cid:2)rst,theninversions,and (cid:2)nallydeletions.Onceall nodes havebeenbeenassignedgenomes,theresultingleaf genomesare fed into our reconstructionprocedure.The results of the reconstruction, intermsofgenecontentandgeneorderateachinternalnode,arecomparedwiththe (cid:147)true(cid:148)tree,i.e.,thatgeneratedinthesimulation. Weconstructedtreesusing(cid:2)vedifferentmodels:an(cid:147)inversion-only(cid:148)model,a(cid:147)no- deletions(cid:148) modelwith a 6:1inversion-to-insertionratio,a (cid:147)no-insertions(cid:148)modelwith a 6:1 inversion-to-deletionratio, a (cid:147)low-insertion/deletion(cid:148)model with a 40:4:1 ratio of inversions to deletions to insertions, and a (cid:147)high-insertion/deletion(cid:148) model with a 30:10:3ratioofinversionstodeletionstoinsertions.Theaverageinsertionlengthwas setto2%oftherootgenomeandtheaveragedeletionlengthto3%ofthelocalgenome. In order to test the ef(cid:2)cacy of cluster condensation, we tested the technique on triples amongthe bacterial genomesthat lie close to each otheron the tree in Fig. 1. Tripleswerechosenbyselectinginternalnodes,then,foreachofthethreeedgesleading outfromtheinternalnode,bychoosinganearbyleafreachablebyfollowingtheedge. Foreachsetofthreegenomes,wemeasuredthesumofthelengthsofallclustersthat werefound. 6 Results Our discussion and summaries of results refer to Fig. 5. Reconstruction of ancestral genomesforthebacterialgenomestakesaround24hoursonatypicaldesktopcomputer. Themidpoint-initializationprovedquitestrong:theonlygenomestobeupdatedinthe 14 6 Pasteurella multocida 7 15 Haemophilus influenzae 16 10 Yersinia pestis−CO92 4 8 17 3 8 Yersinia pestis−KIM 5 18 11 Salmonella typhimurium 7 9 19 2 4 Escherichia coli 2 20 9 Wigglesworthia brevipalpis 10 21 1 Buchnera aphidicola 5 Vibrio cholerae 3 1 Pseudomonas aeruginosa 12 Xylella fastidiosa 6 13 22 Xanthomonas axonopodis 11 23 Xanthomonas campestris Fig.5.Thebacterialtreewithnumberededgesandinternalnodes. 10 Table1.ErrorPercentageinTreeScores AvgerrorMinerrorMaxerror Inversiononly 63.2% 57.3% 67.4% Nodeletions 62.6% 54.8% 70.7% Noinsertions 45.2% 37.6% 54.3% Lowinsertion/deletion 56.4% 46.7% 64.8% Highinsertion/deletion 34.9% 25.1% 46.4% subsequentlocalimprovementprocedurewerethetwochildrenoftheroot(nodes1and 6inFig.5),theonlyneighboringgenomesinwhichoneneighborwasnotusedtocre- atetheother.Whenweusedendpoint-initialization,threeinternalnodeswereupdated (nodes3,4,and6inFig.5)andthescoreoftheentiretreewasloweredby2.8%.This (cid:2)ndingmayindicatethattheinitializationisverygood,butitmayalsore(cid:3)ectthelarge numbersoflocaloptimainthesearchspace(cid:151)asimilar(cid:2)ndingwasreportedforthesim- plerGRAPPA[18].Itshouldbenotedthat,whencalculatingmedians,onlyfourdiffer- entmidpointsinthechild-to-childsortingsequenceareused;fromeachofthesemid- points,onlythreemidpointsinthesortingsequencefromtheintermediategenometo theparentaretested.Thusweonlyperformaveryshallowsearchandcouldeasilymiss a better solution. Interestingly,though,when we did a slightly more thoroughsearch withtenmidpointsfromchildtochildandfourmidpointsfromintermediatetoparent, using endpoint-initialization, the tree score was slightly worse than in the shallower analysis,althoughthesearch,whichtookabout3.5timeslonger,updatedthesamethree internalnodes.Ofcourse,thislargersearchremainsveryshallow;goingbeyonditwill requireamuchmoreef(cid:2)cientimplementationoftheduplication-renamingheuristicof Marronetal.[7](cid:151)inourcurrentversion,itusesupover90%ofthecomputingtime. We simulated 100 labelings of the tree with a root genomesize of 200 genes for eachofthe(cid:2)vepreviouslydescribedscenarios.Endpoint-initializationwasusedinall scenarios.Theleafgenomesproducedinoursimulationsrangedinsizefrom70genes to 400 genes.We comparedthepredictedgenecontentof the internalnodes with the actual gene content.As expected(due to our restriction on generation),the predicted genecontentalwaysmatched,exceptwhenagenecopythatwaspresentataninternal nodewaslostinallleaves.Failuretodetectthiskindofmissinggeneisunavoidablein ananalysissincethedeletionfromallleavesmeansthatnohistoricalrecordis leftto attestthepresenceofthatgeneinancestralgenomes.Whenwecomparedthenumber of operations over all edges in reconstructed trees versus the original simulated tree, thescoreforthetreewasfairlyinaccurate,consistentlyoverestimatingthetruescore, asillustratedinTable1.Therathertightdistributionforthetreeindicatesthattheerror isnotarandomprocess,butaresultofsomeaspectofourreconstructionmethod,one thatmaylenditselftoreversemapping. We compared edge lengths in the reconstructed trees with those in the true trees bycalculatingtheratioofthelengthsforeachedge(Fig. 6).Aperfectreconstruction wouldgivearatioof1.0;asthe(cid:2)gureshows,mostratiosarehigher,withedgesfurther fromleaveshavinglargerratios(andalsolargervariances).Abouthalfofthe23edges arewithinafactoroftwoandanotherquarterarewithinafactoroffour.
Description: