8 Comparative Genomics AsliIsmihanOzen.TammiVesth.DavidW.Ussery DepartmentofSystemsBiology,CenterforBiologicalSequenceAnalysis,Kemitorvet,TheTechnical UniversityofDenmark,Lyngby,Denmark Prokaryotic Classification .................................209 dependentonthedevelopmentofmicrobialtechniquessuchas isolatingandgrowingmicroorganismsinpurecultures,staining CurrentTaxonomyofProkaryotes.........................210 andmicroscopeobservations.Intheirearlyobservations,lackof guidelinesonnaminginevitablyledtoavastnumberofinvalid TheExplosionofSequencedBacterialGenomes ..........210 namesandsynonyms.FerdinandCohnmadethefirstclassifica- tion system of bacteria in 1872; six genera of bacteria were StatisticsonProkaryoticGenomes ........................211 classifiedbasedontheirshape,cellularstructures,pigmentation, DataGrowthovertheYears .............................211 andmetabolicactivities(Cohn1872). TaxonomyAnalysis,MostSequencedPhyla Atthebeginningofthetwentiethcentury,besidesmorphol- andGenera ..............................................212 ogy,theuseofphysiologicalandbiochemicalinformationcould BasicGenomeStatistics .................................213 beincorporated.Europeanscientistsalsoproposedphysiology, ThousandsofGenomeSequences .......................216 metabolism, pigments, and pathogenicity as new systems for Whole-Genome-BasedToolsforTaxonomy ............216 classification.However,someofthesemethodswerethencriti- rRNAPhylogeneticTrees ................................216 cized for being not important for assessing taxonomic ranks. AverageNucleotideIdentities(ANI)andTetra Later,advancesinbiochemistryandmolecularbiologyfromthe NucleotideFrequencyCalculations .....................218 isolation of nucleic acids to elucidation of macromolecular BLASTMatrixUsingReciprocalBestHits ...............220 structureofproteinsandnucleicacidsledtothefoundationof CompositionVectorTrees(CVTree) ....................222 genomicsciences.Developmentofcomputersinthe1950swas Pan-genomeTrees .......................................224 anotherimportantstepinbacterialtaxonomy,wheretheywere first used for analysis of phenotypic and molecular data. Summary ..................................................225 Betweentheyears1960–1980,numericaltaxonomyandchemo- taxonomywereontherise(Stackebrandt2006;Schleifer2009). In late 1950s, scientists were able to identify the molecules Prokaryotic Classification conservedthroughouthistoryoflife,suchasproteins,DNA,or RNAmolecules.Theideaofusingthesemoleculesasblueprints Classification covers the theory and practice of how to order of the evolutionary history of organisms emerged in the 1960s characterized organisms into different groups based on their (Zuckerkandletal.1962).Tertiarystructureandsequenceanal- degreeofrelatedness.Togetherwithidentificationandnomen- ysisofmolecules,suchascytochromeC,ferrodoxins,andfibri- clature,classificationisapartoftaxonomy,asciencethatdeals nopeptides,andalsoimmunologicalapproacheswerebeingused withtherelatednessoforganisms.Thegoalofmanytaxonomists afterward. However, theinterestin these methods decreased as istohaveaclassificationsystemthatreflectsthenaturalrelation- rapidsequencingtechniquesforDNAbecamemoresignificant. shipsamongorganisms.Thisnaturalsystemhasbeendepicted Thefirstgenotypicapproachthatallowedbacteriologiststo mostly as phylogeny (Doolittle 1999)—or an evolutionary classify prokaryotes on the basis of their phylogenetic related- tree—whichisadiagramthatshowsancestor-descendantrela- nesswasDNA-DNAhybridization(DDH)(Wayneetal.1987). tionships of organisms based on their evolutionary history. Inthefollowingyears,moregenotypicstudies,includingcom- However,inferringatruephylogenyforprokaryoticorganisms parative analysis of Ribosomal RNA (ribonucleic acid) genes isverychallengingduetothediversityoftheseorganisms,aswell andprotein-codinggenesequence,allowedmoreinsighttothe asfrequenthorizontaltransferofgenes. relationshipsofprokaryotes(Schleifer2009).Thesmallsubunit Prokaryotes, known as unicellular organisms with no rRNA (16S rRNA in prokaryotes) was shown to be one of the nuclear membrane structure, have a history of more than 3.5 universally conserved molecules became the primary molecule billionyearsonearth,yethumanshavebeenawareofthemfor of interest. Being ubiquitous, having functional consistency, onlythepastfewcenturies,afterfirstbeingdescribedbyRobert genetic stability, appropriate size, and independently evolving Hooke in the seventeenth century. Louis Pasteur and other domains caused this molecule to be chosen for phylogenetic scientists of the nineteenth century described microorganisms analysisandthisapproachbecameaclassicaltoolfortaxonomy indetail,andbegantocategorizethem.Theirclassificationwas (HarayamaandKasai2006).AnimportantstudybyCarlWoese E.Rosenbergetal.(eds.),TheProkaryotes–ProkaryoticBiologyandSymbioticAssociations,DOI10.1007/978-3-642-30194-0_11, #Springer-VerlagBerlinHeidelberg2013 8 210 ComparativeGenomics revolutionizedbacterialtaxonomy,proposingthenewkingdom classificationsystembasedonevolutionaryhistory,clearclus- ofArchaebacteria(WoeseandFox1977).Hislaterstudiescon- tersoftaxonomicunitsinaphylogenetictreeareseensuchthat cludedinaphylogeneticschemeofthreemainbranchesoflife species that share a common ancestor would form the genus (Bacteria,Archaea,andEukarya)thathecalledDomains(Woese andgenerathatshareacommonancestorformafamilyandso etal.1990). forth.Majorsourcesforbacterialnamesandtaxonomicalorder In other genotypic classifications, many protein-coding areBergey’sManualofSystematicBacteriology(Brenneretal. genes were used for phylogenetic relatedness, some of which 2005a), Bergey’s Taxonomic Outlines (http://www.bergeys.org/ are recA, gyrB, genes of some chaperonins, RNA polymerase outlines.html),andthecomprehensivelistavailableatTheTax- subunits(i.e.,rpoB)orsigmafactors (rpoD),elongation factor onomic Outline of Bacteria and Archaea (TOBA) journal G(fus).Themostacceptedcriteriaforselectionoftheseproteins (Garrityetal.2007). is such that, they should not be subjected to horizontal gene Taxonomy tools historically have been mainly based on transfer(HGT),shouldbepresentinallbacteria,preferablyin laboriouslaboratoryexperimentstryingtocharacterizebacteria single copies and at least two highly conserved regions for the based on their phenotypic and biochemical properties until designofPCRprimers(YamamotoandHarayama1996).These molecularapproachesandsequencingtechnologiesweredevel- propertiesgivethemanadvantageofbeingmoreappropriatefor oped. Today, such research can be handled using robotic and phylogeneticanalysisofcloselyrelatedbacteriathan16SrRNA computationaltechniques,wheremostoftheknowledgegained analysis. fromresultsrelyonthedatathatisbeinghandled. In addition to thesinglegene basedmethods, MultiLocus Sequence Typing (MLST) has been widely used for genotypic characterizationandclassificationofprokaryotesbycomparing The Explosion of Sequenced Bacterial multiple housekeeping gene sequences (Maiden et al. 1998). Genomes However, usually a different set of genes is useful for different setoforganisms,andsomedifficultiesoccurinprimerdesignfor Biological data generated by researchers worldwide has been amplification of genes in all strains if the analysis is not growing with a tremendous rate, especially with the advances conducted all in silico. A widely used website and database in molecular biology techniques in the past 50 years. Much of currentlyismlst.net(AanensenandSpratt2005). this vast information can now be accessed through biological databases that hold records for experimental data, sequence data, classification schemes, literature, and some also provide Current Taxonomy of Prokaryotes computationalanalysistools. A part of this huge biological information is the genomic Classificationisdonebycomparinganewlyidentifiedorganism sequences. In modern molecular biology and genetics, withthecollection ofpreviouslyclassified organisms andthen a‘‘genome’’istheentiretyofanorganism’shereditaryinforma- assigning it with a previously described or new species. If tion. Therefore, genomics can be referred to as the science of a bacterial species is considered novel, the proper naming for genome analysis. As such the field of comparative microbial theneworexistingtaxaaremadebynomenclaturethatisbased genomics(CMG)workwithcomparingtheentireDNAmaterial ontheInternationalCodeofNomenclatureofBacteria(Lapage ofamicrobialorganismtootherorganisms.Thefirsttwocom- etal.1992),alsonamedastheBacteriologicalCode.Nomencla- pletebacterialgenomesequenceswerepublishedin1995.Asthe ture is, however, subject to changes because classification is technologies advanced and the sequencing cost went down, adynamicprocess.Thepublicationofnamesfornovelprokary- manymoresequenceswerebeingpublishedandmoredatabases otictaxaismadeintheInternationalJournalofSystematicand were established to handle this information. One of the most EvolutionaryMicrobiology(IJSEM),whichistheofficialjournal useddatabasesisbasedonGenBank,nowlocatedaspartofthe forthispurpose.IJSEMalsopublishes‘‘ValidationLists’’which NationalCenterforBiotechnology Information,NCBI(http:// contain new names published in other journals (Tindall et al. www.ncbi.nlm.nih.gov/genome/browse/). The NCBI GenBank 2006).Anupdatedlistofapprovednamesformicroorganisms holds the nucleotide sequence data from expression sequence based on the international rules can also be found at The tag (EST), genome survey sequences, other high-throughput DSMZ—Deutsche Sammlung von Mikroorganismen und sequencessuchaswhole-genomesequencesandgenomeanno- Zellkulturen GmbH (German Collection of Microorganisms tations of thousands of organisms. Both prokaryotic and andCellCultures)depository(http://www.dsmz.de). eukaryotic data is available (Benson et al. 2008). GenBank is In taxonomy, groups of organisms that are brought a part of an international collaboration called International together based on shared properties are called ‘‘taxa’’ or SequenceDatabase Collaboration, which also consists of DNA ‘‘ranks,’’ and prokaryotic taxonomy makes use of several DataBaseofJapan(DDBJ)andtheEuropeanMolecularBiology ranks or levels. The current classification scheme has laboratory(EMBL).AnotherpartofNCBIthatishighlyrelated a hierarchical structure, where the higher taxonomic ranks tothischapterisNCBITaxonomy.Althoughclaimingnottobe consist of the lower ranked groups. In other words, higher aprimarysource,NCBIprovidestaxonomicalinformationthat taxa(e.g.,genus)containlowertaxa(e.g.,species).Inanideal isgatheredfromvarioussources. 8 ComparativeGenomics 211 400 300 s e m o n e g of 200 er b m u N 100 10 1 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Year .Fig.8.1 GenomespublishedanddepositedtopublicNCBIGenBanksince1995(DatagatheredfromNCBI(http://www.ncbi.nlm.nih.gov/ genomes/lproks.cgi,Jan.2012)) In May 2011, NCBI GenBank contained around 1,500 thanothers,aclearstandardforgenomepublicationhasyetto genome sequences labeled as ‘‘finished.’’ Six months later, at beestablished(Me´digueandMoszer2007). the time of writing (November 2011), this number has gone upto1,790.Currently(Feb.2012),theNCBI‘‘GenomeProjects’’ is changing to ‘‘BioProjects,’’ in order to relate genomic infor- Statistics on Prokaryotic Genomes mation to other data types, such as the transcriptome, prote- ome,andmetagenome. With such a large amount of data, it is interesting to In addition to NCBI (GenBank) and EMBL (Nucleotide see the trends in the basic statistics of the genomic data and Sequence Database), a source for genomic information is the comparisons on different taxonomic levels and years of GenomesOnlinedatabase(GOLD).GOLDaimstoprovidean sequence publications. The data presented in this section is accurate and complete set of finished and ongoing genome taken from the NCBIcompletegenomes list inJan. 2012 and projects with a broad range of information on each project. GenBankfilesfor1,500sequencedgenomesweredownloaded Thesequencedataitselfisnotstoredinthedatabase,however, (November2011). externallinkstowherethedatacanbefoundisgiven,mostof whicharetotheNCBIGenomeProjectpages.GOLDalsopro- videstaxonomicalinformation,thoughnottheprimarysource DataGrowthovertheYears (Bernaletal.2001;Lioliosetal.2010). Onepartofcomparativemicrobialgenomicsistomonitor >Figure8.1illustrateshowmanygenomeswerepublishedeach the available microbial genomic data. Even though the year, since 1995. The two first complete genomes to be sequences available may only be a small fraction of the real sequenced and deposited was Mycoplasma genitalium G37 world,theinformationgatheredisgrowingeveryday.Ittook (Fraser et al. 1995) and Haemophilus influenzae Rd KW20 14 years to sequence the first thousand bacterial genomes (Fleischmann et al. 1995). From 1995 until 1999, only 25 (1995–2009), and already in 2012, less than 3 years later, the genomes were published as complete and they covered 14 two thousandth genome sequence has been deposited to phyla.OfthesetheArchaealgenomesconstitutealargeportion GenBank.Notonlyhasthecostofgenomesequencingdecrease comparedtothefractiontoday(around31%ofthe25compared dramaticallybutalsothetimeandeffortputintothetaskhas to99outof1,500(6.6%)).Thegenomesfromthisfirstperiodof alsogonedown.Alsothecomputationalpowerandsoftwareto genome sequencing cover a large span of the microbial land- handlesequencingdataisbeingrevolutionizedandfastassem- scape with no obvious medical bias. From 2000 to 2010 the bly and interpretation is increasing the number of published number steadily increased from 26 to 1,423 with the major genomes(Ansorge2009).Theincreaseingenomedatahasgiven phylacoveredbeingFirmicutes,Gamma,andAlphasubdivisions risetoawholenewareaofproblemswhenitcomestopublica- ofProteobacteria.Itispossiblethatproducingacompletegenome tion and sharing of data. Databases usually have their own sequence is becoming less popular due to the improvement formatting of the raw data and though some are more used insoftwarethatcanworkondraftgenomes(Chainetal.2009). 8 212 ComparativeGenomics a 400 s 300 s e gr pro 200 n es i 100 m o n ge 0 e, nc R) a E Count of occur Nanoarchaeota (NAN) Fusobacteria (FUS) Planctomycetes (PLA) Acidobacteria (ACI) Aquificae (AQU) Thermotogae (THE) Chloroflexi (CHLFX) Deinococcus−Thermus (DEITH) Chlamydiae/Verrucomicrobia (CHL−V Crenarchaeota (CRE) Spirochaetes (SPI) Cyanobacteria (CYA) Deltaproteobacteria (DEL−PRO) Other Epsilonproteobacteria (EP−PRO) Euryarchaeota (EUR) Bacteroidetes/Chlorobi (BAC−CHLO) Betaproteobacteria (BET−PRO) Alphaproteobacteria (ALP−PRO) Actinobacteria (ACT) Gammaproteobacteria (GAM−PRO) Firmicutes (FIR) b 80 s es 60 gr o pr n s i 40 e m o n ge 20 e, c n e urr 0 c c o Count of ACT) Streptomyces CYA) Synechococcus DEL−PRO) Desulfovibrio GAM−PRO) Xanthomonas CYA) Prochlorococcus GAM−PRO) Acinetobacter CRE) Sulfolobus GAM−PRO) Buchnera GAM−PRO) Francisella SPI) Borrelia EP−PRO) Campylobacter GAM−PRO) Haemophilus CHL−VER) Chlamydophila FIR) Listeria ALP−PRO) Brucella ALP−PRO) Rickettsia FIR) Mycoplasma BET−PRO) Neisseria CHL−VER) Chlamydia GAM−PRO) Vibrio GAM−PRO) Yersinia ACT) Bifidobacterium GAM−PRO) Salmonella GAM−PRO) Shewanella GAM−PRO) Pseudomonas BET−PRO) Burkholderia ACT) Mycobacterium ACT) Corynebacterium EP−PRO) Helicobacter FIR) Staphylococcus FIR) Lactobacillus FIR) Clostridium FIR) Bacillus GAM−PRO) Escherichia FIR) Streptococcus ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( .Fig.8.2 Numberofgenomessequencesfromeachphylaandgenera.Onlygenerawithmorethan10representativegenomesareshown The cost of sequencing and the development of cheaper the GOLD database (http://genomesonline.org, March 2011) sequencingmethodshavemostdefinitelyhadanimpactonthe over 73% of the listed Streptococcus and more than 64% of rateofsequencing(Sboneretal.2011). Escherichiaarelabeledaspathogens. It is likely that some organisms are sequenced because of theirmedicalrelevance.OrganismsbelongingtotheEscherichia TaxonomyAnalysis,MostSequencedPhylaand genus have a considerable role within the medical world, with Genera Escherichiacolibeingthecauseofserious food poisoning. The tendencyofpathogenstobemoreoftensequencedis,however, >Figure8.2ashowsthenumberofgenomeswithineachphyla; not as strong overall. The fraction of pathogens within each Firmicutes and Gamma Proteobacteria are the most highly genusvariesfrom7%(Lactobacillus)to98%(Listeria),forthe represented. The plot in >Fig. 8.2b shows genera with more six different genera belonging to the Firmicutes. For all the than10sequencedgenomes.ThegenusofStreptococcusishighly genera listed in >Fig. 8.2b the range covers everything from overrepresented (63 genomes) while the closest other group is 3% (Synechococcus) and 100% (Borrelia and Rickettsia, Escherichia (45 genomes). According to supporting data from supporting data from GOLD). It should be noted that the 8 ComparativeGenomics 213 annotation of ‘‘pathogen’’ is not always accurate; for example, Chlamydiae/Verrucomicrobia,andNanoarchaeota.Ofthelargest thefirstsequencedgenome,Haemophilusinfluenzaeislistedas genomes (more than 8,000 kb, 32 genomes), members of a ‘‘pathogen,’’ although the strain sequenced (Rd KW20) is Actinobacteriaareprevalent(14genomes)rangingfrom925kb anonpathogenicstrain. to 11,937 kb. The largest genome, as of May 2011, at Inanyevent,itisclearthatthereisastrongsequencingbias, 13,033.779kb,wasSorangiumcellulosumSoce56(soil-dwelling makingtheavailabledataforcertainphylaandgeneraconsid- bacteria)(Schneikeretal.2007). erablymorethanothers.Itcouldbeexpectedthatmorepatho- Aninterestingperspectiveongenomesizeisthefocusonthe gensthannonpathogenswouldbesequenced,astheimmediate minimalgenomeforafree-livingorganism.Definingtheminimal interest in these organisms is larger. However, the bias is not genomeisascienceinitselfandhasbeenheavilydiscussedinthe directly linked to pathogenicity, as some genera aresequenced scientificcommunity (Galperin 2006). In 1995, the genome of more often though not being serious pathogens. For example, Mycoplasma genitalium (a parasite) was published, and at that species like Escherichia coli (urinary tract infections, simple time this was thought to be the smallest genome of any free- diarrhea, dysentery-likeconditions) include pathogens but are livingorganism(Fraseretal.1995).Ofthe1,500genomesinthis not as severe as other species like Borrelia (Lyme disease) or study, M. genitalium isthe18th smallestgenome. Upon closer Listeria (Listeriosis in newborn infants, elderly patients, and inspection,theeightsmallest‘‘genomes’’aredescribedasphage, immunocompromised). Escherichia coli is, however, Integrating and conjugative elements (ICEs), pathogenicity asignificantplayerinthefinancialaspectofmedicalrelevance. island, or genomic island, so not free-living organisms. These These organisms, though rarely lethal, can occur frequently in nongenomic sequences have been reported to GenBank, and thepopulation,andstillrequiretreatment;thisisaburdenon since have been deleted from the list of genomes during this any healthcare system. Another factor could be the economic work. Other genomes smaller than M. genitalium consist of cost,assomeorganismsgrowlesseasilyorreplicateveryslowly Buchnera (an endosymbiant (Pe´rez-Brocal et al. 2006) and making experiments long and expensive. Some pathogens Nanoarchaeum) another symbiont (Waters et al. 2003). requireextremesafetyprocedureswhenculturedandthiscon- The remaining seven genomes are Candidatus species, from sumestime,space,andmoney.Thehistoricfactorcouldalsobe proposed genera, and all of these are described as symbionts partlyresponsibleforsequencingbias.Someorganismsbecame (McCutcheon et al. 2009). It is worth mentioning that the model organisms from the early stage of microbiology and as smallest genome of a ‘‘true’’ free-living organism (as opposed such, many procedures are optimized for these organisms. toparasiteslikeM.genitalium)isconsiderablylarger,containing Unfortunately, due to the large variety within the microbial more than a thousand protein-encoding genes. Two proposed world, many organisms will not respond well to procedures ‘‘minimal free-living organisms’’ are Pelagibacter ubique (het- developed with Escherichia coli as the template. Taxonomical erotroph,133thsmallestinthisstudy;DeWallandCheng2011) biasinsequencingdatais,asstated,acomplexandmultifaceted andProchlorococcusmarinus(autotroph,209thsmallestinthis discussionthatwillprobablyneverend.However,thesetenden- study; Moya et al. 2009). Note that these genomes are still ciesshouldbekeptinmindwhenaccessingthedataavailablefor smallerthanthelargestviralgenomes(Arslanetal.2011). analysis. Anothergenomestatisticscommonlyusedisthepercentage ofAT(>Fig.8.3b),whichiscalculatedastheaverageATcontent ofalltheDNAsequence.GenomeswithhighATcontentinclude BasicGenomeStatistics Candidatus Zinderia insecticola CARI (86%, Beta Proteobacteria), Candidatus Carsonella ruddii PV DNA (83%, Thesequencesof1,500genomeshavebeenobtainedfromNCBI GammaProteobacteria),andBuchneraaphidicolastr.Cc(Cinara GenBankandanalyzedaccordingtobasicstatisticalparameters. cedri; 80%, Gamma Proteobacteria). These are all extremely Here,basicgenomestatisticsreferstocertainDNApropertiesof small genomes. They are also all symbiotic organisms living thegenome,suchasgenomesize,frequenciesofAandTbasesin inside insects, the spittlebug Clastoptera arizonana, jumping the DNA, and bias on the third codon positions for the open plant lice and plant lice, respectively. Genomes with low AT readingframes. (high GC) content include Anaeromyxobacter dehalogenans >Figure8.3aisaboxandwhiskerplotshowingthevariation (Delta Proteobacteria; Sanford et al. 2002) and Cellulomonas ofgenomesizeswithineachphylum.Asseen,severalphylahave flavigena (Actinobacteria; Abt et al. 2010), both with around awidedistributionofsizes.Phylacontainingonlyafewgenomes 25% AT. These genomes consist of an anaerobic and aerobic (less than five sequences) show very little size variation that soil-bacteria, respectively. The AT content within each phyla couldbetheresultofsequencingseveralcloselyrelatedstrains. shows some specificity with phyla like Acidobacteria, Formostphylasizeisnotakeyfeature,althoughChlamydiaand Actinobacteria, and Deinococcus/Thermus having a significant Nanoarchaeotaareexpectedtobesmallgenomes.Thegenomes skew toward low AT and Fusobacteria, Epsilon Proteobacteria, within the Firmicutes are distributed over a broad spectrum andAquificaehavingaskewtowardhighAT(>Fig.8.3b). (580–8,300kb),whileGammaProteobacteriasizevariesbetween CantheATcontentofanorganismberelatedtoitssize?The 32(cid:1)7,215 kb. Large genomes are often seen within answer can be both ‘‘yes’’ and ‘‘no.’’ The numbers from 1,500 Planctomycetes, Beta Proteobacteria, and Actinobacteria while genomesshowthatthesetwopropertiesarenotalwayspropor- small genomes are found within Epsilon Proteobacteria, tionaltoeachother.However,forverylargeandsmallgenomes, 8 214 ComparativeGenomics a G 12000 G 10000 G G G b) G G k G e ( 8000 siz GG me 6000 G GG G o G n G e G 4000 G G 2000 GG GGGGGG a bi o Acidobacteria Actinobacteria Aquificae Archaea Crenarchaeota Archaea Euryarchaeota Archaea Nanoarchaeota Archaea Other Bacteroidetes/Chlorobi Chlamydiae/Verrucomicr Chloroflexi Cyanobacteria Deinococcus-Thermus Firmicutes Fusobacteria Other Bacteria Planctomycetes Proteobacteria Alpha Proteobacteria Beta Proteobacteria Delta Proteobacteria Epsilon Proteobacteria Gamma Spirochaetes Thermotogae b G G 80 G G GGGGGGGGGG 70 G G T G A 60 G ent GG G c er 50 P G 40 GGGGGGGGGGG G G 30 G GG G a bi o Acidobacteria Actinobacteria Aquificae Archaea Crenarchaeota Archaea Euryarchaeota Archaea Nanoarchaeota Archaea Other Bacteroidetes/Chlorobi Chlamydiae/Verrucomicr Chloroflexi Cyanobacteria Deinococcus-Thermus Firmicutes Fusobacteria Other Bacteria Planctomycetes Proteobacteria Alpha Proteobacteria Beta Proteobacteria Delta Proteobacteria Epsilon Proteobacteria Gamma Spirochaetes Thermotogae .Fig.8.3 Boxplotsshowingthedistributionofgenomesize(inkilobase-pairs,panela)andATcontent(inpercentage,panelb)foreachphyla (asdescribedbyNCBITaxonomy).Themiddlebaristhe50%percentile,thebottomandtopoftheboxarethe25%and75%percentiles (Q1andQ3,respectively).Whiskerbarsextendtothemostextremedatapointwhichisnomorethan+/(cid:1)1.58IQR/sqrt(n),whereIQRis theinterquartilerange(IQR=Q3(cid:1)Q1).Anydatapointthatexceedsthislimitisplottedasanindividualdatapoint(outlier).Thegenome sizewascalculatedasthesumoflengthsofallcontigs the answer can be ‘‘yes.’’ >Figure 8.4 shows a scatterplot of astrongcorrelationbetweenthesetwopropertiesofagenome genomesizeandATcontent(Pearsoncorrelationcoefficientof (Pearson correlation coefficient of (cid:1)0.94). The third codon (cid:1)0.48),showingthatsmallgenomeshavehighATcontentand positionisthemostvariablepositionforthecodonandthisis large genomes have low AT content. The analysis also shows wherethelargestvariationinbaseusewouldbeexpected.The a cloud around the middle values, indicating that average size correlation was therefore expected and shows that high AT correspondstoanATcontentwithhighfluctuations. content in a genome correlates with a bias close to (cid:1)1 (which Ontheotherhand,aninterestingrelationisseenbetweenAT is100%ATinthethirdcodon)andlowATcontentcorrelates contentandbiasinthirdcodonposition.>Figure8.5illustrates withabiascloseto+1(whichis100%GCinthethirdcodon). 8 ComparativeGenomics 215 80 factor(Group) Acidbacteria Actinobacteria Aquificae 70 Archaea Crenarchaeota Archaea Euryarchaeota Archaea Nanoarchaeota 60 Archaea Other Bacteroidetes.Chlorobi Chlamydiae.Verrucomicrobia T A Chloroflexi nt 50 Cyanobacteria e c Deinococcus-Thermus Per Firmicutes Fusobacteria 40 Other Bacteria Planctomycetes Proteobacteria alpha Proteobacteria beta 30 Proteobacteria delta Proteobacteria epsilon Proteobacteria gamma 20 Spirochaetes Thermotogae 2000 4000 6000 8000 10000 12000 Genome size (kb) .Fig.8.4 ScatterplotshowingpercentageATcomparedtototalgenomesize(kb)for1,500genomesequences.ThePearsonCorrelation Coefficient(PCC)forthisdatais(cid:1)0.48,whichshowsamediumcorrelation.PCCisoftenusedtomeasurethelineardependencebetween twovariables,andtakesavaluebetween+1and(cid:1)1,where0reflectsnolinearcorrelation 80 Phyla Acidobacteria Actinobacteria Alphaproteobacteria Aquificae 70 Bacteroidetes/Chlorobi Betaproteobacteria Chlamydiae/Verrucomicrobia Chloroflexi ge 60 Crenarchaeota a Cyanobacteria ent Deinococcus-Thermus erc Deltaprotebacteria p Epsilonproteobacteria AT 50 Euryarchaeota Firmicutes Fusobacteria Gammaprotebacteria 40 Nanoarchaeota Other Archaea Other Bacteria Planctomycetes Spirochaetes 30 Thermotogae –0.5 0.0 0.5 Bias in third position .Fig.8.5 ScatterplotshowingpercentageATcomparedtobasebiasinthirdcodonpositionfor1,500genomesequences.Biasiscalculatedsothat 100%AorTinthirdcodonpositiongivesascoreof(cid:1)1,100%GorCinthirdpositiongivesascoreof+1.ThePearsonCorrelation Coefficientforthisdatais(cid:1)0.94,whichshowsastrongcorrelation.PCCisoftenusedtomeasurethelineardependencebetweentwo variables,andtakesavaluebetween+1and(cid:1)1,where0reflectsnolinearcorrelation 8 216 ComparativeGenomics ThousandsofGenomeSequences theAverageNucleotideIdentity(ANI)betweenpairsofgenomes or the Average Amino acid Identity (AAI) of the shared genes Availabilityofthousandsofgenomesmakesitpossibletoinves- between two genomes. The study of Goris et al. on pairwise tigatephylogeniesbasedongenomicinformationandseehow comparisonofcompletesequencedgenomesshowedtheANIof current taxonomy is affected. The development of manycom- the core genes show results similar to analysis of 16S rRNA putational tools and increasing computational power makes it sequence identity and DDH similarity values, concluding that possible to compare whole genomes in a reasonable time, yet a 70% DDH value corresponds to 95% ANI. Hence ANI has comparison of thousands of whole genomes is still a tedious been shown to be an alternative to the tedious DDH method process. Therefore, three data sets were selected that represent (Goris et al. 2007). Another genome-based method, AAI, has different taxonomic levels of prokaryotes. The first data set is been shown to result in strong correlation between 16S rRNA chosentocoverawidecoverageofalltheprokaryoticorganisms geneidentityandAAI-basedphylogenetictreescongruentwith (126 genomes and 23 phyla). The second data set is coregenome-basedtrees(KonstantinidisandTiedje2005). a representative of a well-defined prokaryotic family There are many methods and types of data used to build (Enterobactericiaefamily,50genomes).Thethirdoneischosen evolutionarytrees.Theresultsofthewhole-genome-basedtools as an example of a prokaryotic species and close relatives are usually values representing similarity between organisms, (Escherichiacoli,Escherichiafergusonii,Shigella).Differentcom- which can then be converted to distance-based phylogenies. putationalmethodsthatwehaveencounteredtobefittinginthe Distancemethodsconstituteamajorpartofphylogeneticanal- currenttaxonomyofprokaryoteswereshownforeachdatasetin ysis.‘‘Leastsquares’’isoneofthesemethodswherethesumof thefollowingsections. squares of difference between the observed and the predicted distancesofa treeshouldbeminimized. Unweighted (Cavalli- SforzaandEdwards1967)andweighted(FitchandMargoliash Whole-Genome-BasedToolsforTaxonomy 1967; Beyer et al. 1974) algorithms are suggested for least squares. Minimum Evolution, Neighbor joining, and UPGMA The previous section showed the growth in available sequence arealldistance-basedmethods.Therearealsomethodsthatrely data as well as the bias and diversity in this data. This large onprobabilitiesofevolutionarychange.Maximumlikelihoodis diversity and coverage opens the doors to large-scale phyloge- oneofthem,wheredifferentevolutionaryratescanbetakeninto neticanalysisofgenomesequences.Asaresult,greatinsightinto account and several models can be implemented (Felsenstein bacterialevolutionanddiversityhascomefromcomparisonof 2004). The evolutionary models and the distance methods many microbialgenomesequencesinthelastdecade.The dif- should be chosen carefully when phylogenies aregenerated, as ferences, even between strains of a distinct taxonomic cluster, theymightresultindifferentresultsevenforasmallset. show that bacteria representa greatdiversity, which led to the formation of the hypothetical concepts of ‘‘pan-genomes’’ and ‘‘core-genomes.’’Thepan-genomecontainsthetotalnumberof rRNAPhylogeneticTrees genesfoundinthegenepoolofasetofgenomes(Usseryetal. 2009)andcanbeviewedinthreeseparateparts.Thepartthat Inthissectionthe16SrRNAand23SrRNAcomparisonof126 consists of conserved essential genes common to all genomes various organisms from all bacterial and archaeal phyla is compared(core-genome).Ithasbeenseenthatcore-genomesof presented (>Fig. 8.6). This data set represents a collection of phylogenetically coherent groups contain genes that are less distantlyrelatedprokaryoticorganisms.Thefirstcriteriaforthe prone to horizontal gene transfer and are more stable such as selectionoforganismsforthisdataset,wastogetthelargestand housekeeping genes. The genes essential for colonization, sur- smallest genomes from each phyla (taxonomy reference is vival, or adaptation to a specific environment are thought to Genome metadata from NCBI and GOLD). More organisms formthelifestylegenes,whichcanalsobenamedasthe‘‘shell’’ from each phylawere selected from different environments or for frequently occurring genes. The thirdpart is called ‘‘acces- hostassociations,inordertogetalessbiaseddataintotal. sory’’ or ‘‘cloud’’ genes, as these are rarely found, often strain Ribosomal RNA sequences of all 126 genomes were specificandnonessential(LapierreandGogarten2009).Though predicted using RNAmmer program (Lagesen et al. 2007). hypothetical,thesetermscanserveusefordefiningandclassi- Foreach genome one 16S and 23S rRNA sequence was selected fyingbacteria.Thesedifferent‘‘genomes’’canbeusedtoexplain based on the highest RNAmmer score and appropriate length the differences and similarities between species or genera, and (Lagesen et al. 2010). The length requirements were between visualized by pan-genome trees (Snipen and Ussery 2010). An 1,400 and 1,700 bp for 16s rRNA sequences, 2,500 and elaborate work on the comparisons of genomic DNA using 3,800 bp for 23S rRNA sequences. Once the RNA sequences oligonucleotide-based methods and proteomes with a pan- weregatheredtheywerealignedusingCLUSTALWwithdefault genome approach was presented in a study by Bohlin et al., parameters (10 for gap opening penalty, 0.20 gap extension whereBrucellaspecieswereclassifiedusing32genomes(Bohlin penalty,30%Delaydivergentsequences,0.5forDNAtransitions etal.2010). weight, IUB for DNA weight matrix) (Larkin et al. 2007). Othergenome-basedmethodsincludemeasuresforreplace- After obtaining the alignments, the phylogenetic trees were mentoftheDDH(DNA-DNA-Hybridization)analysis,suchas constructed using MEGA5 (Tamura et al. 2011) and 8 ComparativeGenomics 217 .Fig.8.6 (a)16SrRNAand(b)23SrRNAtreewithNJmethodand1,000bootstrapresamplingsfromClustalWalignment.Thetreesareviewedand coloredwithMEGA5.Branchlengthsaremeasuredinthenumberofsubstitutionspersite.Eachphylumiscollapsedwhenpossible, exceptclassesofProteobacteriawerecollapsedinsteadofphyla 8 218 ComparativeGenomics Neighbor-Joining (NJ) with 1,000 bootstrap resamplings. The method, to make it more similar to DDH, was made by ran- bootstrapvaluesinthesephylogeniesweretransformedtoper- domlychoppingupthegenomesequencesin1,020nucleotide centages. They give a statistical measure for how reliable fragmentsregardlessofwhetherornottheycorrespondtoany a branch separation is. Therefore, higher percentages support ORFs. The fragments from two genomes are aligned using a stronger evidence of grouping, meaning a more prominent a BLAST (Altschul et al. 1990) algorithm or a fast alignment commonancestor,whereaslowerpercentagesmeanthesepara- toolsuchasMUMmerwithoutfragmentation. tiononthatbranchisstatisticallyinsignificant. In this section, 50 genomes from different genera of the RibosomalRNAphylogeniesareusuallyabletodistinguish Enterobacteriaceaefamily(datagatheredfromNCBIGenBank) thedomains,phyla,andgenerainagivenset.Thedistanceson arecomparedbasedontheirANIvalues.ANIcalculationswere this type of phylogeny show the divergence in the rRNA performed as explained in the paper by Richter and Rosello- sequences. According to the bootstrap values in the 16S rRNA Mora’s using Jspecies (Richter and Rossello´-Mo´ra 2009). The phylogenetic tree (>Fig. 8.6a), the phyla level clusters aresig- genomesequencecomparisonwasbasedonnucleotideMUM- nificantwithhigherthan80%bootstrapvalueontheirroots.To mer (NUCmer) which is a fast DNA alignment tool for large- better illustrate this, the two different clades were left scale comparisons (Delcher et al. 1999). MUMmer aligns two uncollapsed while the remaining phyla clades were collapsed. given genome sequences based on maximal unique matches The correspondence of the significant clades to phyla in pro- (MUMs) between the sequences. A ‘‘MUM’’ is an exact string karyotesis,however,anexpectedresult.Phylum,asataxonomic match that occurs once in each genome. Once the MUMs are unit is not defined by the official nomenclature (International identified,theyaresortedinascendingorderaccordingtotheir Code of Nomenclature of Bacteria (Lapage et al. 1992)). The positionsinthegenomes.AftertheglobalMUM-alignment,the highest rank according to the official nomenclature is a class; gaps between them are closed based on the properties of the however, the rank phylum is also being used in prokaryotic gaps.Agapcanbeasinglenucleotidepolymorphism,aninser- taxonomy quite often and seems to serve practical use for the tionordeletionwherealargesequenceisfoundinonebutnot taxonomists. Historically, phyla were referred as divisions the other genome, tandem repeats, or polymorphic regions. If andProkaryotes,asoneofthesuperkingdomsproposedbyWhit- gaps are found, they are aligned using the Smith-Waterman takerandMargulis,weredividedintothreedivisionsbasedoncell algorithm(SmithandWaterman1981). wallstructureorabsence(WhittakerandMargulis1978;Gibbons Comparisonsbasedonthetetranucleotidefrequencieswere and Murray 1978). Although the classification largely changed calculatedusingJspeciesandthealgorithmsfromTeelingetal. sincethedivisionofarchaealphyla(Murray1989)werediscov- (2004). In this method, all possible combinations of ered, most of these names are still in use today. In the 2nd tetranucleotidefrequencies(256frequencies)foreachsequence edition of Bergey’s Manual, phylum was defined as the major is calculated and their z-scores are computed based on the prokaryoticlineages,basedonthe16SrDNAsequencedataand difference between the observed and the expected frequencies usedasmainorganizationalunit(Brenneretal.2005b). for a genomic fragment. The similarity between the two RibosomalRNAbasedphylogeniesusuallyinvolvesthe16S sequences (or genomic fragments) in terms of having similar rRNAsubunitcomparisons.Hereweshowthat23SrRNAphy- patternsoftetranucleotidesisaddressedbycalculatingthePear- logeny can also be useful. When the same dataset is analyzed soncorrelationcoefficientfortheirz-scores.Similarpatternsare using23SrRNAgenes,thebootstrapvaluesaregenerallyhigher expectedtocorrelateandthereforehavehighercorrelationcoef- than16SrRNAphylogenies(>Fig.8.6b).Thereareexceptions ficients,whereasthedistantpatternswouldhavelowercorrela- to this, for example, the bootstrap value on the roots of the tion coefficients. Oligonucleotide frequencies are thought to GammaProteobacteriacladethatishigheronthe16SrRNAtree. carry species-specific signals, where longer signatures carry Thegenerallyhighervaluesforthe23SrRNAanalysismightbe more signals. Thus, closely related organisms are expected to due to the size or information content of the sequences and showsimilardistributionoftheusageofthesesignatures. maybe due to the different mutation rates of the genes. The >Figure 8.7 shows a pairwise genome comparison of ANI separation of phyla on >Fig. 8.6b is significant, but the order value(heatmap).Thegenomesaremanuallyorderedbasedon is different, which might lead to the idea of having different 16S rRNA similarities. It is seen that DNA similarity within relationships among different phyla. However, since the boot- agenusishighercomparedtothesimilaritybetweengenera.It strap values are very lowat that level, it is still not relevant to isthereforepossibletodistinguishgroupsofgenusandspecies conclude how close Firmicutes is to Proteobacteria based on basedontheir DNA similarity. Forcomparison, TetraNucleo- rRNAphylogeny. tidefrequencieswerecalculatedforthesamedataandordered basedon16SrRNAsimilarity.Thetwoheatmapsareexpectedto showsimilarresults,withANIvaluesabove96%identitywould AverageNucleotideIdentities(ANI)andTetra correspondtoveryhighTetraNucleotidefrequenciescorrelation NucleotideFrequencyCalculations coefficients of (cid:3)0.99 (Richter and Rossello´-Mo´ra 2009). It is seenthatwithingenera,sequencesarehighlycorrelatedbasedon Average nucleotide identity was developed as an alternative to tetranucleotide signature usage. Changing the order of the the DDH values, and was initially based on comparison of all matrixbasedonhierarchicalclusteringmightgiveabetterres- shared genes among two genomes. Lateron an advance in the olutionusingTetraNucleotidefrequencies.
Description: