Microbiology(2010),156,3243–3254 DOI10.1099/mic.0.039545-0 Comparative genomics of the genus Bifidobacterium Francesca Bottacini,1,2 Duccio Medini,3 Angelo Pavesi,1 Francesca Turroni,1 Elena Foroni,1 David Riley,4 Vanessa Giubellini,1 Herve´ Tettelin,4 Douwe van Sinderen2 and Marco Ventura1 Correspondence 1Laboratory of Probiogenomics, Department of Genetics, Biology ofMicroorganisms, MarcoVentura Anthropology andEvolution, University of Parma,Italy [email protected] 2Alimentary Pharmabiotic Centre andDepartment ofMicrobiology, Bioscience Institute, DouwevanSinderen National University ofIreland, Western Road, Cork, Ireland [email protected] 3Novartis Vaccines and Diagnostics, Siena,Italy 4Institute for Genome Sciences, Department of Microbiology andImmunology, University of MarylandSchool of Medicine, Baltimore, USA Whole-genome sequencing effortshave revolutionized thestudy of bifidobacterial genetics and physiology. Unfortunately, thesequence ofa single genome does not provide information on bifidobacterial genetic diversity andon howgenetic variabilitysupports improved adaptation of these bacteria tothe environment of thehuman gastrointestinal tract (GIT). Analysis ofnine genomesfrombifidobacterialspeciesshowedthatsuchgenomesdisplayanopenpan-genome structure.Mathematicalextrapolationofthedataindicatesthatthegenomereservoiravailableto the bifidobacterial pan-genome consistsofmore than 5000 genes,manyof which are uncharacterized, butwhich are probably important toprovide adaptiveabilities pertinentto the humanGIT.Wealsodefineacorebifidobacterialgenesetwhichwillundoubtedlyprovideanew baselinefrom which one can examine theevolution ofbifidobacteria. Phylogenetic investigation performed ona total of506orthologues that arecommon tonine complete bifidobacterial genomes allowed theconstruction ofa Bifidobacterium supertreewhich is largelyconcordant withthephylogenetictreeobtainedusing16SrRNAgenes.Moreover,thissupertreeprovideda Received 4March2010 more robust phylogenetic resolution than the 16SrRNA gene-based analysis. Thiscomparative Revised 30June2010 study ofthe genusBifidobacterium thus presentsa foundation for future functional analysesof Accepted 6July2010 this important group ofGITbacteria. INTRODUCTION mechanisms by which these activities are achieved remain largely unknown. Bifidobacteria are commonly found as part of the micro- biota in the human gastrointestinal tract (GIT) (Ventura In recent years, genome sequencing of bifidobacteria and et al., 2007), where their presence has been positively other health-promoting (orprobiotic)bacterial strains has correlated with the health status of their host (Ventura cometothefore,evenleadingtothedesignationofanovel et al., 2009a, 2009b). Bifidobacteria have been claimed to discipline named probiogenomics (Ventura et al., 2007, elicitseveral health-promoting orprobiotic effects, suchas 2009a, 2009b), which aims to improve our knowledge of strengthening of the intestinal barrier, modulation of the the diversity and evolution of commensal bacteria of the immune response and exclusion of pathogens (Marco GITandtounravelthemolecularbasisfortheirpresumed et al., 2006; O’Hara & Shanahan, 2007). Although there health-promoting activities. is some evidence to support each of these functional Of the currently recognized 34 species of the genus claims for particular bifidobacterial strains, the molecular Bifidobacterium, just nine bifidobacterial genomes (repre- senting five species)have beenfully sequenced (Barrangou Abbreviations: COG,clusteredorthologousgroup;GIT,gastrointestinal tract; HGT, horizontal gene transfer; IS, insertion sequence; TUG, truly et al., 2009; Kim et al., 2009; Lee et al., 2008; Schell et al., uniquegene. 2002; Sela et al., 2008; Ventura et al., 2009a, 2009b), with another 17 whose genome sequences are still unfinished One supplementary figure and five supplementary tables are available withtheonlineversionofthispaper. (NCBI). Notably, for a small number of cases, such as 039545G2010SGM PrintedinGreatBritain 3243 F.Bottaciniandothers Bifidobacterium longum subsp. longum (Lee et al., 2008; 2002).However,thequestionastohowmanygenomesare Schell et al., 2002) and Bifidobacterium animalis subsp. necessary to fully describe a bifidobacterial species or the lactis (Barrangou et al., 2009; Kim et al., 2009), two or genusBifidobacteriumhasnotyetbeenanswered. more genome sequences are available. Bacterial genome data have also been used in order to Bifidobacterial genomes range in size from 2.0 to 2.8 Mbp investigate bacterial evolution through the identification of andpossessalargenumberoftRNAmolecules(e.g.from54 genes (constituting the core genome) that appear to be to58),whiledisplayingcommonarchitecturalfeaturesofa conservedamongbacteria.Effortshavealsobeenplacedonthe typical bacterial chromosome (Ventura et al., 2007). B. investigation of the bacterial pan-genome, which represents animalis subsp. lactis DSM10140 and Bifidobacterium the core genome containing genes present in a particular adolescentisATCC15703possessthesmallestchromosomes, taxonomicunitplusthevariableelementsofagenome.These with respective sizes of 2.0 and 2.1 Mbp (Barrangou et al., variableelementsarecomposedofgenesthatareabsentfrom 2009;Kimetal.,2009).Itshouldbepointedoutthat these the genome of at least one member of a taxonomic unit or latter bacteria are extensively used in commercial prepara- genes that are unique to a single member of that taxonomic tions (also known as probiotic foods), and they may unit (Bentley, 2009; Lapierre & Gogarten, 2009; Rasko et al., thereforehavebeensubjectedtogrowthinsyntheticmedia 2008;Tettelinetal.,2005).Thepan-genomesizesandratesof foranextendednumberofgenerations.Suchapracticehas recombination have been suggested to reflect differences in beendemonstratedtocausegenomedecayinbifidobacteria nicheandlifestyleofthedifferentspecies(Bentley,2009). (Leeetal.,2008),withthelossofchromosomalregionsthat In this report, we describe a genomic comparison of are apparently dispensable in an environment different bifidobacterial species in order to highlight the genome from the original ecological niche. Many bacteria contain structure complexity of this group of micro-organisms, as extrachromosomal elements in the form of plasmids and well as to provide a novel genomic perspective on the amongthestrainsthatbelongtotheB.longumphylogenetic evolution of bifidobacteria. species, B. longum subsp. longum DJO10A was shown to harbourtwoplasmids,pDOJH10L(10 kb)andpDOJH10S (3.6 kb),whileB.longumsubsp.longumNCC2705possesses METHODS a single plasmid,pBLO1(3.6 kb).Theother sequencedge- nomesappeartolackplasmids,althoughextrachromosom- Genome sequences. The genomes analysed consisted of the nine completeandpubliclyavailableBifidobacteriumgenomesdescribedin al elements are commonly identified among bifidobacteria Table 1. Functional annotations including clustered orthologous (Alvarez-Mart´ın et al., 2007; Cronin et al., 2007; Lee & group(COG)categorieswereobtainedfromNCBI. O’Sullivan,2006;Sangrador-Vegasetal.,2007).Thereareno obviousunusualspecies-specificfeatureswithregardtothe Pan-genome analysis. The genome comparisons for the pan- coding density, tRNA gene number or rRNA operons. The genome analysis were performed on a larger dataset of 14 publicly number of rRNA loci ranges from two to five (Barrangou available Bifidobacterium genomes, adding five incomplete genome sequencestotheninecompletegenomesequences(Table1).Genomic etal.,2009;Candelaetal.,2004;Kimetal.,2009;Leeetal., comparisons were performed according to a previously described 2008; Schell et al., 2002; Sela et al., 2008; Ventura et al., approach(Tettelinetal.,2005),withthefollowingmodifications.Each 2009a, 2009b), which is more variable than the number of pair of genome sequences was compared by means of i) wu-BLASTP rRNAlociidentifiedinotherhumangut-associatedbacteria searches (all protein vs all protein sequences) and ii) wu-tblastn suchaslactobacilli(Makarovaetal., 2006).Thenumberof searches (all proteins vs translated genome sequences) (http://www. rRNAoperonshasbeen claimedto be positively correlated advbiocomp.com/blast.html). Results from these two comparative searcheswerethencombinedinordertominimizethenumberoffalse withamicro-organism’sabilitytocolonizeandsurviveina negative matches. Hits were filtered such that homologues were particularecologicalniche(Klappenbachetal.,2000).Thus, defined as having 50% sequence similarity over at least 50% of the thevariablenumberofrRNAlociidentifiedinbifidobacter- lengthoftheprotein(Tettelinetal.,2005). ial genomes suggests that different survival strategies may Thesequentialinclusionofupto14strainswasthensimulatedinall havebeenfollowed. possible combinations to determine the number of core and Amongthoseelementsthatareconsideredtobelongtothe dispensable strain-specific and species-specific genes for each set of sequentially added genome data, after which the generated distribu- mobilome, i.e. insertion sequence (IS) elements, transpo- tionsofvalueswereanalysed.Meansandtheinterquartilerangeswere sonsandprophages,thehighestnumberofISelementshas used as point estimates of the central tendency and confidence been identified in the genome of B. longum subsp. infantis intervals,respectively,fortheregressionanalysis. ATCC15697 (Sela et al., 2008), whereas a very limited Theregressionanalysisforthecoregenomewasperformedbyfitting numberofsuchelementswerefoundinthegenomeofthe adouble exponential decay tothe datawith aweighted leastsquare humanoralcavityinhabitant,BifidobacteriumdentiumBd1 regression. (Ventura et al., 2009a, 2009b). (cid:2){N(cid:3) (cid:2){N(cid:3) Thusfar,thegenomesequenceofoneortwostrainsforeach ncore~k1exp t zk2exp t zH 1 2 of the above-mentioned species has provided information about genome variability at intraspecific level (Barrangou Where,n istheaverageofcoregenedistributions,Nisthenumber core et al., 2009; Kim et al., 2009; Lee et al., 2008; Schell et al., ofgenomes, k , t , k , t and Hare free parameters and the inverse 1 1 2 2 3244 Microbiology156 Thebifidobacterialpan-genome Table1.General genome features of Bifidobacterium species included inthis study ND,Notdetermined;C,complete;UC,incomplete. Species Genome G+Ccontent No. No.tRNA No. No.IS Origin GenBank Status size(bp) (mol%) genes genes rRNAloci elements accessionno. B.adolescentisATCC15703 2089645 60 1631 54 5 5 HumanGIT NC_008618 C B.animalissubsp.lactis 1933695 60.5 1528 52 2 1 Infantfaeces NC_011835 C AD011 B.animalissubsp.lactis 1938709 60 1631 52 4 6 Infantfaeces NC_012814 C Bl-04 B.animalissubsp.lactis 1938483 60 1629 51 4 6 Infantfaeces NC_012815 C DSM10140 B.dentiumBd1 2636368 59 2278 55 4 7 Dentalcaries NC_013714 C B.longumsubsp.infantis 2832748 60 2498 79 4 26 Infantfaeces NC_011593 C ATCC15697 B.longumsubsp.longum 2375792 60 1990 58 4 44 HumanGIT NC_010816 C DJO10A B.longumsubsp.longum 2256640 60 1727 57 4 46 HumanGIT NC_004307 C NCC2705 B.longumsubsp.longum 2477838 60 2035 55 3 ND HumanGIT CP002010 C JDM301 B.gallicumDSM20093 2019802 57 1980 ND ND ND HumanGIT ABXB00000000 UC B.pseudocatenulatum 2304808 56 2151 ND ND ND HumanGIT NZ_ABXX00000000 UC DSM20438 B.catenulatumDSM16992 2058429 56 2011 ND ND ND HumanGIT NZ_ABXY00000000 UC B.bifidumNCIM41171 2186140 62 1811 ND ND ND HumanGIT NZ_ABQP00000000 UC B.angulatumDSM20098 2007108 59 1811 ND ND ND HumanGIT NZ_ABYS00000000 UC square of the interquartile ranges of core genome distributions as Identification of orthologues, paralogues and unique genes was weights. The best fit was obtained with R250.999 for k 5254±5, performedfollowingapreliminarystepconsistingofthecomparison 1 t159.6±1.5, k25917±39, t251.19±0.06 and H (the asymptotic of each protein against all other proteins using BLAST analysis core genome size)5967±14. A single exponential decay fit to core (Altschul et al., 1990) (cut-off: E-value 161024 and 30% identity genomesizedataforN¢5(thelasttenpoints)resultedinasimilar overatleast80%ofbothproteinsequences),thenallproteinswere estimateofH,indicatingthatthedoubleexponentialdecaycaptures clustered into protein families using MCL (graph-theory-based correctlytheasymptotictrendofthecoregenome. Markov clustering algorithm) (van Dongen, 2000). Following this, the identified paralogues among the nine bifidobacterial proteomes The regression analysis for new genes and for the pan-genome was werediscardedandthroughMCLtheuniqueproteinsforeachspecies performedaccordingtoapreviouslydescribedmethod(Tettelinetal., of the genus Bifidobacterium were classified. Finally, each unique 2008). The power laws n 5k N2a and n 5k Nb (Heaps’ new new pan pan protein sequence, whose combined coding sequences form the core law) were fitted for N¢5 to new gene and pan-genome data, genome, was functionally annotated according to a COG category respectively,withaweightedleastsquareregression,wheren and new assignment. n are the average of new gene and pan-genome distributions, pan respectively,Nisthenumberofgenomes,k ,k ,aandbarefree new pan Whole-genome alignments at DNA and protein level. parameters,andtheinversesquareoftherespectivedistributionsare Bifidobacterium whole-genome sequence alignments were performed weights.Thebestfitswereobtainedfork 5808±7,a50.64±0.01 new atDNAlevelusingMUMmer(Kurtzetal.,2004). (R250.999) and for k 51797±21, b50.40±0.01 (R250.999). pan Values of a.1 and b.0 both indicate an open pan-genome. Of Comparisons at protein level (Canchaya et al., 2006; Chan et al., note,thevaluesobtainedforthepan-genomeexponentsaandbare 2006) were performed with the PROmer algorithm (part of the ingoodagreementwiththetheoreticalpredictiona512b,andvalues MUMmer package). Distances between the nine bifidobacterial obtainedperformingthesameregressionsforN¢4,6,7,8wereallin genomes were calculated from the PROmer alignment normalized complete agreement with those reported, supporting the solidity of by the sizeofthe shortest genome andwere compared byusing the theresultfortheasymptotictendencyofthissample. followingequation: An open pan-genome of a specific bacterial group indicates that TotalPROmeralignments genomes of such taxa are evolving and diversify at a higher rate Dab~{log2min(cid:4)lengthgenomesequencaeb (cid:5) ab comparedwithaclosedpan-genome. Proteome comparison and extraction of shared and unique Proteome comparison and identification of orthologues. genes. Each of the nine predicted bifidobacterial proteomes was Proteomes of the currently complete actinobacterial genomes searchedfororthologuesagainsttheotherproteome,whereorthology available in public databases were compared firstly using an all- betweentwoproteinswasdefinedasthebestbidirectionalFASTAhits against-allBLAST(Altschuletal.,1990)(cut-off:E-value161024and (Pearson,2000). 30%identityoveratleast80%ofbothproteinsequences),thenthe http://mic.sgmjournals.org 3245 F.Bottaciniandothers BLAST results were clustered into proteins families using MCL (van values of this matrix, and which are thus of most phylogenetic Dongen, 2000). Paralogues were discarded by selecting the families relevance,wereselectedandevaluatedwithax2test. thatcontainedonesingleproteinmemberforeachgenome.Proteins identifiedasbelongingtothegenomemobilome,suchasISelements or phages, were discarded. Orthologues were functionally classified RESULTS AND DISCUSSION usingCOGcategoryassignments. Each set of orthologous proteins were aligned using CLUSTAL_W Whole-genome alignments (Thompson et al., 2002) and phylogenetic trees were constructed using the maximum-likelihood in PhyML (Guindon & Gascuel, Sequencealignmentoftheninepubliclyavailablecomplete 2003).ThesupertreewasbuiltusingSplitsTree(Huson,1998). bifidobacterial genomes was performed in order to establish the genetic relationships among these members 16S rRNA-based phylogenetic tree. A complete phylogenetic of the Bifidobacterium genus. The closest alignments that analysiswasperformedusingthe16SrRNAactinobacterialsequences were obtained for bifidobacteria include those between retrievedfromtheRibosomalDatabaseProject-II(Coleetal.,2005). members of the B. longum phylogenetic species, i.e. B. Whereavailable,16SrRNAgenesequencesofthesamestrainusedin theproteomecomparisonwerechosenforeachspecies.Otherwise,the longum subsp. longum and B. longum subsp. infantis, or 16SrRNAgenesequenceofthetypestrainwasused.Sequenceswere species belonging to the B. adolescentis group such as aligned using CLUSTAL_W (Thompson et al., 2002), and phylogenetic B. adolescentis versus B. dentium (data not shown). In trees were constructed using the maximum-likelihood in PhyML contrast,theleastrelatedgenomes,asconcludedfromtheir (Guindon & Gascuel, 2003) and neighbour-joining in CLUSTAL _ W. low levels of alignment, were those of B. animalis subsp. Furthermore,bootstrapanalysiswasperformedusing100resamplings. lactis (B. pseudolongum group) and genomes from mem- Identification of shared genes between bifidobacteria and bers of the B. longum group, a finding that correlates otherhumangutmicrobiotamembers.Proteins encoded by the well with the evolutionary distance between these species nine bifidobacterial strains isolated from various sections of the based on 16S rRNA or multigene alignments (Ventura humanGIT,suchasB.adolescentis ATCC15703,B.dentiumBd1,B. et al., 2006). longum subsp. infantis ATCC15697, B. longum subsp. longum NCC2705 and B. longum subsp. longum DJO10A, B. longum subsp. Some of the bifidobacterial whole-genome alignments longum JDM301, Bifidobacterium animalis subsp. lactis AD011, B. display as patched regions arranged in an X-shaped animalissubsp.lactisBl-04andB.animalissubsp.lactisDSM10140, distribution across the origin of replication, a finding were compared with the proteins encoded by the genomes of 47 which was previously described for bacteria (Eisen et al., bacterialandtwoarchaealinhabitantsofthehumanGITorbacteria 2000). Such symmetry indicates that paring of sequences isolated from other environments (soil, water and plant) whose tends to occur at the same side as the origin, which is genomeswereusedasoutgroups(SupplementaryTableS1,available with the online version of this paper). The set of clustered genes, explained by the fork replication theory (Tillier & Collins, whichhadbeenobtainedusingtheMCLanalysis(vanDongen,2000) 2000). However, these X-shaped regions are not visible in as described above, was filtered in order to only retain hits that any of the B. animalis alignments. Furthermore, these X- correspond to genes shared by some but not all of the analysed shaped alignments can be observed most clearly in the genomes.Furthermore,abinarymatrixwithvectorsconsistingof0–1 middle region of the xy plots. In fact, the rightmost and valueswasbuiltbyaninternalVisualBasicscriptonaMicrosoftExcel leftmostpartsoftheplotsthatcorrespondtotheoridonot table.Inthismatrix,eachvectorcorrespondstoanorganism,andthe displaysynteny,suggestingthat suchregionsaresubject to 0or1valuesindicateabsenceorpresence,respectively,ofaparticular gene in that organism. The final matrix D represents the genome a higher rate of reshuffling or an increased acquisition of information of 55 organisms and includes data derived from the alien DNA by horizontal gene transfer (HGT) compared analysisofmorethan11000proteins. with the alignment within the B. longum phylogenetic group. Such findings contrast with what has previously Multivariateanalysis.Acorrespondenceanalysiswascarriedouton been described for bacterial genome evolution, where it thisfinalmatrixDusingthestatisticaltoolR2.9.1(Dessau&Pipper, 2008); the product between the transpose of the matrix D (DT with was suggested that the regions around the replication 11000 rows and 55 columns) and the matrix D itself (55 rows and terminusdisplayanincreasedfrequencyofHGTcompared .11000 columns) generates the matrix E (11000 rows and 55 with the remainder of the chromosome (Daubin & columns). Eigenvectors and eigenvalues of the matrix E were Perriere, 2003). calculatedwiththeEIGENsubroutinefromthestatisticaltoolR2.9.1 (Dessau & Pipper, 2008). Only the first 10 eigenvectors were taken These results clearly show that genomes of different into consideration, thus generating matrix F (11000 rows and 10 bifidobacterial species are in general not collinear, thus columns).TheproductbetweenthematrixDandmatrixFgenerated implying the existence of significant genome diversity matrix G (55 rows and 10 columns), which provides the position withinmembersofthegenusBifidobacterium.Thegenome coordinatesofthe55bacteriainthefirst10axesofordination.The alignments also generate data that can be used to percentagevariationfractionassociatedwitheachaxiswascalculated reconstruct phylogenetic relationships between bifidobac- as the ratio of the corresponding eigenvalue to the sum of all teria through the use of PROmer, similar to what was eigenvalues.Finally,theevolutionaryrelationshipsbetweenorganisms werevisualizedbytheconstructionofabidimensionalplotwiththe previously described for other prokaryotic genomes (Henz statisticaltoolSPSS9.0. et al., 2005). A neighbour-joining tree was built using the distances calculated using the PROmer analyses (Fig. 1a). The information carried by matrix F reveals ORFs that maximally contributetotheclusteringof55bacterialspeciesunderexamination The resulting tree, as based on the deduced proteomes, at each axis of ordination. The ORFs corresponding to the highest shows a clustering of bifidobacterial species which is very 3246 Microbiology156 Thebifidobacterialpan-genome Fig. 1. Neighbour-joining tree of sequenced bifidobacteria based on PROmer distances (a). The tree is based on the distances evaluated from the non-overlapping protein aligned regions. (b) Phylogenetic tree of bifidobacteria based on 16S rRNA gene sequences. Bootstraping values are provided. Bars,numberofsubstitutionspersite. similar to that obtained from 16S rRNA comparative sequential addition of each new bifidobacterial genome analysis(Fig.1b)orfromamultigeneapproach(aso-called sequencewasextrapolatedbyfittinganexponentialdecaying supertree,Ventura etal., 2006).Thus, we can arguethat in function to core genes calculated with all the possible the case of bifidobacteria, whole-genome-based phylogen- permutationsintheorderofadditionofgenomesequences etic investigations, also called phylogenomics, generate a (Fig. 2a). As expected, the number of shared genes initially very similar evolutionary pattern to that achieved using decreases with the addition of each new genome sequence. classical molecular markers. Such a phylogenomics-based Nevertheless,extrapolationofthecurveshowsthatthecore analysis provides convincing proof that all bifidobacteria genome content reaches a minimum of approximately 967 share a common ancestor, and in order to expand on this genes,asindicatedinFig.2(a).Thisapparentdifferencefrom finding,amoreindepthanalysisinvolvingtheidentification the number of orthologous proteins shared between all the ofacommonbifidobacterialgenecontentwasperformed. bifidobacterial genomes (506 proteins) is probably a con- sequence of the reduced stringency that is normally em- ployed for the prediction of the core genome sequences The Bifidobacterium conserved core genome (50% similarities over 50% of the total gene nucleotide Analysisofthesetofgenes/proteinsthatwereseparatedfrom length) (Medini et al., 2005; Tettelin et al., 2005; Tettelin thelastcommonancestor,inanevolutionarytime-frame,by etal.,2008),comparedwiththehigherconstraintvaluesthat aspeciationevent(Fitch,1970)allowedtheidentificationof are usually employed in the determination of common 506 orthologues shared by all nine species of the genus orthologous protein sets in bacterial genomes (cut-off E- Bifidobacterium for which the complete genome sequence value 161024 with an identity value of 30% over 80% of wasavailable(datanotshown).Thiscommonsetofproteins both protein sequences). Of note, the actual number of represents the presumed core of bifidobacterial genomic shared genes in each genome varies because of duplicated coding sequences. To estimate an asymptotic limit to the genes and paralogues. Examination of the functional anno- number of genes present in every Bifidobacterium species tation of these genes suggests that the conserved core genes (Bifidobacterium core genome), we included in our analysis encode mostly core housekeeping functions such as those five publicly available but incomplete genomic sequences, involvedinreplication,transcriptionandtranslation,aswell adding to the sample five novel species of the genus: B. as functions related to adaptation/interaction with a parti- pseudocatenulatum, B. gallicum, B. catenulatum, B. bifidum cular environment, such as carbohydrate metabolism, cell and B. angulatum. The number of shared genes found on envelopebiogenesisandsignaltransduction(Table2). http://mic.sgmjournals.org 3247 F.Bottaciniandothers Table 2. Distribution of the genes constituting the core genome sequences of bifidobacteria with respect to COG categories Proportionofcore COGcategory genome(%) 19.2 [J]Translation,ribosomalstructureand biogenesis 3.9 [K]Transcription 7.9 [L]Replication,recombinationandrepair 1.9 [D]Cellcyclecontrol,celldivision, chromosomepartitioning 0.3 [V]Defencemechanisms 3.9 [T]Signaltransductionmechanisms 4.6 [M]Cellwall/membrane/envelope biogenesis 0.7 [U]Intracellulartrafficking,secretionand vesiculartransport 4.3 [O]Post-translationalmodification,protein turnover,chaperones 4.8 [C]Energyproductionandconversion 6.5 [G]Carbohydratetransportandmetabolism 12.2 [E]Aminoacidtransportandmetabolism 6.0 [F]Nucleotidetransportandmetabolism 3.9 [H]Coenzymetransportandmetabolism 2.6 [I]Lipidtransportandmetabolism 2.9 [P]Inorganiciontransportandmetabolism 0.2 [Q]Secondarymetabolitesbiosynthesis, transportandcatabolism 8.4 [R]Generalfunctionpredictiononly 5.7 [S]Functionunknown core genes. The total number of genes identified when all Fig.2.Thedistributionsofthenumber(n)ofcoregenes(a),total 14 genomes are compared is 5125 (ranging between 4960 genes (i.e. pan-genome size; b) and new genes (c) found upon sequential addition of ‘N’ genomes (see Methods) are shown as and 5309 depending on the genome permutation order), box-plotsindicating25–75percentiles,themediansashorizontal more than twice the average number of genes found in a lines, the means as black circles and the whiskers for 10–90 single Bifidobacterium genome (Fig. 2b). The pan-genome percentiles.In(a),adoubleexponentialregressiontocoregenome size, when plotted on a log–log scale versus the number of dataisshownasasolidcurve,asingleexponentialregressionto genomes, shows a clear linear trend in agreement with the N¢5dataisshownasadashedcurve,andtheasymptoticvalue Heaps’lawpan-genomemodel(Tettelinetal.,2008),anda ofthecoregenomeobtained(H5967±14genes)isindicatedby robustfitofthedatawasobtainedforanincreasingpower adottedarrow.In(b)and(c),powerlawregressionstothepan- lawwithpositiveexponentb50.40.Thisindicatesanopen genomesizeandtothenumberofnewgenesareshownassolid Bifidobacterium pan-genome, even though within a single curves, both indicating an open pan-genome (power law species sampled with two or more isolates, very little exponents b50.40±0.01 and a50.64±0.01). For comparison, genomic variation is observed. the limiting trend for an open/closed pan-genome (logarithmic divergence)isshownasadashedcurveinbothpanels. Thenumberofnewgenesdiscoveredbysequentialaddition ofgenomesequencesisshowninFig.2(c).Onaverage,each isolatecontributes521newgenescomparedwithadifferent genome (number of genomes N52), and this number The Bifidobacterium pan-genome decreases as more isolates are added. The last genome Inordertoestimatethetotalgenerepertoireofmembersof (N514) contributes on average 152 new genes compared the genus Bifidobacterium (Bifidobacterium pan-genome) with the other 13, a sizeable amount considering that a weemployedapreviouslydescribedmethod(Tettelinetal., significant proportion of the within-species comparisons 2008) which calculates both the overall number of genes resultinthediscoveryofnonewgenes.Infact,foreachfixed discovered and the expected number of new genes N, the distributions of the genome-order permutations for contributed by each additional genomic sequence, using newgenes(datanotshown)arecomposedofthreemodes:i) the same permutation scheme employed in the analysis of within-species comparisons (very low number of new 3248 Microbiology156 Thebifidobacterialpan-genome genes), ii) comparisons between isolates belonging to very adequately represent the diversity that exists within strains distantspecies(highnumberofnewgenes),andiii)amajor belongingtotheB.animalissubsp.lactisspecies. intermediate mode accounting for ~65% of the compar- isons, which determines the mean of the distribution, and Identification of unique genes representsthetypicalbetween-speciesvariationobservedin theBifidobacteriumgenus.Apowerlawregressiontothelast The pan-genome analysis, restricted to the nine closed ten points of the dataset (N¢5) confirms a behaviour in genomes, also allowed the identification of truly unique agreementwiththeHeaps’lawpredictionsforanopenpan- genes(TUGs),i.e.genespresentonlyinareferencegenome genome (exponent of the power law a50.64) and suggests and absent in any of the other bifidobacterial genomes. thatthecriticalcontributiontothepan-genomeofthegenus Such analyses were carried out as described previously comes from between-species diversity. When we repeated (Raskoetal.,2008).ThenumberofTUGsvariesfrom21to such analyses focussing on specific Bifidobacterium species 230 in the nine genomes analysed. The mean number of for which multiple strain genome sequences are available, TUGs found in the Bifidobacterium genome dataset is such as B. animalis subsp. lactis, different results were 143±133. The large deviation from the mean is indicative achieved.Infact,inthecaseofB.animalissubsp.lactis,the of a high degree of genome diversity in the genus numberofspecificgenesaddedtothepan-genomedropped Bifidobacterium. As expected, the large majority of TUGs to zero after the addition of the third strain. This result do not have a functional annotation, and thus they may probably reflects the fact that B. animalis subsp. lactis is a represent novel biosynthetic or human gut commensal highly clonal, recently evolved taxon from the B. animalis interaction features. Interestingly, the B. longum subsp. species in which genome variability is associated only with infantis ATCC15697 genome contains the largest number very reduced intergenic sequence variation or limited to of TUGs (Supplementary Table S2 and Supplementary clustered regularly interspaced short palindromic repeats Fig. S1). Such a finding was anticipated, as the genome (Barrangouetal.,2009).Alternatively,thesequencedstrains sequence of this strain was shown to exhibit a significant may belong to the same evolutionary clade and may not level of diversity, perhaps due to the presence of RexAB, Fig. 3. (a) Plot representation of shared bifidobacterial genes versus those from intestinal human bacteria using double correspondence analysis. (b) Distribution of the shared bifidobacterial genes versus those from intestinal human bacteria according to COG categories. Each pillar represents the COG distribution in each bifidobacterial genome. (c) Schematic representationofthesharedbifidobacterialgenesversusthosefromintestinalhumanbacteriaaccordingtothebacterialgroups towhichtheybelong.In(a),thedifferentbacterialgroupsarerepresentedwithdifferentcolours;bifidobacteriaarecircled. http://mic.sgmjournals.org 3249 F.Bottaciniandothers which promotes chromosomal rearrangements and inte- adhesion-mediating proteins similar to fimbrial subunits gration of heterologous sequences through recombination andsugarmetabolizingenzymes/transportersarefound.As (Sela et al., 2008). The identification of TUGs in bifido- in the case of other bacterial genomes (Canchaya et al., bacteriamayservetoidentifytargetsforfurtherfunctional 2003), phage-associated genes account for less than 1% of studies on specialized adaptive abilities, particularly with all TUGs in bifidobacteria (data not shown). regard to host interactions and metabolism of host diet components. Additional functional genomic studies (e.g. Dispensable bifidobacterial genome and the whole transcriptional/proteomic profiling and gene inac- human gut microbiome tivation) will be important to understand the function of this large set of unique hypothetical proteins, considering Bifidobacteria are considered to represent a significant the relevance of bifidobacteria in human health, as they component of the intestinal microbiota of mammals could represent putative biomarkers of a ‘healthy’ gut (Turroni et al., 2009a, 2009b; Ventura et al., 2007). The microbiota. initial establishment of bifidobacteria in the human large intestineisbelievedtooccurwithinthefirstcoupleofhours Many of the TUGs are flanked by insertion elements and followingbirth,andfollowingthistheymayremainpresent displaysignificantdivergencefromtheaverageGCgenome during the remainder of the host’s life. In their natural content or atypical codon usage, suggesting acquisition environment, bifidobacteria come into contact with other through HGT. Among their encoded products, putative bacterial components of the intestinal microbiota, with Fig. 4. (a) Phylogenetic supertree based on the sequences of actinobacterial core proteins. The various sequenced bifidobacteria are indicated; the CMN subgroup (Corynebacteria, Mycobacteria and Nocardia) is also highlighted. (b) Phylogenetic tree based on 16S rRNA gene sequences from the same actinobacterial set of species used to build the phylogeneticsupertree. 3250 Microbiology156 Thebifidobacterialpan-genome which they are expected to exchange genetic material amino acids were excluded, the majority of the proteins throughHGT.Inordertoinvestigatethispresumption,we were involved in coenzyme transport and metabolism, analysedthenumberoforthologuesthataresharedbetween signal transduction and unknown functions (Fig. 3b). It bifidobacteria and other human gut bacteria (Supple- seems logical that proteins shared between human gut mentary Table S1). We included in our analysis those bacteria contribute to adaptations that typify specific human gut bacteria whose genomes are fully sequenced, groupsofintestinalcommensals,emphasizingthenecessity which span a broad range of taxonomic groups, i.e. from for functional investigations in order to provide the bio- Archaea to Bacteria. These sequences were retrieved from logical role of these gene products in the human gut databasesandvarioushumangutmicrobiomeprojects,and microbiota. Interestingly, members of the human gut we excluded from our analyses those orthologues that are microbiota that appear to have exchanged genes with commontoallbifidobacterialgenomesandthusconstitute bifidobacteria appear to encompass a wide phylogenetic the core genome sequences. The correspondence analysis distribution including Archaea, such as Methanobrevibac- (seeMethods)identified365genesthatappeartorepresent terium smithii and Methanosphaera stadtmanae, as well as dispensable bifidobacterial genome sequences, i.e. genes varioustaxabelongingtothephylaChlamydiae,Proteobacteria thataresharedwithotherbacterialmembersofthehuman andFirmicutes(Fig.3candSupplementaryTableS4). gut microbiome, while they do not occur in all bifidobac- terialgenomes(Fig.3aandSupplementaryTableS3).When Bifidobacterium phylogeny and phylogenomics theencodedproteinswereclassifiedaccordingtotheirCOG categoriesandcomparedwiththeaverageCOGdistribution To complement the sequence-based and synteny-based in bifidobacterial genomes, we found that, when proteins phylogenetic investigations, we also followed a multigenic involvedinmetabolismandtransportofcarbohydratesand tree phylogeny reconstruction approach (Fig. 4). In fact, http://mic.sgmjournals.org 3251 F.Bottaciniandothers such a strategy provides a solution to the problem of thegenusBifidobacteriumisfoundtobeopen.Theanalysisof combining evidence from different genome loci to infer ninedifferentspeciesinthisworkindicatesthatasubstantial phylogeny without losing information from independent amountofundiscoveredgenesarelikelytobepresentinthe gene histories. A concatenated protein consisting of 506 genomic sequences of isolates belonging to other, as yet sequences, corresponding to each of the core proteins that unsequenced,bifidobacterialspecies.Suchapredicteddiver- wereidentifiedinoneoftheprevioussections,wasusedto sity is not surprising if one considers that sequencing of build a Bifidobacterium supertree, which allows tracing environmentalsamples,suchaswater,soilorfaecalsamples, phylogenybasedon genomic data (phylogenomics) within has enabled the identification of unknown bacterial micro- this group of bacteria. The discriminatory power of the biomes (Eckburg et al., 2005; Venter et al., 2004). These concatenated tree is much higher than that observed with extensive genetic datasets have indicated that the envi- the single 16S rRNA gene, which is also confirmed by the ronmental gene pool available for inclusion by mechanisms analysisofthepairwisedistancesandthestandarddeviation, suchasHGT,transduction,conjugation andtransformation respectively. Each bifidobacterial species was differentiated ismuchlargerthanpreviouslyimagined.Bacteriophagesare as a distinct entity. The average similarity from these con- thought to play a key role in transferring genetic material catenated sequences was 74%, against 95% when the 16S between bacteria that share the same ecological niche. In rRNAgenewasused.Themostcloselyrelatedtaxaexhibited contrast,bacterialspeciesresidinginrestrictedenvironments similaritylevels of more than 90%, i.e.92% forB. longum andlackingmechanismsofgeneexchangemayhaveevolved subsp. longum and B. longum subsp. infantis, which are with considerably less genome variation. Bacteria such as lower than those observed for the same set of strains using BuchneraaphidicolaorBacillusanthracispossessaclosedpan- 16SrRNAgenesequences(99%).Furthermore,theincrease genome, where no or very limited chromosome rearrange- in sequence size led to a considerable increase in tree mentsorgeneacquisitionshaveoccurredduringthecourseof robustness.Infact,theprogressiveconcatenationshowedan evolution(Tamasetal.,2002). increaseindeep-nodebootstrapvalues.Dataconcatenation The current bacterial taxonomy relies on genes associated therefore provides a good means of increasing the robust- with the core genome (e.g. rRNA genes, chaperone- ness of the final tree. The significant increase in bootstrap encoding genes, recA gene) (Stackebrandt et al., 2002). valuesdemonstratesthatthephylogenetictree,ascalculated However, a large proportion of the genetic traits that are fromtheconcatenationofcoreproteinsthatmaybeusedas responsiblefortheadaptationtoaspecificecologicalniche, alternative molecular markers to the 16S rRNA gene, may suchasantibioticresistance,arepresumedtobepartofthe considerably improve the phylogenetic relevance. We also dispensable genome. Therefore, sequencing of multiple performed a phylogenetic investigation based on the con- strains is necessary in order to understand the genetics of catenation of a specific set of core protein sequences from adaptative strategies employed by particular commensal bothbifidobacteriaandothermembersofthegenusActino- micro-organismsinthehumangut,andtoprovideamore bacteria(SupplementaryTableS5).Theproteinsetusedfor consistent definition of the species itself. In this report we thispurposeconsistedofsixhousekeepinggenes,mfc,rpmA, show that sequencing of only a few genomes of a limited atpC,obgE,ileSandlegA,whichencodetranslationinitiation number of bifidobacterial species is not enough to factor IF-3, 50S ribosomal protein, ATP synthase C chain, understand the overall genomic basis and plasticity of the GTPase,isoleucyl-tRNAsynthaseandGTP-bindingprotein, bifidobacterialgenus.Also,ourresultssuggestthatinorder respectively. This phylogenomics approach allowed the to avoid a phylogenetic bias within this genus, further evolutionary positioning of bifidobacteria within the sequencing efforts should concentrate on increasing the phylum Actinobacteria (Fig. 4a), which revealed that number of species sampled rather than the number of bifidobacteriarepresentthedeepestbranchseparatingthem isolates sampled per species. Alternative genome analysis fromotherActinobacteriainamannerthatissimilartothat techniques such as comparative genome hybridization can observedfora16SrRNA-basedtree(Fig.4b). only provide information on the presence/absence and variability of genetic loci that are already known and do not identify genes that are not present in the reference Conclusions genome. Thus, in order to fully explore the genome The genomic era for bifidobacteria has only recently variabilityofbifidobacteria–includingtheidentificationof commenced in earnest, and the limited number of species TUGs,thesizeofwhichcouldbevastlylargerthanthecore for which genome sequences have been completed has genome – a whole genome sequencing approach will be allowed us to take a first look, to our knowledge, at the necessary. pan-genomic structure of this bacterial genus. Regression analysis shows that, in the case of bifidobacteria, ACKNOWLEDGEMENTS singlespeciestendtobeclosedandonlyaverysmallnumber This work was financially supported by the Italian Award for ofnewgenesareexpectedtobeaddedtothespeciesgenepool Outstanding Young Researcher scheme ‘Incentivazione alla mobilita` withtheadditionofnewgenomesequences.Conversely,the di studiosi stranieri e italiani residenti all’estero’ 2005–2009 and a size of the gene pool pertaining to the whole genus is more Marie Curie Reintegration Grant (MERG-CT-2005-03080) to M.V., thantwicethesizeofasinglegenome,andthepan-genomeof by Spinner 2013, Regione Emilia Romagna and the European Fund 3252 Microbiology156
Description: