ebook img

ancestral de novo PDF

36 Pages·2012·0.9 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ancestral de novo

Evolution of Viral Proteins Originated De Novo by Overprinting Niv Sabath,*,1,2 Andreas Wagner,1,2,3 and David Karlin4 1InstituteofEvolutionaryBiologyandEnvironmentalStudies,UniversityofZurich,Zurich,Switzerland 2TheSwissInstituteofBioinformatics,Basel,Switzerland 3TheSantaFeInstitute,SantaFe,NewMexico 4OxfordUniversity,SouthParksRoad,Oxford,UnitedKingdom *Correspondingauthor:E-mail:[email protected]. Associateeditor:DanielFalush Abstract D o w Newprotein-codinggenescanoriginateeitherthroughmodificationofexistinggenesordenovo.Recently,theimportanceofde Rn lo novooriginationhasbeenrecognizedineukaryotes,althougheukaryoticgenesoriginateddenovoarerelativelyrareanddifficult ea d toidentify.Incontrast,virusescontainmanydenovogenes,namelythoseinwhichanexistinggenehasbeen“overprinted”bya sed newopenreadingframe,aprocessthatgeneratesanewprotein-codinggeneoverlappingtheancestralgene.Weanalyzedthe ea fro m evolution of 12 experimentally validated viral genes that originatedde novo and estimated their relative ages. We found that r h youngdenovogeneshaveadifferentcodonusagefromtherestofthegenome.Theyevolverapidlyandareunderpositiveor chttps weakpurifyingselection.Thus,youngdenovogenesmighthavestrain-specificfunctions,ornofunction,andwouldbedifficult ://a ac todetectusingcurrentgenomeannotationmethodsthatrelyonthesequencesignatureofpurifyingselection.Incontrastto a rd e youngdenovogenes,olderdenovogeneshaveacodonusagethatissimilartotherestofthegenome.Theyevolveslowlyand tm are under stronger purifying selection. Some of the oldest de novo genes evolve under stronger selection pressure than the icic.o ancestralgenetheyoverlap,suggestinganevolutionarytugofwarbetweentheancestralandthedenovogene. leup .c o Keywords:overlappinggenes,denovoorigin,newgenes. m /m b e novogenes.First,theincidenceofdenovogeneorigination /a Introduction mayberelativelylowineukaryotes,rangingfrom2to12%of rticle Novel protein-coding genes can have two fundamental ori- allnewgeneoriginationeventsaccordingtorecentestimates -a b s gins(reviewedinLonget al.2003;Babushoket al.2007;Zhou (Zhouet al.2008;Toll-Rieraet al.2009a;EkmanandElofsson tra andWang2008;Bornberg-Baueret al.2010;Kaessmann2010; 2010). Second, direct experimental evidence for the expres- ct/2 TautzandDomazet-Loso2011).Inthefirst,ageneoriginates sionoftheproteinsencodedbycandidatedenovogenesis 9 /1 by modification of an existing gene, for example, through notalwaysavailablein eukaryotes—somemightbeartifacts 2/3 gene duplication, exon shuffling, gene fusion, horizontal of genome annotation (Wang et al. 2003). Third, most eu- 76 7 genetransfer,ortransposition.Inthesecond,ageneorigina- karyotic candidate genes are structurally and functionally /1 0 tes de novo. This mechanism was thought to be highly im- poorly characterized. Finally, current protocols to identify 06 3 probable (Ohno 1970; Jacob 1977), but recent studies have genescreateddenovofromnoncodingsequencesineukary- 1 1 providedexperimentalevidencethatitmaybefrequent.De oticgenomesfocusongeneswithsimilaritytogenesalready by g novo origination can take place in a previously noncoding annotatedinthegenomesequence,whereassomedenovo u e region,suchasanintergenicregion(Caiet al.2008;Toll-Riera genesmaynotbecurrentlyannotated,evenashypothetical, st o et al.2009b;Li,Zhang,et al.2010),oranintron(Sorek2007). which would preclude their discovery (Guerzoni and n 1 However, a gene can also originate de novo from an open McLysaght2011). 7 N readingframethatalreadyencodesaprotein,byamechanism The identification of de novo genes in viruses does not o v e called“overprinting”,inwhichmutationsleadtotheexpres- suffer from these problemsor to a much lesser extent. This m b sion of a second reading frame overlapping the first one holds especially for genes generated by overprinting. e r 2 (Ohno 1984; Keese and Gibbs 1992; Rancurel et al. 2009; Li, Overlapping genes are very common in viral genomes 0 1 Dong,et al.2010).Genome-scalecomputationalanalysesor (Belshawet al.2007;Chiricoet al.2010),providinganabun- 8 experimental analyses of RNA transcripts have proposed dantsourceofsuchdenovogenes.Inaddition,inmostcases, many candidate genes originated de novo through these theexpressionoftheirproteinproducthasbeenproven,and mechanisms (Levine et al. 2006; Begun et al. 2007; Zhou theirfunctionisatleastpartlyknown(Rancurelet al.2009). et al. 2008; Knowles and McLysaght 2009; Chen et al. 2010; Finally, using overlapping genes allows the identification Wuet al.2011;YangandHuang2011). of proteins originated de novo with high reliability (see Most studies in this area have focused on eukaryotes, later),byavoidingtheconfoundingfactorsthatlimitcurrent which are not necessarily the best organisms to study de approaches to identify proteins generated de novo from (cid:2)TheAuthor2012.PublishedbyOxfordUniversityPressonbehalfofthesocietyforMolecularBiologyandEvolution. ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionNon-CommercialLicense(http:// creativecommons.org/licenses/by-nc/3.0),whichpermitsunrestrictednon-commercialuse,distribution,andreproductioninany Open Access medium,providedtheoriginalworkisproperlycited. Mol.Biol.Evol.29(12):3767–3780 doi:10.1093/molbev/mss179 AdvanceAccesspublicationJuly19,2012 3767 MBE Sabathet al. . doi:10.1093/molbev/mss179 noncoding sequences (Guerzoni and McLysaght 2011). For haveenteredthefocalcladethroughhorizontalgenetransfer. brevity, we will refer here to de novo genes as genes that These confounding factors can be easily excluded for genes originatedthroughoverprinting. that arose by overprinting (fig. 1b, blue arrows) within a Viraldenovogenesoftenencodeproteinsthatplayarole pre-existing “ancestral” reading frame (red arrows). inviralpathogenicityorspreading,ratherthanproteinscen- Specifically,if the ancestralgene is presentoutsidethefocal tral to viral replication or structure (Li and Ding 2006; clade(e.g.,taxaT4andT5infig.1b),onecanexcludediver- Rancurelet al.2009).Themajorityoftheseproteinsarepre- gence beyond recognition and horizontal gene transfer, be- dicted to be structurally disordered, i.e., they lack a stable causeineithercase,theancestralgenewouldnotbepresent three-dimensional structure (Dyson and Wright 2005; outsidethefocalclade. Tompa 2005; Sickmeier et al. 2007), but those that are or- Takingtheaboveconsiderationsintoaccount,weaskthe dered have intriguing structural features (Rancurel et al. following questions about the evolutionary dynamics of de 2009). For instance, the protein p19, originated de novo in novogenes:Dodenovogenesadapttotheirgenome,andif D o the plant virus family Tombusviridae (Rancurel et al. 2009), so,howrapidly?Whatistheirrateofevolution?Howdothe wn hasapreviouslyunknowntertiarystructureandapreviously denovogenesinfluencethegenesthattheyoverlap?Dode loa d unknown mode of binding to small interfering RNAs novogenescontributetoviralfitness?Toanswertheseques- ed (Vargasonet al.2003).Thissuggeststhatdenovogeneorig- tions, we analyzed the evolution of 12 independent, experi- fro m inationcanleadtoevolutionaryinnovationsinproteinstruc- mentally validated de novo genes in RNA viruses. We h tureandfunction(Rancurelet al.2009;Bornberg-Baueret al. estimatedtheirrelativeageandcomparedtheirevolutionary ttp s 2010;Kaessmann2010;AbroiandGough2011). dynamics to that of the ancestral gene from which they ://a A prerequisite to identify a de novo gene is that it must originated. cad e have a monophyletic distribution in one clade—the “focal” m Results ic clade(fig.1a,taxaT1,T2,andT3)—whilebeingabsentfrom .o u organismsoutsidethisclade(fig.1a,taxaT4andT5).Wenote AsdescribedinMaterialsandMethods,weassembledadata p.c that this prerequisite is necessary but not sufficient. Genes set of 12 experimentally validated pairs of overlapping om fulfilling it may have an ancient origin, older than the focal protein-coding genes (table 1), in which the ancestral and /m b clade,buttheymighthavedivergedbeyondrecognitionout- the de novo genes could be unambiguously identified from e/a side this clade (Elhaik et al. 2006). Alternatively, they may the phylogenetic distribution of their homologs. We pre- rtic dictedthestructuralandfunctionalorganizationoftheirpro- le-a teinproducts(fig.2,discussedfurtherlater,seealsoMaterials bs andMethods).Alloverlappingregionswerelongerthan220 tra (a) T1 nucleotides (table 1). Among the 12 de novo genes in our ct/2 9 Focal data set, nine genes overlap completely with their ancestral /12 T2 Clade gene,whereasthreeoverlaponlypartially:themachlomovirus /37 6 p31gene,theomegatetravirusp17gene,andtheilarvirus2b 7 T3 /1 gene(fig.2). 0 0 6 3 T4 11 Three Quantifiers Are Useful to Describe the b y Evolutionary Dynamics of Overlapping Genes g T5 u e (b) Ancestral De novo Weinvestigatedthreepropertiesoftheancestralanddenovo st o b1 T1 genes, and the proteins they encode (see Materials and n 1 b2 Methods). The first property is the relative sequence diver- 7 N LAAannscctee Csstto oomrrm(LoCnA ) b3 T2 CFolacdael gence, a proxy for the rate at which a protein changes its ove sequence.Relativedivergencevaluesabove1indicatethata m b b4 T3 codingregionevolvesfasterthanareferencesequence,inour er 2 casethefulllengthsequenceoftheancestralgeneofthepair 0 1 8 T4 considered. The second property is the selective constraint (dN/dS), T5 whichestimatesthestrengthofpurifyingselectiononagene by its ratio of nonsynonymous to synonymous nucleotide FIG. 1. Monophyletic distribution of genes originated de novo. (a) A substitutions.ValuesofdN/dSbelow1areevidenceofpuri- genethatoriginateddenovo(bluearrows)willexhibitamonophyletic fyingselectionwhosestrengthincreaseswithdecreasingdN/ distributionamongrelatedtaxa.However,thisdistributioncouldalsobe dS. In principle, values of dN/dS exceeding 1 might suggest theresultofdivergenceofthegenebeyondrecognitionorofacquisition thatageneevolvesunderpositiveselection(NeiandGojobori ofthegenethroughhorizontalgenetransfer(HGT).(b)Foragenethat 1986). However, the method we used to calculate dN/dS originateddenovo(bluearrows)byoverprintinganancestralreading frame (red arrows), these confounding factors can be excluded (see (Sabathet al.2008)doesnotteststatisticallyforpositivese- Introduction). Colors are displayed in the electronic version of the lection, which is often limited to few sites within the gene article. (NielsenandYang1998;Zhanget al.2005).Therefore,values 3768 MBE EvolutionofViralProteins . doi:10.1093/molbev/mss179 ofdN/dSabove1shouldbetakentoindicateeitherneutral ofthepping(nt) evoTluhteiotnhirodrapnodsitfiivnealseplreocptieornty. istheCodonSimilarityIndex LengthOverlaRegion 273 468 624 382 1,880 698 308 451 312 561 834 228 (oCfSaI)g,ewnheiachndmtehaastuorefsththeeressimtoilfatrhiteygbeentwomeeenctohnetcaoindionngiuts.aCgSeI isbasedon thesamecalculationsastheCodonAdaptation NumberofSequences(orSequencePairs)intheAnalyses(diver-gence,/,andCSI)dNdS 10,9,and5 3,2,and3 70,33,and11 3,1,and3 120,1,and16 10,0,and5 15,6,and6 3,0,and3 6,3,and4 10,1,and5 28,13,and8 10,9,and6 cI(r(dnSesoeehfdmeeWaenrrxpeMopenav(arficCoaitersAnesogtrsdInei)eac,n.tLloaOseiimnacv1psnoep9tardmea8iarar7lsMmedl).,ddoTeobtetnafuhhblntaeyoloseduudevsssa2o)teet.hsdagsrueetsmmnheeeteepmsaorrseoaefuvrpshirotzeeilegrvotsohefifeltystshcihgoefeeonxdrripofigearnecselunalsunlasotesntsmadlcygoeeegffsaetabstrnsthaiaeelia/sssr Dow n DeNovoFrame PB1-F2 ProteinL* ProteinI p17 ORF69(movementprotein) ORF3(longdistancemovementprotein) Protein2b p31 Pog ProteinC Largeenvelopeprotein(L) ProteinB (1atg1phne00ean(cid:3)(cid:3)cieer34mess)4t,)de(r,0aaaal.nnt6rwd2geCe)oSneu-isIxsenihsodldo.ifebewTdarithn(sec6Wltoe%rwmsoi)ltnce,raoagrwglxenhCorgietneSusrnIeedealsveesisagcl(tntou0hievfe.e6dest6dh)(cirePfoaafen<nndrskdietf5fnret.at4chreieen(cid:2)asntttb,c1eo(e0tPPfw(cid:3)bd<<e5ee)et72nwnt..19hoet(cid:2)(cid:2)eavhnnoe loaded from https://a c n) meanrelativedivergenceoftheancestralanddenovogenesis ad AncestralFrame PB1 Polyprotein Nucleocapsid(N) Capsidprotein Replicase ORF3(movementprotei Polymerase Coatprotein Structuralpolyprotein Phosphoprotein(P) Polymerase(P) ProteinA hodbRyfeiNgaWhnAuno(sce4vipena3osogs%tlkgry)aetem,hldnaeaeeshnriooassdewsrtieqdhgutieenhdedaenontsiomcfeefevdeapor.irdenTognivopecoeneetrfrehgbtsieeiesena(tse4cwcnd7hee%edepo,g)nwe.fentnethdoheemeosstneeiRmlteNh(casAetteieo-teddinmeMtipenheatissnteinendtrcisemieaintlaeyst emic.oup.com/mbe/a of nera) athnids eMsteitmhaotdesd).aWgeeitnhefinguprleostt3edanthde4t.hTreheephroorpizeorntiteaslaagxaisinostf rticle TaxonomicDistributiontheOverlap Singlespecies 3speciesinsamegenus 2genotypegroups Wholegenus Wholegenus Wholegenus Wholegenus Wholegenus(containsasinglespecies Wholegenus 2generainsamefamily Wholefamily(contains2ge Wholegenus fgntsgaiheneegalennutdeeeecrrdetssCeis,lvdS(a3ewlIetaiabhv(ncnfieeeoodglrnvede.4odsai4tvscrg)ebao.erytignrnNhreevteeonssisrtcpsaCeaeholrSoetng(Ihwfiedciangssnat.ultct3costouaa)ht,ll)etcah,whutesreeileetaodlhllteeraeiffctgttodit.hivrinTveefaephotmdaericoiiorvvonsseenisnrrttsogtigtmfirerlceeanhaeciclnoegoaemtnefxnt(adiolsfieynelsgsodhno.go(ro3otsiwbvhgeuoi)eess-, -abstract/29/12/3767/10 Species InfluenzaAvirusH5N1 Theilovirus SARScoronavirus Dendrolimuspunctatustetravirus Turnipyellowmosaicvirus Tobaccobushytopvirus Spinachlatentvirus Maizechloroticmottlevirus Israelacuteparalysisvirusofbees Measlesvirus HepatitisBvirus StripedJacknervousnecrosisvirus scaewwMponrheaaorcBrlertneeeteetsdrhsrltdeopieraintwaofrlofdslne,tgsrahdvw.eenaeWnneldtutseofie.lesMorsuvps(ertsaoeeetlsbehuddxtoe)oaaasdfaimnnsntfa)ehoid.lndyerTesdihtfsseotwiuhnornosegf,poslcdreeavoeoiogrvtgtshraesg,errnewsieinanseehinseof.epcsingreRr(euoble(airpAnglesuereNede3rst)sCaoisienitOinaossndVensidnaieAccplabi)fhnatigterecopuaosgatrtreeenefronlesey4tr--l, 06311 by guest on 17 Novem heStudy. Genus InfluenzavirusA Cardiovirus Betacoronavirus Omegatetravirus Tymovirus Umbravirus Ilarvirus Machlomovirus Aparavirus Morbillivirus Orthohepadnavirus Betanodavirus boRebefsloaertrevivaetciooDnnissvi.deregrienngcethem together and synthesizing our ber 2018 pingGenesint Family Orthomyxoviridae Picornaviridae Coronaviridae Tetraviridae Tymoviridae — Bromoviridae Tombusviridae Dicistroviridae Paramyxoviridae Hepadnaviridae Nodaviridae Fvo(isriefegrsddugige)renennisicofi3enavceaofpoanprrrtlerlytyoshethedneoiiptfnrfsaieszifror(oesbnrnlotuteafealfa)rc(n.ohaTcmltheohsevtzoerreuarerlglgoaphrp,erPstothsh<tieoeein0nrre.esl0giln0(arre1eteis2vods)efi)otaasnhneneqddcauoteneehqncfefiueccspaetilaeodintrroisst- 1.Overlap GenomeAccessionNumber NC_007358 NC_001366 NC_003045 NC_005899 NC_004063 NC_004366 NC_003809 NC_003627 NC_009025 NC_001498 NC_003977 NC_003448 oaannncceee,ssstturrgaaglleppsrrtoointtegeiintnhss.aItenvoconolvanevtreaartsatga,edsteihmneoiolavvroerrpalartopetpeaiinsngsthr(ebegluifouen)lls-slheoonfwgththea Table Clade 1 2 3 4 5 6 7 8 9 10 11 12 hpirgohteerinsr.elAatcicvoerddiinvgelryg,etnhceestlohpanestohfeitrheantcweostrraelgroevsseirolanplpininegs 3769 MBE Sabathet al. . doi:10.1093/molbev/mss179 100 aa Ordered protein region Disordered protein region PB1-F2 1InfluenzavirusA PB1 (replicase) RdRP L* 2Cardiovirus Polyprotein (L) zLeVP4 D o I w n 3Betacoronavirus lo a Nucleocapsid RNA d binding ed fro m p17 h 4Omegatetravirus ttp Capsid B1 s://a c a d e Movement m protein ic 5Tymovirus .o Replicase Methyltransferase- up Guanylyltransferase .c o m /m ORF3 b 6Umbravirus e/a O(MRoFv4ement protein) 30K rticle -a b s 7Ilarvirus 2b tm tract/2 9 Replicase RdRP /1 2 /3 7 6 7 p31 p7 /10 8Machlomovirus 0 6 Capsid jelly-roll fold 31 capsid domain 1 b y g u pog e tm s 9Aparavirus t o Structural jelly-roll fold n polyprotein capsid domain 1 7 N o v C e m 10Morbillivirus b e P(Ph)osphoprotein cc r 20 1 8 Large envelope PreS1 Pre S protein (L) S2tmtm tm 11Orthohepadnavirus Replicase Linker Reverse Transcriptase B cctm 12Betanodavirus A RdRP tm FIG.2. Structuralandfunctionalorganizationoftheoverlappinggeneswestudied.Proteinsencodedbyoverlappinggenesareshowntoscale.Foreach proteinpair,theancestralproteinisshownonthebottomandthedenovoproteinontop.B1,basedomain1;cc,coiledcoil;Le,Leaderregion;PA2, phospholipaseA2domain;RdRP,RNA-dependentRNApolymerasedomain;tm,transmembranesegment;z,zinc-bindingregion. 3770 MBE EvolutionofViralProteins . doi:10.1093/molbev/mss179 are significantly different (P<3.2(cid:2)10(cid:3)25). The range of and2.8andthusvarybylessthanafactorthree.Threeno- values of the relative divergences between pairs of ancestral tableexceptionsaretheyoungestdenovoproteinsoftaxa1 proteins is generally narrow or moderate. For instance, the and3(respectively,influenzavirusAPB1-F2andbetacorona- relativedivergencesofthedenovogeneoftymovirus(taxon virus protein I), which exhibit a considerable range of diver- 11),whichencodesthemovementprotein,rangebetween1.2 gences(from0upto10.4). Selective Constraint Table2. MeanValuesofThreeEvolutionary PropertiesforAncestral and De Novo Genes. Figure 3b presents the selective constraint (dN/dS) for the ancestral and de novo genes in each group of viruses we Ancestrala De Novoa P studied. The ancestral and de novo genes are subject to CSI 0.66 (0.09) 0.62 (0.11) 5.4(cid:2)10(cid:3)5 verydifferentconstraints,asattestedbythesignificantdiffer- RSeelleacttivioendiinvetergnesnitcye 01..4060 ((00..5420)) 10..8755 ((10..2505)) 72..19(cid:2)(cid:2)1100(cid:3)(cid:3)344 ence(P<4.9(cid:2)10(cid:3)13)betweentheslopesoftheirrespective Dow regressionlines,andthefactthattheancestralregressionline n aNumbersinparenthesesarestandarddeviations. hasapositiveslope,whereasthedenovoregressionlinehas loa d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/2 9 /1 2 /3 7 6 7 /1 0 0 6 3 1 1 b y g u e s t o n 1 7 N o v e m b e r 2 0 1 8 FIG.3. Evolutionarydynamicsofancestral(red)anddenovogenes(blue).Theverticalaxesshow(a)relativedivergenceand(b)selectiveconstraint (dN/dS)forthe12taxa.Thehorizontalaxisrepresentstheevolutionarydistancefromtheoriginofeachdenovogene(i.e.,theestimatedageofgenes withintheclade).Regressionlinesareplottedforvisualizationofgeneraltrends.LowdN/dSvaluesrepresentstrongselectiveconstraints(seetext).Note thatdN/dSin(b)couldonlybecalculatedforgenepairsthathavelessthan50%aminoaciddivergenceattheaminoacidlevel(seeMaterialsand Methods).Noselectiveconstraintdatacouldbecalculatedforcases6and8(bottompanel)asthesequencepairsinthesecladeshavealldiverged beyond50%.Whereneighboringgroupshadsimilarages,weshiftedtheirpositionslightlyforvisualclarity(groups5and6). 3771 MBE Sabathet al. . doi:10.1093/molbev/mss179 D o w n lo a d e d fro m h ttp s ://a c a d e FIG.4. CodonSimilarityIndex(CSI)ofancestral(red)anddenovogenes(blue).Thehorizontalaxisrepresentstheevolutionarydistancefromtheorigin mic ofeachdenovogene(asinfig.3).Regressionlinesareplottedforvisualizationofgeneraltrends.HighCSIvaluesindicatehighsimilaritybetweenthe .o u codonusageofageneandthecodonusageoftherestofagenome.Colorsaredisplayedintheelectronicversionofthepaper. p .c o m /m anegativeslope.ThelowvaluesofdN/dS<1belowformost whichtheCSIvaluesofthedenovop17genesareallhigher be genesindicatethattheyareunderstrongselectiveconstraints than the CSI values of the ancestral capsid genes. The CSI /artic (purifying selection), i.e., mutations that change the protein valuesforhomologousgenescantakeawiderangeofvalues. le sequence are likely to be selected against. In several viruses, Forinstance,theCSIofthebetacoronavirusIgenevariesfrom -ab s the values of dN/dS for the ancestral genes are particularly 0.27to0.67.Overall,thedatasuggestthatthecodonusageof tra c low,suggestinganextremefunctionalorstructuralconstraint. de novo genes becomes slowly assimilated into that of the t/2 9 Forexample,theratiodN/dSoftheancestralgeneinviruses1, hostgenome. /1 2 4,and5isbelow0.05.Incontrast,insomecases,eitherthede /3 7 pnroovtoeigneIn,ean(tdhtehienfltyumenozvairvuirsumsAovPemB1e-nF2t,ptrhoetebinet,arceosproencativvierulys, De Novo Genes Have Different Properties Depending 67/10 on Their Age of Origin 0 from groups 1, 3, and 5) or the ancestral gene (the ilavirus 6 3 Overall,wefoundthatthedifferencesbetweenancestraland 1 polymeraseandthebetanodavirusproteinA,fromgroups7 1 and12,respectively)exhibitsdN/dS(cid:4)1,whichindicatesneu- denovogenesappearstodecreasewithtime.Taxa1–5,with by g tralevolutionorpositiveselection.Asexplainedatthebegin- theshortestdistancefromtheorigin—correspondingtothe ue ning of the Results section, the test we employ cannot youngestdenovogenes—allexhibitasimilarpattern(fig.3): st o the de novo genes show higher relative divergence with re- n distinguishbetweenthem.Forallgroups,thetrendsinrelative 1 divergence (fig. 3a) and selective constraints (fig. 3b) were specttotheirancestraloverlappinggenesandweakerselec- 7 N consistentwithoneanother.Forexample,thedenovopro- tiveconstraint. ov e The pattern that contrasts most with that of taxa 1–5 m teins of cardioviruses are more highly diverged than the an- b cestral proteins and they also show a lower selective occursintaxa7and12(IlarvirusandBetanodavirus,respec- er 2 tively),wherethedenovogeneshaveamoreancientorigin. 0 constraint. 1 8 Here,thedenovogenesevolvemoreslowlyandexhibitstron- gerpurifyingselectionrelativetotheirancestraloverlapping Codon Similarity Index genes(fig.3).Inotherwords,heremutationsthatchangethe Figure 4 presents the CSI values of ancestral and de novo sequence of the de novo proteins are more deleterious, on genes.Theslopesofthetworegressionlinesshowasignificant average, than mutations that change the sequence of the difference (P<0.032). The slope of the ancestral regression overlappingancestralproteins. lineisnotsignificantlydifferentfromzero(P>0.96),whereas Overall,ourobservationssuggestthatolderdenovogenes the de novo regression line has a positive slope (P<0.003). aremoreadaptedtotheirgenomeandevolveunderstronger Relative to their ancestral overlapping genes, most de novo purifyingselection.Thisinferenceissupportedbyexperimen- genes show lower CSI values, although the ranges of CSI taldataonthefitnesseffectsofmutationsindenovogenesof values in de novo and ancestral genes overlap markedly in different ages (table 3). Mutations in the youngest de novo mostgroups.Anexceptionisomegatetraviruses(taxon4),in genes (taxa 1–3) have little to moderate effect, whereas 3772 MBE EvolutionofViralProteins . doi:10.1093/molbev/mss179 DescriptionofEffectandReferences SuppressionofPB1-F2neitheraffectedviralreplicationnorvirusloadsinthelungsofmice(McAuleyetal.2010). SuppressionofL*decreasestheabilityofTheiler’svirustoinduceachronicinfectionofthecentralnervoussystem(Stavrouetal.2010). SuppressionofProteinIexpressionleadonlytoare-ducedplaquesize,suggestingaminoreffectonfit-ness(Fischeretal.1997). Unknown Aknock-outmutantofthemovementproteinrepli-catesonlyatlowlevelsinprotoplasts(WeilandandDreher1989). Long-distancemovementisabolishedintheabsenceofORF4plants(Ryabovetal.1999) Unknown Unknown Unknown SuppressionofCresultsinmuchmildersymptomsandlowermortalityinmice(Pattersonetal.2000). DeletionswithintheSdomainoftheenvelopeproteindrasticallyreduceinfectivity(LeDuffetal.2009). SuppressionofB2causesasevereimpairmentintheintracellularaccumulationofviralRNAincellculture(Fenneretal.2006). Downloaded fro m FitnessEffectWhentheNovelGeneIsSuppressed Littleornoeffect Moderateeffect Littleornoeffect Unknown Severeeffect Severeeffect Unknown Unknown Unknown Severeeffect Severeeffect Severeeffect https://academic.oup .c u- is om oftheDeNovoGenesintheStudy. Function(s) Virulencefactor(Zamarinetal.2006).Involvedinreglationofpolymeraseactivity(Mazuretal.2008).Functionseemsstrain-specificandhost-specificanddisputed(Krumbholzetal.2011). Involvedintheestablishmentofpermanentinfectionsofthecentralnervoussystem(Chenetal.1995);antiapoptoticeffectincellculture(Ghadgeetal.1998). Unknown Unknown Viralmovementthroughtheplant(Bozarthetal.1992). Long-distance(systemic)movementinplants(Ryabovetal.1999);stabilizesviralgenomicRNA. Unknown Unknown Unknown Virulencefactor(Pattersonetal.2000). BeckandNassal2007).Viralenvelopeglycoprotein( BlocksRNAinterference(Fenneretal.2006). /mbe/article-abstract/29/12/3767/1006311 by guest o n n,Function,andFitnessEffect EvidenceforExpression Chenetal.(2001b) vanEyllandMichiels(2002) Senanayakeetal.(1992)andChenetal.(2001a) Hanzliketal.(1995) WeilandandDreher(1989) Ryabovetal.(1998) Xinetal.(1998) Scheets(2000) WardropandBriedis(1991) Peterson(1981) Iwamotoetal.(2005) 17 November 2018 o EvidenceofExpressi Genus InfluenzavirusA Cardiovirus Betacoronavirus Omegatetravirus Tymovirus Umbravirus Ilarvirus Machlomovirus Aparavirus Morbillivirus Orthohepadnavirus Betanodavirus 3. ble oup Ta Gr 1 2 3 4 5 6 7 8 9 10 11 12 3773 MBE Sabathet al. . doi:10.1093/molbev/mss179 suppression of older de novo genes (taxa 6 and 10–12) has as it ages, resulting in increased selective constraints and a severeeffects. decreasedrateofsequencechange. Ourresultsalsosuggestthatingeneral,theancestralgenes Discussion are more constrained in sequence than the de novo genes whichoverlapthem.However,thispatterncanbereversedin Several previous studies have examined the codon usage of olddenovogenes.Inparticular,threedenovogenesofour individual or small groups of overlapping genes (McGeoch dataset,whichencodetheIlarvirus2bprotein,themorbilli- et al.1985;KeeseandGibbs1992;Pavesiet al.1997;McVeigh virusCprotein,andbetanodavirusBprotein(taxa7,10and et al.2000;Leeet al.2010),theirratesofevolution(Mizokami 12), are subject to stronger selective constraints than their et al.1997;Sanzet al.1999;Jordanet al.2000;Fujiiet al.2001; ancestralgenes(fig.3b).Wespeculatethatthisreversalcould Nekrutenko et al. 2005; McGirr and Buehuring 2006; reflect an evolutionary “tug of war” between the two genes Hernandez et al. 2010), and selective constraints (Fujii et al. overthedominanceofthesequence.Anultimate“victory”of D 2001;Hugheset al.2001;GuyaderandDucray2002;Liet al. o adenovogeneinthistugofwarwouldconsistinthedisap- w n 2004; Hughes and Hughes 2005; Narechania et al. 2005; pearanceoftheancestralgene.Wespeculatethatsomeviral loa Campitelli et al. 2006; Holmes et al. 2006; McGirr and d genes that do not overlap any other gene today may have e d Buehuring 2006; Obenauer et al. 2006; Pavesi 2006; Suzuki originatedthroughoverprintingbuthaveeventuallylostthe fro 2006; Pavesi 2007; Zaaijer et al. 2007). Overall, our observa- m tionsagreewiththoseofpreviousstudies—ancestralandde olavpeprilnapg. Aregsiiomnilaorfsctewnoarioprhoatesinb-eceondipnrgopgoesneedsfowritahninovtehre- http nabolveotgoenimesprdoifvfeeroinntthheesseepsrtoupdeierstieins.Nseevveerratlhwelaeysss,.wFierswt,ebrey genome of archaeal Thermoplasma (Rogozin et al. 2002). s://ac Theseoverlappinggenesarenonoverlappinginotherrelated a examining the phylogenetic distribution of overlapping genomes, possibly because of duplication and consequent dem gwehniechpoafirtsh,ewtweowgeerneesaibsleantcoestirdaelnatnifdywrehliicahblyon(eseisethlaetedre) loss(Rogozinet al.2002). ic.ou p novogene.Second,ournewmethodallowedustoestimate .c o Our Evolutionary Inferences Are Robust and m therelativeageoforiginofdenovogenesandtocorrelatethis Biologically Coherent /m age with several quantifiers of their evolutionary dynamics. b e Thus, it allowed us to analyze how the evolutionary forces Formostoverlappinggenepairsinourdataset,theidentifi- /a affectingde novogenes change over time. Third, we useda cationofthedenovogeneishighlyreliable,astheancestral rticle larger data set than most studies mentioned earlier, which gene has a much wider phylogenetic distribution than the -a b s were carried out on individual genes. Fourth, we used a denovogene.Forinstance,thedenovomovementprotein tra method specifically tailored to overlapping genes (Sabath of Tymoviruses (taxon 5) has homologs only in this ct/2 et al. 2008) to study the selective constraints these genes genus, whereas its ancestor, the methyltransferase- 9/1 aresubjectedto.Othermethodscangivemisleadingresults guanylyltransferase has homologs in over a dozen families 2/3 when applied to overlapping genes (Holmes et al. 2006; (Rozanovet al.1992). 76 7 Suzuki2006;Pavesi2007;Sabathet al.2008). Oneimportantcaveatofouranalysisisthatweareunable /1 0 toestimatetheabsoluteageofdenovogenesbutonlytheir 06 3 age relative to the divergence of a viral housekeeping gene 1 1 De Novo Genes Adapt to Their Genomes thatencodestheRNA-dependentRNApolymerase.Ourrel- by Our results suggest that de novo genes do adapt to their ative age estimates of de novo genes, however, are broadly gu e genome.Morespecifically,denovogenesevolveveryrapidly consistentwiththeirtaxonomicdistribution(table1,column st o shortlyaftertheirorigin.Astheyage,theytendtoexperience 6):asexpected,youngdenovogenesshowamorerestricted n 1 increasingly severe selective constraints, and their codon distribution than older de novo genes. The youngest genes 7 N usage tends to approach that of the ancestral gene from (groups1–3infig.3)arefoundinlessthanonegenus,with ov e whichtheyoriginate. gene1occurringonlyinasinglespecies(theseobservations m b Ourresultsareconsistentwithpopulationgeneticstheory are not due to sequencing bias, as each taxon considered er 2 (HartlandClark1997).Viruseshavelargepopulationsizes.At contains several species or genera). Conversely, all oldest de 0 1 suchlargesizes,naturalselectionishighlyefficient,whichhas novo genes (6–12) are found at least throughout a whole 8 two consequences regarding de novo genes. First, they are genus(intwogeneraforcases10and11). likely to become fixed in a population only if they provide A case where gene age may have been overestimated is some selective advantage. Second, even though a de novo that of taxon 3 (betacoronavirus I gene). Our analysis sug- genemightinitiallyonlyprovideaverysmallfitnessbenefit,in gestedthattheIandORF9bgeneshaveacommonorigin(see a large population this fitness benefit can be sufficient to supplementary information, Supplementary Material online, causethegene’sfixation.Immediatelyafteritsorigin,these- case3)despitetheirdifferentlengths(98aaand207aa,respec- quenceofadenovogenewilltypicallybefarfromoptimalfor tively)andlackofsignificantsequencesimilarity(notshown). the (rudimentary) function it provides, unlike a gene origi- This inference is based on assuming functionality for the nated through modification of an existing gene. unannotated ORFs in five closely related genomes (supple- Consequently, one would expect a de novo gene to evolve mentary fig. S4, Supplementary Material online). Conse- rapidlyshortlyafteritsoriginandtobecomebetteradapted quently, we calculated the age of the I gene by considering 3774 MBE EvolutionofViralProteins . doi:10.1093/molbev/mss179 the node common to I and ORF9b (marked with blue line, Atkins2010,Firthet al.2010).However,thesignatureofpu- supplementaryfig.S5,SupplementaryMaterialonline,group rifying selection is mostly absent in young de novo genes. 3). In the alternative scenario where these two genes have Thus, the number of young de novo genes may be much independent origins, their estimated age would be reduced. larger than it appears, because these methods can simply Nevertheless,theoverallpatternoftheresultswouldremain notdetectmanysuchgenes. unchanged(notshown).Anothercaveattoourstudyisthat RNAsecondarystructure,andselectionpressureforhighpro- Conclusion and Perspectives teinexpressionlevelmaybepartlyresponsibleforthediffer- Inclosing,wepointtoseveraldirectionsforfutureresearch. encesweobservedincodonusage(PlotkinandKudla2011). Approaches to estimate selection pressures in overlapping Finally, the estimated rate of evolution of most ancestral genes (e.g., Sabath et al. 2008) lag behind those for anddenovoproteins(fig.3)isgenerallycoherentwiththeir nonoverlapping genes (reviewed in Anisimova and Kosiol function or effecton viralfitness(table 3).Forinstance,the 2009), which can detect lineage-specific and site-specific se- D o ancestral methyltransferase-guanyltransferase of tymoviruses lection pressures. The development of advanced methods wn experiencessevereconstraints,asexpectedfromanenzyme, could,forinstance,revealtheroleofpositiveselectioninde loa d whereasthedenovoproteinIofbetacoronaviruses,whichis novo gene origination and perhaps predict interactions be- ed dispensableforreplication,experienceslowornoconstraints. tweenproteinsencodedbyoverlappinggenes,suchastheRz fro m We note two exceptions: first, the orthohepadnavirus and Rz1 genes of bacteriophage lambda (Zhang and Young h replicase gene, encoding an essential reverse transcriptase 1999).Finally,furtherresearchisneededtoshedlightonhow ttp s function, is not subject to very strong selective constraint exactlydenovooverlappinggenesoriginateandbecomees- ://a (e.g., dN/dS between 0.28 and 0.78, taxon 11 in fig. 3b). tablished, i.e., the mutational events that result in their ex- ca d However, this discrepancy is readily explained. The overlap- pression,theirfrequency,andtheireffectsonviralfitness. em pingregionofthereplicasegeneinfactencodestwodomains ic.o (fig. 1): a disordered, hypervariable linker and the reverse Materials and Methods up .c transcriptase domain.The relaxed selective constraintisthe Sequence Analyses om result of a high dN/dS for the linker (average of 0.94) and a /m WeextractedallfullysequencedviralgenomesfromtheNCBI b verylowdN/dSforthereversetranscriptasedomain(average e viral genome database (Bao et al. 2004) in June 2011 and /a of0.15).Second,thetymovirusmovementproteingenehasa identified all viral proteins annotated in these genomes. All rtic dN/dS of 1, suggesting an absence of selective pressure, de- le homology searches were carried out against a database of -a spiteencodinganimportantfunctionthatallowsthespread b of viral RNA between cells. Again, this discrepancy can be these proteins, using PSI-basic local alignment search tool stra (BLAST) (Altschul et al. 1997) with an E value cutoff of c attributed to the fact that the movement protein consists 10(cid:3)6.Weperformedallmultiplesequencealignmentsusing t/29 of a slowly evolving region (around aa 1–400) and a /1 MAFFT (Katoh et al. 2002) and constructed phylogenetic 2 fast-evolvingregion(C-terminal200aa)(resultsnotshown). /3 trees with the BIONG method (Gascuel 1997). We rooted 7 6 7 these trees with the mid-point rooting method (Farris /1 Young De Novo Genes Might Have Strain-Specific 1972). We predicted the domain organization of proteins 00 6 Functions and Be Difficult to Detect by encodedbyoverlappinggenesusingANNIE(Ooiet al.2009). 31 1 Sequence Analysis b y Ourresultshavetwopracticalimplications.First,theysuggest Collection of Viral Overlapping Genes gu e that recently evolved genes might have strain-specific func- Weidentifiedfromtheliteratureasetof40overlappinggene st o tions, or possibly no function, as suggested previously pairsforwhichtheexpressionofaproteinproductfromtwo n 1 (TrifonovandRabadan2009)forPB1-F2(taxon1),theyoun- reading frames had been experimentally verified. All gene 7 N gest de novo genes in our data set. Experimental studies pairsinthisdatasetcomefromvirusesthatinfecteukaryotes. ov e should thus take this possibility into account. Our current Among these gene pairs, we selected 29 pairs coming from m b methodtoestimatedN/dSdoesnottelluswhethertheele- viruseswhosegenomeencodesanRNA-dependentRNApo- er 2 vateddN/dS values observed in some young de novo genes lymerase(RdRP),tofacilitatecomparisonamongclades(see 0 1 8 indicateneutralevolutionorpositiveselection.However,itis later).Wefurthernarrowedthedatasettooverlappinggene possiblethattheycomenotonlyfromneutralmutationsbut pairsinwhichwecouldidentifywhichgenehadoriginatedde also from beneficial mutations subject to positive selection, novo(seeproceduredescribedlater).Intotal,weobtained12 whichwouldreflectevolutionaryadaptations. gene pairs that correspond to 12 cases of de novo origin, Second,ourresultshaveimplicationsfortheidentification stemming from 12 families of RNA viruses that met these of overlapping genes. Current bioinformatics methods to criteria. The data set shares some genes with a previously detectoverlappinggenesusethesignatureofpurifyingselec- publisheddata set (4casesoutof 12: groups4,6, 8, and 11 tion(FirthandBrown2005,2006;Sabathet al.2009;Sabath below) (Rancurel et al. 2009). The reason why we could in- and Graur 2010). These recently developed methods have cludeonlyaminorityofthegenespublishedintheRancurel hadgreatsuccessandleadtodiscoveries in many viral taxa dataset(4outof17)isthatwerestrictedourselvestocon- (Chung et al. 2008; Firth 2008; Firth and Atkins 2008a, b; sideringpairsinwhichbothancestralanddenovoproteins Sabath et al. 2009; Firth and Atkins 2009a, b, c; Firth and had less than 50% amino acid divergence (percentage of 3775 MBE Sabathet al. . doi:10.1093/molbev/mss179 identity).Table1lists,foreachgenepair,thespeciestaxon- each genome listed in table 1, we identified the RdRP omy,thegenomeaccessionnumber,thenamesoftheover- domain by using HHpred (Soding et al. 2005) against the lappinggenes,andtheirlengths.Intherestofthearticle,we PFAM database (Finn et al. 2008) with an E-value cutoff of willrefertoeachcaseeitherbyitsgenusorbythenumberof 10(cid:3)10. We identified the orthologous RdRP domains within itsclade,aslistedintable1.Table3listsbibliographicalevi- the other genomes of each clade using PSI-BLAST, aligned dence about the expression, function, and fitness effect of them,andconstructedtheirphylogenetictrees.Supplemen- mutationsinthedenovogene. taryfigureS6,SupplementaryMaterialonline,presentsthese “RdRPtrees.” Identifying De Novo and Ancestral Genes WedefinedthefocalcladeoftheRdRPtreeasthesmallest cladethatcontainsallthetaxafoundwithinthefocalcladeof Toidentifydenovogenecandidates,weappliedthecriterion thephylogenetictreeoftheoverlappinggenes.TheLCAfor ofmonophylystatedintheintroduction:oneofthegenesin theRdRPtreewasdefinedasearlier.Wecomparedthefocal D anoverlappingpair—theancestralgene—mustoccurineach o cladesintheRdRPtreeandthetreeofoverlappinggenesand w member of a viral clade, whereas the other gene—the de n found that in 9 of 12 cases the focal clades were identical, lo novo gene—must be restricted to a single subclade, the a d whereas in three cases (2, 5, and 12, within the genera e focal clade (fig. 1b). To find genes that meet this criterion, d we first identified, for each gene pair, homologous protein Cardiovirus, Tymovirus, and Betanodavirus, respectively), we fro m foundminordifferences(supplementaryfig.S6,Supplemen- productsinrelatedgenomes.(Wefoundnoevidenceofdu- h plicated genes in the genomes under study, hence all our taryMaterialonline).Overall,thiscomparisonsuggestedthat ttps homologsareorthologs.)Wealignedthehomologousprotein theRdRPgenesandtheoverlappinggenepairshavesimilar ://a sequences of the ancestral protein (which is more phyloge- evolutionaryhistories.OnthebasisoftheRdRPtree,wethus cad estimated as a proxy for the age of a de novo gene the se- e neticallywidespread)andconstructedtheirphylogenetictree. m quence divergence of the RdRP domain in each focal clade ic Bymanualexaminationofthesetrees,weidentified12cases .o since the origin of the de novo gene, i.e., its accumulated u omfodneopnhoyvloy.oWrigeinsc(atanbnleeds 1thaendre3la)tethdagtemnoemt tehseocfrittheeriofnocoafl genetic distance D along the tree branches since the LCA. p.com clade for unannotated ORFs to overcome missing genes To estimate D, we generated 100 bootstrap RdRP trees. For /m duetofaultannotation.Thesetreesalsoallowedustoinfer eachtreei(1(cid:5)i(cid:5)100),wecalculatedDi,theaveragelength be/a theinternalnodeofatreeclosesttotheoriginofthedenovo ofthephylogenetictreebranchesbetweentheLCAandeach rtic gene. We call this node the last common ancestor (LCA, eaxmtapnletignefnigoumree1tbh,aDt cwoonutladincsaltchueladteeanso[vdo(LgCeAn,eT.1F)o+r tdh(eLCeAx-, le-ab marked with a blue circle in the hypothetical example of i s fig. 1b). To ensure that the identification of the LCA is not T2)+d(LCA, T3)]/3, where d(LCA, T1)=b1+b2, d(LCA, trac biased by genome annotation, we manually examined the T2)=b3+b2, and d(LCA, T3)=b4, and b1, b2, b3, and b4 t/2 9 related genomes for presence of homologous unannotated are the branch lengths shown in the figure. Finally, we esti- /12 ORFs.Weprovidedetailedexplanationsofthechallengesin Dma¼teð1D=10a0sÞPthe100aDve.rWagee eostviemrataeldl athllebrbaonochtstlreanpgthtsrebesy, /376 dinefonromvaotiogenn,eSuapnpdleLmCeAntidareynMtifiactaetriioanloinnlitnhee(ssuupppplleemmeennttaarryy the BIONG mei¼th1odi (Gascuel 1997). In supplementary 7/100 figureS7,SupplementaryMaterialonline,wepresentDand 6 figs.S1–S4,SupplementaryMaterialonline).Thephylogenetic 3 1 the standard deviation within each group. For convenience, 1 treeandthecorrespondinggenomicmapsofthe12casesare b weorderedthecladesintable1accordingtoincreasingD. y presented in supplementary figure S5, Supplementary g u Material online. For all gene pairs that met the monophyly es criterion,wealsodeterminedtheDNAsequencealignments Analysis of the Evolutionary Properties of t o n correspondingtotheaminoacidsequencealignments(ofthe Overlapping Genes 17 ancestralproteins),toenablethecalculationsdescribedlater. For each of our 12 overlapping gene pairs (table 1), we col- No v lected the full sequences of the ancestral and the de novo e m Estimation of the Relative Age of Origin of the De genes,thesequenceoftheregionofthegenomewherethey be Novo Genes overlapped, and the sequences of other genes annotated in r 20 the genome in which they occur. We also collected this in- 18 Tounderstandtheevolutionarydynamicsofdenovogenes, formation for homologous overlapping genes in related ge- oneneedstoestimatetheirageoforiginandtocomparethis nomeswithinthefocalclade(seeearlier).Asallsubsequent ageamongdifferentclades.Thisestimationismadedifficult analyses are carried on sequences within the focal clade, by differences in mutation rate, population dynamics, and which is defined by the distribution of the de novo protein selection pressures among different viral genomes and rather than the ancestral protein, it is independent of the genes (Duffy et al. 2008). To alleviate these difficulties, we BLAST cutoff. We used these data to study the following calibrated our estimates with a reference molecule, the threepropertiesofancestralanddenovogenes: RdRPproteindomain,whichiscommontoallcladesinthe study(Bruenn2003).TheRdRPdomainhasacommonorigin 1) The“relativesequencedivergence”betweenpairsofho- andsimilartertiarystructureinthecladeswestudy(Bruenn mologousproteinsthatthegenesencode.Wedefinethe 2003),andthusweassumethatasafirstapproximationitis sequence divergence between two proteins as the pro- subject to similar functional and structural constraints. In portion of amino acids in which they differ and the 3776

Description:
specific functions, or no function, and would be difficult to detect using current genome annotation methods that rely on the .. fault annotation. These trees .. populations of cotton leaf curl geminivirus, a single-stranded DNA virus.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.