Inherent size constraints on prokaryote gene networks due to “accelerating” growth M. J. Gagen and J. S. Mattick ARC Special Research Centre for Functional and Applied Genomics, ∗ Institute for Molecular Bioscience, University of Queensland, Brisbane, Qld 4072, Australia (Dated: February 9, 2008) Networks exhibiting “accelerating” growth have total link numbers growing faster than linearly with network size and can exhibit transitions from stationary to nonstationary statistics and from random to scale-free to regular statistics at particular critical network sizes. However, if for any reason the network cannot tolerate such gross structural changes then accelerating networks are constrained to have sizes below some critical value. This is of interest as the regulatory gene net- 4 works of single celled prokaryotes are characterized by an accelerating quadratic growth and are 0 size constrained to be less than about 10,000 genes encoded in DNA sequence of less than about 0 10 megabases. This paper presents a probabilistic accelerating network model for prokaryotic gene 2 regulationwhichcloselymatchesobservedstatisticsbyemployingtwoclassesofnetworknodes(reg- n ulatory and non-regulatory) and directed links whose inbound heads are exponentially distributed a over all nodes and whose outbound tails are preferentially attached to regulatory nodes and de- J scribedbyascalefreedistribution. Thismodelexplainstheobservedquadraticgrowthinregulator 4 number with gene number and predicts an upper prokaryote size limit closely approximating the observed value. ] N M I. INTRODUCTION these implicit findings. . o Accelerating networks are more prevalent and im- i The rapidly expanding field of network analysis, re- portant in society and in biology than is commonly b viewed in [1, 2], has provided examples of networks ex- realized—see the survey in Ref. [14]. In fact, any net- - q hibiting “accelerating”networkgrowth,where link num- work that requires functional integration and organiza- [ ber grows faster than linearly with network size. For in- tion (where the activity of any given node is dependent 2 stance, the Internet [3] appears to grow by adding links on the state of the network or different subnetworks) is more quickly than sites though the relative change over v bydefinitionanacceleratingnetwork,thatis,asthe net- 1 timeissmallandtheInternetappearstoremainscalefree work expands, the proportion of the network devoted to 2 and well characterized by stationary statistics [4]. Simi- control and regulation expands disproportionately. This 0 larly, the number of links per substrate in the metabolic in turn means that all such networks, sooner or later, 2 networks of organisms appears to increase linearly with must be limited in their size and complexity, which limi- 1 substrate number [5], the average number of links per tationscanonlybebreachedbychangingeitherthephys- 3 scientistincollaborationnetworksincreaseslinearlyover 0 icalnatureofthecontrolarchitecture(astatetransition) time [6, 7, 8, 9, 10], and languages appear to evolve via / or by reducing the functional integration. In the latter o accelerated growth [11]. case, where networks are hitting a complexity limit, fur- i b In the main, the chief focus of these studies has been ther growth in network size will likely display structural q- onlocating parameter regimesallowingacceleratingnet- transitionsfromrandomlyconnected,toscalefreestatis- : works to maintain scale free statistics and thereby to al- tics, to densely connected and perhaps finally to fully v low continued unconstrained growth. For example, an connected statistics. Should such networks be unable i X early study considered a growing network receiving Nα to successfully complete these transitions forany reason, r newlinks forα>0whenthe networksizeisatN nodes, then it is likely that network growth must cease entirely a but restricted analysis to the case α ≤ 1 as “Obviously, or that either a transitionto a nonacceleratingstructure α cannot exceed 1 (the total number of links has to be is required to permit further growth or novel technolo- smallerthanN2/2sinceonemayforbidmultiple links).” giesmustappearallowingthecontinuationofaccelerated and“Thedensityofconnectionsinrealnetworksremains growth. Exemplar accelerating networks displaying such rather low all the time, so one may reasonably assume size limits or structural transitions include (a) all forms that α is small.” [12]. Equivalent limits were considered ofeconomicmarketswherethelatestpriceofferedbyany in Ref. [13]. In such restricted parameter regimes net- participant instantly affects all other participants, (b) works could maintain scale free statistics, though this industrial companies and sectors implementing a Just- result carries the implicit but unexamined finding that In-Time business model where any worker can halt the alternateparameterregimespermittransitionsfromsta- entireproductionsystem,(c)errorpropagationnetworks tionary to nonstationarystatistics. This paper builds on linking anerrorsource with all affected nodes as studied in software analysis and in models of the propagation of diseases, bushfires, cracks, and electricity grid fail- ures, (d) in any dynamical system dependent on relative ∗Electronicaddress: [email protected] quantities so changes in one node instantly affects ev- 2 eryothernode suchasrelativetranscriptionfactorbind- numbers and genomes are indeed of restricted size (less ing probabilities or relative evolutionary fitness, (e) in thanabout10,000geneswithgenomesofbetween0.5and computer hardware and in cluster and grid supercom- 10 megabases [18]), in contrast to the genomes of multi- puter networks, and (f) in organizational networks [14]. cellulareukaryotes(withforhumans,about30,000genes In fact, it is well understood that social networks only and a genome of about 3 gigabases [19, 20]). Ref. [17] take on small world statistics when the network is large predicted the size limit N ≤ 20,000 genes as continued c enough—in small towns everyone one knows everyone genomegrowthrequiresthe number ofnew regulatorsto else so social networks are accelerating, and social net- exceed the number of nonregulatory nodes. works make a transition to small world statistics only as A satisfactory model of prokaryotic gene regulatory individual nodes saturate their connectivity limits [15]. networks requires some novel features. As mentioned Similar observations can be made about the scale free above, we introduce probabilistic link formation to al- Internet and World Wide Web—when sufficiently small, lowrapidacceleratedgrowthandcorrespondinglystricter these networks were likely accelerating until connectiv- size limits. (A different but related mechanism was in- ity capacities saturated forcing a transition to scale free troducedin Refs. [21, 22] which consideredthe effects of structures to permit further growth [14]. stochasticfluctuationsinthenumberofaddedlinkswith This paper develops an accelerating network model of each additional node.) In addition, we employ directed prokaryotic(singlecelled)generegulatorynetworkstoin- links and partition nodes into two classes where “regu- vestigatesizeandcomplexitylimitsinherentintheadop- lators” can source outbound regulatory links to regulate tionofanacceleratingarchitecture. Becauseourfocus is other nodes (both regulators and non-regulators), while onstructuraltransitions,weexplicitlydonotneedtore- “non-regulators” cannot source outbound links. (Ref. strict the degree of acceleration to low values of α ≈ 0. [23] has previously considered networks of distinguish- Rather, we permit this parameter to take on any value able nodes.) Further, experimental evidence presented including α > 1 and ensure that the network is not sat- below indicates that the distribution of inbound links is urated by making link formation probabilistic. The re- compact and exponential while the distribution of out- sulting novel “probabilistic” accelerating networks grow bound links is long-tailed and likely scale-free. As a re- byaddingonaveragepNα newlinkswithα>0andoth- sult, the heads and tails of our directed links are placed erwisearbitraryprovidedtheprobabilityofaddingalink accordingtotwodistinctdistributions. Altogether,these is suitably constrained p ≪ 1 so that total link number features allow the reproduction of the observed features remains less than of order N2. ofprokaryotegeneregulatorynetworksandsatisfactorily Thegeneregulatorymodelpresentedhereismotivated predicts the maximum prokaryotic gene count. by comparative genomics findings that the total number Ourapproachreproducingacceleratingnetworkstatis- of regulatoryproteins controlling gene expression(links) ticsforgrowingprokaryotegenomescomplementsandin- scalesquadraticallywiththe number ofgenesoroperons formsalternatenetworkingapproachesseekingtodeduce (nodes)inprokaryotes[16,17]. Thisquadraticgrowthre- or simulate the regulatory networks of particular organ- sults asthe numberoflinks made bya regulatorexploit- ismsfromgeneperturbationandmicroarrayexperiments ing homology dependent (sequence specific) interactions [24, 25, 26]. scales proportionally to the number of randomly drift- ing promotor sequences or effectively, with gene num- InSectionIIwecanvasstheavailableliteraturetochar- ber [17]. Hence, gene regulatory networks are inherently acterize the statistics of prokaryote gene regulatory net- accelerating—theprobablenumberoflinksperregulator works. This then allows the construction of accelerat- pNα increases linearly with node number with α=1, so ing growth network models in Section III where we use consequently, the total number of links scales quadrat- thecontinuousapproximationandsimulationstoanalyze ically as pNα+1. In small and sparsely connected net- networkstatistics. Thesizeconstraintsinherentinaccel- works, most links come from different regulators sug- erating prokaryote regulatory networks are modelled in gesting that regulator number also scales quadratically Section IV. with gene number, pNα+1. Such an accelerating net- work would be characterized initially by sparse connec- tivity at low gene numbers and subsequently by denser connectivity at high gene numbers as networks attempt a transition to a densely connected regime. If the evolv- II. OVERVIEW OF PROKARYOTE GENE ing networks can successfully make this transition, the NETWORKS evolutionary record will display a transition in network statistics for some critical network size N . Conversely, c if these networks, optimized by evolution in the sparse Ongoing genome projects are now providing sufficient regime, are unable to make the transition to the densely datatousefully constrainanalysisofthe generegulatory connected regime, the evolutionary record would show networks of the simpler organisms. Ref. [16] first noted a strict size limit N ≤ N at some critical network size. theessentiallyquadraticgrowthintheclassoftranscrip- c Butthisisexactlywhatisobserved. Allprokaryoticgene tional regulators (R) with the number of genes (N ) in g 3 bacteria with the observed results 3.0 N1.87±0.13, transcriptional regulation log R g 2.5 R∝ NNg22..0037±±00..1231,, ttwraonsccormipptoionneanltrseygsutleamtison (1) 2.0 800 g 1.5 600 Ng2.16±0.26, transcriptional regulation. 1.0 400 Here,thetoptwolinesrefertodifferentclassesofregula- 200 0.5 torswhile thebottomtwolines arethe resultsofacross- 0 checking analysis of two alternate databases. Quotedin- 0 5000 10000 0.0 tervals reflect 99% confidence limits [16]. The explana- 2.5 3.0 3.5 log Ng 4.0 tion for this quadratic growth was that each additional transcription factor doubles the number of available dy- FIG. 1: Double-logarithmic plot of regulatory protein num- namicalstateswhich,itwasposited,allowsforadoubling ber (R) against total gene number (Ng) for bacteria (circles) in the fixation probabilities for this class of genes. and archaea (triangles), adapted from Ref. [17]. The log- As noted above, Ref. [17] provides an alternate theo- log distribution is well described by a straight line with slope retical analysis predicting quadratic growth in any reg- 1.96±0.15(r2 =0.88,95%confidenceintervalindicated),cor- ulatory network exploiting homology dependent interac- responding toaquadratic relationship between regulator num- tions and analyzed 89 bacterialand archeaelgenomes to ber and genome size. The inset shows the same data before determine the relations log-transformation [17]. Dashed lines show the best linear fit to the data R=(0.055±0.004)Ng (r2 =0.75). aNb =(1.6±0.8)10−5N1.96±0.15 (r2 =0.88) g g R= pNg2 =(1.10±0.06)10−5Ng2 (r2 =0.87) with genome size [31]. Each operon can be either unreg- cN =(0.055±0.004)N (r2 =0.75). ulatedandsoconstitutivelyorstochasticallyexpressedor g g (2) subjecttocombinatoricregulationbymultipleregulatory In this paper, acceleratingnetworkswill be based onthe protein transcription factors binding to each operon’s quadraticsecondline (while nonacceleratingmodels pre- promotor sequence. sented in later work will work with the linear third line Againassumingthat E. coli is typical,anygivenregu- [27]). In all cases, the limits reflect 95% confidence lev- latoryproteinaffectsanaverageofabout5operonswith els. For completeness, the data is shown in Fig. 1. The this distribution being long tailed [32] so the majority observed quadratic growth implies an ever growing reg- of regulatorsaffect only one operonthough some regula- ulatory overhead so there will eventually come a point tors(CRP)canaffectupto71operonsor133genes[33]. where continued genome growth requires the number of (This latter reference estimated that eachregulator con- new regulators to exceed the number of nonregulatory trolsonaverage3genes.) Morerecentestimateshavethe nodes, and based on this, Ref. [17] predicted an upper transcriptionfactorCRP,aglobalsensoroffoodlevelsin sizelimit ofabout20,000genes,withinafactoroftwoof the environment,regulatingupto 197genesdirectlyand the observed ceiling. a further 113 genes indirectly via 18 other transcription Earlier surveys of bacterial genomes noted that larger factors [34]. (To observe the long tailed distribution, see genomes harboured more transcription factors per gene Fig. 2 of Ref. [33] and Fig. 4 of Ref. [34].) than smaller ones [28], with this trend attributed to the need in larger genomes for a more complex network of However, the number of inputs taken by an operon regulatory proteins to achieve coordinated expression of is characterized by a compact exponential distribution alargersetofcellularfunctions, andtoselectionincom- with a rapidly decaying tail so the majority of regulated plexenvironmentsleadingtoenrichmentintranscription operons are controlled by a single regulator while very factors allowing regulation of gene expression and signal few regulated operons are controlled by four, five, six or integration. A similar upward trend in the proportion seven regulators [32, 33, 34]. In particular, Ref. [33] ex- ofregulatorsasa fractionofgenome size with increasing amined 500 regulatory links from about 100 regulators genome size was observed in Ref. [29] attributed to a to almost 300 operons to estimate that each regulated need for an increasing responsiveness in diverse environ- operon takes on average 2 inputs though Fig. 2 of this ments, with confirming observations in Ref. [30]. referencesuggestsanaverageinputnumberofabout1.5. Prokaryotestypically group their DNA encoded genes Similarly, Ref. [32] suggests that 424 regulated operons in operons, co-regulated functional modules of average receive 577 links giving an average input number of 1.4, size 1.70 genes each in E. coli which value we treat as while Ref. [34] estimates that 327 regulated operons re- typical though in reality, operon size decreases slightly ceive 524 links giving an average input number of 1.6. 4 III. ACCELERATING PROKARYOTE resulting rate of growth in the number of links attached NETWORK MODELS tonoden isalsothenproportionaltot . Ifthereisalso j jN someprobabilityoftheappearanceofnovelpromotorse- WeextendthegenenetworkmodelofRef. [33]tocon- quences,these combinedprocesses suffice to produce the struct an accelerating network model of prokaryote reg- observed scale free distributions. This model is roughly ulatory gene networks. Prokaryotes typically pack their consistent with recentestimates of the relative contribu- N genesintoalessernumberofN =N /g co-regulated tions to prokaryote genome growth which suggest that g g o operons where we assume that operons contain exactly horizontalgenetransferratesγh areroughlyonethirdof go = 1.70 genes. Of the existing operons, Or are regu- gene loss rates γh = γl/3 and roughly one half of verti- latedoperonsandOu =N−Or areunregulatedoperons. cal inheritance or gene genesis rates γh = γv/2 leading Of the total number of operons, there are R regulatory to roughly constant sized genomes over long times (as operons whose regulatory interactions are directed links N˙ ≈γ +γ −γ ≈0), while “it is remarkable that phy- h v l fromregulatoryoperonstoregulatedoperons. Underthe logenetic distributions of at least 60% [and up to 75%] assumptionthatthereisonlyoneregulatorygeneperreg- of protein families can be explained merely by vertical ulatory operon, the observedquadratic relation of Eq. 2 inheritance.” [36]. Similarly, three quarters of examined becomes transcription factors in Ref. [34] were two-domain pro- teins with shared domain architectures leading to the R=pg2N2. (3) estimate that about 75% of transcription factors have o arisen as a consequence of gene duplication (though the When regulators and regulatory links are very rare, i.e. joint duplication of regulatory regions and of regulated when genomes are small, it is likely that every new link genes or of transcription factors together with regulated isassociatedwithanew regulatorsothe numberoflinks genes is more rare). A further implication of these gene varies roughly quadratically with operon number. We duplicationprocessesis that, inthe main, regulatorscan write only appear on entry to the genome—a potential regu- lator lacking any target matches in a given genome will L=lN2, (4) never form any links when most operons arise from pro- motor preserving duplication events. This allows us to where l denotes the probability of forming a particular considerably simplify our model, and hereafter, we only beneficiallinkperoperon. Thevalueforlwillbeapprox- allow regulators to appear on their entry to the genome. imately pg2, butthe exactrelationmustbe derivedfrom o Of course, more realistic but considerably more compli- the details of the implemented model. cated models are possible. Each regulatory link between nodes is directed, and characterizedby two distinct distributions describing re- Incontrasttotherelativelysmallnumberofregulatory spectively the placement of the heads and tails of each nodes,allnodes canthemselvesbe regulatedby inbound link. Only a relatively few nodes are regulatory, and of links and in fact, can be multiply regulated as promotor these, the number of outbound link tails per regulatory regionscancontainmorethanonebinding site. Further, node are described by a size dependent long-tailed dis- themanyusedandunusedpromotorregionbindingsites tribution with average about hti≈5. Such a long-tailed broadlysamplethespaceofpossiblebindingsitessoonly distribution requires that link tails be preferentially at- a small fraction of nodes will be regulated by any one tached to an existing regulatory operon or equivalently, regulator. As aresult, the number ofinbound link heads theassociatedregulatedoperonmustpossessonepromo- per node is described by a size dependent exponential torbindingsite(amongothers)thatbindsthatparticular distribution with a low average of hhi ≈ 1.5 as typically regulator. Consequently,thepreferentialselectionofreg- results from the random or non-preferential attachment ulatorsmeansthatthepromotorsequencesofnewlyregu- of inbound links to operon promotor sequences. latednodescannotberandomlychosen—randomlydrift- We suppose that the operon network grows by the se- ing promotor sequences would be as likely to match any quential addition of numbered nodes n for 1 ≤ k ≤ N, k one regulator as another. A plausible physical explana- and that at network size k, node n (1 ≤ i ≤ k) has t i ik tion for the preferentialattachment of link tails to exist- outboundtailsandh inboundheads. Wedo notmodel ik ing regulatorsis thatnewly fixatedoperonscomelargely the manytrials ofpotentialgenesovermanygenerations from gene duplication events [35] where some of the du- and merely include fixated genes in our count—that is, plicated promotor binding sites are under strong selec- drifting sequence is not counted as part of the fixated tive constraint while other binding sites and the operon genome. Thisfurther impliesthe sequenceofestablished genescandriftfreely. Geneduplicationthenimpliesthat nodes is under severe selective constraint and unable to in a genome of size N operons, if some regulator n has driftsoconsequentlynewlinkscannotbeaddedbetween j t outboundregulatorylinksto approximatelyt reg- existingnodes. (IfaproportionfN ofexistingnodescan jN jN ulatedoperons,thentheprobabilitythatanewlyfixated explore novel sequence space in time dt, then the num- operon is also regulated by n is simply the proportion ber of new regulators increases as dR ∝ fN2dt, and as j ofsuchregulatedoperonsinthegenome,ort /N. This N is itself a function of time, this integrates to generate jN implements the required preferential attachment as the a non-quadratic relation between regulator and operon 5 somesmallfixedlengthandanentiregenomescalespro- portionately to genome length—doubling genome length doubles the probability of a match. Hence, the expected numberoflinks formedper regulatorscaleslinearlywith present genome size. As the number of source trial se- quences also scales with genome length, the expected number of matches between all regulators and all regu- latedoperonsscalesquadraticallywithgenomelength,or effectively, with operon number assuming constant sized operons over the evolutionary record. n 1 As a consequence,onentryinto the genome,eachnew n N genehassomeprobabilityofbeingaregulatordependent firstlyonitssuitabilitytobindDNAandsecondlyonthe linearly increasing expected number of acceptable bind- ing targets present in the genome on entry (or at later times). As discussed above, the predominance of verti- calgenegenesiseventsallowsasimplifiedmodelwherein the probability of a new node being regulatory is deter- mined solely by the number of available links present at the time of entry. We assume then that on entry into the genomeeachnewnode n canform2k−1links with k nodesn ,...,n consistingofasingleself-regulatorylink 1 k FIG. 2: An example statistically generated E. coli genome from node nk to itself with probability l, (k −1) regu- using the later results of this paper where (for convenience latory outbound links to the existing nodes each with only) operon nodes numbered n1,...,nN are placed sequen- equal probability l, and, provided that sufficient regula- tially counterclockwise on a circle in their historical order of torsalreadyexist,l(k−1)inboundregulatorylinks from entry into the genome. The filled points on the outer circle some subset of the existing regulators chosen according locateregulatorsandhaveradiusindicatingthenumberofout- topreferentialattachment. (Forconsistency,wecanonly bound regulatory links. The open points on the middle circle add ≈ lk distinct regulatory links to node n provided k locateregulatedoperonsandhaveradiusindicatingthenumber thereareatleastthismanyregulatorsinexistence. From of inbound regulatory inputs. The arrows in the inner circle Eq. 3, the average number of regulators pg2k2 must be show all directed regulatory links. o greater than the number of regulatory links lk, and this will be satisfied for k > l/(pg2) ≈ 1.) As a result, the o totalnumberofheadsortailsattachedtonoden onen- number which is not observed.) k trytothegenomerangesbetween0andk,witheachlink For clarity, Fig. 2 preempts later calculations and de- pictsastatisticallygeneratedversionofanE.coligenome formed with probability l. Hence, the respective proba- bilities that the initial number of heads h = j or the wherenodesareplacedsequentiallycounterclockwiseina kk initial number of tails t =j for node n is circle(forconvenienceonly). Alternativegenomemodels kk k may be distinguished by the age distribution of regula- k tors,regulatedoperonsandtheirlinknumbers,andthese P(j,k)= lj(1−l)k−j, (5) are indicated in this figure. In particular, Fig. 2 shows (cid:18)j(cid:19) a highly nonuniform distribution of regulators and out- bound link numbers with gene age in contrast to a uni- with the proviso that all the inbound links can only form distribution of regulated operons and of inbound be added to node nk if there is a sufficient number link numbers. (It will turn out that these latter age- of regulators among the nodes n1,...,nk. The aver- independentdistributionareonlypresentwhenregulator age number of inbound and outbound links is identical, number grows quadratically with genome size.) htkki = hhkki = lk showing linear growth in link num- These distributions result from the physical processes ber with increasing network size. The addition of node underlying the formation of regulatory links in prokary- nk and its links will increase the probable number of otes. As discussedabove,a substantialproportionofthe heads attached to earlier nodes nj for 1≤j ≤(k−1) so generegulationnetworkofprokaryotesisenactedviaho- hjk ≥ hjj, while the probable number of tails outbound mology dependent interactions as when sequence spec- from node nj increases tjk ≥tjj if and only if that node ified protein transcription factors bind to specific pro- is regulatory with tjj >0. moter sequences. The undirected nature of evolution- As regulators can only be created on entry to the arysearchesmeansthatgeneregulatorynetworksfunda- genome,thedistributionofregulatorsatanytimeisspec- mentally exploitthe samesequence matching algorithms ifiedby the distribution P(j,k) for t . Using Eq. 5, the kk usedincomparativegeneticswheretheprobabilityofob- probability that node n is a regulator is 1−P(0,k), so k taining matches between a single given trial sequence of for a network of N nodes, the predicted total number of 6 regulators is give N l+llnN if α=0 j R = 1−(1−l)k h = (10) jN Xk=1(cid:2) (cid:3) lNα+ l(α−1)jα if α>0. α α 1−l−(1−l)N+1 = N − Integration of these link numbers over all node numbers l j gives the required total number of links as in Eq. 8. l ≈ N(N +1). (6) For0≤α<1,the numberoflinks pernode ismonoton- 2 ically decreasing with node number. However, for α=1 The exact top line shows the expected behaviour for the andonlyinthiscase,thefinaldistributionisindependent numberofregulatorsintherespectivelimitsl→0giving of node number j because earlier nodes receive exactly R→0,andl→1 givingR→N. The approximaterela- enough links from latter nodes to balance the initially tioninthethirdlinecanbecomparedtotheobservedEq. biased distribution of heads h ≈ lj, so in the end, all jj 3 and immediately suggests l ≈ 2pg2, while a fit to the nodes receive on average the same number of inbound o more accurate top line gives the connection probability regulatory links hh i = lN for 1 ≤ j ≤ N. For faster jN as accelerationrates,α>1,thenumberoflinkspernodeis monotonically increasing as later nodes receive a greater l=1.15×2pg2 = 7.31×10−5. (7) number of links on entry to the genome and this imbal- o ance is not corrected. This probability value suggests an average promotor The possibility of monotonically increasing numbers binding site length of −log4l = 6.9 bases. The aver- of links with node number in accelerating networks has age number of links per regulator using the second line not previously been considered. This possibility requires of Eq. 6 is then approximatelyL/R≈2, while the more modifying the usual continuum approach [37, 38, 39] so accurate top line with N =2528 operons for E. coli [31] the final inbound link distribution is obtained via gives L/R = 2.12, about a factor of two from the ob- served value of 5 for E. coli [32]. 1 N H(k,N) = dj δ(k−h ) jN N Z 0 1 ∂h −1 A. Random distribution of regulated operons = ± jN at [j =j(k,N)], (11) N (cid:18) ∂j (cid:19) The distribution of link heads for all nodes (with pos- where j(k,N) is the solution of the equation k = h . jN session of a link head designating a regulated node), The top line is used when all nodes possess the same can be straightforwardly calculated under the assump- average link number while the second line is applicable tion that the tkk ≈ lk new tails added with node nk with the plus (negative) sign when the average numbers are randomly distributed across the k existing nodes so oflinkspernodeismonotonicallyincreasing(decreasing) on average, each existing node receives l links. To build with node number. Non-monotonic cases require alter- insight, it is useful to consider the general case where nate approaches. tkk ≈ hkk ≈ lkα for α ≥ 0. Setting α = 0 adds with Underquadraticgrowthintotallinknumberwhenα= some probability a constant number of links with each 1,andonlyinthiscase,thefinaldistributionoflinkheads newnode,α=1addsalinearlygrowingnumberofprob- is independent of node number and evaluated using Eq. ablelinkswitheachnewnode,α=2addsaquadratically 11 to give growing number of probable links with each new node, and so on. The total number of links present in the net- 1 N H(k,N) = dj δ(k−lN) work is then N Z 0 = δ(k−lN). (12) N 2lNα+1 2lkα = (8) Z α+1 As expected, a compact final link distribution results 0 when all nodes have an average of t = lN inbound jN The continuous approximation [37, 38, 39] for links ran- regulatory links at time N. This distribution calcu- domly distributed over k existing nodes determines the latedunderthecontinuousapproximationequatestoone number of inbound head links for node nj according to where in reality, each node receives a controlling head withprobabilitylfromeveryothernode(thoughinprac- ∂h t jk = kk tise,the totalnumber ofreceivedlinksis oforderunity). ∂k k Hence, for any node in a network of size N, the actual = lkα−1. (9) probability of having k heads is This can be integrated with initial conditions h ≈ ljα N jj H(k,N)= lk(1−l)N−k. (13) at time j and final conditions tjN ≈ lNα at time N to (cid:18)k(cid:19) 7 Anetworksimulationwithlineargrowthoflinknumbers 1 per node model (Eq. 5) serves to validate this predicted 0.9 Ph(k) final distribution. Fig. 3 compares the predicted distri- 0.8 bution H(k,N) against observed distributions for typ- 0.7 ical simulated networks of various sizes with negligible 0.6 discrepancies. 0.5 ForanetworkofsizeN,theprobabilitythatanygiven 0.4 operonisunregulatedisH(0,N)sotheexpectednumber 0.3 of unregulated operons summed over all N nodes is 0.2 O =N(1−l)N. (14) 0.1 u k 0 This determines the number of regulated operons as 1 2 3 4 5 6 O =N −O =N 1−(1−l)N , (15) r u FIG.4: ThepredictedproportionsP (k)oftheregulatedoper- h (cid:2) (cid:3) ons of E. coli taking multiple regulatory inputs for a genome showingtheexpectedbehaviourasl →1givingO →N r of N =2528 operons. This distribution closely approximates and l → 0 giving O ≈ lN2 = L as each of the sparsely r that observed for E. coli in Fig. 2(d) of Ref. [31] and of Fig. distributed links hits a distinct regulated operon. We 5 of Ref. [34]. note that random gene duplication and deletion events will not change the H(k,N) distribution (other than changing N) as all nodes are identically connected on and of Fig. 5 of Ref. [34], though it underestimates the average. The H(k,N) distribution appears in Fig. 2 numbersofregulatedoperonswith4,5,6and7inputs— whichshowsa uniform(age-independent)distributionof essentially no regulators are predicted to have 5 or more regulated nodes over the genome, and this uniformity is inputs for genomes of size N = 2528 operons. In ad- onlyexpectedforα=1correspondingtolineargrowthin dition, the average number of inbound regulatory links linknumberspernodeandquadraticgrowthinregulator per operon (for all operons) is hki = L/N = lN = 0.19, numbers. whiletheaveragenumberofinboundregulatorylinksfor regulated operons is hk i = L/O ≈ 1. A more accu- r r 1 rate calculationusing the specific values for E. coli gives 0.9 H(k,N) hk i=L/O =1.10,verycloseto the E. coli value of1.5 r r or 1.6 noted in Refs. [32, 33, 34]. 0.8 N=2000 0.7 0.6 N=6000 B. Scale-free distribution of regulator operons 0.5 0.4 N=10000 On entry into the genome, node n sources on aver- k 0.3 age lk outbound regulatory links and this linear growth 0.2 in link number means that more recent nodes are more N=14000 0.1 likely to be immediately regulatory and more likely to be highly connectedongenome entry. However,noden 0 k 0 1 2 3 4 5 6 7 k will also receive on average lk inbound regulatory links whosetailswillbe preferentialattachedtoexistingregu- FIG. 3: A comparison of the predicted distribution of in- lators. Thefinaldistributionoflinknumberwithagewill bound link numbers per node H(k,N) (solid lines) against dependontherateatwhichearliernodesunderpreferen- thatobservedinsimulatednetworksofvarioussizes(indicated tial attachment can attract links relative to the linearly points) with quadratic growth in the total probable number of increasing link numbers of later regulatory nodes. randomly attached links. Onentryattimek,noden receivesh ≈lk inbound k kk linksfromexistingregulatorynodesinthesetn ,...,n . 1 k Thesepredictionscanbecomparedtoobservation. For As previously,we gaininsightby consideringthe general the E. coli network of size N = 2528 operons or 4289 case where t ≈ h ≈ lkα for α ≥ 0 (though we con- kk kk genes[31], the predictedproportionofregulatedoperons tinuetousethedistributionP(j,k)ofEq. 5todetermine receiving k >0 inputs is both the number of links j prior to exponentiation and regulatoryprobabilitysoconsequentlythenumberofreg- H(k,N) Ph(k)= , (16) ulators continues to increase quadratically according to 1−H(0,N) Eq. 6). Asaresult,theneedtoensurethatallregulatory and is shownin Fig. 4. Here, the calculateddistribution links tonode n aredistinct requiresthatnew link num- k closely approximates the compact exponential distribu- ber lkα be less than the number of existing regulators tion observed for E. coli shown in Fig. 2(d) of Ref. [31] lk2 requires α ≤ 2. The h new tails added with node kk 8 n are preferentially attached to the existing regulatory As previously, it is straightforwardto calculate the fi- k nodes n with probability proportionalto the number of nal outbound link distribution in the case α = 1 using j existing regulatorylinksfor thatnode attime k,i.e. t . Eq. 11. This gives jk Usingthecontinuousapproximation[37,38,39],therate 1 N of growth in outbound link number for node nj is then T(k,N) = dj δ(k−lN) approximately N Z0 = δ(k−lN). (23) ∂t t jk jk ∂k =hkk kt dj. (17) Again,we find the expected compactdistribution result- 0 jk ing when all nodes possess the same average number of R links. This raises the question however,of how it is that The denominator here is a probability weighting to en- a probabilistic accelerating network subject to preferen- sure normalization and is the total number of outbound tial attachment can end up with all nodes possessing links for all nodes. Following [1], we can evaluate the the same average number of links? The answer lies in denominator using the identity our use of two classes of distinguishable nodes, regula- torsandnon-regulators,whichrequiresthatwetakeinto ∂ k k ∂ t dj = t dj+t . (18) account the known distribution of regulators with node jk jk kk ∂k Z Z ∂k 0 0 number over the genome. The average link number per node at node n (Eq. 22) equates to the product of the This can be evaluated using Eq. 17 noting t ≈ h ≈ j kk kk average number of link tails per regulator at node n , lkα giving j denoted t (j,N), and the average number of regulators r per node atnoden , denoted ρ(j). This latter density is ∂ k j t dj =2lkα, (19) ρ(j)=dR(j)/dj ≈lj by Eq. 6, so by definition, we have jk ∂k Z 0 t =t (j,N)ρ(j), (24) jN r which can be integrateddetermining the denominator of giving Eq. 17 to be α+1 α−3 k 2l tr(j,N)=N 2 j 2 . (25) t dj = kα+1. (20) Z jk α+1 Hence,forα<3,theaveragenumberoflinksperregula- 0 torisadecreasingfunctionofnodenumberjasthegrow- This is in agreement with Eq. 8. Substituting this value ing number of links added to recent nodes is insufficient into Eq. 17 gives to outweigh the effects of preferential attachment which more rapidly increases the number of links attached to ∂t α+1t jk jk early nodes. In particular, for α = 1 with the addition = . (21) ∂k 2 k of a linearly increasing number of links per node, the average number of regulatory links per regulator scales Finally, this can be integrated with initial conditions inversely with node number j. In other words, the den- t ≈ljα attime j and finalconditions t attime N to jj jN sity of regulators is very low at small node numbers j give while the veryfew regulatorynodes inthis stretchofthe α+1 α−1 genomeareheavilyconnecteddue topreferentialattach- tjN =lN 2 j 2 . (22) ment so as to maintain the constant average of Eq. 22. (See Fig. 2.) Again we find that the respective choices α < 1 and The t (j,N) distribution contains information about α > 1 lead to monotonically decreasing and increasing r both node connectivity and node age and so approxi- numbers of links per node as a function of node num- matesgenomestatistics(simulatedorobserved)whenall ber, while setting α=1 ensures the number of links per ofthisinformationisavailable. However,itisusuallythe node is independent of node number. In this case, the case that node age information is unavailable necessitat- preferential attachment of links to earlier nodes does in- ing calculation of connectivity distributions that are not deedacttocanceltheinitialbiasinlinknumbertowards conditioned on node age. This effectively requires bin- later nodes. It is also apparent that when α = 1, the ningtogetherallnodesirrespectiveoftheiragetoobtain limit l → 1 implies all nodes possess exactly N links as a final link distribution. In the case of linearly growing expected for a fully connected regular network. (Prefer- number of links per node, α = 1, the delta function of entialattachmentcannotdistortconnectivitynumbersin Eq. 11 is resolved by the equality j = N/k giving the thiscaseasallnodeshaveanequalnumberoflinks.) Ad- final distribution as ditionally,inthe limit l→0 wehavet =0asrequired jN for an entirely disconnected network. The case α = 0 1 ∂j T(k,N) = − duplicates results found for growing networks which add N (cid:18)∂k(cid:19) a constant number of links with each new node subject 1 to preferential attachment [2]. = , (26) k2 9 ∞ which, as required, is normalized as T(k,N) = 1. 1 The expected proportion of regulatorsRPt(k) possessing -0.5 log Pt(k) N=10000 k links is then obtained by integrating the continuous distribution of Eq. 26 overappropriate ranges[1,3/2]or [k−1/2,k+1/2] to obtain -1.5 N=14000 1 k =1 N=2000 3 P (k)= (27) -2.5 t 4k24−1 k >1. N=6000 Predicted Thesetheoreticalpredictionscomparewelltosimulations -3.5 ofnetworksofvarioussizeswithlinearlyincreasingnum- bers of probable links per node and subject to preferen- -4.5 tial attachment. Fig. 5 shows simulated outbound link 0.0 0.5 1.0 1.5 2.0 log k 2.5 distributions which are long-tailed and scale free with probabilities scaling roughly as P (k) ∝ k−2 for large k. FIG. 5: A simulation of the proportion of outbound links per t The Pt(k) distribution shows a full one third of regula- regulatorPt(k)innetworksofvarioussizeswithlineargrowth torshaveonlyonelink,while60%havetwoorfewerlinks, intheprobablenumberoflinkspernodepreferentiallyattached to regulatory nodes. The log-log plot shows slopes of roughly and71%havethreeorfewerlinks. Fig. 6showsthelong- tailed distribution P (k) expected for a simulated E. coli −2inagreementwiththeoretical predictions(heavysolidline) network of N = 252t8 operons with preferential attach- of a long-tailed scale free distribution Pt(k)∝k−2. ment of links. This figure shows marked similarities to the long-tailed distribution of E. coli shown in Fig. 2(c) 0.35 of Ref. [31]. In particular, the expected number of regu- Pt(k) latorswith k links is P (k)R(N)with the number ofreg- t 0.3 ulatorsR(N)obtainedfromEq. 6(orfromobservation). For E. coli, this predicts the probable existence of about 0.25 one regulator possessing link numbers in each of the re- 0.2 spective ranges between [40,49] links, between [50,64] links, between [65,94] links, between [95,169] links, and 0.15 between[170,700]linksforinstance. (Thisapproximates theconnectivityoftheglobalfoodsensorCRPwhichreg- 0.1 ulates up to 197 genes directly [34].) The average of the 0.05 P(k) distribution (as well as the t (j,N) distribution) is k r formally undefined as long as the integration limits are 0 taken to infinity. However, in a network of N nodes, a 1234567891011121314151617181920212223242526272829303540455055606570 regulatorcanpracticallyonlyregulateatotalofN nodes, and this cutoff allowsus estimate the averageconnectiv- FIG. 6: The predicted proportion of regulatory operons Pt(k) ity per regulator (complementing previous estimates fol- regulating k different operons for a simulated E. coli genome withN =2528 operons. Asexpected, most regulators regulate lowing Eq. 7). Using the cutoff and approximating the only one other operon, though a small number of regulators summation via an integral, the average connectivity per can regulate more than 40 operons. This distribution closely regulator in a network of N nodes is approximates theobserved proportions forE.coliinFig. 2(c) N of Ref. [31] and Fig. 4 of Ref. [34], and predicts the probable hki = kP (K) existence of about one E. coli regulator possessing link num- t Xk=1 bers in each of the respective ranges between [40,49] links, between [50,64] links, between [65,94] links, between [95,169] 1 1 4N2−1 = + ln , (28) links, and between [170,700] links, and so on. 3 2 (cid:18) 15 (cid:19) (or simply lnN using the continuous distribution of Eq. 26.) The average number of links per regulator for E. transitionatsomecriticalnetworksizeeithertoanonac- coli from Eq. 28 is hki=7.51 (or 7.83 using the simpler celerating architecture permitting continued growth or derivation), which again compares well to the observed must cease growth entirely, and we now seek to predict value of 5 in E. coli [32]. thelocationofthistransitionpointandcompareittothe evolutionaryrecord. Webeginbyexamininganoverview ofthe acceleratinggenome model. Fig. 7 shows thatlin- IV. INHERENT PROKARYOTE SIZE LIMITS ear growth in link numbers per node (α = 1) allows a quadratic growth in the total number of links (Eq. 4) The accelerating nature of regulatory gene networks despite eachof the number ofregulators(Eq. 6) andthe necessarily means that these networks must exhibit a numberofregulatednodes(Eq. 15)asymptotingtosome 10 fractionof N after an initial period of quadratic growth. the proportion of regulators which control transcription For large genomes, almost all new nodes will be regu- factors scales linearly with network size and equals 15% lators and densely connected into the existing network for an N = 2000 network, 29% for N = 4000, 44% for which will then multiply regulate almost every node. N = 6000, 59% for N = 8000, 73% for N = 10000, and 88% for N = 12000 operons (after which the ap- proximations made break down). Naturally, when most 16000 regulators themselves control other regulators, then the L 14000 entire regulatory network will consist of a single giant component. These ratios compare reasonably well with 12000 those observed in E. coli where Ref. [34] noted that of 10000 121 transcription factors for which one or more regula- Or tory genes areknown, 38factors or 31.4%regulate other 8000 N transcriptionfactors. TheapproximatesecondlineofEq. 6000 29 with N =2528 for E. coli determines this proportion 4000 Rq R asPrr =18.5%whilethemoreaccuratetoplinegivesthe proportion of regulators which control transcription fac- 2000 tors as P =17.7%, giving a reasonable match between rr 0 prediction and observation. 0 2000 4000 6000 8000 10000 12000 14000 N 16000 Astheproportionofregulatorsoftranscriptionfactors rises, the probable length of regulatory cascades will in- FIG.7: Thequadraticgrowthinthenumberofregulatorylinks crease. In fact, the proportion of regulators taking part L,andtheasymptotingquadraticgrowthofregulatoryoperons in a regulatory cascade of length n≥1 is R and of regulated operons Or in relation to the total number of operons N. Actual numbers ofregulators for89prokaryote p =(1−P )Pn−1. (30) genomes are shown (solid points), while the non-asymptoting n rr rr fitted quadratic curve Rq is shown for comparison. The ob- This equation can be obtained from a tree of all binary served maximum size of prokaryote genomes (of order 10,000 pathwayswhichateachbranchingpointeitherterminate genes or about 6,000 operons) lies near the transition point with probability (1 − P ) or cascade with probability between sparse and dense connectivity as an increasing pro- rr P . As such, the probable cascade length is negligi- portion of operons become linked into the regulatory network. rr ble when the proportion of regulators controlling regu- lators is small P ≪ 1 but can become large as P rr rr Thetransitionfromsparsetodenseconnectivityoccurs itself increases. As Prr is indeed large for networks of asanincreasingproportionofoperonsbecomelinkedinto size N > 6000, this again suggests that long cascades theregulatorynetworkleadingtotheemergenceofasin- of regulatory interactions will lead to the coalescing of a glegiantcomponentoffullyconnectednodes. Onewayto single giant component in this regime. Again, the calcu- highlightthistransitionisbydeterminingtheproportion lated lengths of regulatory cascades can be compared to oftranscriptionfactorwhichcontroldownstreamregula- thoseinE.coliwherethenumberofcascadesofregulated tors as such linkages create the single giant component. transcriptionfactorsobservedina particularsetofregu- The proportion of regulators controlling regulators is latoryinteractionswas23two-levelcascadesor37.7%,32 three-level cascades or 52.5%, and 6 four-level cascades N or 9.8% [34]. As one-level or autoregulatoryinteractions 1 N R(N) P (N) = 1−(1−l)k are not included in this observation, the predicted pro- rr R(N)kX=1(cid:2) (cid:3) k N portionsforE.coliarep¯n =pn/(1−p1)withPrr =17.7% ≈ lN. (29) giving 82% two-level cascades, 15% three-level cascades, 3%four-levelcascades,1%five-levelcascades,andsoon. Here, the first fraction on the RHS normalizes the pro- It is seen that the theoretical predictions overestimate portion in terms of the number of regulators R(N) (Eq. the proportion of two-level cascades and underestimate 6), the first term in the summation is the probability the number of three-level and higher cascades probably that node n is a regulator, the second term is the av- because of selectionpressuresnot included in the model. k erage number of regulatory outbound links for this reg- Lastly,wenotethatthenumberofcyclesinvolvingclosed ulatory node t (k,N) at network size N (Eq. 25 with regulatory loops of size greater than one (i.e. involv- r α=1), and the third term approximatesthe probability ing more than autoregulation) in the examined portion that these nodes link to one of the existing regulators of the E. coli regulatory network is zero reflecting that under random attachment. (If the very first and very feedback loops in these organisms are carried out at the last terms are dropped, the remaining summation over post-transcriptional level involving metabolites such as all nodes of the probability that n is regulatory with appear in the lac operon [32, 33, 34]. k the stated number of links equates to the total number We note that our model is entirely unable to explain of links in the network L≈ lN2. This is the more accu- the highproportionofautoregulationobservedinE. coli rateversionofthecalculationleadingtoEq. 25.) Hence, with various estimates that 28.1% [40], 50% [32] and