ebook img

On universal common ancestry, sequence similarity, and phylogenetic structure PDF

25 Pages·2012·0.74 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview On universal common ancestry, sequence similarity, and phylogenetic structure

TheobaldBiologyDirect2011,6:60 http://www.biology-direct.com/content/6/1/60 RESEARCH Open Access On universal common ancestry, sequence similarity, and phylogenetic structure: the sins of P-values and the virtues of Bayesian evidence Douglas L Theobald Abstract Background: The universal common ancestry (UCA) of all known life is a fundamental component of modern evolutionary theory, supported by a wide range of qualitative molecular evidence. Nevertheless, recently both the status and nature of UCA has been questioned. In earlier work I presented a formal, quantitative test of UCA in which model selection criteria overwhelmingly choose common ancestry over independent ancestry, based on a dataset of universally conserved proteins. These model-based tests are founded in likelihoodist and Bayesian probability theory, in opposition to classical frequentist null hypothesis tests such as Karlin-Altschul E-values for sequence similarity. In a recent comment, Koonin and Wolf (K&W) claim that the model preference for UCA is “a trivial consequence of significant sequence similarity”. They support this claim with a computational simulation, derived from universally conserved proteins, which produces similar sequences lacking phylogenetic structure. The model selection tests prefer common ancestry for this artificial data set. Results: For the real universal protein sequences, hierarchical phylogenetic structure (induced by genealogical history) is the overriding reason for why the tests choose UCA; sequence similarity is a relatively minor factor. First, for cases of conflicting phylogenetic structure, the tests choose independent ancestry even with highly similar sequences. Second, certain models, like star trees and K&W’s profile model (corresponding to their simulation), readily explain sequence similarity yet lack phylogenetic structure. However, these are extremely poor models for the real proteins, even worse than independent ancestry models, though they explain K&W’s artificial data well. Finally, K&W’s simulation is an implementation of a well-known phylogenetic model, and it produces sequences that mimic homologous proteins. Therefore the model selection tests work appropriately with the artificial data. Conclusions: For K&W’s artificial protein data, sequence similarity is the predominant factor influencing the preference for common ancestry. In contrast, for the real proteins, model selection tests show that phylogenetic structure is much more important than sequence similarity. Hence, the model selection tests demonstrate that real universally conserved proteins are homologous, a conclusion based primarily on the specific nested patterns of correlations induced in genetically related protein sequences. Reviewers: This article was reviewed by Rob Knight, Robert Beiko (nominated by Peter Gogarten), and Michael Gilchrist. Background study demonstrated that UCA is a much more probable In a recent study, I applied model selection theory to a model than competing independent ancestry models. data set of universally conserved protein sequences, in One of the notable strengths of this study is that it pro- an attempt to formally quantify the phylogenetic evi- vides evidence for common ancestry without recourse to dence for and against the theory of universal common the common assumption that a high degree of sequence ancestry (UCA) [1]. For the conserved protein data, this similarity necessarily implies homology. This UCA study was subsequently criticized in a paper by Koonin and Wolf (hereafter referred to as K&W), in Correspondence:[email protected] which they argue that the results in favour of UCA are BiochemistryDepartment,BrandeisUniversity,Waltham,MA02454,USA ©2011Theobald;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommons AttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,andreproductionin anymedium,providedtheoriginalworkisproperlycited. TheobaldBiologyDirect2011,6:60 Page2of25 http://www.biology-direct.com/content/6/1/60 “a trivial consequence of significant sequence similarity This is the question I set out to answer: Is there a uni- between the analyzed proteins” and that my tests “yield versal tree — or, more broadly, a universal pattern of results ‘in support of common ancestry’ for any suffi- genetic relatedness — in the first place? ciently similar sequences” [2]. Here I show that K&W’s Several researchers have recently questioned the nat- conclusions are incorrect. While sequence similarity is a ure and status of the theory of UCA or have emphasized highly probable consequence of common ancestry, simi- the difficulties in testing a theory of such broad scope larity alone is insufficient to establish homology by the [11,15-18]. For example, Ford Doolittle has disputed model selection tests. Rather, the phylogenetic pattern whether objective evidence for UCA, as described by a of nested, hierarchical, sequence correlations is the universal tree, is possible even in principle: dominant factor that forces the conclusion of common ancestry for the real protein data. Before considering Indeed, one is hard pressed to find some theory-free K&W’s specific arguments in detail, I give an extended body of evidence that such a single universal pattern background to the question of universal common ances- relating all life forms exists independently of our try and provide a setting for understanding why null habit of thinking that it should [19]. hypothesis tests of significance, such as BLAST-style E- values, are inadequate to quantitatively address the evi- This sentiment was echoed also by K&W, who con- dence for and against UCA. cluded that a “formal demonstration of UCA … remains elusive and might not be feasible in principle.” [2]. Such Universal common ancestry: The qualitative evidence and criticisms of UCA point to a need for a formal test, need for a formal test similar to the formal tests of fundamental physical the- Universal common ancestry is the hypothesis that all ories like general relativity and quantum mechanics. extant terrestrial life shares a common genetic heritage. Darwin originally proposed UCA in 1859, yet was The classic arguments for common ancestry include characteristically circumspect, only committing to the many independent, converging lines of evidence from view that “animals are descended from at most only various fields, including biogeography, palaeontology, four or five progenitors, and plants from an equal or les- comparative morphology, developmental biology, and ser number” [3]. The hypothesis of UCA was evidently molecular biology [1,3-14]. The great majority of this an open question at least until the mid 1960’s, when a evidence, however, is qualitative in nature and only debate about UCA and the universality of the genetic directly addresses the relationships of limited sets of code (then as yet undeciphered) played out in the pages higher taxa, such as the common ancestry of metazoans of Science. One of the most celebrated arguments for or the common ancestry of plants. UCA is based on the fact that the genetic code is identi- The broader question of universal common ancestry is cal, or nearly so, in all known life. The argument had much more ambitious and correspondingly difficult to been circling informally for some years before Hinegard- assess. Are Europeans, Euryarchaeota, Euglena, Yersinia, ner and Engelberg first presented it in detail [20-23]: yew, and yeast all genetically related? Of course, biolo- gists routinely incorporate all of these taxa into a uni- Because the genetic code should remain invariant, its versal phylogenetic tree, which is an explicit constancy can be used to establish the number of representation of the genealogical relationships among primordial ancestors from which all (present) organ- these diverse taxa. But any group of taxa can be con- isms are derived. If, for example, the code is univer- nected in a tree; one can even make a phylogenetic tree sal … then all existing organisms would be from random sequences or characters. Yet is a tree itself descendants of a single organism or species. If the justifiable in light of the evidence? In a paper that moti- code is not universal, the number of different codes vated my original test of common ancestry, Sober and should represent the number of different primordial Steel set out the issue very clearly [11]: ancestors … When biologists attempt to reconstruct the phyloge- Hinegardner and Engelberg’s reasoning hinges on the netic relationships that link a set of species, they assumption that the genetic code is so important for usually assume that the taxa under study are genea- fundamental genetic processes that any mutations in the logically related. Whether one uses cladistic parsi- code would be lethal. Carl Woese criticized this argu- mony, distance measures, or maximum likelihood ment, noting its dependence on the assumption that the methods, the typical question is which tree is the genetic code is a “historical accident” and must not be best one, not whether there is a tree in the first “chemically determined” [23]. Woese was a proponent place. of the “stereochemical hypothesis”, which holds that the TheobaldBiologyDirect2011,6:60 Page3of25 http://www.biology-direct.com/content/6/1/60 association between a certain codon and its respective much is unclear. Furthermore, we also know now that amino acid is dictated by chemical phenomena — that the genetic code is far from “frozen”, and that it con- is, the observed code is required somehow by the laws tinues to evolve [27]. of physics, perhaps by binding affinity of the nucleic Ideally, we would like to evaluate the evidence for acid codon to its corresponding amino acid [23-26]. UCA from the universal genetic code, accounting for Woese was also sceptical that the code was “frozen”, the frozen accident and stereochemical hypotheses and and he postulated plausible mechanisms by which a allowing for code evolution. For instance, we could cal- degenerate code could evolve. If the code were somehow culate the following two different probabilities, and determined by physicochemical principles and evolvable, compare them: (1) the probability of arriving at a near then multiple origins of life could conceivably converge universal genetic code assuming two or more indepen- independently on the same code. However, the stereo- dent origins and convergence due to chemical con- chemical hypothesis was considered and largely disre- straints from the escaped triplet hypothesis, versus (2) garded by most researchers, including Francis Crick, due the probability of arriving at a near universal code to a lack of evidence and difficulty in imagining a possi- assuming a single origin and a“ frozen accident”. Of ble mechanism [20-22,24]. course, reasonable and persuasive verbal arguments can In 1968, Crick still presented the “frozen accident” be made on this point [25], and the qualitative evidence argument for UCA with some reservation [24]. But by compellingly supports universal ancestry. But quantita- 1973, in his famous essay on the explanatory power of tively estimating these probabilities is non-trivial. To evolutionary theory, Theodosius Dobzhansky laid out calculate these probabilities we need formally specified the existing evidence for UCA as if it were beyond dis- stochastic models for how genetic codes evolve under pute [4]. According to Dobzhansky, the primary support both hypotheses — namely, we need well-defined likeli- for UCA is given by several key molecular similarities hood functions, probability distributions for the shared by all known life: (1) the “universal” genetic observed data given each of the competing hypotheses. code, (2) nucleic acid as the genetic material, (3) shared We currently have no such formal models for genetic polymers such as proteins, RNAs, lipids, and carbohy- code evolution [11,12,25,27]. The calculations are drates, and (4) core metabolism. These are today still further complicated by the fact that there are other the main arguments for UCA. plausible hypotheses for the evolution of the code with The standard presentation of this evidence is, how- empirical support [25,30], not all of which are mutually ever, strictly qualitative; it does not quantitatively assess exclusive. Consequently, it is currently impractical to the likelihood that these commonalities could be arrived construct a formal, quantitative test of UCA using evi- at independently from multiple origins. Each of Dobz- dence from the genetic code. hansky’s arguments for UCA has its weaknesses, and Fortunately, however, we do in fact have ready-to-use, Sober and Steel provide several criticisms of these stan- well defined, and widely accepted stochastic models for dard arguments [11]. While a detailed analysis of these the evolution of protein sequences. These phylogenetic lines of evidence for UCA is beyond the scope of this models of sequence evolution have been developed in article, as a case study let us briefly revisit the “univer- detail over the past few decades in the field of molecular sal” genetic code, widely considered the most persuasive evolution, and they benefit from both a firm theoretical evidence for UCA [5,6,10,24]. basis and widespread empirical support from genetics The origin and evolution of the genetic code is still an [31]. It is for these very reasons that I based my formal open question [25,27], yet recently a modification of the probabilistic, model-based tests of UCA on protein stereochemical theory has made a comeback, in the sequence data. form of the “escaped triplet hypothesis” of Yarus and Knight [28,29]. Their hypothesis has the great advantage Sequence similarity and homology are not equivalent that it is based on a large body of experimental data One common thread among the various arguments for that has shown a significant chemical association common ancestry is the inference from certain biologi- between amino acids and their corresponding codon/ cal similarities to homology. However, with apologies to anti-codon triplets found in RNA aptamers. Fisher, similarity is not homology. It is widely assumed From Yarus and Knight’s work, we now have empirical that strong sequence similarity indicates genetic kinship. support for an association of specific codons with their Nonetheless, as I and many others have argued respective amino acids, which means we have a viable [1,32-40], sequence similarity is strictly an empirical mechanism for evolving similar or identical genetic observation; homology, on the other hand, is a hypoth- codes from independent origins. Hence, the plausibility esis intended to explain the similarity. Common ances- of a stereochemical theory decreases the force of the try is only one possible mechanism that results in “frozen accident” argument for UCA — but by how similarity between sequences. In a landmark paper on TheobaldBiologyDirect2011,6:60 Page4of25 http://www.biology-direct.com/content/6/1/60 the inference of homology from sequence similarity [33], evolution that phylogenetic models readily grasp: the the late Walter Fitch presented the problem as follows: hierarchical structure of nested correlations induced in genetic sequences by genealogical processes. Now two proteins may appear similar because they descend with divergence from a common ancestral Logical problems and misconceptions with BLAST-style gene (i.e., are homologous in a time-honoured null hypothesis tests meaning dating back at the least to Darwin’s Origin A main motivation for my original analysis [1] was to of Species) or because they descend with conver- escape from the logical quagmire posed by frequentist gence from separate ancestral genes (i.e., are analo- null hypothesis tests and bring state-of-the-art probabil- gous). It is nevertheless possible that the restrictions istic methods (Bayesian and likelihoodist) to bear on the imposed by a functional fitness may cause sufficient question of UCA. My tests of common ancestry are very convergence to produce an apparent genetic related- pointedly conducted within a Bayesian and model-selec- ness. Therefore, the demonstration that two present- tion framework, eschewing frequentist null hypothesis day sequences are significantly similar, by either che- methodology — a point made very prominently in the mical or genetic criteria, still must necessarily leave original report. undecided the question whether their similarity is Nevertheless, in their criticism K&W state that they the result of a convergent process or all that remains “interpret these tests [of common ancestry] within the from a divergent process. For example, it is at least null hypothesis framework.” The reviewers for K&W’s philosophically possible to argue that fungal cyto- criticism similarly expect a null hypothesis in my ana- chromes c are not truly homologous to the lysis, ask whether I am using the Fisher or Neyman- metazoan cytochromes c, i.e., they just look Pearson methodology of null hypothesis testing, and homologous. claim that small P-values directly support a common ancestry hypothesis [2]. Due to these misconceptions Colin Patterson made a similar argument [39], expli- about both my model selection methodology and the citly pointing out that statistically significant sequence capabilities of frequentist null hypothesis tests, I find it similarity does not necessarily force the conclusion of necessary to explicitly lay out the advantages of the homology: former and the logical problems with latter. Those who are familiar with Bayesian inference and model … given that homologies are hypothetical, how do selection theory, and with the standard criticisms of we test them? How do we decide that an observed frequentist null hypothesis tests, may wish to skip to similarity is a valid inference of common ancestry? If the end of the Introduction where I specifically discuss similarity must be discriminated from homology, its K&W’s simulation. assessment (statistically significant or not, for exam- Frequentist hypothesis tests (of both Fisher and Ney- ple) is not necessarily synonymous with testing a man-Pearson flavors) have been highly criticized and hypothesis of homology. disparaged for many decades by professional statisticians [44-71]. These criticisms are well-known in statistics — How, then, would we know if highly similar biological they in fact currently reflect the consensus view — and sequences had independent origins or not? In all but the as such I make no claims of originality in the following. most trivial cases we do not have direct, independent I begin with a representative quote from Jeff Gill, cur- evidence for homology — rather, we conventionally rent Director of the Center for Applied Statistics at infer the answer based on some qualitative argument, WUSTL, writing for social scientists, but the criticism often involving sequence similarity as a premise. One applies generally: purpose of my original analysis [1] was to give this ver- bal argument a formal basis, using modern evolutionary The null hypothesis significance test (NHST) should models and probability theory. Perhaps the E-values not even exist, much less thrive as the dominant from Karlin-Altschul statistics have already solved this method for presenting statistical evidence in the problem from a statistical perspective [41-43]? As I social sciences. It is intellectually bankrupt and dee- explain in detail below, such is not the case. The most ply flawed on logical and practical grounds. More that a low E-value for sequence similarity (say, from a than a few authors have convincing demonstrated BLAST search) can do is show that the observed degree this [54]. of sequence similarity is greater than that expected by random chance — the assessment of homology, on the Gill goes on to list 33 high-profile supporting publica- other hand, is a distinct question. Furthermore, Karlin- tions by mainstream statisticians from over the past half Altschul tests for similarity miss a critical aspect of century. TheobaldBiologyDirect2011,6:60 Page5of25 http://www.biology-direct.com/content/6/1/60 E-values, P-values, and Karlin-Altschul statistics similarity score S), we will obtain a distribution of simi- A BLAST E-value is a Fisherian null hypothesis signifi- larity scores. It has been found empirically that random cance test [41-43,72]. With BLAST searches, the null similarity scores closely follow an extreme value distri- hypothesis holds that the observed level of sequence bution (EVD). An EVD has a bell-shaped curve, some- similarity is an artefact generated by the optimal align- what similar to a Gaussian, but is asymmetric with a ment of two random sequences [41-43,72]. longer right tail (see Figure 1). To conduct a Fisherian null hypothesis significance For Karlin-Altschul statistics, then, the null distribu- test, a P-value is calculated based on a “null” distribu- tion is the EVD of random similarity scores, where the tion of some relevant statistic of the data. The P-value is similarity score is the appropriate test statistic. Imagine interpreted as the weight of evidence against the null that we align two protein sequences and obtain the hypothesis; the smaller the P-value, the greater the evi- observed, optimal similarity score S for the alignment. dence against the null. A P-value is defined as the prob- The P-value is then the probability of that score or ability of obtaining a value of the test statistic at least as greater, as given by the EVD of random sequence align- extreme as the actual observed value, assuming that the ments. This probability is also known as the “tail prob- null hypothesis is true. More formally, we say that the ability”, as it quantifies the probability in the rightmost P-value = p(D|N), read as the conditional probability of tail of the EVD (shown as the shaded region of the D given (or assuming) N, where D is data equal to or extreme value distribution in Figure 1). more extreme than the observed value, and N is the Note that there is a close relationship between P- null hypothesis. values and E-values. An E-value is similarly defined as In Karlin-Altschul statistics of sequence similarity, two the average number of times we expect to obtain a sequences are optimally aligned by maximizing the simi- value of the test statistic at least as extreme as the value larity between them [41-43]. The similarity between two actually observed, assuming that the null hypothesis is sequences is quantified by a similarity statistic, or simi- true. For mostly historical reasons, the E-value is con- larity score, S. Conventionally, the similarity score is ventionally used more often in sequence similarity tests, found by a weighted sum of the aligned positions like BLAST searches, and it can be directly calculated between two sequences, in which the weights are given from the P-value: by a log-odds amino acid scoring matrix, such as the E= -ln(1-P) (1) BLOSUM62 matrix [73]. If we take a large number of random sequences and align them (by maximizing the Figure1Extremevalueprobabilitydistributionforsimilarityscores.Astandardextremevaluedistribution(EVD)isshown.InKarlin-Altschul statistics,similarityscores(S)foralignmentsofrandomsequencesareassumedtofollowanEVD.P-valuesarebasedonthetailprobability, shownastheshadedbluearea,correspondingtotheprobabilityofobservingasimilarityscoregreaterthanorequaltotheobservedsimilarity score(S )foragivenalignmentofinterest. o TheobaldBiologyDirect2011,6:60 Page6of25 http://www.biology-direct.com/content/6/1/60 For small values of the P-value (< 0.05), both mea- the question of validity or legitimacy of creation sures are approximately equivalent. The important point science. Surely you realize that not being Mr Wil- here is that for our purposes E- and P-values are inter- liams in no way entails being Mr Ayala! [78] changeable, and all the arguments made below apply equally to both. Clearly, I am not attorney Williams (with high statisti- As mentioned above, a low P-value (or E-value) is cal significance, P < 0.01), and just as clearly, the fact conventionally considered to be a measure of evidence that we have established that I am not Williams does against the null hypothesis; the smaller the P-value, the not imply that I am Dr Ayala. Likewise, not being an more reason we have to believe that the null is false. alignment of random sequences in no way entails being The logic behind this interpretation of the P-value is homologous. neatly summarized by Ronald Fisher’s famous disjunc- If our aim is to establish my true identity, we could, tion: Upon obtaining a small P-value, “Either an excep- one-by-one, rule out the possibilities that I am William tionally rare chance has occurred, or the theory of Martin, or Eugene Koonin, or Yuri Wolf, or one of the random distribution is not true” (italics in original) [74]. other ~ 7 billion people on the planet. But wouldn’t it For the moment, let us assume that this reasoning is be useful if we had a method that instead could directly valid — that a small P-value is indeed evidence against provide positive evidence that I am Douglas Theobald? the null. We will revisit this assumption later. Unfortunately, null hypothesis significance tests are Although the logical problems with null hypothesis incapable by design of providing evidence for any statistical tests are many and profound, here I recount hypothesis — null hypothesis tests are intended to only only three faults that are most pertinent to homology provide evidence against the null, no more nor less [75]. inference using a BLAST-style null hypothesis test for Hence null hypothesis tests, by their own frequentist sequence similarity based on Karlin-Altschul statistics. logic, cannot provide evidence for common ancestry. The False Dichotomy: Rejecting the null does not imply P-values cannot provide evidence for the null hypothesis acceptance of a favoured alternative hypothesis Furthermore, failing to reject the null, by obtaining a According to the logic of null hypothesis testing, a small large P-value (e.g., P ≫ 0.05), does not imply that the P-value allows us to reject the null hypothesis at some null is true [75]. As Sir Ronald Fisher, inventor of the P- specified “level of significance” [47,57,64,65,74-76]. By value and null hypothesis significance test, wrote: “It is a convention, a meaningful level of significance is often fallacy so well known as to be a standard example, to chosen as 0.05 or 0.01, though other values are fre- conclude from a test of significance that the null quently used depending on the application (e.g, PSI- hypothesis is thereby established;… “ (emphasis in origi- BLAST by default uses a 0.005 cut-off [77]). P-values nal) [74]. Similarly, Fisher wrote that “it should be noted smaller than the chosen significance level allow us to that the null hypothesis is never proved or established, “reject the null hypothesis” as false. Nevertheless, reject- but is possibly disproved, in the course of experimenta- ing the null hypothesis of an alignment of random tion” so that “experimenters … are prepared to ignore sequences is not logically equivalent to accepting com- all [insignificant] results” [79,80]. This is often stated as mon ancestry. This reasoning could be valid only if ‘ran- the maxim: “failing to reject the null does not mean domness’ and ‘common ancestry’ were mutually accepting the null”, or, more prosaically, “Thou shalt exclusive hypotheses, but the logical complement of ‘a not draw inferences from a nonsignficant result!” [81]. random alignment’ is ‘not a random alignment’, rather For example, according to the logic of null hypothesis than ‘common ancestry’. Non-random, significant significance testing, an E-value of 10 does not mean that sequence similarity can be due to many factors besides the sequence alignment is likely to be random — it sim- common ancestry. ply means that a random alignment could easily have The common belief that a small E-value indicates resulted in the observed level of sequence similarity. sequence homology (i.e., that the two sequences share a Clearly it would be beneficial to have a statistical common ancestor) is based on a false dichotomy. Dur- method that could provide evidence for the null and not ing the 1981 creationist trial in Little Rock, Arkansas, just evidence against it. the attorney for the State, David Williams, committed The “Prosecutor’s Fallacy” and pregnant women: the same logical fallacy by arguing that criticisms of evo- Improbable data does not imply an improbable hypothesis lutionary theory were evidence for creationism. Geneti- Up till now, we have assumed that the frequentist posi- cist Francisco J. Ayala, a witness for the plaintiffs, tion is correct, namely that Fisher’s disjunction has logi- corrected him: cal force and a small P-value is evidence against the null hypothesis. But even this key premise is fallacious, as it … negative criticisms of evolutionary theory, even if relies on a notorious error of probabilistic reasoning they carried some weight, are utterly irrelevant to known in law as the“ Prosecutor’s Fallacy” [82,83]. A TheobaldBiologyDirect2011,6:60 Page7of25 http://www.biology-direct.com/content/6/1/60 small P-value, which measures the improbability of the hypothesis significance tests, such as Fisherian P-values, data under the null, in fact does not imply that the null which were developed within the frequentist statistics is improbable. paradigm. The advantages of model selection methods The Prosecutor’s Fallacy arises from incorrectly infer- include a firm logical and theoretical basis and elegant ring that a cause is unlikely because the effect is unli- handling of model complexity, both of which make kely. A simple example may suffice to illuminate the them attractive for analyzing complex evolutionary mod- problem. A quick glance around should establish that, at els for biological sequence data. any given time, most women are not pregnant. In my The core idea is elegantly simple: Quantitatively calcu- state of Massachusetts, for example, the frequency of late how well different models explain the data (judged women that are pregnant is approximately 2% [84]. We either by the likelihood of the data or by the posterior can quantify this observation with a conditional prob- probability of a model), and compare the models to ability statement: the probability that a particular person each other. Alternative models compete against each is pregnant, given that the person is a woman, is 0.02. other head-to-head, and the observed data is the judge. Symbolically, we write p(P|W) = 0.02, where P is the From a likelihoodist point of view, such as when using proposition that “this person is pregnant” and W is the the AIC, the preferred model is just the model that proposition that “this person is a woman”. Note that p explains the data best (by assigning it the highest prob- (P|W) is small — in fact small enough to be “statistically ability). From a Bayesian point of view, the preferred significant” by convention. Nevertheless, the small value model is the model that has the highest probability for p(P|W) does not imply that the inverse probability, given the data, within the set of models being evaluated. p(W|P), is also small. The probability that ‘this person is The critical part is in having explicit stochastic models a woman given that this person is pregnant’ is obviously (likelihood functions) for each of the hypotheses under much greater than 0.02! comparison. The same rules of probability necessarily apply to p(D| One great advantage of model selection methodology N) (the P-value) and the inverse probability p(N|D) (the is how the complexity of competing hypotheses is probability of the null hypothesis given the observed handled. When judging the explanatory power of com- data). Thus, a small P-value does not imply that the null peting hypotheses, two opposing factors must be hypothesis is unlikely. In fact, by itself, the P-value tells accounted for: parsimony and the fit to the observed us nothing at all regarding the probability of the null data. By increasing the number of parameters in a hypothesis. The only way to calculate the probability of model — i.e., by making the hypothesis more complex the null hypothesis, p(N|D), is by using Bayes theorem, — one can always improve the “goodness of fit” to the but frequentist methodology does not allow us to do data. For instance, when fitting a polynomial curve to a that. set of points, the discrepancy between the curve and the The reason why the Prosecutor’s Fallacy is false is data points can be minimized arbitrarily by increasing simply because true hypotheses routinely predict low- the order of the polynomial until it equals the number probability data. Observing a piece of data that is unli- of data points. However, simpler models are preferred, kely could be nothing more than that — unlikely things since increasing the complexity of the model increases happen all the time (like pregnant women or a winning the uncertainty in the estimates of the parameters. The lottery ticket). The null hypothesis may predict the fewer ad hoc parameters, the better — a principle infor- observed data with a small probability, resulting in a mally known as Occam’s Razor. P-values unfortunately small P-value, but if competing models are even worse have no way to account for model complexity. Model — that is, competing models predict the same data with selection methods, in contrast, weigh these two oppos- even lower probability — then why should we reject the ing factors (goodness-of-fit and complexity) to find the null? For these reasons, there is no logical reason to hypothesis that is jointly the most accurate and the think that the null hypothesis is likely to be false based most precise. solely on a small P-value; additional information is Model selection methodology solves all the problems required. with null hypothesis significance test P-values enumer- ated above. First, in model selection methodology, there A wayout: Bayes, likelihood, and model selection is no null hypothesis. All hypotheses are treated equally, Model selection methods, such as log likelihood ratios none is given special favoured status, and they compete (LLR), the Akaike Information Criterion (AIC), and head-to-head. There is no need to rely on a false dichot- Bayes factors [47], are now used routinely in modern omy, since multiple models are compared directly to biological research, especially in phylogenetics, genetics, each other. A hypothesis is “rejected” only if there is a and bioinformatics [69,85-95]. Model selection theory better alternative. Second, model selection scores pro- has largely been developed as an alternative to null vide the evidence for each hypothesis, not just against TheobaldBiologyDirect2011,6:60 Page8of25 http://www.biology-direct.com/content/6/1/60 the “null”. The complexity and ability of a model to model phylogenetic structure or different levels of explain the data is the evidence for and against it, when sequence similarity. K&W’s simulated data has no phy- compared to other competing hypotheses. Third, model logenetic signal in the sense that the sequences lack selection methodology does not fall prey to the Prosecu- nested patterns of similarities. For this artificial data set tor’s fallacy. A particular model A may predict the data generated by the profile model, the model selection tests with a small probability, but if no other hypothesis does choose common ancestry over independent ancestry. any better (after accounting for model complexity), then K&W make three distinct claims based on their simu- model A is the preferred hypothesis. Fortunately, in lated data: model selection theory pregnant women are not (1) the model selection tests prefer UCA solely due to rejected. the high similarity of the protein sequences, regardless In my formulation of the tests for UCA, independent of genealogical history, ancestry models are represented by multiple phyloge- (2) their simulation corresponds to a convergent, inde- netic trees, one tree for each group of taxa that is pendent ancestry model, and therefore the model selec- assumed to be genealogically related under this particu- tion tests err in choosing common ancestry for the lar model [1]. Unrelated taxa are found in different simulated sequences, trees. Because each tree is assumed to be independent (3) the demonstration of UCA is dependent on the (by definition of independent ancestry), the complete assumption that proteins with highly similar sequences probability of the data assuming an independent ances- share common ancestry. try model is simply the product of the probabilities of All of these claims are incorrect, and I consider each the several trees that compose it. For instance, consider of them in turn below. two independent, unrelated trees of taxa A and B, and the sequence evidence X. The total probability of the Results and Discussion sequence data X for the independent ancestry hypothesis Claim 1: The conclusion of common ancestry is primarily IA is given by the joint probability of the data from tree due to genealogical structure in the protein sequences, A and tree B: not mere similarity In their Abstract, K&W make their main claim: p(X|IA)=p(X|A&B)=p(X|A)p(X|B) (2) …the purported demonstration of the universal com- Common ancestry models are represented by includ- ing all taxa in unified trees. Once ‘common ancestry’ mon ancestry is a trivial consequence of significant and ‘independent ancestry’ are thus defined for a set of sequence similarity between the analyzed proteins. The nature and origin of this similarity are irrelevant taxa, it is straightforward to apply the model selection for the prediction of “common ancestry” of by the tests to determine which hypothesis is best. model comparison approach. Koonin and Wolf’s rebuttal: Common ancestry models Later they further explain that the model selection win in a simulation of similar sequences lacking results in favour of UCA are “simply a restatement of phylogenetic structure the fact that these proteins display a highly statistically In a recent criticism of my model selection tests of significant sequence similarity”. Reviewer William Mar- UCA, Koonin and Wolf argue that the test results do tin agrees: “They are absolutely right on this”. not support the conclusion of universal common ances- I present four different evidences demonstrating that try [1,2]. K&W support this claim by a simulation study K&W’s primary claim — that my results are indepen- in which similar sequences were randomly generated without any phylogenetic structure. In K&W’s simula- dent of phylogenetic history and are simply due to sequence similarity — is incorrect. The first piece of evi- tion, each column of a sequence alignment has a differ- dence is based on mathematical considerations of the ent, independent amino acid distribution (the amino theory behind the phylogenetic models used in the tests, acids are distributed according to a discrete distribution, while the final three are empirical. also called a categorical distribution [96,97]). They then I. Phylogenetic models involve more than mere sequence generated artificial sequences by randomly selecting amino acids from each column’s distribution. The sto- similarity Based on the theory underlying modern phylogenetic chastic model corresponding to this simulation I will call the “profile” model, due to its similarity to common models, we know that sequence similarity is only one component that indirectly affects the model selection sequence profiles [98-100]. The profile model can be tests [31,101,102]. Each of the models, whether common considered a star-tree in which each site has its own ancestry or independent ancestry, involve phylogenetic amino acid substitution matrix, and hence it cannot TheobaldBiologyDirect2011,6:60 Page9of25 http://www.biology-direct.com/content/6/1/60 trees. Trees are mathematical structures that can large test scores favouring UCA: account for both gross similarity (largely in the branch lengths) and subtle patterns of similarities (in the parti- … when comparing a common-ancestry model to a cular topology of the tree) [31,101,102]. It is this latter multiple-ancestry model, the large test scores are a component — the complex hierarchical patterns of simi- direct measure of the increase in our ability to accu- larities induced by a genealogical branching process — rately predict the sequence of a genealogically that phylogenetic trees can account for but methods like related protein relative to an unrelated protein. BLAST, which consider only overall sequence similarity, cannot. Later they claim that this sentence is “simply a restate- Consider the phylogenetic tree of six protein ment of the fact that these proteins display a highly sta- sequences shown in Figure 2. In standard Markovian tistically significant sequence similarity”. However, from phylogenetic models, such as the ones used in the the considerations given above, we know that a phyloge- model selection tests, the probability of the six netic model’s ability to predict a given sequence (e.g., sequences depends on the topology of the tree and its sequence D in Figure 2) from other sequences (e.g. branch lengths [31,101]. The sum of the branch lengths sequences A, B, C, E, F in Figure 2) is a function of the between two sequences is proportional to the number of tree topology and the patterns in the sequences, not evolutionary substitutions separating those two simply of the gross similarities among the sequences. sequences. In this sense, the total distance along the II. High sequence similarity is insufficient to force the tree between two sequences is a measure of their simi- conclusion of common ancestry larity. Importantly, if you change the topology, while If K&W’s hypothesis is correct — that the model selec- maintaining the same distance between two sequences, tion tests choose common ancestry simply due to or even maintaining the same total tree length, you gen- sequence similarity — then the tests should choose erally change the likelihood of the model (i.e., you common ancestry over independent ancestry for any set change the probability of the sequences). of sequences with highly statistically significant Now imagine that you replace these six sequences sequence similarity. K&W make this very claim in their with a different set that nevertheless has identical per- Discussion: “The likelihood tests of the kind described cent identity and similarity as the original set (perhaps by Theobald … yield results ‘in support of common as measured by an alignment score using a substitution ancestry’ for any sufficiently similar sequences.” How- matrix). In general this replacement will also change the ever, this prediction is directly contradicted by the likelihood, even if the topology and branch lengths are example given in the Supplementary material of my ori- held constant. This means that the likelihood of a tree is ginal Nature letter (section 4.3, pages 16-20 [1]). There not simply a function of the similarities of the sequences I present a simple case in which the model selection [31,101,102]. Rather, the likelihood of a phylogenetic tests choose independent ancestry for sequences with tree is also a function of the nested pattern of similari- highly significant similarity. Here I present another ties in the data. Therefore, the model selection tests, similar example, in which four sequences have highly which are explicitly based on likelihoods of trees, con- significant similarity to each other (in all possible pair- sider information in the sequences that cannot be wise comparisons, Table 1), and yet the model selection reduced to simple sequence similarity. tests prefer independent ancestry models over common Hence it is possible for highly similar sequences to ancestry (Table 2). Therefore, the model selection tests nevertheless have conflicting hierarchical structure that necessarily consider factors other than mere sequence does not fit well to a single, global tree. Common ances- similarity, and highly significant sequence similarity is try implies phylogenetic structure; Markovian character not sufficient for the model selection tests to choose evolution along a bifurcating tree results in hierarchical common ancestry. patterns of correlated character changes. If the phyloge- Why do the model selection tests favour independent netic structure in a set of highly similar sequences is origins for these sets of sequences, in spite of significant conflicting, then this is evidence against common ances- similarity? The answer is that these similar sequences try, and it could be evidence for independent ancestry have conflicting phylogenetic structure, and conflicting models that do not force the proteins into a global phy- phylogenetic correlations are unlikely to have been gen- logeny. This is one key advantage of my model-based erated by a common ancestry process. The known pre- tests over simple BLAST-type analyses that only look at sence of conflicting phylogenetic structure in the gross sequence similarity. universal protein data set I used [103] (as indicated by K&W quote what they call a “key sentence” from my suspected horizontal gene transfer events) was in fact a Nature letter where I explain the significance of the major motivation for my model selection tests of UCA TheobaldBiologyDirect2011,6:60 Page10of25 http://www.biology-direct.com/content/6/1/60 Figure2Examplephylogeny.Atoyphylogenyofsixsequences,representedbythelettersA-F. — it is possible that a high enough degree of conflicting III. Star tree models account for similarity, but not for phylogenetic structure could indicate that a particular genealogical structure in the real protein data independent ancestry hypothesis is a superior model for K&W suggest that the universal proteins in my dataset the universal protein data. could have been generated by a process completely

Description:
Background. The universal common ancestry (UCA) of all known life is a fundamental component of modern evolutionary theory, supported by a wide
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.