bioRxiv preprint doi: https://doi.org/10.1101/481846; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 1 Random Tanglegram Partitions (Random TaPas): An Alexandrian Approach to 2 the Cophylogenetic Gordian Knot 3 Juan Antonio Balbuena1, Óscar Alejandro Pérez-Escobar2, Cristina Llopis-Belenguer1, Isabel 4 Blasco-Costa3 5 1Ecology and Evolution of Symbionts Lab, Cavanilles Institute of Biodiversity and Evolutionary 6 Biology, University of Valencia, Official P.O. Box 22085, 46071 Valencia, Spain. E-mail: 7 [email protected]; 2Identification and Naming Department, Royal Botanic Gardens, Kew, 8 Richmond, TW9 3AB, U.K.; 3Department of Invertebrates, Natural History Museum of Geneva, 9 P.O. Box 6134, CH-1211 Geneva, Switzerland. 10 Abstract 11 Symbiosis is a key driver of evolutionary novelty and ecological diversity, but our 12 understanding of how macroevolutionary processes originate extant symbiotic associations is 13 still very incomplete. In this context, cophylogenetic tools are used to assess the congruence 14 between the phylogenies of two groups of organisms related by extant associations. If 15 phylogenetic congruence is higher than expected by chance, we conclude that there is 16 cophylogenetic signal in the system under study. However, none of the current cophylogenetic 17 methods quantifies the degree of cophylogenetic signal satisfactorily. We present a novel 18 approach, Random Tanglegram Partitions (Random TaPas) that applies a given global-fit 19 method to random partial tanglegrams of a fixed size to identify the associations, terminals and 20 nodes that maximize phylogenetic congruence. By means of simulations, we show that the 21 output value produced is inversely proportional to the number of cospeciation events employed 22 to build simulated tanglegrams. In addition, with time-calibrated trees, Random TaPas is also 23 efficient at distinguishing cospeciation from pseudocospeciation. Random TaPas can handle 24 large tanglegrams in affordable computational time and incorporates phylogenetic uncertainty in 25 the analyses. We demonstrate its application with two real examples. In both systems, Random 26 TaPas revealed low cophylogenetic signal, but mapping its variation onto the tanglegram 27 pointed two different coevolutionary processes accounting for this result. Finally, we suggest 1 bioRxiv preprint doi: https://doi.org/10.1101/481846; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 28 that the recursive partitioning of the tanglegram buffers the effect of phylogenetic 29 nonindependence occurring in current global-fit methods and therefore Random TaPas is more 30 reliable to identify host-symbiont associations that contribute most to cophylogenetic signal 31 than regular global-fit methods. Random TaPas can be implemented in the public-domain 32 statistical software R with scripts provided herein. 33 Keywords: Symbiosis, Coevolution, Codiversification, Cophylogenetic Signal. 34 INTRODUCTION 35 Symbiosis is widespread throughout the tree of life and is considered as a key driver of 36 evolutionary novelty and ecological diversity (Moran 2006; Zook 2015). Because organisms do 37 not evolve in isolation, the evolutionary fate of symbiotic partners is intertwined at ecological 38 and evolutionary levels, but despite the centrality of symbiosis in evolutionary biology, our 39 understanding of how macroevolutionary processes originate extant symbiotic associations is 40 still very incomplete (Weber et al. 2017). However, the recent emergence of robust comparative 41 phylogenetic methods has expanded and facilitated research in this area (Hutchinson et al. 42 2017a). Cophylogeny, in particular, provides a quantitative framework to evaluate the 43 dependency of two evolutionary histories (Hutchinson et al. 2017b). This approach involves 44 some assessment of the congruence between the phylogenies of two groups of species or taxa 45 related by extant associations, where congruence quantifies the degree of both topological and 46 branch-length similarity (Page 2003). If such congruence is higher than expected by chance, it 47 is concluded that there is cophylogenetic signal in the system studied (Mendlová et al. 2012). 48 Although cophylogenetic signal was initially interpreted as evidence of high level of 49 cospeciation, it has been shown that other mechanisms can account for some degree of 50 topological congruence. Particularly, complete host-switching events (i.e., colonization of a new 51 host species followed by speciation) among closely related hosts can result in symbiont 52 diversification mimicking the tree topology of the host, a process that has been termed 53 pseudocospeciation (de Vienne et al. 2013). In any case, even if its causality cannot not be 54 determined, detecting cophylogenetic signal is highly relevant because it implies that 2 bioRxiv preprint doi: https://doi.org/10.1101/481846; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 55 contemporary ecological associations among species have been the product of a coupled 56 evolutionary history such that ancestral forms of each species experienced and responded to 57 shared selection pressures (Hutchinson et al. 2017b). 58 The wide range of cophylogenetic methods currently available can be roughly 59 categorized as either event-based or global-fit (Hutchinson et al. 2017a), but none of them deal 60 with cophylogenetic signal satisfactorily. Event-based methods attempt to reconstruct the 61 coevolutionary history of the organisms involved by assigning costs to each type of event and 62 heuristically search for the solution(s) that minimize(s) the overall sum of costs (Charleston and 63 Libeskind-Hadas 2014). The problem is that cophylogenetic signal can be easily overestimated 64 because the default cost of cospeciation is assumed to be zero, which is often at odds with 65 empirical evidence (de Vienne et al. 2013). In addition, with large datasets, the approach 66 becomes computationally prohibitive and the influence of phylogenetic uncertainty is not 67 explicitly considered (i.e., the single input phylogenetic trees of host and symbionts are assumed 68 to represent the actual evolutionary relationships), which may lead to erroneous conclusions if 69 not all clades are well supported. Global-fit methods, for their part, assess the degree of 70 congruence between two phylogenies and can also identify the specific interactions that 71 contribute most to overall congruence (Balbuena et al. 2013). They can handle large datasets 72 economically in terms of computational time, and the effect of phylogenetic uncertainty can be 73 assessed (Pérez-Escobar et al. 2016). However, current methods such as PACo (Balbuena et al. 74 2013) or ParaFit (Legendre et al. 2002), provide statistical evidence of cophylogenetic signal, 75 but produce no directly interpretable statistic as for its strength and no explicit links with 76 coevolutionary events are made. 77 Therefore, additional work remains to be done in this domain. Particularly, the recent, 78 spectacular expansion of DNA sequencing and phylogenetic reconstruction is leading to 79 increasingly common scenarios involving species-rich trees and complex host-symbiont 80 interactions (e.g., Hutchinson et al. 2017b). Equating these elaborate systems to evolutionary 81 Gordian knots, we present herein an Alexandrian approach to cophylogenetic signal assessment. 82 Random Tanglegram Partitions (Random TaPas) applies a given global-fit method to random 3 bioRxiv preprint doi: https://doi.org/10.1101/481846; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 83 partial tanglegrams of a fixed size to identify the associations, terminals and nodes that 84 maximize phylogenetic congruence. By means of simulations, we show that the output value 85 produced by Random TaPas is inversely proportional to the number of cospeciation events 86 employed to build the tanglegrams. In addition, with time-calibrated trees, Random TaPas is 87 also efficient at distinguishing cospeciation from pseudocospeciation. The method is also useful 88 to identify what host-symbiont associations contribute most to phylogenetic congruence because 89 the variation in cophylogenetic signal can be mapped onto the tanglegram. We illustrate its 90 application with two real datasets and show how it can incorporate phylogenetic uncertainty in 91 the analyses. R scripts (R Core Team 2018) to perform Random TaPas and a user’s guide are 92 also provided. 93 MATERIALS AND METHODS 94 The Random-TaPas algorithm 95 The starting point is a triple (H, S, A), where H and S represent the phylogenies of hosts and 96 symbionts, and A is a binary matrix with rows and columns corresponding to terminals in H and 97 S, respectively, in which extant associations between each terminal are coded as 1 and no 98 associations as 0. The triple is often represented graphically as a tanglegram (Fig. 1), in which H 99 and S usually are displayed face to face and their terminals are connected by lines reflecting the 100 associations encapsulated in A. Figure 1 provides a complete overview of the Random TaPas 101 algorithm. In short, for a large number of times (N), it selects n random unique one-to-one 102 associations, so that each taxon in H is associated to one, and only one, taxon in S, and vice 103 versa. (Since the aim is to find associations that maximize congruence, this is a condition to be 104 met by any perfect host-symbiont cospeciation scenario). Then a global-fit method is applied to 105 the partial tanglegram defined by the n associations and the statistic produced is saved. Finally, 106 a percentile p of the distribution of the N statistics generated is set and the frequency of 107 occurrence of each host-symbiont association of A included in p is computed. 108 The output of Random TaPas is a frequency distribution of host-symbiont associations 109 in which each value is expected to be directly proportional to the contribution of the host- 4 bioRxiv preprint doi: https://doi.org/10.1101/481846; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 110 parasite association to the global pattern of phylogenetic congruence (Figure 1). The shape of 111 this distribution can be characterized by computing the variance to mean ratio (VMR) of the 112 frequencies, whose value is biologically informative. In a perfect cospeciation scenario, VMR 113 = 0 is expected because each host-symbiont association would contribute equally to the global 114 fit, yielding a uniform frequency distribution. Thus, cophylogenetic signal would be significant 115 and maximal. If the association between H and S were completely random, the frequency 116 distribution would follow a Poisson distribution with VMR = 1 and cophylogenetic signal 117 would be not significant. Finally, unequal contributions of the host-symbiont associations would 118 lead to skewed frequency distributions with VMR >> 1 and, if significant, VMR would be 119 inversely proportional to cophylogenetic signal. 120 Global-fit methods 121 Random TaPas can be applied in conjunction with any global-fit method. For the sake of 122 demonstration, we chose here two very different approaches: Procrustes Approach to 123 Cophylogeny (PACo) (Balbuena et al. 2013) and geodesic distances (GD) in tree space (Schardl 124 et al. 2008). PACo uses Procrustean superimposition of Euclidean embeddings of the 125 phylogenetic trees to assess phylogenetic congruence (Balbuena et al. 2013). The second 126 approach entails computing pairwise GDs between the partial phylogenies defined by n. Most 127 global-fit methods translate matrices of evolutionary distances into Euclidean space whose 128 dimensionality is higher than that of tree space (Holmes 2005). So the advantage of this 129 approach is that it skips the potential effect of this dimensional mismatch. In both methods, the 130 value of the statistic produced is inversely proportional to statistical congruence. 131 PACo was performed with package paco (Hutchinson et al. 2017b) employing patristic 132 distances as input. Its use with Random TaPas is aimed at finding the set of partial tanglegrams 133 that minimize the square sum of residuals (m2 ) of the Procrustes superposition of host and XY 134 symbiont spatial configurations in p. The use of GDs with Random TaPas attempts to find the 135 set of partial tanglegrams in p with minimal pairwise distances in tree space. These distances 5 bioRxiv preprint doi: https://doi.org/10.1101/481846; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 136 were computed with function dist.multiPhylo of package distory in R (Chakerian and Holmes 137 2010). 138 Synthetic data 139 The operation of Random TaPas depends on three parameters: N, n and p (Fig. 1). We assessed 140 performance with different parameter combinations and the two aforementioned global fit 141 methods with 200 synthetic tanglegrams generated with CoRe-Gen (Keller-Schmidt et al. 2011). 142 For a given set of input parameters, the program provides a pair of resulting ultrametric trees, a 143 list of associations between the terminals and the number of coevolutionary events 144 (cospeciation, duplication, sorting, and complete host-switching) involved in the construction of 145 the tanglegram. Since it is not possible to accurately control a priori the output based on the 146 input parameters, we first generated a library of 1,000 tanglegrams. The trees were built 147 following a pure-speciation (Yule) model, which has been shown to describe adequately 148 empirical phylogenetic trees (Hagen et al., 2015). For each host-symbiont pair, we specified a 149 different random combination of input parameters sampled from a uniform distribution. The 150 parameters and sampling ranges were: number of generations (100-200), probability of 151 cospeciation event (0.2-1.0), probability of host-switching event (0.2-0.8) and probability of 152 choosing a host for speciation instead of a parasite (0.7-1.0). Because our method is intended for 153 a wider range of settings than those considered by Keller-Schmidt et al. (2011), the parameter 154 ranges are broader than those used originally. Thus, our resulting tanglegrams included 155 scenarios where cospeciation events ranged from very rare to very common. Since some 156 combinations of parameters yielded trees with few terminals (≤ 15) or numbers of host parasite 157 associations (≤ 20), we ran CoRe-Gen 1,274 times to obtain a set of 1,000 workable 158 tanglegrams. 159 Given that Random TaPas is specially intended for large tanglegrams, we chose for 160 subsequent analyses two sets of 100 tanglegrams each involving approximately 50 (mean = 161 50.5, median = 51, range: 47-54) and 100 (mean = 99.7, median = 100, range: 93-107) host- 162 symbiont associations, respectively. For convenience, these subsets will be henceforth referred 6 bioRxiv preprint doi: https://doi.org/10.1101/481846; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 163 to as Set50 and Set100, respectively. Since some trees included terminal polytomies and GD 164 requires fully resolved trees, we added 0.01 arbitrary units of time to all terminal branches. A 165 limitation of Core-gen is that it only generates coevolutionary systems in which each symbiont 166 can be associated with only a single host. Given that many real-world scenarios do not conform 167 to this scenario (e.g., Example 2 below), we modeled the occurrence of colonization (aka host- 168 sharing) by a given symbiont of several hosts following Drinkwater et al. (2016). For each 169 triple, a rate of host sharing of 0%, 10% or 20% of the total available host species was set 170 randomly. Then colonization was simulated by selecting each terminal in S and allowing the 171 corresponding symbiont to colonize a random number x of additional hosts, {x ∈ (cid:0) ∣ 0 ≤ x ≤ R 172 × N }, where R represents the rate of host sharing for the given tanglegram and N the total h h 173 number of host terminals. 174 Since additive trees (i.e., phylograms) estimated from molecular sequence data are 175 commonly used in cophylogenetic analyses, we ran the simulations with the original ultrametric 176 trees and phylograms derived from them. The transformation of the ultrametric trees into 177 phylograms was done by multiplying their branch lengths by varying rates of molecular 178 substitution (Brown and Yang 2011; Paradis 2014), sampled from a log-normal distribution 179 with mean 0.01 substitutions/site/time unit (Brown and Yang 2011). 180 Simulations 181 We applied Random TaPas to Set50 and Set100 (both additive and ultrametric trees) using both 182 PACo and GD with all combinations of the following parameter values: N=104, n= 5, 10 and 20 183 (Set50), and n=10, 20 and 40 (Set100) (i.e., n representing about 10%, 20% and 40% of the total 184 associations), and p= 1% and 5%. In preliminary analyses, N= 105 was also tested with additive 185 trees in combination with the same arrays of n and p values but the results were very similar to 186 those obtained with N=104 (Supplementary Figures S1 and S2 in Supplementary Material). 187 Since the increased computing time resulting from using a larger N did not result in a detectable 188 improvement in performance of the method, N= 105 was not further considered. In simulations 189 involving additive trees, prior to running PACo the patristic distances were rendered Euclidean 7 bioRxiv preprint doi: https://doi.org/10.1101/481846; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 190 by taking the element-wise square root of each cell in order to minimize geometric distortion 191 and avoid negative eigenvalues (de Vienne et al. 2011). 192 For each simulation, the VMR of the frequency distribution of host-symbiont 193 associations produced by Random TaPas was computed. We assessed the relationship between 194 the VMRs of each set of simulations and the number of coevolutionary events (cospeciation, 195 duplication, sorting, host-switching and colonization) associated to each tanglegram in two 196 ways. First, regression trees were built with a recursive-partitioning function of package rpart in 197 R (Therneau and Atkinson 2018). This approach is indicated to provide a general picture of the 198 structure of the data and reveal complex interactions between variables (Crawley 2013). The 199 importance of each coevolutionary event in each tree model (scaled to sum 1) was computed 200 according to Therneau and Atkinson (2018). In addition, the coefficient of determination of 201 each model (R2) was computed as the 1–relative error at the last split (Therneau and Atkinson 202 2018). Second, we computed the correlation coefficients between VMR of frequency 203 distributions and the numbers of each of the coevolutionary events. 204 Pseudocospeciation experiment 205 In order to assess the ability of Random TaPas to distinguish between cospeciation and 206 pseudocospeciation (de Vienne et al. 2007), we selected a triple (#63) in Set50 with high 207 cospeciation signal. This simulated host-symbiont system resulted from 50 cospeciations, 1 208 sorting, 2 duplications, 1 complete host-switching and 0 colonization events. We arbitrarily 209 selected a clade of 17 terminals to simulate rapid colonization and speciation of the symbiont 210 lineage (de Vienne et al. 2007). In both the ultrametric and additive trees versions of the 211 tanglegram, the branch lengths of the clade were shortened to half their original length, whereas 212 its basal branch was lengthened to keep the original height of the clade. We applied Random 213 TaPas to the original and modified tanglegrams using GD, N = 104, p = 5% and n = 5, 10 and 214 20. The ability of Random TaPas to distinguish cospeciation from pseudocospeciation was 215 assessed by plotting as a heatmap on the tanglegram, the frequency of occurrence of each host- 216 symbiont link and the average frequency of occurrence of each terminal in p. 8 bioRxiv preprint doi: https://doi.org/10.1101/481846; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 217 Biological Examples 218 To gain further insight on its performance and provide further guidance to prospective users, we 219 demonstrate the application of Random TaPas to two real-world examples. The trees, 220 association matrices and R scripts used with each example are available at 221 https://github.com/Ligophorus/RandomTaPas. 222 Example 1: Neotropical orchids and their euglossine pollinators.— We used the association 223 data and published chronograms of pollinated orchids and their corresponding bee pollinators 224 (Ramírez et al. 2011). The posterior probability trees of orchids and euglossine bees to build the 225 consensus were kindly supplied by Santiago Ramírez, University of California Davis. 226 Random TaPas was run with GD and PACo, N = 104, n = 29 and p= 5%. This example 227 involves 129 bee-orchid associations. The n value chosen was the maximum number of one-to- 228 one associations that could be formed over the 104 runs, and it is close to the 20% we 229 recommend based on results of the preceding simulations. In order to account for taxonomic 230 uncertainty, we computed 95% confidence intervals of the host-symbiont frequencies of each 231 association using a random sample (excluding the burn-in set) of 1,000 pairs of the posterior- 232 probability trees used to build the consensus trees of euglossine bees and orchids. Random 233 TaPas was run for each tree pair as specified above and the 2.5% and 97.25% percentiles of 234 each host-symbiont association frequency were computed. 235 We also assessed the different contribution of each host-symbiont association to the 236 global cospeciation signal by displaying as heatmap on the tanglegram their frequency of 237 occurrence and the average frequency of occurrence of each terminal occurring in p. Taking the 238 latter as a continuous trait, fast maximum likelihood estimators of ancestral states were 239 computed with fastAnc of package phytools in R (Revell 2012), and their values were displayed 240 on the nodes of the phylogeny as a color scale. We adopted this approach only to assess 241 different levels of cophylogenetic signal across the tanglegram and by no means imply that we 242 consider these estimators to reflect actual ancestral states of host-symbiont associations. In 243 addition, we used negative binomial regressions (glm.nb function from the MASS package in R, 9 bioRxiv preprint doi: https://doi.org/10.1101/481846; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 244 Venables and Ripley 2002) to test for differences in average frequency of occurrence of each 245 terminal in p among three main clades that represent each an orchid subtribe. 246 Example 2: Coitocaecum parvum and its amphipod host.— Data of mitochondrial haplotypes of 247 the cytochrome oxidase subunit 1 of the trematode, Coitocaecum parvum (Crowcroft, 1945), 248 and those of its amphipod host, Paracalliope fluviatilis (Thomson, 1879), in seven locations in 249 South Island, New Zealand from Lagrue et al. (2016). This example is worked out in the 250 Random TaPas User’s Guide (available at https://github.com/Ligophorus/RandomTaPas). 251 Random TaPas was run with GD and PACo, N = 104, n = 8 and p = 5% to evaluate the 252 extent to which genetic differentiation among populations was coupled with local adaptation 253 patterns (Lagrue et al. 2016). We chose n= 8 because it is a good compromise between the 254 number of unique combinations of unique one-to-one associations that can be evaluated over 255 104 runs and the fraction of total number of host-symbiont associations (8 of 75, i.e., ≈ 10%). 256 (The accompanying R script and the user’s guide demonstrates the strategy followed to choose 257 the best n for this example). PACo by default assumes dependency of one phylogeny upon the 258 other. Since in this case codivergence between host and parasite lineages was assumed to be 259 reciprocal, PACo was run in symmetric mode (Hutchinson et al. 2017a). Estimation of 260 confidence intervals of the host-symbiont frequencies and mapping of results on the tanglegram 261 was performed as in Example 1. 262 RESULTS 263 Simulations 264 Not all simulations could be performed because, in some tanglegrams, the number of possible 265 unique one-to-one associations between terminals was less than the number n set in simulations. 266 So the number of tanglegrams produced in Set50 with n = 20, was 79; and those in Set100 with 267 n=20 and n=40 was 99 and 63, respectively. 268 Forty-eight regression tree models describing the relationship between VMR of host- 269 symbiont frequency distributions and the number of evolutionary events of tanglegrams 270 involving the additive and the original ultrametric trees with all combinations of n and p and 10
Description: