Selecting optimal minimum spanning trees that share a topological correspondence with phylogenetic trees. Prabhav Kalaghatgi Thomas Lengauer Max Planck Institute for Informatics Max Planck Institute for Informatics Saarbru¨cken Saarbru¨cken 7 [email protected] [email protected] 1 0 2 n Abstract a Choi et al. (2011) introduced a minimum spanning tree (MST)-based method called CLGrouping, J forconstructingtree-structuredprobabilisticgraphicalmodels,astatisticalframeworkthatiscommonly 1 used for inferring phylogenetic trees. While CLGrouping works correctly if there is a unique MST, we 1 observe an indeterminacy in the method in the case that there are multiple MSTs. In this work we ] remove this indeterminacy by introducing so-called vertex-ranked MSTs. We note that the effectiveness O of CLGrouping is inversely related to the number of leaves in the MST. This motivates the problem of C findingavertex-rankedMSTwiththeminimumnumberofleaves(MLVRMST).Weprovideapolynomial timealgorithmfortheMLVRMSTproblem,andproveitscorrectnessforgraphswhoseedgesareweighted . h with tree-additive distances. t a m 1 Introduction [ 1 Phylogenetic trees are commonly modeled as tree-structured probabilistic graphical models with two types v of vertices: labeled vertices that represent observed taxa, and hidden vertices that represent unobserved 4 ancestors. The length of each edge in a phylogenetic tree quantifies evolutionary distance. If the set of 4 taxa under consideration contain ancestor-descendant pairs, then the phylogenetic tree has labeled internal 8 vertices, and is called a generally labeled tree (Kalaghatgi et al., 2016). The data that is used to infer the 2 0 topology and edge lengths is usually available in the form of gene or protein sequences. . Popular distance-based methods like neighbor joining (NJ; Saitou and Nei (1987)) and BIONJ (Gascuel, 1 0 1997) construct phylogenetic trees from estimates of the evolutionary distance between each pair of taxa. 7 Choi et al. (2011) introduced a distance-basedmethod called Chow-Liu grouping (CLGrouping). Choi et al. 1 (2011) argue that CLGrouping is more accurate than NJ at reconstructing phylogenetic trees with large : v diameter. The diameter of tree is the number of edges in the longest path of the tree. i CLGrouping operates in two phases. The first phase constructs a distance graph G which is a complete X graph over the labeled vertices where each edge is weighted with the distance between each pair of labeled r a vertices. Subsequently a minimum spanning tree (MST) of G is constructed. In the second phase, for each internalvertexv oftheMST,thevertexsetV consistingofv anditsneighborsisconstructed. Subsequently i i i a generally labeled tree T over V is inferred using a distance-based tree construction method like NJ. The i i subtree in the MST that is induced by V is replaced with T . i i Distances are said to be additive in a tree T if the distance between each pair of vertices u and v is equal to the sum of lengths of edges that lie on the path in T between u and v. Consider the set of all phylogenetic trees T such that the edge length of each edge in each tree in T is strictly greater than zero. A distance-based tree reconstruction method is said to be consistent if for each {D,T|T ∈ T} such that D is additive in T, the tree that is reconstructed using D is identical to T. Please note the following well-known result regarding the correspondence between trees and additive distances. Considering all trees in T, if D is additive in a tree T then T is unique (Buneman, 1971). 1 We show that if G has multiple MSTs then CLGrouping is not necessarily consistent. We show that therealwaysexistsanMSTM suchthatCLGroupingreturnsthecorrecttreewhenM isusedinthesecond phase of CLGrouping. We show that M can be constructed by assigning ranks to the vertices in G, and by modifying standard MST construction algorithms such that edges are compared on the basis of both edge weight and ranks of the incident vertices. The MSTs that are constructed in this manner are called vertex-ranked MSTs. Givenadistancegraph,theremaybemultiplevertex-rankedMSTswithvastlydifferentnumberofleaves. Huanget al.(2014)showedthatCLGroupingaffordsahighdegreeofparallelism,because,phylogenetictree reconstructionforeachvertexgroupcanbeperformedindependently. Withrespecttoparallelism,wedefine an optimal vertex-ranked MST for CLGrouping to be a vertex-ranked MST with the maximum number of vertex groups, and equivalently, the minimum number of leaves. We developed an O(n2logn) time algorithm Algo. 1 that takes as input a distance graph and outputs a vertex-ranked MST with the minimum number of leaves (MLVRMST). The proof of correctness of Algo. 1 assumes that the edges in the distance graph are weighted with tree-additive distances. 2 Terminology A phylogenetic tree is an undirected edge-weighted acyclic graph with two types of vertices: labeled vertices that represent observed taxa, and hidden vertices that represent unobserved taxa. Information, e.g., in the formofgenomicsequences,isonlypresentatlabeledvertices. Werefertotheedgeweightsofaphylogenetic tree as edge lengths. The length of an edge quantifies the estimated evolutionary distance between the sequences corresponding to the respective incident vertices. All edge lengths are strictly positive. Trees are leaf-labeledifallthelabeledverticesareleaves. Leaf-labeledphylogenetictreesarethemostcommonlyused models of evolutionary relationships. Generally labeled trees are phylogenetic trees whose internal vertices may be labeled, and are appropriate when ancestor-descendant relationships may be present in the sampled taxa (Kalaghatgi et al., 2016). Each edge in a phylogenetic tree partitions the set of all labeled vertices into two disjoint sets which are referred to as the split of the edge. The two disjoint sets are called to the sides of the split. A phylogenetic tree can be rooted by adding a hidden vertex (the root) to the tree, removing an edge e in the tree, and adding edges between the root and the vertices that were previously incident to e. Edge lengths for the newly added edges must be positive numbers and must sum up to the edge length of the previously removed edge. Rooting a tree constructs a directed acyclic graph in which each edge is directed away from the root. A leaf-labeled phylogenetic tree is clock-like if the tree can be rooted in such a way that all leaves are equidistant from the root. Among all leaf-labeled phylogenetic trees, maximally balanced trees and caterpillar trees have the smallest and largest diameter, respectively, where the diameter of a tree is defined as the number of edges along the longest path in the tree. ThedistancegraphGofaphylogenetictreeT istheedge-weightedcompletegraphwhoseverticesarethe labeled vertices of T. The weight of each edge in G is equal to the length of the path in T that connects the corresponding vertices that are incident to the edge. A minimum spanning tree (MST) of an edge-weighted graph is a tree that spans all the vertices of the graph, and has the minimum sum of edge weights. 3 Chow-Liu grouping Choietal.(2011)introducedtheprocedureChow-Liugrouping(CLGrouping)fortheefficientreconstruction of phylogenetic trees from estimates of evolutionary distances. If the input distances are additive in the phylogenetic tree T then the authors claim that CLGrouping correctly reconstructs T. CLGroupingconsistsoftwostages. Inthefirststage,anMSTM ofGisconstructed. Inthesecondstage, for each internal vertex v, a vertex group Nb(v) is defined as follows: Nb(v) is the set containing v and all the vertices in M that are adjacent to v. For each vertex group, a phylogenetic tree T is constructed using v 2 distances between vertices in Nb(v). Subsequently, the graph in M that is induced by Nb(v) is replaced by T (see Fig. 1e for an illustration). T may contain hidden vertices which may now be in the neighborhood v v of an internal vertex w that has not been visited as yet. If this the case, then we need an estimate of the distance between the newly introduced hidden vertices and vertices in Nb(w). Let h be the hidden vertex v that was introduced when processing the internal vertex v. The distance from h to a vertex k ∈ Nb(w) is v estimated using the following formula, d =d −d . hvk vk vhv The order in which the internal vertices are visited is not specified by the authors and does not seem to be important. CLGrouping terminates once all the internal vertices of M have been visited. This procedure is called Chow-Liu grouping because the MSTs that are constructed using additive dis- tances are equivalent to Chow-Liu trees (Chow and Liu, 1968), for certain probability distributions. Please read Choi et al. (2011) for further detail. 4 Indeterminacy of CLGrouping CLGrouping is not necessarily consistent if there are multiple MSTs. We demonstrate this with the phy- logenetic tree T shown in Fig. 1a. For the corresponding distance graph G of T (see Fig. 1b), two MSTs of G, M and M are shown in Fig. 1c and Fig. 1d, respectively. The intermediate steps, and the final 1 2 result of applying CLGrouping to M and M are shown in Fig. 1e and Fig. 1f, respectively. CLGrouping 1 2 reconstructs the original phylogenetic tree if it is applied to M but not if it is applied to M . 1 2 The notion of a surrogate vertex is central to proving the correctness of CLGrouping. The surrogate vertex of a hidden vertex is the closest labeled vertex, w.r.t. distances defined on the phylogenetic tree. CLGroupingwillreconstructthecorrectphylogenetictreeonlyiftheMSTcanbeconstructedbycontracting all the edges along the path between each hidden vertex and its surrogate vertex. Since the procedure that constructs the MST is not aware of the true phylogenetic tree, the surrogate vertex of each hidden vertex must selected implicitly. In the example shown earlier, M can be constructed by contracting the edges 1 (h ,l ), and (h ,l ). Clearly there is no selection of surrogate vertices such that M can be constructed by 1 1 2 3 2 contracting the path between each hidden vertex and the corresponding surrogate vertex. If there are multiple labeled vertices each of which is closest to a hidden vertex then Choi et al. (2011) assume that the corresponding surrogate vertex is implicitly selected using the following tie-breaking rule. Let the surrogate vertex set Sg(h) of a vertex h be the set of all labeled vertices that are closest to h. If l and l belong to both Sg(h ) and Sg(h ), then the same labeled vertex (either l or l ) is selected 1 2 1 1 1 2 as the surrogate vertex of both h and h . This rule for selecting surrogate vertices cannot be consistently 1 2 applied across all hidden vertices. We demonstrate this with an example. For the tree shown in Fig. 2 we have Sg(h ) = {l ,l }, Sg(h ) = {l ,l }, and Sg(h )={l ,l ,l ,l ,l }. It is clear that there is no selection 1 1 2 2 4 5 3 1 2 3 4 5 of surrogate vertices that satisfies the tie-breaking rule. 5 Ensuring the consistency of CLGrouping In order to construct an MST that is guaranteed to have the desired topological correspondence with the phylogenetic tree, we propose the following tie-breaking rule for selecting the surrogate vertex. Let there be a total order over the set of all labeled vertices. Let R(l) be the rank of vertex l that is given by the order. We define the surrogate vertex Sg(h) of h to be the highest ranked labeled vertex among the set of labeled vertices that are closest to h. That is, Definition 1. Sg(h)= min R(l) ,where, l∈Sg(h) Sg(h)= min d . lh l∈L(T) The inverse surrogate set Sg−1(l) is the set of all hidden vertices whose surrogate vertex is l. 3 A l1 l3 B l1 44 l2 2 1 1 4 4 l2 2 h1 h2 1 l4 l3 4 2 l4 Phylogenetic tree T The complete graph G constructed using distances that are additive in T C D l2 l1 l3 l4 l1 l3 l4 l2 4 4 2 4 2 4 An MST M1 of G An MST M2 of G E F l2 l1 l3 l4 l1 l3 l4 l2 l24 l34 l344 l442 l1l2 l1l3 l1 2 2l3 l4 l1 3 1l4 l2 l2 2 h1 l3 1 h1 ll43h221l23 ll42h131l42 l1 2 1 l3 l1 3 3 l2 1 l2 2h1 h21 l4 l3 1 h11 l4 Correct tree constructed by Incorrect tree constructed by applying CLGrouping to M1 applying CLGrouping to M2 Figure 1: The example used to demonstrate that CLGrouping may not reconstruct the correct tree if there are multiple MSTs. The phylogenetic tree T that is used in this example is shown in panel a. The distance graph G of T is shown in panel b. Two MSTs of G, M and M , respectively, are shown in panels c and 1 2 d. Panels e and f show the intermediate steps and the final result of applying CLGrouping to M and M 1 2 respectively. CLGrouping reconstructs the original phylogenetic tree if it is applied to M , but not if it is 1 applied to M . 2 l1 l5 1 1 1 h1 1 1 h2 1 l2 h3 l4 2 l3 Figure 2: The phylogenetic tree that is used to demonstrate that the tie-breaking rule as defined by Choi et al. (2011) cannot be applied in general. In order to ensure that the surrogate vertices are selected on the basis of both distance from the corre- sponding hidden vertex and vertex rank, it is necessary that information pertaining to vertex rank is used when selecting the edges of the MST. We use Kruskal’s algorithm (Kruskal, 1956) for constructing the de- sired MST. Since Kruskal’s algorithm takes as input a set of edges sorted w.r.t. edge weight, we modify the 4 input by sorting edges with respect to edge weight and vertex rank as follows. It is easy to modify other algorithms for constructing MSTs in such a way that vertex rank is taken into account. Definition 2. Wedefinebelow,whatismeantbysortingedgesonthebasisofedgeweightandvertexrank. Given a edge set E, and a ranking R over vertices in E, let d(u,v) be the weight of the edge {u,v}, and let R(u) be the rank of the vertex u. Let the relative position of each pair of edges in the list of sorted edges be defined using the total order <. That is to say, for each pair of edges {a,b} and {c,d}, {a,b}<{c,d}, if and only if (i) d(a,b)<d(c,d), or if (ii) d(a,b)=d(c,d) and min{R(a),R(b)}<min{R(c),R(d)}, or if (iii)d(a,b)=d(c,d) and min{R(a),R(b)}=min{R(c),R(d)} and max{R(a),R(b)}<max{R(c),R(d)}. TheMSTthatisconstructedbyapplyingKruskal’salgorithmtotheedgesthatareorderedwithrespect to weight and vertex rank is called a vertex-ranked MST (VRMST). Now, we will prove Lemma 1, which is used to prove the correctness of CLGrouping. Lemma1. Adaptedfromparts(i)and(ii)ofLemma8inChoietal.(2011). GivenaphylogenetictreeT and a ranking R over the labeled vertices in T, let G be the distance graph that corresponds to T =(V ,E ) and T T let E be the list of edges of G sorted with respect to edge weight and vertex rank, as defined in Definition 2. ≤ Let M =(V ,E ) be the VRMST that is constructed by applying Kruskal’s algorithm to E . The surrogate M M ≤ vertex of each hidden vertex is defined with respect to distance and vertex rank as given in Definition 1. M is related to T as follows. (i) If j ∈V and h∈Sg−1(j) s.t. h(cid:54)=j, then every vertex in the path in T that connects j and h belongs M to the inverse surrogate set Sg−1(j). (ii) For any two vertices that are adjacent in T, their surrogate vertices, if distinct, are adjacent in M, i.e., for all i,j ∈V with Sg(i)(cid:54)=Sg(j), T {i,j}∈E ⇒{Sg(i),Sg(j)}∈E T M Proof 1. First we will prove Lemma 1 part (i) by contradiction. Assume that there is a vertex u on the path between h and j, such that Sg(u) = k (cid:54)= j. We have d ≤ d (equality holds only if R(k) < R(j)). Similarly, since Sg(h) = j, we have d ≤ d (equality uk uj hj hk holds only if R(j)<R(k)) We consider all eight positions of k w.r.t. h,u, and j (see Fig. 3). For case 1 we have d ≤d (since Sg(h)=j) hj hk ⇔d +d +d ≤d +d hl lu uj hl hk ⇔d +d ≤d lu uj lk ⇔d <d +d uj ul lk ⇔d <d (contradiction since Sg(u)=k). uj uk For case 2 we have d ≤d (equality holds only if R(j)<R(k)) hj hk ⇔d +d +d ≤d +d +d hu ul lj hu ul lk ⇔d +d ≤d +d ul lj ul lk ⇔d ≤d (contradiction since Sg(u)=k). uj uk 5 k k Case 1 Case 5 h l u j h u j k Case 2 Case 6 h k u j h u l j Case 7 h u k j Case 3 k h u j Case 8 Case 4 h u=k j h u j k Figure 3: The cases that were considered in the proof of Lemma 1 part (ii). For some phylogenetic tree T let j be a labeled vertex and let h be a hidden vertex in the inverse surrogate set of j. u is a vertex in the path between h and j. Each case specifies one of the eight possible positions of a labeled vertex k w.r.t h,u, and j. Hidden vertices are represented with white circles and labeled vertices are represented with black circles. Each dashed line represents a path between the two vertices at its end points. For case 3 we have d ≤d hj hk ⇔d +d ≤d hu uj hk ⇔d <d +d uj hk hu ⇔d <d (contradiction since Sg(u)=k). uj uk For case 4 we have d =d +d (see Fig. 3 case 4) uk uj jk ⇔d >d (contradiction since Sg(u)=k). uk uj For case 5 we have d ≤d (equality holds only if R(j)<R(k)) hj hk ⇔d +d ≤d +d hu uj hu uk ⇔d ≤d (contradiction since Sg(u)=k). uj uk For cases 6,7, and 8, we have d ≤d hj hk ⇔d +d ≤d hk kj hk ⇔d <d (contradiction). hk hk Now we will prove part (ii) of Lemma 1. Consider the edge {i,j} in E such that Sg(i) (cid:54)= Sg(j). Let V T i and V be the sides of the split that is induced by the edge {i,j}, such that V and V contain i and j, j i j respectively. Let L and L be sets of labeled vertices that are defined as V ∩V and V ∩V respectively. i j i M j M 6 From part (i) of Lemma 1 we know that Sg(i) ∈ L and Sg(j) ∈ L . Additionally, for any k ∈ L \{Sg(i)} i j i and l∈L \{Sg(j)}, from the definition of surrogate vertex it follows that j d ≥d (equality holds only if R(Sg(i))<R(k)) ki Sg(i)i d ≥d (equality holds only if R(Sg(j))<R(l)) lj Sg(j)j d =d +d +d kj ki ij lj ≥d +d +d Sg(i)i ij Sg(j)j =d . Sg(i)Sg(j) It is clear that min{R(k),R(l)}>min{R(Sg(i)),R(Sg(j))}, (1) and that d ≥d . (2) kl Sg(i)Sg(j) The cut property of MSTs states that given a graph G=(V,E) for each pair V ,V of disjoint sets such 1 2 that V ∪V = V, each MST of G contains one of the smallest edges (w.r.t. edge weight) which have one 1 2 end-point in V and the other end-point in V . 1 2 Note that the vertex-ranked MST M is constructed using edges that are sorted w.r.t. edge weight and the vertex rank R. From equations (1) and (2) it is clear that among all edges with one end point in L and i theotherend-pointinL , theedge{Sg(i),Sg(j)}isthesmallestedgew.r.tedgeweightandvertexrank(see j Definition 2). Since L and L are disjoint sets and L ∪L =V , it follows that {Sg(i),Sg(j)}∈E . i j i j M M CLGrouping can be shown to be correct using Lemma 1 and the rest of the proof that was provided by Choietal.(2011). Thusifthedistancesareadditiveinthemodeltree,CLGroupingwillprovablyreconstruct the model tree provided that the MST that is used by CLGrouping is a vertex-ranked MST (VRMST). The authors of CLGrouping provide a matlab implementation of their algorithm. Their implementation reconstructs the model tree even if there are multiple MSTs in the underlying distance graph. The authors’ implementation takes as input a distance matrix which has the following property: the row index, and the column index of each labeled vertex is equal. The MST that is constructed in the authors implementation is a vertex-ranked MST, with the rank of each vertex being equal to the corresponding row index of the labeled vertex. We implemented their algorithm in python with no particular order over the input distances and were surprised to find out that the reconstructed tree differed from the model tree, even if the input distances were additive in the model tree. Depending on the phylogenetic tree, there may be multiple corresponding vertex-ranked MSTs with vastly different numbers of leaves. In the next section we discuss the impact of the number of leaves in a vertex-ranked MST, on the efficiency of parallel implementations of CLGrouping. 6 Relating the number of leaves in a VRMST to the optimality of the VRMST in the context of CLGrouping Inthecontextofparallelprogramming,Huangetal.(2014)showedthatitispossibletoparallelizeCLGroup- ing by independently constructing phylogenetic trees for each vertex group, and later combining them in order to construct the full phylogenetic tree. In order to relate the balancedness of a phylogenetic tree to the number of leaves in a corresponding vertex-ranked MST, we consider clock-like caterpillar trees and maximally balanced trees such that each hidden vertex of each tree has degree three. Consider the case in which the phylogenetic tree is a caterpillar tree (least balanced). There exists a correspondingVRMSTwhichhasastartopologythatcanbeconstructedbycontractingedgesbetweeneach 7 l4 l3 l2 a root l5 l1 l8 b h6 l6 l7 l4 l5 root h5 l3 l6 l2 Branch length h2h1 h4 ll31 Branch length h1h2 h6 ll12 l1 l7 ll68 l4 h5 l2 l5 h3 h3 l5 h4 l4 l8 l8 l3 l7 l1 l2 l3 l4 l5 l6 l7 l8 l6 l1 l2 l3 l4 l5 l6 l7 l8 l7 Figure4: Bothpanelsshowclock-likephylogenetictreesandVRMSTswiththemaximumandtheminimum numberofleaves,thatareconstructedbycontractingcorrespondingedgesthatarehighlightedinorangeand blue, respectively. The difference between the maximum and the minimum number of leaves in VRMSTs is largest for the caterpillar tree shown in panel a, and smallest for the maximally balanced tree shown in panel b. hidden vertex and one labeled vertex that is in the surrogate vertex set of each hidden vertex (see Fig 4a). A star-shaped VRMST has only one vertex group, comprising all the vertices in the VRMST, and does not afford any parallelism. Instead, if the VRMST was to be constructed by contracting edges between each hidden vertex h and a labeled vertex that is incident to h, then the number of the vertex groups would be n−2, where n is the number of vertices in the phylogenetic tree. The resulting VRMST would have the minimum number of leaves (two). With respect to parallelism, an optimal vertex-ranked MST for CLGrouping is a vertex-ranked MST with the maximum number of vertex groups, and equivalently, the minimum number of leaves. Consider a phylogenetic tree T =(V ,E ) which is maximally balanced. It is clear that the set L(T) of T T labeled vertices of T can be partitioned into a disjoint set C of vertex pairs such that for each vertex pair {u,v}∈C, u and v are adjacent to the same hidden vertex h∈V . Given a vertex ranking R, the surrogate T vertex of h will be max R(l). Thus, independently of vertex ranking, the number of distinct surrogate l∈{u,v} vertices will be L(T)/2. Each labeled vertex that is not selected as a surrogate vertex will be a leaf in the vertex-ranked MST. It follows that all corresponding VRMSTs of T will have L(T)/2 leaves (see Fig 4b). Whether or not the phylogenetic trees that are estimated from real data are clock-like depends on the set of taxa that are being studied. Genetic sequences that are sampled from closely related taxa have been estimatedtoundergosubstitutionsatasimilarrate,resultinginclock-likephylogenetictrees(dosReisetal., 2016). In the context of evolution, trees are caterpillar-like if there is a strong selection; the longest path from the root represents the best-fit lineage. Inthenextsectionwewillpresentanalgorithmforconstructingavertex-rankedMSTwiththeminimum number of leaves. 8 7 Constructing a vertex-ranked MST with the minimum number of leaves Weaimtoconstructavertex-rankedMSTwiththeminimumnumberofleaves(MLVRMST)fromadistance graph. An algorithm for constructing a MLVRMST is presented in subsection 7.3. In the following two subsections we will present two lemmas, which will be used for proving the correctness of the algorithm. 7.1 A common structure that is shared by all MSTs In this section we will prove the existence of a laminar family F over the vertex set of an edge-weighted graphG. AcollectionF ofsubsetsofasetS isalaminarfamilyoverS if,foranytwointersectingsetsinF, onesetcontainstheother. Thatistosay, foreachpairS ,S inF suchthat|S |≤|S |, eitherS ∩S =∅, 1 2 1 2 1 2 or S ⊂S . 1 2 The vertex sets in F define a structure that is common to each MST of G. Furthermore, F can be used to obtain an upper bound on the degree of each vertex in a MST. The notion of a laminar family has been utilized previously by Ravi and Singh (2006), for designing an approximation algorithm for the minimum-degree MST Lemma 2. Given an edge-weighted graph G=(V,E) with k distinct weight classes W ={w ,w ,...,w }, 1 2 k and an MST M of G, let F be the forest that is formed by removing all edges in G that are heavier than w . i i Let C be the collection comprising the vertex set of each component of F . Consider the collection F which i i is constructed as follows: F =(cid:8)∪k C (cid:9)∪V. The following is true: i=1 i (i) F is a laminar family over V (ii) Each vertex set in F induces a connected subgraph in each MST of G Proof. (i). Consider any two vertex sets S and S in F. Let w and w be the weights of the heaviest 1 2 1 2 edges in the subgraphs of M that are induced by S and S , respectively. Let F and F be the forests that 1 2 1 2 are formed by removing all edges in M that are heavier than w and w , respectively. Let C and C be the 1 2 1 2 collections comprising the vertex set of each component in F and F , respectively. 1 2 It is clear that S ∈ C and S ∈ C . Consider the case where w = w . Since C =C , it follows that 1 1 2 2 1 2 1 2 S ∩S =∅. If w (cid:54)=w , then without loss of generality, let w <w . F can be constructed by adding to 1 2 1 2 1 2 2 F all edges in M that are no heavier than w . Each component in F that is not in F induces a connected 1 2 1 2 subgraph in exactly one component of F . If S ∈C ∩C then S ∩S =∅. Otherwise, if S ∈C \C , then 2 1 1 2 1 2 1 1 2 S isasubsetofexactlyonesetinC . ThisimpliesthateitherS ⊂S ,orS ∩S =∅. ThusF isalaminar 1 2 1 2 1 2 family over V. (ii). Let S be the vertex set of a component in the subgraph G of G that is created by removing all i i edges in G that are heavier than w . It is clear that S induces a connected subgraph in each minimum i i i spanning forest of G . For each minimum spanning forest there is a corresponding MST of G, such that the i minimum spanning forest can be constructed by removing from the MST all the edges are heavier than w . i It follows that S induces a connected subgraph in each MST of G. i 7.2 Selecting surrogate vertices on the basis of maximum vertex degree Lemma 3. We are given a phylogenetic tree T, the corresponding distance graph G = (V,E), and the laminar family F of the distance graph. Let the subgraph g =(V ,E ) of G contain all edges that are present g g in at least one MST of G. Let h be a hidden vertex in T such that there is a leaf l in Sg(h), and h is incident to l. Let S be a vertex set in F and let w be the corresponding edge weight. Then the following holds: i i (i) Let J be the set of all vertices that are incident to vertex v in g. Let S be the smallest sub-collection v v of F that covers J but not v. Among all MSTs, the maximum vertex degree δ (v) of v is |S |. v max v (ii) δ (l)≤δ (v) for each vertex v in Sg(h). max max 9 Proof. (i). Let J ={j ,j ,...,j } be the set of all vertices that are incident to v. Let M be some MST of v 1 2 k G. Let S ={S ,S ,...,S } be the smallest sub-collection of F that covers J and does not include v. Let v 1 2 m v S contain a set S that covers multiple vertices in J. Let j and j be any two vertices in S . Let w be the v i 1 2 i i heaviest weight on the path that joins j and j in M. The edges {v,j } and {v,j } are heavier than w . If 1 2 1 1 i they were not, then we would have v ∈ S . Since v, j and j are on a common cycle, each MST of G can i 1 2 only contain one of the two edges {v,j }, and {v,j }. It follows that for each set S ∈ S , each MST can 1 2 i v contain at most one edge which is incident to v and to a vertex in S . Thus the maximum number of edges i that can be incident to v in any MST is the number of vertex sets in S , i.e., δ (v)=|S |. v max v (ii). LetJ andJ bethesetofallverticesthatareincidenttolandving,respectively. Letj ∈J \Sg(h). l v l The weight of the edge {j,l} ∈ E is given by d . d > d since j ∈/ Sg(h). Thus d > d , and g jl jh vh lj lv consequently v ∈ J . We have d = d +d = d +d = d . Consider the MST M = (V ,E ) l jl jh hl jh hv jv M M that contains the edges {l,v} and {l,h}. Consider the spanning tree M(cid:48) that is formed by removing {l,h} from E and adding {v,h}. M(cid:48) and M have the same sum of edge weights. Thus we also have j ∈ J . M v ConsequentlyJ ⊆J . LetS andS bethesmallestsub-collectionsofF suchthatS coversJ butdoesnot l v l v l l containl, andS coversJ butdoesnotcontainv. S coversbothJ andJ sinceJ ⊆J . Thus|S |≤|S |. v v v l v l v l v From part (i), we know that |S |=δ (l) and |S |=δ (v). Thus δ (l)≤δ (v). l max v max max max 10