ALGORITHMS FOR BIOLOGICAL NETWORK ALIGNMENT A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Jason Flannick August 2009 (cid:13)c Copyright by Jason Flannick 2009 All Rights Reserved ii IcertifythatIhavereadthisdissertationandthat,inmyopinion,itisfullyadequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Serafim Batzoglou) Principal Adviser IcertifythatIhavereadthisdissertationandthat,inmyopinion,itisfullyadequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Harley McAdams) IcertifythatIhavereadthisdissertationandthat,inmyopinion,itisfullyadequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (David Dill) Approved for the University Committee on Graduate Studies. iii iv Abstract A major goal in the post-genomic era is to understand how genes and proteins organize to ulti- mately cause complex traits. There are multiple levels of biological organization, such as low-level interactions between pairs of molecules, higher-level metabolic pathways and molecular complexes, and ultimately high-level function of an organism. Interaction networks summarize interactions between pairs of molecules, which are the building blocks for higher levels of molecular organization. As databases of interaction networks continue to grow in size and complexity, new computational tools are needed to search them for these higher levels of organization. Network alignment algorithms search interaction networks from different species to identify con- served functional modules—groups of molecules that cooperate to perform a common biological function. The increasing belief that functional modules are a critical level of molecular organization has led to an upsurge of network alignment research. This dissertation focuses on three network alignment algorithms we developed: (1) Græmlin, the first network aligner to align multiple large interaction networks efficiently; (2) Græmlin 1.1, a network aligner that uses a training set of network alignments to improve Græmlin’s accuracy; and (3) Græmlin 2.0, a network aligner that uses machine learning techniques to find the most accurate network alignments to date. Even though Græmlin 2.0 supersedes Græmlin and Græmlin 1.1, I discuss all three algorithms in detail to provide a complete account of our work. To provide a context for our algorithms, I also discuss the intuition behind network alignment and review past research in the field. In addition, I review methods we developed to build the Stanford Network Database, a database of microbial interaction networks that we have analyzed extensively with our alignment algorithms. Finally, I discuss several specific examples of network alignments. The examples illustrate the advantages of our alignment algorithms over other algorithms. I conclude with illustrations of applications we have developed for network alignment, including its use to predict protein function and characterize functional modules. v Acknowledgements MuchofthecontentinChapter2paraphrasesapreviouslysubmittedarticle[1]. Inthatwork,Balaji SrinivasandesignedthemethodologybehindtheStanfordNetworkDatabase, wrotethearticle, and created Figures 2.1-2.5. I helped with the design and validation of the networks in the Stanford Network Database as well as with the design of the Stanford Network Browser. Chapter 4 contains content from a previously published article [2]. In that work, I designed the Græmlin algorithm, wrote the article, and implemented and tested Græmlin with Tony Novak. Content from that article, which Balaji Srinivasan helped write, also appears in Section 8.1. Chapter6alsocontainscontentfromapreviouslypublishedarticle[3]. Inthatwork, Idesigned, implemented, and tested Græmlin 2.0, and I also wrote the article. Section8.3paraphrasesasubmittedarticle[4]. Inthatwork,Iperformedthenetworkalignment and wrote the article text that describes the alignment and its analysis. Throughout the work described in this dissertation I was supported in part by a Stanford Grad- uate Fellowship. Much of the work also was funded by NSF grant EF-0312459, NIH grant UO1- HG003162, the NSF CAREER Award, and the Alfred P. Sloan Fellowship IwouldliketothankallofthepeopleatStanfordwhohavehelpedmeduringmydoctoralstudies. Foremost,myadvisorSerafimBatzoglouhasgivenmeinvaluablesupportandthefreedomtopursue my own research direction. Also, Balaji Srinivasan and Tony Novak have helped a great deal with the work in this dissertation. Balaji introduced me to the network alignment problem, and without him this dissertation would cover a different topic. Tony helped with much of the design behind ournetworkalignmentalgorithms,andwithouthimthisdissertationwouldonlycoverGræmlin0.9. Finally,mycolleaguesinSerafim’slabhavehelpedimmeasurablyalongthewayandhavehelpedme throughmanystickingpoints. Inparticular,TomDohelpedformulatethekeyideaofGræmlin2.0’s scoring function, George Asimenos resolve computer and cluster issues on numerous occasions, and Eugene Fratkin helped me iron out the details behind many ideas. Finally, I would like to thank my friends and family for all of their support. My parents have given me unquestioned love and support for which I am eternally grateful. And without my wife Katie, thisworkwouldnothavebeenpossible. Theknowledgethatshewasthereformehashelped me through the difficult times in research that come along too often. vi Contents Abstract v Acknowledgements vi 1 Introduction 1 1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 The Stanford Network Database 6 2.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Network integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 Statistical validation and comparison . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 The Stanford Network Browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Single protein analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Multiple protein analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Overview of Network Alignment 17 3.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 A sample network alignment problem . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 A simple formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.2 Limitations of the simple formulation . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Network alignment vs. the graph matching problem . . . . . . . . . . . . . . . . . . 23 3.4 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 Equivalence class formulation of network alignment . . . . . . . . . . . . . . . . . . . 27 3.5.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5.2 Formal problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 vii 4 Græmlin 30 4.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Scoring function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2.1 Node scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2.2 Edge scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Search algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.1 Pairwise alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.2 Multiple alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4.1 Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4.2 Network-to-network alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4.3 Query-to-network alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.4 Multiple network alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5 Improved Scoring Function for Græmlin 59 5.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Overview of scoring function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.1 Evolutionary events and probabilities . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.1 Determination of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.3 Training sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4 Results of parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.5 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5.1 Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6 Græmlin 2.0 78 6.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2.1 Stage 1: Global alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2.2 Stage 2: Disjoint local alignment . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2.3 Stage 3: Probabilistic assignment . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3.1 Equivalence class accuracy comparisons . . . . . . . . . . . . . . . . . . . . . 92 6.3.2 Local alignment accuracy comparisons . . . . . . . . . . . . . . . . . . . . . . 95 viii 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.A Global alignment feature function definition . . . . . . . . . . . . . . . . . . . . . . . 98 6.A.1 Node feature function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.A.2 Edge feature function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7 Alignment of Known Modules 102 7.1 Alignments by Græmlin and Græmlin 1.1 . . . . . . . . . . . . . . . . . . . . . . . . 102 7.1.1 “Gaps” are important in network alignment . . . . . . . . . . . . . . . . . . . 103 7.1.2 Noisy networks make network alignment more difficult . . . . . . . . . . . . . 105 7.1.3 A more lenient edge model can uncover novel connections between modules . 105 7.1.4 Strong BLAST hits do not always imply functional orthology . . . . . . . . . 105 7.1.5 Multiple alignments of many species can become unstable . . . . . . . . . . . 106 7.2 Alignments by Græmlin 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 8 Applications of Network Alignment 111 8.1 Analysis of proteins and modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 8.1.1 Functional annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 8.1.2 Module identification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.2 Discovery of cryptic modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.3 Analysis of inositol metabolism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8.3.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8.3.2 The myo-inositol module in C.crescentus . . . . . . . . . . . . . . . . . . . . 119 8.3.3 Alignment of the myo-inositol module to other α-proteobacteria . . . . . . . 121 8.3.4 Prediction of myo-inositol transporter in S. meliloti . . . . . . . . . . . . . . 122 8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 9 Conclusion 124 9.1 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 9.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 ix List of Tables 4.1 We quantified the performance of Græmlin on 10 networks in the Stanford Network Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 The Stanford Network Database contains more alignable KEGG pathways than the eukaryotic networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Græmlinperformsnetwork-to-networkalignmentofnetworksthresholdedat0.5with high specificity, sensitivity, and efficiency. . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Græmlinperformsnetwork-to-networkalignmentofnetworksthresholdedat0.25with high specificity, sensitivity, and efficiency. . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5 Græmlin efficiently and accurately performs query-to-network alignment of networks thresholded at 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.6 Græmlin efficiently and accurately performs query-to-network alignment of networks thresholded at 0.25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.7 Græmlin efficiently and accurately performs multiple network alignment. . . . . . . . 57 5.1 Græmlin 1.1 is more accurate than Græmlin and MaWISh. . . . . . . . . . . . . . . 76 5.2 Græmlin 1.1 is more specific than Græmlin. . . . . . . . . . . . . . . . . . . . . . . . 77 6.1 Græmlin 2.0 aligns nodes with higher specificity. . . . . . . . . . . . . . . . . . . . . 94 6.2 Græmlin 2.0 aligns nodes with higher sensitivity. . . . . . . . . . . . . . . . . . . . . 95 6.3 Græmlin 2.0 groups proteins into modules with higher accuracy. . . . . . . . . . . . 97 x
Description: