Natural Computing Series SeriesEditors: G.Rozenberg Th.Bäck A.E.Eiben J.N.Kok H.P.Spaink LeidenCenterforNaturalComputing AdvisoryBoard: S.Amari G.Brassard K.A.DeJong C.C.A.M.Gielen T.Head L.Kari L.Landweber T.Martinetz Z.Michalewicz M.C.Mozer E.Oja G.P˘aun J.Reif H.Rubin A.Salomaa M.Schoenauer H.-P.Schwefel C.Torras D.Whitley E.Winfree J.M.Zurada Hans-Joachim Böckenhauer Dirk Bongartz Algorithmic Aspects of Bioinformatics With118Figuresand9Tables 123 Authors SeriesEditors Dr.Hans-JoachimBöckenhauer G.Rozenberg(ManagingEditor) ETHZurich [email protected] InformationTechnologyandEducation CABF11 Th.Bäck,J.N.Kok,H.P.Spaink Universitätsstr.6 LeidenCenterforNaturalComputing 8092Zurich LeidenUniversity Switzerland NielsBohrweg1 [email protected] 2333CALeiden,TheNetherlands Dr.DirkBongartz A.E.Eiben ComputerScienceI VrijeUniversiteitAmsterdam RWTHAachen TheNetherlands Germany [email protected] LibraryofCongressControlNumber:2007924247 ACMComputingClassification(1998):F.2.2,G.2.1,G.2.2,J.3 ISSN 1619-7127 ISBN 978-3-540-71912-0 SpringerBerlinHeidelbergNewYork Thisworkissubjecttocopyright.Allrightsarereserved,whetherthewholeorpartofthematerialisconcerned, specificallytherightsoftranslation, reprinting,reuseofillustrations, recitation, broadcasting,reproduction on microfilmorinanyotherway,andstorageindatabanks.Duplicationofthispublicationorpartsthereofispermitted onlyundertheprovisionsoftheGermanCopyrightLawofSeptember9,1965,initscurrentversion,andpermission forusemustalwaysbeobtainedfromSpringer.ViolationsareliableforprosecutionundertheGermanCopyright Law. SpringerisapartofSpringerScience+BusinessMedia springer.com ©Springer-VerlagBerlinHeidelberg2007 OriginallypublishedintheGermanlanguagebyB.G.TeubnerVerlagas “Hans-JoachimBöckenhauerundDirkBongartz:AlgorithmischeGrundlagenderBioinformatik”. ©B.G.TeubnerVerlag|GWVFachverlageGmbH,Wiesbaden2003 Theuseofgeneraldescriptivenames,registerednames,trademarks,etc.inthispublicationdoesnotimply,evenin theabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevantprotectivelawsandregulationsand thereforefreeforgeneraluse. CoverDesign:KünkelLopka,Werbeagentur,Heidelberg Typesetting:bytheAuthors Production:LE-TEXJelonek,Schmidt&VöcklerGbR,Leipzig Printedonacid-freepaper 45/3100/YL 543210 Preface The discovery of the double-helix structure of DNA by Watson and Crick more than (cid:12)fty years ago was the starting point of a new era in molecular biology. Since then, our knowledge of biological structures and processes has growntremendously.Butmanyoftheseadvanceswouldhavebeenunthinkable without using computational methods. Computer science plays a leading role in the emerging interdisciplinary (cid:12)eld of bioinformatics. Only the interplay between biological methods and concepts from informatics has enabled us to successfully maintainprojectssuch as the Human Genome Project.But com- pleting this project also initiated further challenges for bioinformatics. The main goal now is to analyze and make use of the collected data. Moreover, new areasareexplored, forinstance, the prediction of the spatial structure of proteins. These results will be useful for designing new drugs and improved medical therapies; in this context, computer science faces bold challenges. However, progress in molecular biology also in(cid:13)uences the design and devel- opmentofcomputersciencemethodsandconcepts,asintheexcitingresearch (cid:12)eld of molecular computing. This book introduces some of the fundamental problems from the (cid:12)eld of bioinformatics;itdiscussesthemodelsusedtoformallydescribetheproblems, anditanalyzesthealgorithmicapproachesusedtoattackandeventuallysolve them.Thisbookcanberegardedasatextbook,describingthetopicsindetail and presenting the formal models in a mathematically stringent yet intuitive way. Thus, it is well suited as an introduction into the (cid:12)eld of bioinformatics for students or for preparing introductory lectures. The spectrum of topics includes classical subjects such as string algorithms, fundamental approaches for sequencing DNA and for analyzing the sequencing data, but also more recent subjects such asstructure prediction for biomoleculesand haplotyping models. We havetried to present some fundamental ideas and approachesfor each of the topics in as much detail as possible, and also to give some insight into current research. VI Without help from others, it would have been much harder, if not impos- sible, to write this book. So we would like to thank everyone who helped us in one way or another. Above all, we would like to thank Juraj Hromkovi(cid:20)c for encouraging us to writethistextbook,forsupportinguswithmanyhelpfulsuggestionsandcom- ments, and, last but not least, for carefully proofreading previous versions of the manuscript. We aregrateful to Heidi Imho(cid:11) forher commentsand advice concerning the biological basics, and to Mark Cieliebak for discussions and comments on restriction site mapping. Our cordial thanks to our colleagues JoachimKupke, Sebastian Seibert, Walter Unger, and Viktor Keil for helpful advice and for the discussion on numerous topics, as well as for their support in the technical realization of this book. Moreover, we thank all who helped us to improve on the previous (German) version of the book by sending us their comments and pointing out some mistakes. In particular, we would like tothankKatharinaBalzer,FrankKurth,andNanMungard.Aspecialthanks goes to the team at Springer, particularly to Ronan Nugent, for his support. Finally, we would like to express our thanks to our families and friends, for their enduring supportand motivation duringour workon this book. Cordial thanks to all of you! Zu(cid:127)rich and Mo(cid:127)nchengladbach, January 2007 Hans-Joachim Bo(cid:127)ckenhauer Dirk Bongartz Contents 1 Introduction............................................... 1 Part I Introduction and Basic Algorithms 2 Basics of Molecular Biology................................ 7 2.1 Proteins................................................ 7 2.2 Nucleic Acids ........................................... 9 2.3 Hereditary Information and Protein Biosynthesis ............ 12 2.4 Experimental Techniques ................................. 15 2.4.1 Basic Terms and Methods .......................... 15 2.4.2 Duplication of DNA ............................... 15 2.4.3 Gel Electrophoresis and Direct Sequencing............ 16 2.4.4 DNA Chips....................................... 19 2.5 Bibliographic Notes...................................... 20 3 Basic Concepts: Strings, Graphs, and Algorithms.......... 23 3.1 Strings................................................. 23 3.2 Graphs................................................. 25 3.3 Algorithms and Complexity............................... 28 3.4 Bibliographic Notes...................................... 35 4 String Algorithms ......................................... 37 4.1 The String Matching Problem............................. 37 4.2 String Matching Automata ............................... 39 4.3 The Boyer{MooreAlgorithm.............................. 44 4.4 Su(cid:14)x Trees............................................. 50 4.5 Further Applications of Su(cid:14)x Trees........................ 58 4.5.1 Generalized Su(cid:14)x Trees and the Substring Problem ... 58 4.5.2 Longest Common Substrings........................ 61 4.5.3 E(cid:14)cient Computation of Overlaps................... 63 VIII Contents 4.5.4 Repeats in Strings................................. 66 4.6 Su(cid:14)x Arrays............................................ 68 4.7 Summary............................................... 77 4.8 Bibliographic Notes...................................... 78 5 Alignment Methods ....................................... 81 5.1 Alignment of Two Strings ................................ 82 5.1.1 Basic De(cid:12)nitions .................................. 82 5.1.2 Global Alignment ................................. 84 5.1.3 Local and Semiglobal Alignment .................... 89 5.1.4 Generalized Scoring Functions ...................... 94 5.2 Heuristic Methods for Database Search..................... 97 5.2.1 The FASTA Heuristic.............................. 98 5.2.2 The BLAST Heuristic.............................. 99 5.3 Multiple Alignments .....................................101 5.3.1 De(cid:12)nition and Scoring of Multiple Alignments ........101 5.3.2 Exact Computation of Multiple Alignments...........104 5.3.3 Combining Pairwise Alignments.....................109 5.4 Summary...............................................114 5.5 Bibliographic Notes......................................114 Part II DNA Sequencing 6 Introduction and Overview ................................119 7 Physical Mapping .........................................123 7.1 Restriction Site Mapping .................................123 7.1.1 The Double Digest Approach .......................124 7.1.2 The Partial Digest Approach .......................131 7.1.3 Comparison of Methods for Restriction Site Mapping ..141 7.2 Hybridization Mapping...................................143 7.2.1 Mapping with Unique Probes .......................146 7.2.2 Mapping with Unique Probes and Errors .............157 7.2.3 Mapping with Non-unique Probes ...................165 7.3 Summary...............................................166 7.4 Bibliographic Notes......................................168 8 DNA Sequencing ..........................................171 8.1 Shotgun Sequencing .....................................171 8.1.1 Crucial Points to Be Considered in a Suitable Model...174 8.1.2 The Shortest Common Superstring Problem ..........176 8.1.3 Re(cid:12)ned Models for Fragment Assembly ..............196 8.2 Sequencing by Hybridization..............................201 8.3 Summary...............................................207 Contents IX 8.4 Bibliographic Notes......................................208 Part III Analyzing Biological Data 9 Finding Signals in DNA Sequences ........................213 9.1 Identical and Similar Substrings...........................213 9.2 Tandem Repeats ........................................217 9.3 Frequent and Infrequent Substrings ........................223 9.4 Hidden Markov Models...................................228 9.5 Summary...............................................235 9.6 Bibliographic Notes......................................235 10 Genome Rearrangements ..................................237 10.1 Modeling ...............................................237 10.2 Sorting Undirected Permutations ..........................239 10.3 Sorting Directed Permutations ............................247 10.4 Computing the Syntenic Distance .........................249 10.5 Summary...............................................255 10.6 Bibliographic Notes......................................255 11 Phylogenetic Trees ........................................257 11.1 Ultrametric Distances....................................258 11.2 Additive Trees ..........................................265 11.3 Characters with Binary States ............................268 11.4 The Parsimony Principle and the Quartet Method...........275 11.5 Summary...............................................283 11.6 Bibliographic Notes......................................285 12 Haplotyping ...............................................287 12.1 Inferring Haplotypes from a Population ....................288 12.2 Haplotyping a Single Individual ...........................305 12.3 Summary...............................................316 12.4 Bibliographic Notes......................................316 13 Molecular Structures ......................................319 13.1 RNA Secondary Structure Prediction ......................320 13.1.1 Minimizing the Free Energy ........................322 13.1.2 Stochastic Context-Free Grammars ..................329 13.2 Structure-Based Comparison of Biomolecules................337 13.3 Protein Structure Prediction ..............................349 13.3.1 De Novo Structure Prediction | The HP Model ......352 13.3.2 Protein Threading.................................363 13.4 Summary...............................................369 13.4.1 RNA Secondary Structure Prediction ................369 X Contents 13.4.2 Structure-Based Comparison of Biomolecules .........371 13.4.3 Protein Structure Prediction........................371 13.5 Bibliographic Notes......................................372 13.5.1 RNA Secondary Structure Prediction ................372 13.5.2 Structure-Based Comparison of Biomolecules .........373 13.5.3 Protein Structure Prediction........................374 References.....................................................377 Index..........................................................389 1 Introduction Topics like biotechnology, genetics, and bioinformatics frequently (cid:12)nd their way into today’s headlines. This trend, often seen as the beginning of a new era, was initiated by a seminal discovery more than (cid:12)fty years ago, the dis- covery of the DNA double helix structure by Watson and Crick in 1953. The developmentinmolecularbiologycontinuedtogrowsteadilyfromthen,andit becamemoreandmoreimportanttothe public, forexample,with the launch of the Human Genome Project in the 1990s and with the presentation of the cloned sheep \Dolly." In 2000, scientists eventually announced the complete sequencing of the human genome. Let us take a closer look at the task of DNA sequencing. The goal here is to determine the sequence of nucleotides, which are the elementary build- ing blocks in the human DNA, i.e., in the molecule storing our hereditary information. From the viewpoint of informatics, we are looking for a string madeupfromlettersrepresentingthesenucleotides.Methodsforreadingtheir sequence have already been known since the 1980s, but the length of DNA molecules that can be sequenced using these methods is severely restricted. Current methods enable us to read about 1000 consecutive nucleotides. Nev- ertheless, we are mainly interested in DNA molecules consisting of hundreds ofthousandsofthem.Howcanwereachourgoalofsequencinginthesecases? One possible approach is the following: We generate a multitude of copies of theDNAmoleculeofourinterest.Werandomlybreakeachofthesecopiesinto fragments.Withhighprobability,theresultingfragmentsfromdi(cid:11)erentcopies overlap with each other. Ideally, these fragments are su(cid:14)ciently short for di- rect sequencing. Having performed these sequencing operations, we are left with many string fragments of which we know that they occur as substrings in the DNA sequence, and that these fragmentsmay overlapwith eachother. But we have no clue how to combine these fragmentsto achievethe complete DNA sequence, since their order was lost in this process. Typically, we have to reorder thousands of string fragments, a task we are not capable of doing by hand. This is the point where informatics comes in. First of all, it gives us the appropriate tools for managing the data, i.e., the set of string fragments.