ALGORITHMS IN BIOINFORMATICS A PRACTICAL INTRODUCTION C7033_FM.indd 1 10/26/09 3:13:09 PM CHAPMAN & HALL/CRC Mathematical and Computational Biology Series Aims and scope: This series aims to capture new developments and summarize what is known over the whole spectrum of mathematical and computational biology and medicine. It seeks to encourage the integration of mathematical, statistical and computational methods into biology by publishing a broad range of textbooks, reference works and handbooks. The titles included in the series are meant to appeal to students, researchers and professionals in the mathematical, statistical and computational sciences, fundamental biology and bioengineering, as well as interdisciplinary researchers involved in the field. The inclusion of concrete examples and applications, and programming techniques and examples, is highly encouraged. Series Editors Alison M. Etheridge Department of Statistics University of Oxford Louis J. Gross Department of Ecology and Evolutionary Biology University of Tennessee Suzanne Lenhart Department of Mathematics University of Tennessee Philip K. Maini Mathematical Institute University of Oxford Shoba Ranganathan Research Institute of Biotechnology Macquarie University Hershel M. Safer Weizmann Institute of Science Bioinformatics & Bio Computing Eberhard O. Voit The Wallace H. Couter Department of Biomedical Engineering Georgia Tech and Emory University Proposals for the series should be submitted to one of the series editors above or directly to: CRC Press, Taylor & Francis Group 4th, Floor, Albert House 1-4 Singer Street London EC2A 4BQ UK C7033_FM.indd 2 10/26/09 3:13:09 PM Published Titles Algorithms in Bioinformatics: A Kinetic Modelling in Systems Biology Practical Introduction Oleg Demin and Igor Goryanin Wing-Kin Sung Knowledge Discovery in Proteomics Bioinformatics: A Practical Approach Igor Jurisica and Dennis Wigle Shui Qing Ye Meta-analysis and Combining Cancer Modelling and Simulation Information in Genetics and Genomics Luigi Preziosi Rudy Guerra and Darlene R. Goldstein Combinatorial Pattern Matching Modeling and Simulation of Capsules Algorithms in Computational Biology and Biological Cells Using Perl and R C. Pozrikidis Gabriel Valiente Niche Modeling: Predictions from Computational Biology: A Statistical Statistical Distributions Mechanics Perspective David Stockwell Ralf Blossey Normal Mode Analysis: Theory and Computational Neuroscience: A Applications to Biological and Chemical Comprehensive Approach Systems Jianfeng Feng Qiang Cui and Ivet Bahar Data Analysis Tools for DNA Optimal Control Applied to Biological Microarrays Models Sorin Draghici Suzanne Lenhart and John T. Workman Differential Equations and Pattern Discovery in Bioinformatics: Mathematical Biology Theory & Algorithms D.S. Jones and B.D. Sleeman Laxmi Parida Engineering Genetic Circuits Python for Bioinformatics Chris J. Myers Sebastian Bassi Exactly Solvable Models of Biological Spatial Ecology Invasion Stephen Cantrell, Chris Cosner, and Sergei V. Petrovskii and Bai-Lian Li Shigui Ruan Gene Expression Studies Using Spatiotemporal Patterns in Ecology Affymetrix Microarrays and Epidemiology: Theory, Models, and Hinrich Göhlmann and Willem Talloen Simulation Horst Malchow, Sergei V. Petrovskii, and Glycome Informatics: Methods and Ezio Venturino Applications Kiyoko F. Aoki-Kinoshita Stochastic Modelling for Systems Biology Handbook of Hidden Markov Models in Darren J. Wilkinson Bioinformatics Martin Gollery Structural Bioinformatics: An Algorithmic Approach Introduction to Bioinformatics Forbes J. Burkowski Anna Tramontano The Ten Most Wanted Solutions in An Introduction to Systems Biology: Protein Bioinformatics Design Principles of Biological Circuits Anna Tramontano Uri Alon C7033_FM.indd 3 10/26/09 3:13:09 PM C7033_FM.indd 4 10/26/09 3:13:09 PM ALGORITHMS IN BIOINFORMATICS A PRACTICAL INTRODUCTION WING-KIN SUNG C7033_FM.indd 5 10/26/09 3:13:09 PM Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2010 by Taylor and Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number: 978-1-4200-7033-0 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging‑in‑Publication Data Sung, Wing-Kin. Algorithms in bioinformatics : a practical introduction / Wing-Kin Sung. p. cm. -- (CHAPMAN & HALL/CRC mathematical and computational biology series) Includes bibliographical references and index. ISBN 978-1-4200-7033-0 (hardcover : alk. paper) 1. Bioinformatics. 2. Genetic algorithms. I. Title. II. Series. QH324.2.S86 2009 572.80285--dc22 2009030738 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com C7033_FM.indd 6 10/26/09 3:13:09 PM Contents Preface xv 1 Introduction to Molecular Biology 1 1.1 DNA, RNA, and Protein . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Proteins . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 DNA . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 RNA . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 Genome, Chromosome, and Gene . . . . . . . . . . . . . . . 10 1.2.1 Genome. . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.2 Chromosome . . . . . . . . . . . . . . . . . . . . . . 10 1.2.3 Gene . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.4 Complexity of the Organism versus Genome Size. . 11 1.2.5 Number of Genes versus Genome Size . . . . . . . . 11 1.3 Replication and Mutation of DNA . . . . . . . . . . . . . . 12 1.4 Central Dogma (from DNA to Protein) . . . . . . . . . . . 13 1.4.1 Transcription (Prokaryotes) . . . . . . . . . . . . . 13 1.4.2 Transcription (Eukaryotes) . . . . . . . . . . . . . . 14 1.4.3 Translation . . . . . . . . . . . . . . . . . . . . . . . 15 1.5 Post-TranslationModification (PTM) . . . . . . . . . . . . 17 1.6 Population Genetics . . . . . . . . . . . . . . . . . . . . . . 18 1.7 Basic BiotechnologicalTools . . . . . . . . . . . . . . . . . . 18 1.7.1 Restriction Enzymes . . . . . . . . . . . . . . . . . 19 1.7.2 Sonication . . . . . . . . . . . . . . . . . . . . . . . 19 1.7.3 Cloning . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.7.4 PCR . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.7.5 Gel Electrophoresis . . . . . . . . . . . . . . . . . . 22 1.7.6 Hybridization . . . . . . . . . . . . . . . . . . . . . 23 1.7.7 Next Generation DNA Sequencing . . . . . . . . . . 24 1.8 Brief History of Bioinformatics . . . . . . . . . . . . . . . . 26 1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2 Sequence Similarity 29 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2 Global Alignment Problem . . . . . . . . . . . . . . . . . . 30 2.2.1 Needleman-Wunsch Algorithm . . . . . . . . . . . . 32 2.2.2 Running Time Issue . . . . . . . . . . . . . . . . . . 34 2.2.3 Space Efficiency Issue . . . . . . . . . . . . . . . . . 35 vii viii 2.2.4 More on Global Alignment . . . . . . . . . . . . . . 38 2.3 Local Alignment . . . . . . . . . . . . . . . . . . . . . . . . 39 2.4 Semi-Global Alignment . . . . . . . . . . . . . . . . . . . . 41 2.5 Gap Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.5.1 General Gap Penalty Model . . . . . . . . . . . . . 43 2.5.2 Affine Gap Penalty Model . . . . . . . . . . . . . . 43 2.5.3 Convex Gap Model . . . . . . . . . . . . . . . . . . 45 2.6 Scoring Function . . . . . . . . . . . . . . . . . . . . . . . . 50 2.6.1 Scoring Function for DNA . . . . . . . . . . . . . . 50 2.6.2 Scoring Function for Protein . . . . . . . . . . . . . 51 2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3 Suffix Tree 57 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2 Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 Simple Applications of a Suffix Tree . . . . . . . . . . . . . 59 3.3.1 Exact String Matching Problem . . . . . . . . . . . 59 3.3.2 Longest Repeated Substring Problem . . . . . . . . 60 3.3.3 Longest Common Substring Problem . . . . . . . . 60 3.3.4 Longest Common Prefix (LCP) . . . . . . . . . . . 61 3.3.5 Finding a Palindrome . . . . . . . . . . . . . . . . . 62 3.3.6 ExtractingtheEmbeddedSuffixTreeofaStringfrom the Generalized Suffix Tree . . . . . . . . . . . . . . 63 3.3.7 Common Substring of 2 or More Strings . . . . . . 64 3.4 Construction of a Suffix Tree . . . . . . . . . . . . . . . . . 65 3.4.1 Step 1: Construct the Odd Suffix Tree . . . . . . . 68 3.4.2 Step 2: Construct the Even Suffix Tree . . . . . . . 69 3.4.3 Step 3: Merge the Odd and the Even Suffix Trees . 70 3.5 Suffix Array . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5.1 Construction of a Suffix Array . . . . . . . . . . . . 73 3.5.2 Exact String Matching Using a Suffix Array . . . . 73 3.6 FM-Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.6.1 Definition. . . . . . . . . . . . . . . . . . . . . . . . 77 3.6.2 The occ Data Structure . . . . . . . . . . . . . . . . 78 3.6.3 Exact String Matching Using the FM-Index . . . . 79 3.7 Approximate Searching Problem . . . . . . . . . . . . . . . 81 3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4 Genome Alignment 87 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2 Maximum Unique Match (MUM) . . . . . . . . . . . . . . . 88 4.2.1 How to Find MUMs . . . . . . . . . . . . . . . . . . 89 4.3 MUMmer1: LCS . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3.1 Dynamic Programming Algorithm in O(n2) Time . 93 4.3.2 An O(nlogn)-Time Algorithm . . . . . . . . . . . . 93 ix 4.4 MUMmer2 and MUMmer3 . . . . . . . . . . . . . . . . . . 96 4.4.1 Reducing Memory Usage . . . . . . . . . . . . . . . 97 4.4.2 Employinga New Alternative Algorithmfor Finding MUMs . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4.3 Clustering Matches . . . . . . . . . . . . . . . . . . 97 4.4.4 Extension of the Definition of MUM . . . . . . . . . 98 4.5 Mutation Sensitive Alignment . . . . . . . . . . . . . . . . . 99 4.5.1 Concepts and Definitions . . . . . . . . . . . . . . . 99 4.5.2 The Idea of the Heuristic Algorithm . . . . . . . . . 100 4.5.3 Experimental Results . . . . . . . . . . . . . . . . . 102 4.6 Dot Plot for Visualizing the Alignment . . . . . . . . . . . . 103 4.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . 105 4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5 Database Search 109 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.1.1 BiologicalDatabase . . . . . . . . . . . . . . . . . . 109 5.1.2 Database Searching . . . . . . . . . . . . . . . . . . 109 5.1.3 Types of Algorithms. . . . . . . . . . . . . . . . . . 110 5.2 Smith-Waterman Algorithm . . . . . . . . . . . . . . . . . . 111 5.3 FastA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3.1 FastP Algorithm . . . . . . . . . . . . . . . . . . . . 112 5.3.2 FastA Algorithm. . . . . . . . . . . . . . . . . . . . 113 5.4 BLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.4.1 BLAST1 . . . . . . . . . . . . . . . . . . . . . . . . 115 5.4.2 BLAST2 . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4.3 BLAST1 versus BLAST2 . . . . . . . . . . . . . . . 118 5.4.4 BLAST versus FastA . . . . . . . . . . . . . . . . . 118 5.4.5 Statistics for Local Alignment . . . . . . . . . . . . 119 5.5 Variations of the BLAST Algorithm . . . . . . . . . . . . . 120 5.5.1 MegaBLAST . . . . . . . . . . . . . . . . . . . . . . 120 5.5.2 BLAT . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.5.3 PatternHunter . . . . . . . . . . . . . . . . . . . . . 121 5.5.4 PSI-BLAST (Position-Specific Iterated BLAST) . . 123 5.6 Q-gram Alignment based on Suffix ARrays (QUASAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . 124 5.6.2 Speeding Up and Reducing the Space for QUASAR 127 5.6.3 Time Analysis . . . . . . . . . . . . . . . . . . . . . 127 5.7 Locality-Sensitive Hashing . . . . . . . . . . . . . . . . . . . 128 5.8 BWT-SW . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.8.1 Aligning Query Sequence to Suffix Tree . . . . . . . 130 5.8.2 Meaningful Alignment. . . . . . . . . . . . . . . . . 133 5.9 Are Existing Database Searching Methods Sensitive Enough? 136 5.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Description: