Yi-Ping Phoebe Chen (Ed.) Bioinformatics Technologies Yi-Ping Phoebe Chen (Ed.) Bioinformatics Technologies With 129 Figures and 50 Tables 123 Yi-Ping Phoebe Chen (Ed.) School of Information Technology, Faculty of Science and Technology, Deakin University, Australia Email: [email protected] Library of Congress Control Number: 2004115713 ACM Computing Classification (1998): J.3, I.5, H.3 ISBN 3-540-20873-9 Springer Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable for prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springeronline.com © Springer-Verlag Berlin Heidelberg 2005 Printed in Germany The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: By the author Cover design: KünkelLopka, Heidelberg Production: LE-TEX Jelonek, Schmidt & Vöckler GbR, Leipzig Printed on acid-free paper 45/3142/YL - 5 4 3 2 1 0 Preface This book arose primarily out of a compelling need for a comprehensive reference in bioinformatics that will cater to students, research, and indus- try. We strongly believe that this new field evolved from the active interac- tion of two fast-developing disciplines: biology and information technol- ogy. Solving modern biological problems requires advanced computational methods. Key techniques include database management, data modeling, pattern recognition, data mining, query processing, and visualization of biological data. Until very recently, virtually all public databases were based on large flat files stored in simple formats. Navigation among data- bases required expert knowledge and considerable patience. The huge quantities of biological data and escalating demands of modern biological research increasingly require the sophistication and computing power of information technology (IT) tools. More specifically, optimal use of these tools requires proximal information – knowing which data points are in the surrounding area of others. In this book, we will present methodologies and data structures for arriving at high quality biological information, which can then be used as foundation to develop practical tools for cluster- ing and visualization in biological data mining and database management. Throughout the book, we will demonstrate the application of well estab- lished concepts and techniques of information technology to the manage- ment and analysis of biological data. Biological analysis requires the inte- gration of software tools used in data mining, such as clustering, classification, decision trees and decision tables, and sequence and struc- tural modeling such as data modeling. A distinctive feature of our book is the integration of advanced database technologies with visualization tech- niques such as query-interactive user interfaces, visual descriptions, and advanced 3-D visual modeling. Biological data continue to grow exponentially in size and complexity. As a result, they introduce new data types not previously seen even in mo- lecular biology. It is vital and urgent that advanced information technolo- gies, in particular, database technologies and visual analysis, be applied to support biological research and innovation based on biological data. Spe- cific IT-motivated activities are taking root in some parts of the biological VI Preface research community, and we foresee that they will benefit information technology. “Bioinformatics technologies” is a comprehensive book that covers these two important areas, viz., IT and biology, which have become inter- woven in recent years. Many international experts have made contributions to this book. Each article is written in a way that a practitioner of bioin- formatics can easily understand and then apply the knowledge gained to extract useful information from biological data. Each article covers one topic, and can be read independently of each other. The book provides both a general survey of the topic and an in-depth exposition of the state-of-the- art. Practitioners will certainly find this book very resourceful and handy when looking for solutions to practical problems in bioinformatics. Re- searchers can use this book as a source for obtaining background informa- tion, current trends and developments; this provides them also with the most important references on these topics. The book covers the basic principles and applications of bioinformatics technologies. It also contains many articles that specifically address bioin- formatics databases and emerging topics in bioinformatics technologies such as patterns discovery, data mining, simulation and visualization. The central issue in bioinformatics is how to transform biological data into meaningful and valuable information. It implies that the biological knowl- edge related to the problem domain is incorporated into the requirements analysis phase of the bioinformatics. However, it has been recently recognized that in the twenty-first century bioinformatics will play an increasingly important role. For this reason, the international conference series on Asia-Pacific Bioinformatics (first bioin- formatics conference in the IT domain) was founded in 2002. The underly- ing goal behind this conference series is to recognize the interdisciplinary nature of bioinformatics in the interplay between biology and IT and how information technology can be applied to biology. Even though a great deal of attention is paid to this area in terms of re- search and investment, the theoretical understanding needs further refine- ment to bring the outcome of the biological analysis effectively to the ser- vice of mankind. In editing this book, this viewpoint has been carefully taken into consideration to conceptually organize the recent progress in bioinformatics. The book is organized into twelve chapters that cover twelve important technologies in bioinformatics. Chapter1, Introduction to Bioinformatics, provides an overview of bio- informatics technology, and different techniques within bioinformatics. Further, it introduces the relationships between the other chapters. Preface VII Chapter 2, Overview of Structural Bioinformatics, presents an overview of structural bioinformatics. The chapter describes organization of structural bioinformatics, the Protein Data Bank, secondary resources and applications, and using structural bioinformatics approaches in drug design. It also includes structural classification, structure prediction, functional assignments in structural genomics, protein-protein interactions and protein-ligand interactions. The role of structural bioinformatics in systems biology is also briefly discussed. Chapter 3, Database Warehousing in Bioinformatics, deals with the ba- sics in database warehousing, transforming biological data into knowledge, data warehouse architectures and data quality in bioinformatics. Chapter 4, Data Mining for Bioinformatics, discusses the basics of data mining applicable to bioinformatics. The main types of data analysis, namely, biomedical data analysis, DNA data analysis, protein data analysis and microarray data analysis, are elaborated upon. Biomedical data analy- sis includes a major nucleotide sequence database, a protein sequence da- tabase, a gene expression database, and software tools for bioinformatics research. DNA data analysis covers DNA sequence and DNA data analy- sis. Protein data analysis encompasses protein and amino acid sequence and protein data analysis. Chapter 5, Machine Learning in Bioinformatics, dwells on the theory behind machine learning applied to bioinformatics. It includes neural net- work architectures and applications. We also describe other machine learn- ing techniques, such as genetic algorithms and fuzzy systems. Chapter 6, Systems Biotechnology: a New Paradigm in Biotechnology Development, describes a new paradigm in biotechnology development called system biotechnology. It covers integrative approaches and in silico modeling and simulation of cellular processes. Chapter 7, Computational Modeling of Biological Processes with Petri Net-Based Architecture, describes computational modeling of biological processes with a Petri net-based architecture, a hybrid Petri net and a hybrid dynamic net, and a hybrid functional Petri net. The chapter also covers the implementation of a HFPNe in a genomic object net and the modeling of biological processes with a HFPNe and a genomic object net and its visualizer. Chapter 8, Biological Sequence Assembly and Alignment, illustrates biological sequence assembly and alignment. It covers large-scale se- quence assembly, Euler sequence assembly, PESA sequence assembly, large-scale pairwise sequence alignment, large-scale multiple sequence, alignment, and load balancing and communication overheads. Chapter 9, Modeling for Bioinformatics, covers the basics of modeling techniques related to bioinformatics. It includes the major modeling tech- VIII Preface niques, namely, hidden Markov modeling for biological data analysis, comparative modeling and molecular modeling. An elaborate discussion is made to apply hidden Markov modeling on biological data to have se- quence identification, sequence classification, and multiple alignment gen- eration. Comparative modeling comprises protein comparative modeling, comparative genomic modeling, and probabilistic modeling. The probabil- istic modeling encompasses Bayesian networks, stochastic context-free grammars, and probabilistic Boolean networks. Finally, we describe mo- lecular modeling, which deals with molecular and related visualization ap- plications, molecular mechanics, and modern computer programs used in molecular modeling. Chapter 10, Pattern Matching for Motifs, addresses the issues in pattern matching for discovering motifs. Topics include gene regulation and pro- moter organization. We include motif recognition and motif detection strategies. The chapter also includes two different approaches, namely, the single gene multi-species approach and the multi-gene multi-species ap- proach. Chapter 11, Visualization and Fractal Analysis of Biological Sequences, deals with visualization and fractal analysis of biological sequences. It elaborates on the fractal analysis, the recurrent iterated function system model, the moment method to estimate the parameters of the IFS (RIFS) model, multifractal analysis, the DNA walk model, and chaos game repre- sentation of biological sequences. Two-dimensional portrait representation of DNA sequences and one-dimensional measure representations of bio- logical sequences are also introduced. Chapter 12, Microarray Data Analysis, discusses the techniques used to analyze microarray data and microarray technology used for genome ex- pression study, image analysis for data extraction, and data analysis for pattern discovery. In a rapidly expanding area such as bioinformatics, no book can claim to cover the topics that suit the interests of everyone. However, it is hoped that this book is comprehensive enough to serve as a useful and handy guide for both practitioners and researchers. This book will help both IT professionals and biologists to understand the bioinformatics world. We would like to thank all authors who contributed the chapters in this book, without whom the mission would have been impossible. Special thanks to the reviewers for their professional inputs. We thank Ricky Chen and Chinnu Subramaniam for helping us check parts of the manuscript at short notice. We have taken care to cite referenced work. If we have missed any citation, we apologize for the lapse. We thank all researchers for their permission to use their figures in this book. We also wish to thank the Springer publisher Ralf Gerstner for his final step of checking and Preface IX timely help before publication. Finally, we wish to thank our families and friends for their support. We are sure that some errors may stay behind in the book. Your input for improvement will be helpful for future reprints and editions. Com- ments, corrections, and constructive suggestions should be sent to Springer or by electronic mail to [email protected] January 2005 Yi-Ping Phoebe Chen Contents Preface...................................................................................................V 1 Introduction to Bioinformatics............................................................1 1.1 Introduction...................................................................................1 1.2 Needs of Bioinformatics Technologies...........................................2 1.3 An Overview of Bioinformatics Technologies................................5 1.4 A Brief Discussion on the Chapters................................................8 References.........................................................................................12 2 Overview of Structural Bioinformatics.............................................15 2.1 Introduction.................................................................................15 2.2 Organization of Structural Bioinformatics....................................17 2.3 Primary Resource: Protein Data Bank..........................................18 2.3.1 Data Format..........................................................................18 2.3.2 Growth of Data.....................................................................18 2.3.3 Data Processing and Quality Control.....................................20 2.3.4 The Future of the PDB..........................................................21 2.3.5 Visualization.........................................................................21 2.4 Secondary Resources and Applications........................................22 2.4.1 Structural Classification........................................................22 2.4.2 Structure Prediction..............................................................28 2.4.3 Functional Assignments in Structural Genomics....................30 2.4.4 Protein-Protein Interactions...................................................32 2.4.5 Protein-Ligand Interactions...................................................34 2.5 Using Structural Bioinformatics Approaches in Drug Design.......37 2.6 The Future...................................................................................39 2.6.1 Integration over Multiple Resources......................................39 2.6.2 The Impact of Structural Genomics.......................................39 2.6.3 The Role of Structural Bioinformatics in Systems Biology....39 References.........................................................................................40 3 Database Warehousing in Bioinformatics.........................................45 3.1 Introduction.................................................................................45 3.2 Bioinformatics Data.....................................................................48 3.3 Transforming Data to Knowledge................................................51 3.4 Data Warehousing.......................................................................54 3.5 Data Warehouse Architecture.......................................................56 3.6 Data Quality................................................................................58 3.7 Concluding Remarks....................................................................60 XII Contents References.........................................................................................61 4 Data Mining for Bioinformatics........................................................63 4.1 Introduction.................................................................................63 4.2 Biomedical Data Analysis............................................................64 4.2.1 Major Nucleotide Sequence Database, Protein Sequence Database, and Gene Expression Database..............................65 4.2.2 Software Tools for Bioinformatics Research.........................68 4.3 DNA Data Analysis.....................................................................71 4.3.1 DNA Sequence.....................................................................71 4.3.2 DNA Data Analysis..............................................................76 4.4 Protein Data Analysis..................................................................92 4.4.1 Protein and Amino Acid Sequence........................................92 4.4.2 Protein Data Analysis............................................................99 References.......................................................................................109 5 Machine Learning in Bioinformatics..............................................117 5.1 Introduction...............................................................................117 5.2 Artificial Neural Network..........................................................120 5.3 Neural Network Architectures and Applications.........................128 5.3.1 Neural Network Architecture..............................................128 5.3.2 Neural Network Learning Algorithms.................................131 5.3.3 Neural Network Applications in Bioinformatics..................134 5.4 Genetic Algorithm.....................................................................135 5.5 Fuzzy System............................................................................141 References.......................................................................................147 6 Systems Biotechnology: a New Paradigm in Biotechnology Development....................................................................................155 6.1 Introduction...............................................................................155 6.2 Why Systems Biotechnology?....................................................156 6.3 Tools for Systems Biotechnology...............................................158 6.3.1 Genome Analyses...............................................................158 6.3.2 Transcriptome Analyses......................................................159 6.3.3 Proteome Analyses..............................................................161 6.3.4 Metabolome/Fluxome Analyses..........................................163 6.4 Integrative Approaches..............................................................164 6.5 In Silico Modeling and Simulation of Cellular Processes............166 6.5.1 Statistical Modeling............................................................167 6.5.2 Dynamic Modeling.............................................................169 6.6 Conclusion................................................................................170 References.......................................................................................171